DELEGATE-52
A benchmark that assesses LLMs abilities to carry out tasks and manipulate domain documents without introducing errors.
Simulates Long-Horizon Workflow document-editing tasks that are typical …
A benchmark that assesses LLMs abilities to carry out tasks and manipulate domain documents without introducing errors.
Simulates Long-Horizon Workflow document-editing tasks that are typical for knowledge workers.
Contains: 310 work environments across 52 professional domains like coding, crystallography, genealogy and music sheet notation.
Document 15k tokens in length and 5-10 complex editing tasks, which are typical LLM delegated tasks.
Similar benchmarks are domain-focused: