DELEGATE-52

A benchmark that assesses LLMs abilities to carry out tasks and manipulate domain documents without introducing errors.

Simulates Long-Horizon Workflow document-editing tasks that are typical …

A benchmark that assesses LLMs abilities to carry out tasks and manipulate domain documents without introducing errors.

Simulates Long-Horizon Workflow document-editing tasks that are typical for knowledge workers.

Contains: 310 work environments across 52 professional domains like coding, crystallography, genealogy and music sheet notation.

Document 15k tokens in length and 5-10 complex editing tasks, which are typical LLM delegated tasks.

Similar benchmarks are domain-focused: