Home /paper

LLMs Corrupt Your Documents When You Delegate

May 9, 2026 ·

Notes on LLMs Corrupt Your Documents When You Delegate by Philippe Laban, Tobias Schnabel and Jennifer Neville.

An interesting paper from researchers at Microsoft.

They introduce a benchmark called DELEGATE-52 that tests whether LLMs can safely carry out, what they call, "long-delegated workflows" for document editing across 52 domains. Every set of instructions in the benchmark is lossless and reversible, allowing the authors to measure how much each task degrades the file's information over multiple interactions.

They found that even the strongest frontier models, including Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 (the paper was released before their successors), corrupted an average of about 25% of document content after 20 interactions. Across all tested models, average degradation was about 50% (Laban et al., 2026).

Python was the main exception. It was the only domain in which most models met the paper’s “delegation-ready” threshold, with 17 of 19 models scoring at least 98% after 20 interactions.

Three examples of document degradation over 20 interactions: a Linux Kernel Architecture graph diagram losing nodes and edges, a 12-Shaft Twill Diamond textile pattern becoming corrupted, and an ActionBoy Palm Tree 3D object losing geometry. Each shows progressive corruption from interaction 4 to 20.

Figure 1 from (Laban et al., 2026) shows examples of document degradation across different domains. The benchmark itself is text-only; the visual renderings are illustrative.

Surprisingly, the degradation didn't happen gradually over instructions, but models would typically fail catastrophically after a certain number of steps. Stronger frontier models would fare better only by delaying the step at which the degradation occurs.

They also found that tool use did not prevent degradation. The tested models performed worse with tools, averaging an additional 6% of degradation.

Measuring Document Corruption

To measure document corruption, they introduce a domain-specific Document Similarity Measure that parses documents into components. For a recipe, that means ingredients (name, quantity, unit), steps, and tips; for Python code, it means functions, classes, and imports. This lets them compare two parsed documents based on their actual content, rather than just raw text. Typical document similarity measures might overlook seemingly small changes, such as 200g to 800g of butter, which can be really bad in a recipe, whereas a surface-level rewrite that preserves the underlying structure doesn't need to be heavily penalised.

Pipeline diagram showing how a raw recipe text file is parsed into structured ingredients, steps and tips, then scored for semantic equivalence against a reference using a weighted formula: 0.4 times ingredient score plus 0.4 times step score plus 0.2 times tip score.

Figure 5 from (Laban et al., 2026) - the domain-specific parsing pipeline, with a concrete recipe example showing how ingredients, steps and tips are extracted and compared

The approach of creating reversible transforms was inspired by Backtranslation, a machine translation technique in which text is translated into another language and then back, allowing the result to be compared with the original. DELEGATE-52 adapts that idea to document editing: apply a forward edit, apply the inverse edit, and compare the reconstructed document to the original. Imagine splitting a CSV into separate files by expense category, then merging them back together. Or converting all amounts in an accounting ledger to euros, then converting back.

They use a round-trip relay simulation method in which every task is assumed to be reversible, defined by a forward instruction and its inverse.

Seed Doc

2 cups flour

1 cup sugar

½ tsp salt

— Mix dry

— Bake 350°F

Original s

Forward edit
"→ metric"🔄 Fresh context

Transformed

473 ml flour

237 ml sugar

2.5 ml salt

— Mix dry

— Bake 175°C

Edited t

Backward edit
"→ imperial"🔄 Fresh context

Reconstructed

2.01 cups flour

1 cup sugar

½ tsp salt

— Mix dry

— Bake 350°F

Reconstructed ŝ

sim(s, ŝ)

RS@2 = 97.3

Reconstruction
Score

Each LLM call is independent with no conversation history. Errors survive into the next round because they are baked into the document itself, not the context window.

It's worth checking out some examples in the GitHub repo, see music, robotics and Ham radio as examples.

They also tested the inclusion of distractor documents in LLM interactions and found that they harm documents more as interaction length increases.

Basically, degradation severity is exacerbated by document size, interaction length, and the presence of distractor files. However, important to note that the LLM interactions themselves are stateless - it's not just that more noise in context causes outputs to degrade.

The simulation below steps through a recipe domain across 8 round-trips. Each forward edit converts imperial measurements to metric; each backward edit reverts. Watch how errors compound through the document across completely independent calls.

Round 0 of 8

Forward edit

—

Backward edit

—

LLM call 🔄 Fresh context — no history → document received, instruction applied, output returned

Original (reference)

Chocolate Chip Cookies

Ingredients

2 cupsall-purpose flour

1 cupgranulated sugar

½ tspsalt

2 sticksunsalted butter

Steps

1. Cream butter and sugar until fluffy.

2. Sift in flour and salt, fold gently.

3. Bake at 350°F for 12 minutes.

After round-trip reconstruction

Chocolate Chip Cookies

Ingredients

2 cupsall-purpose flour

1 cupgranulated sugar

½ tspsalt

2 sticksunsalted butter

Steps

1. Cream butter and sugar until fluffy.

2. Sift in flour and salt, fold gently.

3. Bake at 350°F for 12 minutes.

RS@k across rounds

—

Reconstruction score

■ Corrupted value ■ Deleted content ■ Hallucinated content

DELEGATE-52

The benchmark contains 310 work environments across 52 domains. Each environment includes real seed documents, distractor files, and 5-10 reversible edit tasks that resemble the kinds of tasks a worker might delegate to an LLM.

Figure 3 from (Laban et al., 2026) - the 52 domains across five categories: Code & Configuration, Science & Engineering, Creative & Media, Structured Records, and Everyday

Figure 4 shows an example work environment from the accounting domain.

Work environment diagram for the accounting domain, showing the Hack Club ledger as the seed document with distractor files including a chart of accounts and expense reimbursement policy. Ten edit tasks branch out, including category split, person split, CSV conversion, euro conversion, and fund accounting, each with a forward and backward instruction.

Figure 4 from (Laban et al., 2026) - a work environment from the accounting domain, using a Hack Club ledger as the seed document, with forward/backward edit pairs like splitting by expense category and merging back

Results

They tested 19 models across the benchmark. All 19 models degraded documents over the course of the simulation. The top performers, such as Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4, still corrupted an average of about 25% of the document content after 20 interactions. Across all tested models, average degradation was about 50%, with weaker models failing more severely.

Heatmap table of round-trip relay scores for 19 LLMs at workflow lengths 2 through 20. All models show declining scores from left to right, colour-coded from green (high preservation) through yellow to red (severe degradation). Gemini 3.1 Pro scores highest at 80.9 after 20 interactions; GPT 5 Nano scores lowest at 10.0.

Table 1 from (Laban et al., 2026) - round-trip relay results for 19 LLMs across 20 interactions, colour-coded by degradation severity. Every model declines over time; frontier models delay but do not avoid degradation.

Short-term performance did not reliably predict long-horizon performance. Some models that looked similar after two interactions diverged sharply after twenty, while others that started behind later caught up. This is one of the reasons the paper argues for long-horizon evaluation rather than only testing one-shot or short workflows.

The kind of degradation also changes with model strength. Weaker models tend to lose content through deletion, while frontier models are more likely to preserve content but corrupt it.

Takeaways

One takeaway is that we need to be careful not to extrapolate model capabilities from one area to all domains. Models follow a Jagged Frontier of LLM Capability, where they can excel in some tasks while making serious errors in others. For example, they perform well on Python and poorly on some structured-but-unfamiliar document formats, such as textual 3D object files.

It also raises interesting questions about whether we need to decouple the reasoning engine from the state management system. LLMs may be useful as the reasoning layer, but long-running document workflows probably need external state, parsers, validators, diffs, tests, and reversible operations to prevent silent corruption.

References

Philippe Laban, Tobias Schnabel, and Jennifer Neville. LLMs Corrupt Your Documents When You Delegate. April 2026. arXiv:2604.15597, doi:10.48550/arXiv.2604.15597. ↩ ¹ ² ³ ⁴ ⁵ ⁶

Video

I've made a YouTube video for this article.

Comments

Reply to this post on Bluesky or Mastodon to join the conversation.