“`html
Text Degeneration: A Structural Failure Mode in Production
A self-reinforcing failure mode of autoregressive language models, with measurable consequences for inference cost and throughput.
In our recent work on a small language model specialized for domain-specific Optical Character Recognition (OCR) tasks — detailed in the DharmaOCR paper and available on HuggingFace (with demo space) — we observed that fewer than three percent of pages consumed nearly half of the total wall-clock time.
The requests responsible were those that hit the configured maximum-token limit and exhibited an n-gram repetition pattern at their tail. They did not produce a complete output; instead, they repeated a fragment and continued until the system’s hard limit cut them off.

We re-ran the experiment on a second dataset, and a third. The same shape appeared, with varying intensity. A small minority of requests was responsible for a measurable share of the total wall-clock time of the batches they were in. This phenomenon is called Text Degeneration.
The Anomaly in the Inference Log
What we were measuring was not noise; text degeneration is a known phenomenon, described in the language-modeling literature since Holtzman and colleagues’ 2020 paper, and characterized in subsequent works as a self-reinforcing failure of autoregressive generation.
The shape was always the same: a small number of requests would enter a generation loop. The model would repeat a token, then a fragment, then the same token again until the system’s max-tokens guard cut it off.

The reason is structural. A healthy request ends when the model emits an EOS token — the model’s signal that the output is complete. A degenerate request never reaches that signal; it loops, filling its allocated context with repeated tokens or sentences until the hard max-tokens limit forcibly terminates it.
The difference in output length between the two is not marginal. And because inference time scales directly with the number of tokens produced, a degenerate request occupies a disproportionate share of the available GPU for a disproportionately long time.
The instinct, watching one of these requests in real-time, is to treat it as a tuning problem. Raise the repetition penalty. Lower the temperature. Switch the decoder. Add a streaming check that aborts a request once it begins to repeat. These instincts are reasonable and help mitigate some issues, but they do not address the cause.
The cause is older than any of these decoders; it is built into the optimization objective that produced the model in the first place.
Why Degeneration Is Structural, Not Configurable
A language model trained with maximum-likelihood — which is to say, almost every model in production today — is trained on a single, narrow imperative: given everything that has come before, assign high probability to whatever came next. Minimize the negative log-likelihood of the reference sequence, token by token, across the entire corpus.
Because the model is autoregressive, it never sees the full sequence it will eventually produce; it only ever predicts one token at a time, conditioned on what precedes it. The objective does not care what the model generates as a whole. It cares only that, at each step, the model assigns high probability to the next token in the reference corpus.
This produces models that are extraordinarily good at continuation. It also produces a side effect that has been documented in the literature for years and remains structurally unresolved: the more often a token or a fragment appears in recent context, the more probable it becomes on the next step. Once the model enters such a region, the gradient of probability points back into it, not out of it. The end-of-sequence token, which would normally close the generation, sits at a vanishingly low probability relative to the repeated fragment. The loop sustains itself until something external — a max-tokens cap, a streaming abort, an exhausted KV cache — finally interrupts it.
This is what makes degeneration structural. The loop is not a defect of the decoding strategy; it is a high-probability region of the distribution itself, produced by the training objective, reinforced by repetitive patterns in the empirical training data, and embedded in the geometry of the model’s internal activations — a description supported in successive analyses since 2020 (Source: Holtzman et al, 2020).
Decoding strategies — temperature, top-p, repetition penalties, beam search variants — operate on top of that distribution. They can make the loop less likely to be entered; they cannot remove it. The same is true of specialized models and general-purpose models alike: each inherits this geometry from the optimization that produced it.
This is the part of the problem that has been discussed in research papers. What is much less discussed — and what we addressed directly in our recent work — is what happens to the rest of the system while one of these loops is running.
The Cost Multiplier Hiding in Plain Sight
We replaced the degenerate requests with synthetic requests of average duration in our experiment — a simple way to estimate the cost the loops had imposed. Total inference time fell from 7.3 minutes to 4.2 minutes. The wall-clock cost of the entire batch had been inflated by 42.47% by a small minority of degenerate requests.
This is not a story about the failed request; the failed request’s runtime was, in some sense, secondary. What mattered was that during the time the loop was alive, every other request running on the same GPU paid for it.
We measured this directly. Across three datasets, the duration distribution of healthy requests shifted whenever a degenerate request was active in parallel. The mean duration of a healthy request rose by at least 15%, and in one dataset by more than 71%, when at least one degenerate sequence was occupying the same machine. The healthy requests had not become more difficult; the system serving them had become measurably slower.

The mechanism is mundane and material. Modern inference servers — vLLM in our experiments — extract throughput by holding many requests in a dynamic batch and serving them in parallel through paged memory. The amount of memory occupied by a sequence grows roughly linearly with the number of tokens it has produced. When a sequence enters a degeneration loop and approaches the configured token cap, it occupies a disproportionate share of the available memory for a disproportionately long time. The scheduler has less room to admit new sequences to the batch. Parallelism falls; throughput across the batch falls with it (Source: Kwon et al, 2026).
Which means the cost of a single degenerate request is not paid by the request; it is paid by the queue.
This raises a question about evaluation. If degeneration is a structural property of the training objective and if its production cost is large, sustained, and contagious — why does it not appear on the standard benchmarks used to compare these models?
It does not appear on any benchmark we are aware of — OCR-specific or otherwise. The metric is absent from every standard evaluation suite used to compare these models. The omission is worth taking seriously.
The Benchmark Blind Spot
The explanation is likely straightforward: benchmark designers focus on measuring output quality, and standard evaluations tend to capture average response quality rather than pathological edge cases. Failure modes fall outside that frame. But in the case of text degeneration, that omission has real consequences — even occurring in fewer than 3% of requests, its impact on system throughput is disproportionate enough to matter.
A consequence visible in our results is that two models can produce nearly identical quality scores while differing substantially in degeneration rate, and therefore in production cost. In our Table 1, several pairs of fine-tuned models illustrate this. A model with a marginally higher quality score is not necessarily the better model to deploy; the benchmark cannot tell which is which. It was not designed to.

The argument we make explicitly in the DharmaOCR paper (Source: Cardoso et al, 2026) is that this is a methodological gap. Studies that propose models and benchmarks for autoregressive generation should track degeneration rate as a first-class metric, alongside accuracy and cost. The omission is structural; the consequences are operational and economic.
Another reasonable response to all of this is that benchmark omission may be that the failure mode itself is solvable at the inference layer: detect repetition early, abort the request, retry. The system can be made resilient even if the benchmark cannot be made complete.
It is a reasonable response; it is also, by the paper’s evidence, partial.
Why Mitigation Is Itself a Tax
The mitigations described in the literature operate at the inference layer. Real-time repetition detection screens for loops as they form. Retry mechanisms reissue affected requests, sometimes against a different model or decoding configuration. Both are real interventions, and both reduce the visible footprint of degeneration in production.
Both also have a cost: real-time detection runs on every output, not only the ones that fail; it is an online monitoring mechanism running alongside inference, which carries its own latency and compute overhead. Retries multiply the inference cost of the requests they handle. And degeneration does not always manifest as simple, easily recognizable repetition patterns: heuristics broad enough to catch pathological loops will also penalize legitimate outputs that contain similar structures.
Key Takeaways
- Text degeneration is a structural failure mode in autoregressive language models with significant production costs.
- The cost of degenerate requests is not contained within the request but spreads to other tasks running on the same system.
- Benchmarks often miss this issue, leading to models that look good but have hidden performance impacts in real-world use cases.
“`
Originally published at huggingface.co. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




