Article by John Tribbia
Modern cars log hundreds of fault codes. When a warning light trips, the code tells the mechanic which system threw the fault, not just that something failed. A shop that responded to every warning light by replacing the air filter would fix almost nothing, spend a lot of time doing it, and never understand why the same light came back. The warning light is not the diagnosis. It is the prompt to begin one.
The previous four posts built the instruments. Model quality drives retention, and category-level variation is what makes the causal identification work. The trajectory of how a model arrives at its answer carries signal the final score misses. That trajectory shape predicts whether users come back. Users reveal quality gaps through their prompt behavior, without ever filing a bug report.
All four tell you that something is failing. None of them tell you where. A retention dip is an outcome, not a cause.
Failure Rates Are Not Edge Cases
MAST (NeurIPS 2025, UC Berkeley) analyzed 1,642 multi-agent execution traces across seven state-of-the-art open-source frameworks. Failure rates ranged from 41% to 86.7%. These are not prototype systems. They are the frameworks being deployed in production today.
A separate study of real production agent failures found that parsing failures alone account for roughly 38% of all observed incidents: malformed JSON, missing schema fields, instruction noncompliance. Not 38% of some narrow subcategory. Thirty-eight percent of everything. That single failure type has a specific, tractable repair target: the output formatting instructions in the system prompt. You do not need a model retrain. You need a better instruction.
That is the structural case for a taxonomy. If 38% of your failures route to the same fix, and you are applying the same fix to 100% of your failures, you are burning resources on the 62% while under-investing in the 38% that would actually move the needle. The taxonomy is the routing layer between the signal and the repair.
Mapping Failures to Layers
A loss taxonomy maps observed failure signals to the system component most likely responsible — routes known failures to the correct repair layer, and surfaces failures that have no existing category. That second function is the one most systems skip.
Read it as a routing table, not a checklist.
Failure Taxonomy · Observable Signals · Repair Routing
| Failure Pattern | Observable Signal | Likely Locus | Repair Target |
|---|---|---|---|
| Correct intent, wrong format | Retry with same content, reformatted | System prompt | Output format instructions |
| Hallucinated facts, right structure | User correction follow-up | Base model knowledge | RAG layer or targeted fine-tune |
| Refused valid request | Abrupt session end after refusal | System prompt | Instruction calibration |
| Right answer, wrong step order | Mid-task rephrase, step backtrack | Reasoning scaffold | Chain-of-thought structure |
| Works in eval, breaks in production | Eval-prod quality gap | Eval distribution | Eval set expansion |
| Fails under long context | Quality degrades mid-session | Context management | Chunking / summarization strategy |
| Unclassified | Does not match any row above | Unknown | Extend the taxonomy |
The last row is the most important. When a failure pattern does not cross-reference cleanly, the temptation is to force it into the nearest existing category. That is the mistake. An unclassified failure is a signal that the taxonomy has a blind spot. Patching the symptom without extending the taxonomy guarantees the next variant of that failure also goes unclassified.
The residual is a metric. A growing residual means the taxonomy is lagging the system. A shrinking residual means the taxonomy is keeping pace. That number belongs on the same dashboard as your quality scores.
Taxonomy-Routed vs. Naive Repair
The synthetic experiment runs 1,000 simulated production interactions with five injected failure types, proportions drawn from MAST's published distribution: specification misalignment (~25%), reasoning failure (~20%), context management (~18%), output format error (~15%), retrieval and knowledge gap (~12%), and residual unclassified (~10%). Two repair strategies are applied over three cycles.
Strategy A (Naive) treats all failures as equivalent. Repair is applied uniformly: the same prompt adjustment across every failure type, regardless of which layer the fault originates from.
Strategy B (Taxonomy-Routed) cross-references each failure against the taxonomy before any repair decision is made. Repair is routed to the specific locus identified for that failure type.
Quality is scored per cycle using the weighted-category approach from the model quality post.
Failure type distribution from MAST (NeurIPS 2025). Quality scoring uses the weighted-category method with frozen usage weights from the model quality post. Taxonomy-matched repair recovers 80–90% of injected loss per cycle. Mismatched repair recovers 10–30%.
After cycle 1, both strategies improve. Naive repair picks up the easy wins first: format errors happen to respond to general prompt tightening, so the early numbers look encouraging regardless of which approach you take.
After cycle 2, the divergence starts. The taxonomy-routed strategy continues recovering. The naive strategy plateaus. The failures that did not respond to the uniform fix are still present, and some that initially appeared to improve have re-emerged in slightly different form because the root layer was never addressed.
After cycle 3, Strategy B has recovered 85% of injected quality loss. Strategy A has recovered 45% and is beginning to regress on failure types whose repair target conflicts with the uniform fix applied to other types.
The crossover is not a fluke of the specific numbers chosen. It is a structural property of the problem. When you apply a fix to the wrong layer, you do not merely fail to fix the failure. In some cases you create a new one. That effect is real in the synthetic data, and the mechanism is not synthetic at all.
The Loop in Full
The complete cycle runs seven steps. The first six appear in most implementations. The seventh is where most implementations stop short.
Step 1: Baseline. Establish quality from evals before any user traffic enters the system. The repair cycle measures against it.
Step 2: Collect. Aggregate loss patterns from users. Explicit ratings are only part of the signal. Retries, rephrases, session-ending refusals, and low copy rates all carry information about where the system is underperforming.
Step 3: Cross-reference. Run losses against the taxonomy to identify the repair locus. Everything downstream depends on getting this right.
Step 4: Prioritize. Rank by volume times severity times tractability. Fix the highest-leverage failure first, not the most recently visible one.
Step 5: Repair. Target the specific layer: system prompt, RAG configuration, fine-tune, eval set, or context strategy. Not all of these at once.
Step 6: Test. Measure quality improvement against baseline. Confirm the repair moved the right metric and did not degrade an adjacent one.
Step 7: Update the taxonomy. Classify any failures that did not fit existing categories. Add new rows. Shrink the residual. This is the step most implementations skip.
A system that closes the loop at step 6 improves its current known failures. It stays blind to the next generation because the taxonomy still does not know they exist. Skip Step 7 and the loop can only get better at what it already knows. Without it, the residual grows quietly while everything else looks fine.
Back to the Shop
A diagnostic shop fixes the right thing the first time, updates its fault code library when a new failure mode comes in, and gets faster over time. A shop without one replaces air filters. Both look like they are working. The difference shows up on the third visit.
The earlier posts built the instruments. The taxonomy connects them to a repair. Without that connection, you have signal and no routing. The monitoring loop becomes a thrash loop.
The question left open is what the residual looks like in real user data: how fast new failure modes accumulate, whether a taxonomy can be built to keep pace, and whether the unclassified share stabilizes or grows. The WildChat dataset has the behavioral traces to start answering that.
Methodological Note
N=1,000 synthetic interactions. Five injected failure types with proportions drawn from MAST published percentages (NeurIPS 2025, UC Berkeley).
Quality scoring uses the weighted-category method developed in the model quality post. Weights are frozen at pre-period values. The same GAMM specification is used for scoring consistency across the series.
Repair effectiveness: taxonomy-matched repair recovers 80–90% of injected loss per cycle. Mismatched repair recovers 10–30%. When repair targets conflict across failure types (e.g., format tightening applied to a refusal-heavy distribution), net quality may decrease.
Three repair cycles simulated. No partial matching: a repair either routes to the correct locus or it does not. Code: GitHub repository (link added at publication).
This Series
| Post | Core Claim |
|---|---|
| Model Quality | Quality drives retention; category-level variation enables causal identification |
| ARC | Trajectory shape carries signal the final score misses |
| ARC-Retention | Trajectory shape predicts 7-day return |
| WildChat | Users reveal quality gaps through prompt behavior |
| This post | Taxonomy routes loss signals to the right repair layer |
| Next | Emergent failures: measuring taxonomy lag in real user data |
Prior work. The within-version quality exposure framework, identification strategy, frozen-weights design, and GAMM specification are described in Does Making AI Smarter Actually Make People Use It More? The ARC trajectory evaluation framework is described in The Shape of a Good Answer. The ARC-retention extension is in The Shape Predicts the Return. Behavioral inference from prompt topic shifts is in How Users Vote With Their Prompts.
Data note. All data is synthetic. No real users, no proprietary models, no production systems. Failure type proportions are drawn from MAST (Guo et al., NeurIPS 2025). The repair effectiveness parameters (80–90% recovery for taxonomy-matched repair, 10–30% for mismatched repair) are specified in the data-generating process and are not estimated from the data.
Software. Python 3.11; numpy, pandas, matplotlib. Code available at the GitHub link above at publication.
Ideas, analysis, and opinions are my own. Generative AI was used as an editor after the writing and analysis were complete — sentence restructuring and light copy-editing. The author reviewed all suggested changes.