The Value of Failure Taxonomy

Sequel to How Users Vote With Their Prompts, The Shape Predicts the Return, The Shape of a Good Answer, and Does Making AI Smarter Actually Make People Use It More? · N=1,000 synthetic interactions · 5 injected failure types · 3 repair cycles · Failure distribution from MAST (NeurIPS 2025)

Article by John Tribbia

My recent trip to the auto mechanic at the Toyota dealership was my last freebie under my ToyotaCare plan after purchasing a new car two years ago. I wanted to make sure I got everything that needed fixing, fixed. The mechanics ran diagnostics and the system flagged several fault codes, like an aged air filter, low brake fluid, and something off with the steering. With these systems, when a warning light trips, the code tells the mechanic which system threw the fault, not just that something failed. A shop that responded to every warning light by replacing the air filter would fix almost nothing, spend a lot of time doing it, and never understand why the same light came back. The warning light is not the diagnosis. It is the prompt to begin one.

The previous four posts built the instruments. Model quality drives retention, and category-level variation is what makes the causal identification work. The trajectory of how a model arrives at its answer carries signal the final score misses. That trajectory shape predicts whether users come back. Users reveal quality gaps through their prompt behavior, without ever filing a bug report.

All four tell you that something is failing. None of them tell you where. A retention dip is an outcome, not a cause.

Failure Rates Are Not Edge Cases

MAST (NeurIPS 2025, UC Berkeley) analyzed 1,642 multi-agent execution traces across seven state-of-the-art open-source frameworks. Failure rates ranged from 41% to 86.7%. These are not prototype systems. They are the frameworks being deployed in production today.

A separate study of real production agent failures found that parsing failures alone account for roughly 38% of all observed incidents: malformed JSON, missing schema fields, instruction noncompliance. Not 38% of some narrow subcategory, but 38% of everything. That single failure type has a specific, tractable repair target: the output formatting instructions in the system prompt. You do not need a model retrain. You need a better instruction.

That is the structural case for a taxonomy. If 38% of your failures route to the same fix, and you are applying the same fix to 100% of your failures, you are burning resources on the 62% while under-investing in the 38% that would actually move the needle. The taxonomy is the routing layer between the signal and the repair.

Mapping Failures to Layers

A loss taxonomy maps observed failure signals to the system component most likely responsible. It routes known failures to the correct repair layer, and surfaces failures that have no existing category. The second function is the one most systems skip.

Read it as a routing table, not a checklist.

Failure Taxonomy · Observable Signals · Repair Routing

Failure Pattern	Observable Signal	Likely Locus	Repair Target
Correct intent, wrong format	Retry with same content, reformatted	System prompt	Output format instructions
Hallucinated facts, right structure	User correction follow-up	Base model knowledge	RAG layer or targeted fine-tune
Refused valid request	Abrupt session end after refusal	System prompt	Instruction calibration
Right answer, wrong step order	Mid-task rephrase, step backtrack	Reasoning scaffold	Chain-of-thought structure
Works in eval, breaks in production	Eval-prod quality gap	Eval distribution	Eval set expansion
Fails under long context	Quality degrades mid-session	Context management	Chunking / summarization strategy
Unclassified	Does not match any row above	Unknown	Extend the taxonomy

The last row is the most important. When a failure pattern does not cross-reference cleanly, the temptation is to force it into the nearest existing category. That is the mistake. An unclassified failure is a signal that the taxonomy has a blind spot. Patching the symptom without extending the taxonomy guarantees the next variant of that failure also goes unclassified.

The residual is a metric. A growing residual means the taxonomy is lagging the system. A shrinking residual means the taxonomy is keeping pace. That number belongs on the same dashboard as your quality scores.

Taxonomy-Routed vs. Naive Repair

This synthetic experiment runs 1,000 simulated production interactions with five injected failure types, proportions drawn from MAST's published distribution: specification misalignment (~25%), reasoning failure (~20%), context management (~18%), output format error (~15%), retrieval and knowledge gap (~12%), and residual unclassified (~10%). Two repair strategies are applied over three cycles.

Strategy A (Naive) treats all failures as equivalent. Repair is applied uniformly: the same prompt adjustment across every failure type, regardless of which layer the fault originates from.

Strategy B (Taxonomy-Routed) cross-references each failure against the taxonomy before any repair decision is made. Repair is routed to the specific locus identified for that failure type.

Quality is scored per cycle using the weighted-category approach from the model quality post.

Synthetic Experiment · N=1,000 · 3 Repair Cycles

Quality Recovery by Repair Strategy

% of injected quality loss recovered across three repair cycles

Taxonomy-Routed Naive (Uniform) Full Recovery (100%)

Failure type distribution from MAST (NeurIPS 2025). Quality scoring uses the weighted-category method with frozen usage weights from the model quality post. Taxonomy-matched repair recovers 80–90% of injected loss per cycle. Mismatched repair recovers 10–30%.

After cycle 1, both strategies improve. Naive repair picks up the easy wins first: format errors happen to respond to general prompt tightening, so the early numbers look encouraging regardless of which approach you take.

After cycle 2, the divergence starts. The taxonomy-routed strategy continues recovering. The naive strategy plateaus. The failures that did not respond to the uniform fix are still present, and some that initially appeared to improve have re-emerged in slightly different form because the root layer was never addressed.

After cycle 3, Strategy B has recovered 85% of injected quality loss. Strategy A has recovered only 45% and is beginning to regress on failure types whose repair target conflicts with the uniform fix applied to other types.

Format instruction tightening that fixes output errors can over-constrain the system prompt and increase valid-request refusals. A repair that helps one failure type makes a different failure type worse. The taxonomy prevents this category of unintended side-effect failures.

The crossover is not a fluke of the specific numbers chosen. It is a structural property of the problem. When you apply a fix to the wrong layer, you do not merely fail to fix the failure. In some cases you create a new one. That effect is real in the synthetic data, and yet the mechanism is not synthetic.

Feedback Loop in Full

The complete cycle runs seven steps. The first six appear in most implementations. The seventh is where most implementations stop short.

Step 1: Baseline. Establish quality from evals before any user traffic enters the system. The repair cycle measures against it.

Step 2: Collect. Aggregate loss patterns from users. Explicit ratings are only part of the signal. Retries, rephrases, session-ending refusals, and low copy rates all carry information about where the system is underperforming.

Step 3: Cross-reference. Run losses against the taxonomy to identify the repair locus. Everything downstream depends on getting this right.

Step 4: Prioritize. Rank by volume times severity times tractability. Fix the highest-leverage failure first, not the most recently visible one.

Step 5: Repair. Target the specific layer: system prompt, RAG configuration, fine-tune, eval set, or context strategy. Not all of these at once.

Step 6: Test. Measure quality improvement against baseline. Confirm the repair moved the right metric and did not degrade an adjacent one.

Step 7: Update the taxonomy. Classify any failures that did not fit existing categories. Add new rows. Shrink the residual. This is the step most implementations skip.

A system that closes the loop at step 6 improves its current known failures. It stays blind to the next generation because the taxonomy still does not know they exist. Skip Step 7 and the loop can only get better at what it already knows. Without it, the residual grows quietly while everything else looks fine.

Back to the Shop

A diagnostic shop fixes the right thing the first time, updates its fault code library when a new failure mode comes in, and gets faster over time. A shop without one replaces air filters. Both look like they are working. The difference shows up on the third visit.

The earlier posts built the instruments. The taxonomy connects them to a repair. Without that connection, you have signal and no routing. The monitoring loop becomes a thrash loop.

The question left open is what the residual looks like in real user data: how fast new failure modes accumulate, whether a taxonomy can be built to keep pace, and whether the unclassified share stabilizes or grows. The WildChat dataset has the behavioral traces to start answering that in a forthcoming analysis.

Methodological Note

N=1,000 synthetic interactions. Five injected failure types with proportions drawn from MAST published percentages (NeurIPS 2025, UC Berkeley).

Quality scoring uses the weighted-category method developed in the model quality post. Weights are frozen at pre-period values. The same GAMM specification is used for scoring consistency across the series.

Repair effectiveness: taxonomy-matched repair recovers 80–90% of injected loss per cycle. Mismatched repair recovers 10–30%. When repair targets conflict across failure types (e.g., format tightening applied to a refusal-heavy distribution), net quality may decrease.

Three repair cycles simulated. No partial matching: a repair either routes to the correct locus or it does not. Code: GitHub repository (link added at publication).

This Series

Post	Core Claim
Model Quality	Quality drives retention; category-level variation enables causal identification
ARC	Trajectory shape carries signal the final score misses
ARC-Retention	Trajectory shape predicts 7-day return
WildChat	Users reveal quality gaps through prompt behavior
This post	Taxonomy routes loss signals to the right repair layer
Next	Emergent failures: measuring taxonomy lag in real user data

Prior work. The within-version quality exposure framework, identification strategy, frozen-weights design, and GAMM specification are described in Does Making AI Smarter Actually Make People Use It More? The ARC trajectory evaluation framework is described in The Shape of a Good Answer. The ARC-retention extension is in The Shape Predicts the Return. Behavioral inference from prompt topic shifts is in How Users Vote With Their Prompts.

Data note. All data is synthetic. No real users, no proprietary models, no production systems. Failure type proportions are drawn from MAST (Guo et al., NeurIPS 2025). The repair effectiveness parameters (80–90% recovery for taxonomy-matched repair, 10–30% for mismatched repair) are specified in the data-generating process and are not estimated from the data.

AI Usage

Ideas, analysis, and opinions are my own. Generative AI was used as an editor after the writing and analysis were complete — sentence restructuring and light copy-editing. The author reviewed all suggested changes.