Article by John Tribbia
I have run Bear Peak in Boulder so many times, I have lost count. The trail from Cragmoor Trailhead gains 2,800 feet in under three miles. I know exactly what a good run on that climb feels like from the inside (spoiler: it hurts regardless). You resist the urge to push hard off the gun. You find a rhythm in the first quarter-mile up the Cragmoor stairs with a shorter stride, consistent breathing, and reading the grade. You build through the middle section where the trail pitches and the less-experienced runners are already blown up. And at the top, when it genuinely hurts and every instinct says to ease off, you hold. Form intact. Effort peaking exactly when it needs to.
That heart rate curve is what I’m talking about. Not the finish time. Not the average. The shape of that curve, rising and peaking and holding, is the thing that tells you whether you managed your effort well or just survived it. A runner who goes out too hard shows a different heart rate profile entirely: a spike in the first ten minutes, then a slow collapse. Maybe even the same finish time, but a completely different story.
A mountain climb is unforgiving. There is no flat section where you recover from a bad start. No downhill to coast on. The terrain compresses the whole problem of pacing into one continuous test. You cannot fake fitness on a climb. The Strava upload will show you the shape of your effort, not just where you finished.
The arc run peaks at ~163 bpm near the summit and holds. The blowup run spikes to 173 in the first 12 minutes and slowly collapses. Same finish time, same average heart rate, opposite story.
Watch a language model work through a difficult multi-step task and you notice something similar is missing from how we evaluate it. The quality of its outputs often does not follow that curve. It might go out too fast and commit confidently to a flawed assumption in the first two steps, then spend six more building coherent-sounding reasoning on a wrong foundation. Or it might hold form through the first two-thirds and then fall apart when the task pitches steep and the context window (the model’s working memory) is saturated. Or it might stumble and recover in ways that look more like luck than controlled effort.
The difference between the runner’s rising arc and the model’s erratic path is not just a performance gap. It is a measurement gap. We are not yet good at describing the shape of how agents succeed or fail across the course of a task. We ask whether they got the answer right at the end. That is like judging a mountain race by finish time alone.
Where This Fits in the Literature
The idea that step-level evaluation beats outcome-only evaluation is not new. Lightman et al. (2023) made this case rigorously in “Let’s Verify Step by Step,” showing that models trained to score each individual reasoning step (process reward models) substantially outperform models that score only the final answer (outcome reward models) on hard math problems.1 Cobbe et al. (2021) laid some of the groundwork with outcome supervision in “Training Verifiers to Solve Math Word Problems."2 If you are doing serious RLHF (reinforcement learning from human feedback, the technique used to train language models on human preferences) or working on reasoning chain quality, you are probably already thinking in terms of step-level feedback.
ARC is not trying to reinvent that. The distinction worth drawing is that Lightman et al. are solving a training problem: how do you generate the right signal to improve a model during training? ARC is solving a diagnostic problem: once you have a deployed agent producing multi-step outputs, how do you figure out what is actually wrong with it, where in the task it breaks down, and which intervention is likely to fix it? The frameworks address different questions. One shapes the model. The other tells you what shape the model is in and what shape the next training run should target.
The practitioner gap here is real. Most eval tooling in production follows an outcome-only logic even when teams know better. You set up a benchmark, you check final answers, you track a score. The score goes up over time and you ship. What gets missed is how the score goes up: whether the model is genuinely getting better at reasoning through hard tasks, or just getting better at pattern-matching to common task structures. A recent example from Anthropic makes this concrete. Bloom is a behavioral elicitation framework that takes a single behavior (sycophancy, self-preservation, self-preferential bias) and automatically generates many scenarios to measure how often it occurs.3 The output is an elicitation rate: does this behavior appear, and how often? That is still an outcome-level measurement. ARC asks the next question downstream: when in the task trajectory does it appear, and what shape does the failure take? Bloom might tell you that a model is sycophantic 34% of the time. ARC would tell you whether that sycophancy shows up in the orientation steps (Early Collapse, where the model builds on a flawed assumption from the start) or whether it holds and then drifts sycophantic as the context grows heavy (Late Drift). Those are different problems with different fixes, and both measurements are needed. Bloom integrates with Weights and Biases for experiments at scale, which means the logging infrastructure for step-level signals already exists in most teams running it. ARC-style diagnostics could be layered directly on top of what Bloom already produces. ARC is an attempt to make that diagnostic layer cheap enough that teams actually run it.
ARC
An arc is a shape. Rising, peaking, holding. It is also the name of the framework I’m introducing here for evaluating agents on complex, multi-step tasks.
What Goes Wrong
Most agent failures are failures of arc, not ability. The capability is there, but it doesn’t deploy in the right shape. When you plot step-by-step quality scores instead of just checking the final answer, a small set of failure patterns emerge, all producing similar aggregate scores while calling for completely different interventions.
The four patterns below are a logical partition of what can happen to a sequence of step scores across a task: it starts high and collapses early, it starts low and drifts, it declines steadily, or it dips and recovers. They are derived from the geometry of trajectories, not from a large empirical study. Whether these four are the right and complete taxonomy for your domain is precisely what the Calibrate phase is designed to tell you, by comparing eval-side pattern distributions against production transcripts.
To make these patterns concrete, the examples below use synthetic scores constructed to illustrate each shape clearly. They are thought experiments, not measured model outputs. The point is the diagnostic logic, not the numbers:
| Model | Aggregate | Step Scores |
|---|---|---|
| A | 0.74 | 0.90, 0.91, 0.88, 0.60, 0.55, 0.58, 0.61, 0.62, 0.60, 0.58 |
| B | 0.74 | 0.60, 0.58, 0.62, 0.78, 0.82, 0.85, 0.88, 0.87, 0.55, 0.45 |
| C | 0.74 | 0.85, 0.82, 0.78, 0.74, 0.71, 0.68, 0.65, 0.62, 0.58, 0.57 |
| D | 0.74 | 0.88, 0.90, 0.85, 0.40, 0.38, 0.72, 0.85, 0.87, 0.86, 0.89 |
The dashed gray line marks the aggregate score of 0.74, identical for all four models. Click a pattern above to show or hide it. The right intervention for each is completely different. Treating them as equivalent wastes a sprint.
Model A collapses early and grinds through the task on a broken foundation. Model B builds well and then falls apart at the end. Model C is slowly bleeding quality across every step. Model D hits a wall, catches itself, and recovers. A grounding prompt helps Model A. Context management helps Model B. Model D might not need intervention at all.
The model makes a flawed assumption in the first few steps, then produces locally coherent outputs that all build on a wrong foundation. This is the most dangerous pattern because the model is not confused. It is confidently wrong. This is the runner who goes out too hard and does not know it yet.
Intervention: Improve how the model grounds its initial assumptions before committing to an approach.
Performance holds through the first two-thirds of the task and then falls sharply. Usually context saturation: language models have a fixed context window, a limit on how much text they can hold in active memory at once. As the task grows more complex, that window fills and the model loses track of constraints established early on. The runner with strong legs and no fuel left for the ridge.
Intervention: Context management, summarization at intermediate steps, or chunking strategies.
Performance declines incrementally across every step. Nothing breaks catastrophically, which is precisely why aggregate scores miss it. Exactly the kind of thing that Strava's average pace hides and the split-by-split view reveals.
Intervention: Requires failure mode classification first. Root cause determines which intervention applies.
A dip followed by self-correction. Recovery capability is undervalued. A model that catches and corrects its own mistakes under real conditions may be more deployment-ready than one with a cleaner trajectory on easier evaluations. The runner who catches a root, stumbles, and drives back into rhythm.
Intervention: Measure recovery rate explicitly as a first-class metric.
Score each step against its sub-goal with a zero-to-one judgment, plot the scores, look at the shape. The rubric, who applies it, and how to verify the score is real rather than a format artifact are the subject of the Assess section below.
The Measurement Side
Knowing that an arc broke is only useful if you know the break was real, know what caused it, and know that the thing you're measuring actually matters in the real world. ARC addresses all three.
Assess: Did Something Real Happen?
The goal of Assess is to produce a trustworthy trajectory: a vector of per-step scores that reflects genuine model behavior, not format recognition. It has two components: building the trajectory and verifying it is real.
Building the Trajectory
Select a task suite of 30–50 multi-step tasks representative of your deployment domain. Each task should be decomposable into 4–10 discrete sub-goals with verifiable outputs.
Here is what the scoring protocol looks like for a single task. Say you are evaluating a customer-service agent that handles a four-step support workflow: understand the user's issue, look up their account, propose a resolution, and draft the reply. Before you run the agent, you write a one-sentence success definition for each step, for example: for step 1, "correctly identifies the issue type and accounts for any stated constraints," for step 2, "retrieves the correct account record and surfaces the relevant fields," and so on. You run the agent and capture its output at each step. You then pass each output to a separate AI judge, along with only that step's success definition, and ask it to score from 0 to 1. The judge never sees the other steps or the final answer. Four steps, four scores, one trajectory. That is the protocol. Everything that follows (the rubric, the calibration check, the surface-feature verification) is about making sure those four numbers are trustworthy.
1. Score each step independently. Use a privacy-preserving No Look Eval (NLE) judge model with a rubric grounded in the sub-goal, not the final answer. The "No Look" means it does the validation live without seeing how the task ends. It cannot be influenced by knowing whether the final answer was right or wrong. Prompt it to return a score between 0.0 and 1.0 with a one-sentence rationale. Never pass the full conversation history to the NLE. Score each step in isolation to avoid halo effects (the tendency to rate later steps more generously because earlier steps went well, even if those later steps have independent flaws). Human annotation is not required for routine runs and it is reserved for calibration (see step below).
# NLE prompt template
nle_prompt = """Given this sub-goal: {subgoal}
And this model output: {step_output}
Score the output 0.0-1.0 for how well it achieves the sub-goal.
Return JSON: {"score": float, "rationale": str}"""
The rubric anchors the score to the sub-goal, not to style or length. A four-band scale covers most task types:
| Score | Criterion | Typical signals |
|---|---|---|
| 0.0–0.2 | Sub-goal not achieved | Output ignores the sub-goal, reaches a wrong conclusion, or hallucinates required facts. |
| 0.3–0.5 | Partial progress | Correct direction but a required element is missing, or the sub-goal is met with a flawed assumption that will propagate downstream. |
| 0.6–0.8 | Sub-goal achieved with caveats | All required elements present, with a minor gap in precision, completeness, or constraint adherence. |
| 0.9–1.0 | Sub-goal fully achieved | All required elements present, constraints respected, output verifiably correct against the sub-goal spec. |
Calibrate the NLE judge before trusting it in production: score five human-labeled examples per task category across the full rubric range, and tune the NLE prompt until its scores correlate with human scores at r ≥ 0.80 (a strong correlation meaning the automated judge ranks outputs in nearly the same order a human reviewer would). This takes roughly 30 minutes per task category and should be re-run whenever you change the judge model or the prompt template.
2. Compute the trajectory vector. Store scores as a list indexed by step. Compute three slope indicators: early slope (steps 1–N/3), mid slope (steps N/3–2N/3), and late slope (steps 2N/3–N). These three numbers alone will classify most trajectories into one of the four failure patterns.
def classify_trajectory(scores):
n = len(scores)
s = list(scores)
t = n // 3
early_mean = sum(s[:t]) / t
mid_mean = sum(s[t:2*t]) / t
late_mean = sum(s[2*t:]) / (n - 2*t)
slope_l = (s[-1] - s[2*t]) / max(n - 2*t - 1, 1)
# Check recovery first: dip > 0.20, then returns > 0.10 above the dip
dips = [i for i in range(1, n-1) if s[i] < s[i-1] - 0.20]
if dips and s[-1] > s[dips[0]] + 0.10:
return 'recovery'
# Early collapse: early section strong, drops and stays low through the end
if (early_mean - mid_mean) > 0.20 and (early_mean - late_mean) > 0.20:
return 'early_collapse'
# Late drift: final segment slopes clearly negative
if slope_l < -0.12:
return 'late_drift'
# Steady degradation: overall score declined without a structural break
if (s[0] - s[-1]) > 0.15:
return 'steady_degradation'
return 'healthy'
In plain terms: this function reads the score sequence in thirds, looks for structural patterns (a strong start that collapses, a sharp late drop, a dip followed by recovery, or a slow overall slide), and returns a label. That label determines which intervention the Repair phase selects.
Verifying the Trajectory is Real
3. Run surface-feature variation. Paraphrase each failing task three times with structurally identical content but different wording, examples, and formatting. Re-score. If score variance across variants exceeds 0.15, the model is responding to surface features, not underlying capability.
Here is what that looks like in practice. Suppose your agent is failing at step 3 of a customer support routing task. Specifically, it misidentifies the urgency tier. You write two paraphrases that ask for exactly the same thing:
| Variant | Phrasing | Step 3 Score |
|---|---|---|
| Original | "Classify the urgency of this ticket: a user says their account is locked and they cannot access payroll." | 0.31 |
| Paraphrase A | "A customer reports being locked out of their account and unable to run payroll. Assign a priority level." | 0.87 |
| Paraphrase B | "Ticket summary: payroll access blocked due to account lockout. Determine the urgency category." | 0.82 |
Score variance across three variants: 0.56. The agent is not failing to recognize an urgent payroll lockout. It is tripping on the word "locked" in the original phrasing, which the model associates with a lower-tier security issue. The failure is surface-level format sensitivity, not a capability gap. Sending this to Repair and spending a sprint on a grounding prompt would be the wrong call. The fix is prompt wording, not model behavior.
Variance below 0.08 across all three paraphrases means the failure is consistent regardless of how you ask. That is a genuine capability gap worth repairing.
Repair: Which Failure, and What Fix?
Repair has three sequential steps: sub-goal weighting, failure localization, and disambiguation. All three must complete before an intervention is selected. Skipping to the intervention based on the symptom alone is the single most common way teams waste a sprint on the wrong fix.
Step 1: Weight Sub-Goals by Downstream Impact
Not all steps matter equally. A flawed assumption in step 1 of a 7-step task is much worse than the same flaw in step 6, because the downstream steps all build on it. Weighting makes that explicit so you are prioritizing the right break points.
Consider a coding assistant tasked with implementing a feature: (1) understand the requirement, (2) identify affected files, (3) write the core logic, (4) write tests, (5) handle edge cases, (6) update documentation, (7) summarize changes. Steps 1 and 3 are critical gates: if the agent misreads the requirement or writes broken core logic, everything downstream is wrong regardless of how well it tests or documents. Steps 4–7 are isolated or shared dependencies.
# Downstream impact weighting
# 1 = isolated step (failure affects only this step)
# 2 = shared dependency (failure propagates to 2-3 downstream steps)
# 3 = critical gate (failure invalidates all downstream steps)
# Coding assistant task: understand → find files → core logic → tests → edge cases → docs → summary
weights = [3, 2, 3, 2, 1, 1, 1]
weighted_score = sum(s*w for s,w in zip(scores, weights)) / sum(weights)
# Example: agent scores well everywhere but misreads the requirement (step 1)
scores = [0.30, 0.85, 0.80, 0.88, 0.82, 0.90, 0.91]
# Unweighted average: 0.78 (looks fine)
# Weighted average: 0.64 (gate failure surfaces)
The unweighted average here is 0.78, which might pass a quality bar. The weighted average is 0.64, which correctly flags that the most load-bearing step failed. Assign weights before you score. Doing it after introduces hindsight bias: you will unconsciously weight the steps that happened to fail in this run rather than the steps that are genuinely most critical regardless of outcome.
Step 2: Localize the Break Point
The trajectory score tells you a failure happened. Localization tells you where. But score alone is often not enough: a step that looks mediocre in isolation might be fine given its difficulty, while an apparently acceptable step might have introduced a subtle error that takes three steps to surface. Adding cost and latency alongside trajectory gives you a richer signal and often surfaces the break before the score does.
Think of it like this: if a step's score drops, that is the model telling you it struggled. If its latency spikes at the same step, the model is also telling you it was uncertain: it took longer to generate the output. If both happen together, you have a high-confidence break point. If latency spikes but score holds, the model was uncertain but recovered, and that is a step worth watching in future runs. Clustering these three signals together (score drop, latency spike, and token cost increase) is more diagnostic than any one alone.
def find_break_point(scores, latency_ms=None, token_cost=None):
baseline = sum(scores[:len(scores)//3]) / (len(scores)//3)
signals = []
for i in range(1, len(scores)):
step_drop = scores[i-1] - scores[i]
base_drop = baseline - scores[i]
lat_spike = (latency_ms[i] / latency_ms[i-1] > 1.5) if latency_ms else False
cost_spike = (token_cost[i] / token_cost[i-1] > 1.4) if token_cost else False
n_signals = sum([step_drop > 0.20, base_drop > 0.20, lat_spike, cost_spike])
signals.append((i, n_signals, step_drop, base_drop))
# Step with the most converging signals is the most likely break point
signals.sort(key=lambda x: -x[1])
return signals[0][0] if signals else None
In practice: you are running a document summarization agent across a 6-step pipeline. Steps 1–3 look fine. At step 4, score drops from 0.82 to 0.58, latency jumps from 1.2s to 3.8s, and token count doubles. All three signals converge: step 4 is the break. Without latency and cost data, step 4's score drop alone would still flag it, but the convergence removes ambiguity: the model is not just producing a worse output, it is genuinely struggling to produce any output at all. That is useful information for choosing the intervention.
Step 3: Run the Disambiguation Protocol
Once you have the break point, you have a location but not a cause. Five failure types produce overlapping symptoms. The table below gives each one a concrete probe (a targeted re-run that changes exactly one thing) to separate them. The "Disambiguator" column is the test. The "Intervention" column is only valid after the probe confirms the cause.
A brief example of why this matters: suppose your document agent fails at step 4. The symptom is a dropped constraint: the model stops tracking a key requirement introduced in step 1. Two completely different causes produce that symptom: context loss (the model has simply run out of working memory for early context) and a capability gap (the model never reliably propagates multi-step constraints, regardless of context length). Context loss is fixed with chunking. A capability gap is not. That requires curriculum work or a different model. Applying chunking to a capability gap makes the task slower without fixing anything. The probe is what tells you which problem you actually have.
| Symptom | Likely Cause | Disambiguator | Intervention |
|---|---|---|---|
| Wrong answer, step 1–2 | Hallucination or goal misread | Re-run with the correct constraint injected explicitly ("The account type is enterprise, not SMB"). Persists? Hallucination: the model is fabricating. Repairs? Goal misread: it understood the wrong task from the original phrasing. | Grounding or clarification prompt |
| Correct mid-steps, wrong late | Context loss or capability gap | Summarize the mid-point context into a fresh prompt and re-run only the failing tail. Improves? Context loss: the model had the capability but lost the thread. Flat? Capability gap: the model cannot do the late-task reasoning regardless of context freshness. | Chunking or summarization for context loss. Curriculum or model upgrade for capability gap. |
| Wrong tool called | Tool selection or goal misread | Swap the tool descriptions in the system prompt, renaming the tools but preserving their function. Persists? Goal misread: it is calling whatever it thinks fits the intent. Repairs? Tool selection: it was pattern-matching on the tool name, not the task. | Tool description rewrite or goal clarification |
| Gradual decline, all steps | Context accumulation or capability gradient | Inject a fresh context summary at the midpoint of the task and re-run from there. Improves? Accumulated context is the drag. Flat? The model simply performs worse as task complexity compounds. That is a capability gradient, not a context problem. | Chunking or periodic summarization for context. Structured task decomposition for capability gradient. |
Log format:
date | task_id | symptom | probe | result | classification | interventionThe cost of misclassification is asymmetric. A context management fix applied to a capability gap wastes a sprint. A grounding prompt applied to a context loss problem masks the symptom. Verify before you ship.
Calibrate: Can We Trust These Answers Over Time?
Calibrate has three disciplines, each building on the previous. The first two keep the eval signal internally honest. Most systems skip behavioral grounding. That is where the gap between controlled evaluation and real-world behavior opens.
Discipline 1: Internal Signal Integrity
1. Build and protect a held-out eval set. Reserve 20% of your task suite, stratified by task type and difficulty, as a held-out set that never touches training data. Rescore this held-out set after every training run alongside your main eval set.
2. Track the held-out gap. A stable or narrowing gap means the eval signal is trustworthy. A widening gap means the model has learned to score well on familiar eval tasks without genuinely improving on new ones: surface performance inflating faster than real capability.
gap = in_distribution_score - held_out_score
# gap < 0.05 -> signal healthy
# gap 0.05-0.10 -> watch closely
# gap > 0.10 -> surface inflation likely, pause and investigate
3. Retire contaminated benchmarks proactively. Contamination happens when a model was trained or fine-tuned on data that resembles your eval tasks, causing scores to inflate artificially even without genuine improvement. Flag any benchmark where the model scores above 0.92 on three consecutive runs. Assume contamination risk and retire it.
Discipline 2: Recovery Rate Tracking
Recovery rate is a leading indicator of deployment robustness that most eval systems do not measure. Build it in from the start.
4. Inject known errors into your eval suite. For 5–10 tasks per eval run, deliberately introduce a wrong answer or flawed reasoning step at position 3 or 4 in the task sequence. Measure whether it self-corrects within two subsequent steps.
recovery_rate = corrected_tasks / error_injection_tasks
# Improving -> model getting more robust
# Degrading -> potential regression, investigate
A model that achieves 0.85 on clean trajectories but a 0.30 recovery rate is less deployment-ready than one with 0.80 on clean trajectories and a 0.65 recovery rate. Novel production tasks will always introduce errors. Recovery rate tells you what clean-run scores cannot.
Discipline 3: Behavioral Grounding
This is the boundary between evaluation and in-the-wild measurement. Run this check monthly.
5. Sample production transcripts. Pull 50–100 real user interactions that map to your eval task categories. If you do not have direct transcript access, proxy behavioral signals work: did the user regenerate? Abandon? Continue without editing?
6. Classify transcripts using ARC failure patterns. Use the same taxonomy. A single reviewer with the rubric takes roughly two hours for 100 transcripts.
7. Compare distributions and close the loop. If distributions diverge by more than 15–20 percentage points on a given pattern, either your eval tasks are not representative of production or your failure taxonomy needs revision.
eval: {early_collapse: 0.35, late_drift: 0.28, recovery: 0.22, steady_deg: 0.15}
prod: {early_collapse: 0.12, late_drift: 0.31, recovery: 0.19, steady_deg: 0.38}
steady_degradation diverges 23pp -> add more to eval suite before next run
Behavioral grounding also validates your intervention decisions. If your eval says late-drift is the dominant failure pattern and you apply context management fixes, production should show late-drift declining in subsequent months. An eval can be perfectly calibrated and still measure the wrong thing. Behavioral grounding is how you find out.
The Flywheel
ARC is not a one-time audit. Each rotation deposits something that makes the next rotation faster and more precise.
The framework describes what a good trajectory looks like and how to classify deviations from it. But a classification system is itself an instrument and instruments fail in specific ways. The next post does what you should do with any new tool before trusting it in production: deliberately tries to break it. The classifier gets stress-tested against synthetic trajectories at increasing noise levels, and the failure modes are documented precisely. Knowing the operating envelope of your measurement tool is part of trusting it.
References
- Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. (2023). "Let's Verify Step by Step." arXiv preprint arXiv:2305.20050. https://arxiv.org/abs/2305.20050
- Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. (2021). "Training Verifiers to Solve Math Word Problems." arXiv preprint arXiv:2110.14168. https://arxiv.org/abs/2110.14168
- Anthropic (2025). "Bloom: A Framework for Behavioral Elicitation." Anthropic Research. https://www.anthropic.com/research/bloom
Ideas, analysis, and opinions are my own. Generative AI was used as an editor after the writing and analysis were complete — sentence restructuring and light copy-editing. The author reviewed all suggested changes.