data project

Most agent evaluations only check whether the final answer was right. That is like judging effort in a mountain race by finish time alone. ARC is a framework for evaluating the full trajectory.

The Shape of a Good Answer
Assess - Repair - Calibrate (ARC). A framework for long-horizon model diagnostics.

Article by John Tribbia

I have run Bear Peak in Boulder so many times, I have lost count. The trail from Cragmoor Trailhead gains 2,800 feet in under three miles. I know exactly what a good run on that climb feels like from the inside (spoiler: it hurts regardless). You resist the urge to push hard off the gun. You find a rhythm in the first quarter-mile up the Cragmoor stairs with a shorter stride, consistent breathing, and reading the grade. You build through the middle section where the trail pitches and the less-experienced runners are already blown up. And at the top, when it genuinely hurts and every instinct says to ease off, you hold. Form intact. Effort peaking exactly when it needs to.

That heart rate curve is what I’m talking about. Not the finish time. Not the average. The shape of that curve, rising and peaking and holding, is the thing that tells you whether you managed your effort well or just survived it. A runner who goes out too hard shows a different heart rate profile entirely: a spike in the first ten minutes, then a slow collapse. Maybe even the same finish time, but a completely different story.

A mountain climb is unforgiving. There is no flat section where you recover from a bad start. No downhill to coast on. The terrain compresses the whole problem of pacing into one continuous test. You cannot fake fitness on a climb. The Strava upload will show you the shape of your effort, not just where you finished.

Bear Peak · Cragmoor Trailhead to Summit
Two runs. Same finish time. Completely different story.
Heart rate profile over ~40 min · 2,800 ft ascent
The arc (good run) Went out too hard

The arc run peaks at ~163 bpm near the summit and holds. The blowup run spikes to 173 in the first 12 minutes and slowly collapses. Same finish time, same average heart rate, opposite story.

Watch a language model work through a difficult multi-step task and you notice something similar is missing from how we evaluate it. The quality of its outputs often does not follow that curve. It might go out too fast and commit confidently to a flawed assumption in the first two steps, then spend six more building coherent-sounding reasoning on a wrong foundation. Or it might hold form through the first two-thirds and then fall apart when the task pitches steep and the context window is saturated. Or it might stumble and recover in ways that look more like luck than controlled effort.

The difference between the runner’s rising arc and the model’s erratic path is not just a performance gap. It is a measurement gap. We are not yet good at describing the shape of how agents succeed or fail across the course of a task. We ask whether they got the answer right at the end. That is like judging a mountain race by finish time alone.


Where This Fits in the Literature

The idea that step-level evaluation beats outcome-only evaluation is not new. Lightman et al. (2023) made this case rigorously in “Let’s Verify Step by Step,” showing that process reward models trained to score individual reasoning steps rather than just final answers substantially outperform outcome reward models on hard math problems.1 Cobbe et al. (2021) laid some of the groundwork with outcome supervision in “Training Verifiers to Solve Math Word Problems."2 If you are doing serious RLHF or working on reasoning chain quality, you are probably already thinking in terms of step-level feedback.

ARC is not trying to reinvent that. The distinction worth drawing is that Lightman et al. are solving a training problem: how do you generate the right signal to improve a model during training? ARC is solving a diagnostic problem: once you have a deployed agent producing multi-step outputs, how do you figure out what is actually wrong with it, where in the task it breaks down, and which intervention is likely to fix it? The frameworks address different questions. One shapes the model. The other tells you what shape the model is in and what shape the next training run should target.

The practitioner gap here is real. Most eval tooling in production follows an outcome-only logic even when teams know better. You set up a benchmark, you check final answers, you track a score. The score goes up over time and you ship. What gets missed is how the score goes up: whether the model is genuinely getting better at reasoning through hard tasks, or just getting better at pattern-matching to common task structures. ARC is an attempt to make the diagnostic layer cheap enough that teams actually run it.


ARC

An arc is a shape. Rising, peaking, holding. It is also the name of the framework I’m introducing here for evaluating agents on complex, multi-step tasks.

A
Assess
As a targetBuild understanding before committing. Orient, gather context, figure out what the problem actually requires.
As a measurementDetect what happened across the trajectory, verify it was a genuine failure, and confirm the signal is trustworthy before investing in a fix.
R
Repair
As a targetExecute with peak quality. The understanding built in Assess pays off here.
As a measurementIdentify which failure was load-bearing, classify it unambiguously using a verified protocol, and select the correct intervention.
C
Calibrate
As a targetHold quality through to completion without degrading.
As a measurementKeep the eval signal honest over time and verify it is pointing at the right thing in the real world.
Framework
The arc as both target and measurement
A good agent traces an arc. ARC tells you when it did not, where it broke, and whether you can trust what you're seeing.
A - Assess build understanding first R - Repair peak execution, fix the break C - Calibrate hold quality, trust the signal start end
The ideal agent trajectory and the measurement framework are the same shape. A good agent traces an arc. ARC tells us when it did not, where it broke, and whether we can trust what we're seeing.

What Goes Wrong

Most agent failures are failures of arc, not ability. The capability is there, but it doesn’t deploy in the right shape. When you plot step-by-step quality scores instead of just checking the final answer, four failure patterns emerge repeatedly, all producing similar aggregate scores while calling for completely different interventions.

To make this concrete, imagine four models each scoring around 0.74 on a ten-step task:

ModelAggregateStep Scores
A0.740.90, 0.91, 0.88, 0.60, 0.55, 0.58, 0.61, 0.62, 0.60, 0.58
B0.740.60, 0.58, 0.62, 0.78, 0.82, 0.85, 0.88, 0.87, 0.55, 0.45
C0.740.85, 0.82, 0.78, 0.74, 0.71, 0.68, 0.65, 0.62, 0.58, 0.57
D0.740.88, 0.90, 0.85, 0.40, 0.38, 0.72, 0.85, 0.87, 0.86, 0.89
Agent trajectories
Four models. Same aggregate score. Four completely different problems.
Per-step quality score (0–1) across a 10-step task

The dashed gray line marks the aggregate score of 0.74, identical for all four models. Click a pattern above to show or hide it. The right intervention for each is completely different. Treating them as equivalent wastes a sprint.

Model A collapses early and grinds through the task on a broken foundation. Model B builds well and then falls apart at the end. Model C is slowly bleeding quality across every step. Model D hits a wall, catches itself, and recovers. A grounding prompt helps Model A. Context management helps Model B. Model D might not need intervention at all.

Early Collapse

The model makes a flawed assumption in the first few steps, then produces locally coherent outputs that all build on a wrong foundation. This is the most dangerous pattern because the model is not confused. It is confidently wrong. This is the runner who goes out too hard and does not know it yet.

Intervention: Improve how the model grounds its initial assumptions before committing to an approach.

Late Drift

Performance holds through the first two-thirds of the task and then falls sharply. Usually context saturation: the model loses track of earlier constraints as complexity compounds. The runner with strong legs and no fuel left for the ridge.

Intervention: Context management, summarization at intermediate steps, or chunking strategies.

Steady Degradation

Performance declines incrementally across every step. Nothing breaks catastrophically, which is precisely why aggregate scores miss it. Exactly the kind of thing that Strava's average pace hides and the split-by-split view reveals.

Intervention: Requires failure mode classification first. Root cause determines which intervention applies.

Recovery After Error

A dip followed by self-correction. Recovery capability is undervalued. A model that catches and corrects its own mistakes under real conditions may be more deployment-ready than one with a cleaner trajectory on easier evaluations. The runner who catches a root, stumbles, and drives back into rhythm.

Intervention: Measure recovery rate explicitly as a first-class metric.

Score each step with a zero-to-one judgment, plot the vector, look at the shape.


The Measurement Side

Knowing that an arc broke is only useful if you know the break was real, know what caused it, and know that the thing you're measuring actually matters in the real world. ARC addresses all three.

Assess: Did Something Real Happen?

The goal of Assess is to produce a trustworthy trajectory: a vector of per-step scores that reflects genuine model behavior, not format recognition. It has two components: building the trajectory and verifying it is real.

Building the Trajectory

Select a task suite of 30–50 multi-step tasks representative of your deployment domain. Each task should be decomposable into 4–10 discrete sub-goals with verifiable outputs.

1. Score each step independently. Use a privacy preserving No Look Eval (NLE) LLM with a rubric grounded in the sub-goal, not the final answer. Prompt it to return a float 0.0–1.0 with a one-sentence rationale. Never pass the full conversation history to the NLE. Score each step in isolation to avoid halo effects from earlier correct steps.

# NLE prompt template
nle_prompt = """Given this sub-goal: {subgoal}
And this model output: {step_output}
Score the output 0.0-1.0 for how well it achieves the sub-goal.
Return JSON: {"score": float, "rationale": str}"""

2. Compute the trajectory vector. Store scores as a list indexed by step. Compute three slope indicators: early slope (steps 1–N/3), mid slope (steps N/3–2N/3), and late slope (steps 2N/3–N). These three numbers alone will classify most trajectories into one of the four failure patterns.

def classify_trajectory(scores):
    n = len(scores)
    s = list(scores)
    t = n // 3
    early_mean = sum(s[:t]) / t
    mid_mean   = sum(s[t:2*t]) / t
    late_mean  = sum(s[2*t:]) / (n - 2*t)
    slope_l    = (s[-1] - s[2*t]) / max(n - 2*t - 1, 1)

    # Check recovery first: dip > 0.20, then returns > 0.10 above the dip
    dips = [i for i in range(1, n-1) if s[i] < s[i-1] - 0.20]
    if dips and s[-1] > s[dips[0]] + 0.10:
        return 'recovery'

    # Early collapse: early section strong, drops and stays low through the end
    if (early_mean - mid_mean) > 0.20 and (early_mean - late_mean) > 0.20:
        return 'early_collapse'

    # Late drift: final segment slopes clearly negative
    if slope_l < -0.12:
        return 'late_drift'

    # Steady degradation: overall score declined without a structural break
    if (s[0] - s[-1]) > 0.15:
        return 'steady_degradation'

    return 'healthy'

Verifying the Trajectory is Real

3. Run surface-feature variation. Paraphrase each failing task three times with structurally identical content but different wording, examples, and formatting. Re-score. If score variance across variants exceeds 0.15, the model is responding to surface features, not underlying capability.

Threshold: Variance > 0.15 across 3 paraphrase variants means flag as format artifact. Variance < 0.08 means genuine failure, so proceed to Repair. Assess costs one sprint when you skip it and discover a format artifact three sprints later. It is non-optional.

Repair: Which Failure, and What Fix?

Repair has three sequential steps: sub-goal weighting, failure localization, and disambiguation. All three must complete before an intervention is selected. Skipping to the intervention based on the symptom alone is the single most common way teams waste a sprint on the wrong fix.

Step 1: Weight Sub-Goals by Downstream Impact

# Downstream impact weighting
# 1 = isolated step (failure affects only this step)
# 2 = shared dependency (failure propagates to 2-3 downstream steps)
# 3 = critical gate (failure invalidates all downstream steps)
weights = [3, 1, 2, 1, 3, 1, 2]   # assign per task
weighted_score = sum(s*w for s,w in zip(scores, weights)) / sum(weights)

Step 2: Localize the Break Point

def find_break_point(scores):
    baseline = sum(scores[:len(scores)//3]) / (len(scores)//3)
    for i in range(1, len(scores)):
        step_drop = scores[i-1] - scores[i]
        base_drop = baseline    - scores[i]
        if step_drop > 0.20 or base_drop > 0.20:
            return i
    return None

Step 3: Run the Disambiguation Protocol

Once you have the break point, run the appropriate probe before assigning a root cause. Five failure types produce overlapping symptoms. The probe is what separates them. Never classify based on symptom alone.

Disambiguation protocol - verify before prescribing
SymptomLikely CauseDisambiguatorIntervention
Wrong answer, step 1–2 Hallucination or goal misread Re-run with constraint injection. Persists? Hallucination. Repairs? Goal misread. Grounding or clarification prompt
Correct mid-steps, wrong late Context loss or capability gap Summarize context at midpoint, re-run tail. Improves? Context loss. Flat? Capability gap. Chunking or summarization
Wrong tool called Tool selection or goal misread Swap tool descriptions. Persists? Goal misread. Repairs? Tool selection. Tool description rewrite or goal clarification
Gradual decline, all steps Context loss or capability gap Inject fresh context mid-run. Improves? Context loss. Flat? Capability gap. Chunking or curriculum gap
Minimum probe n: 5 tasks per symptom before classification.
Log format: date | task_id | symptom | probe | result | classification | intervention

The cost of misclassification is asymmetric. A context management fix applied to a capability gap wastes a sprint. A grounding prompt applied to a context loss problem masks the symptom. Verify before you ship.

Calibrate: Can We Trust These Answers Over Time?

Calibrate has three disciplines, each building on the previous. The first two keep the eval signal internally honest. Most systems skip behavioral grounding. That is where the gap between controlled evaluation and real-world behavior opens.

Discipline 1: Internal Signal Integrity

1. Build and protect a held-out eval set. Reserve 20% of your task suite, stratified by task type and difficulty, as a held-out set that never touches training data. Rescore this set after every training run alongside your in-distribution set.

2. Track the held-out gap. A stable or narrowing gap means the eval signal is trustworthy. A widening gap means surface performance is inflating faster than real capability.

gap = in_distribution_score - held_out_score
# gap < 0.05    -> signal healthy
# gap 0.05-0.10 -> watch closely
# gap > 0.10    -> surface inflation likely, pause and investigate

3. Retire contaminated benchmarks proactively. Flag any benchmark where the model scores above 0.92 on three consecutive runs. Assume contamination risk and retire it.

Discipline 2: Recovery Rate Tracking

Recovery rate is a leading indicator of deployment robustness that most eval systems do not measure. Build it in from the start.

4. Inject known errors into your eval suite. For 5–10 tasks per eval run, deliberately introduce a wrong answer or flawed reasoning step at position 3 or 4 in the task sequence. Measure whether it self-corrects within two subsequent steps.

recovery_rate = corrected_tasks / error_injection_tasks
# Improving -> model getting more robust
# Degrading  -> potential regression, investigate

A model that achieves 0.85 on clean trajectories but a 0.30 recovery rate is less deployment-ready than one with 0.80 on clean trajectories and a 0.65 recovery rate. Novel production tasks will always introduce errors. Recovery rate tells you what clean-run scores cannot.

Discipline 3: Behavioral Grounding

This is the boundary between evaluation and in-the-wild measurement. Run this check monthly.

5. Sample production transcripts. Pull 50–100 real user interactions that map to your eval task categories. If you do not have direct transcript access, proxy behavioral signals work: did the user regenerate? Abandon? Continue without editing?

6. Code transcripts for ARC failure patterns. Use the same taxonomy. A single coder with the rubric takes roughly two hours for 100 transcripts.

7. Compare distributions and close the loop. If distributions diverge by more than 15–20 percentage points on a given pattern, either your eval tasks are not representative of production or your failure taxonomy needs revision.

eval: {early_collapse: 0.35, late_drift: 0.28, recovery: 0.22, steady_deg: 0.15}
prod: {early_collapse: 0.12, late_drift: 0.31, recovery: 0.19, steady_deg: 0.38}
steady_degradation diverges 23pp -> add more to eval suite before next run

Behavioral grounding also validates your intervention decisions. If your eval says late-drift is the dominant failure pattern and you apply context management fixes, production should show late-drift declining in subsequent months. An eval can be perfectly calibrated and still measure the wrong thing. Behavioral grounding is how you find out.


The Flywheel

ARC is not a one-time audit. Each rotation deposits something that makes the next rotation faster and more precise.

ARC flywheel A R C Assess Repair Calibrate
A
Deposits
A growing library of classified trajectory patterns, a bank of adversarial paraphrase probes, and variance benchmarks by task type. Failure shapes that required careful manual diagnosis in run one get recognized automatically in run ten.
R
Deposits
A validated disambiguation log: a record of which probes worked, which symptoms mapped to which root causes in your domain, and which interventions held up on held-out sets. The protocol gets faster and more accurate with every entry.
C
Deposits
A progressively sharper eval signal and a growing body of behavioral validation evidence. The gap metric builds a baseline. The retirement log shows which benchmarks have a shelf life. Recovery rate trends tell you whether the model is getting more robust or just more accurate on easy paths.
Each time you go around the arc, it gets faster and harder to fool. Checklists reset. ARC compounds. It gets better at the hardest part: knowing which lever to pull and whether the signal points at the right thing.

The framework describes what a good trajectory looks like and how to classify deviations from it. But a classification system is itself an instrument and instruments fail in specific ways. The next post does what you should do with any new tool before trusting it in production: deliberately tries to break it. The classifier gets stress-tested against synthetic trajectories at increasing noise levels, and the failure modes are documented precisely. Knowing the operating envelope of your measurement tool is part of trusting it.

References

  1. Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. (2023). "Let's Verify Step by Step." arXiv preprint arXiv:2305.20050. https://arxiv.org/abs/2305.20050
  2. Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. (2021). "Training Verifiers to Solve Math Word Problems." arXiv preprint arXiv:2110.14168. https://arxiv.org/abs/2110.14168