Stress-Testing the Arc

This is the second post in the ARC series. The first post, The Shape of a Good Answer, introduces the framework — what a good agent trajectory looks like and how to classify when it breaks. This post stress-tests the classifier itself.

Article by John Tribbia

There is a second problem on Bear Peak that I did not mention in the first post. You can run a perfect arc — controlled start, building rhythm, holding form through the summit and still not know it, if your heart rate monitor is lying to you. The Garmin spikes to 190 on a flat section. The GPS trace shows a detour you did not take. Trusting the instrument is a separate problem from running the right shape, and it requires its own kind of rigor. You calibrate a GPS watch against known distances. You verify the heart rate sensor against a chest strap. You do not ship the data until you understand what the errors look like.

The same separation applies to ARC. The previous post described what a good agent trajectory looks like and how to classify deviations from it. This one does what you should do with any new instrument before trusting it in production: deliberately tries to break it. The classifier gets stress-tested against synthetic trajectories at increasing noise levels. The failure modes are documented precisely. Knowing the operating envelope of your measurement tool is part of trusting the measurement.

Testing the Framework

The most honest thing you can do with a new framework is try to break it. We generated 1,500 synthetic trajectories (300 per pattern) labeled each one with its true failure type, ran them through the classifier, and checked how often it got the right answer. The test used the four trajectories defined in the previous post as base signals, then added noise at increasing levels to map the performance envelope.

It took three versions to get a clean classifier. The path there is worth documenting, because it validates the framework in a way that a single clean result would not.

Step 1: Run the Classifier on Its Own Examples

Before touching noise levels or bulk trials, the obvious first test is: does classify_trajectory() correctly label the four example trajectories shown in the first post?

Model	Expected	V1 Returns	Status
A	early_collapse	steady_degradation	✗ wrong
B	late_drift	healthy	✗ wrong
C	steady_degradation	steady_degradation	✓ correct
D	recovery	recovery	✓ correct

Two of four fail. Both have the same root cause. The original classifier checks the slope within each segment to detect early_collapse and late_drift. But both patterns, as they appear in the example data, are cross-segment drops — invisible to within-segment arithmetic.

Model A: early = [0.90, 0.91, 0.88], within-segment slope = −0.007. The collapse happens at step 4, after the early segment ends. The slope check reaches −0.007, far from the −0.15 threshold, and falls through to steady_degradation.

Model B: late = [0.88, 0.87, 0.55, 0.45], per-step slope = −0.1075. The threshold is −0.15. Close but not there. Falls through every branch and returns healthy.

Step 2: Three Rounds to Get It Right

V1 → V2: The obvious fix for early_collapse is to compare segment means instead of within-segment slope: if the early-third mean drops substantially by the mid-third, flag it. That works. early_collapse classification jumps from 0% to 67% accuracy. But it immediately breaks recovery. Recovery trajectories also have a large early-to-mid drop before they correct. The mean comparison cannot tell them apart because it is not looking at what happens in the final third.

V2 → V3: The fix requires two changes together. First, check recovery before early_collapse: recovery has to clear the gate first. Second, add a confirming condition to early_collapse: the early mean must exceed both the mid mean and the late mean by more than 0.20. A recovery trajectory has a high late mean, so it does not get caught. Early collapse stays low through the end, so it does. Both conditions together resolve the confusion entirely.

The iteration matters because it demonstrates that the framework caught a real bug rather than a contrived one. Fixing it required understanding the structural difference between the patterns, not just adjusting a threshold.

V3: The Validated Classifier

def classify_trajectory(scores):
    n = len(scores)
    s = list(scores)
    t = n // 3
    early_mean = sum(s[:t]) / t
    mid_mean   = sum(s[t:2*t]) / t
    late_mean  = sum(s[2*t:]) / (n - 2*t)
    slope_l    = (s[-1] - s[2*t]) / max(n - 2*t - 1, 1)

    # 1. Recovery first: dip > 0.20, then climbs back > 0.10 above the dip
    dips = [i for i in range(1, n-1) if s[i] < s[i-1] - 0.20]
    if dips and s[-1] > s[dips[0]] + 0.10:
        return 'recovery'

    # 2. Early collapse: early strong, drops and stays low through mid AND late
    if (early_mean - mid_mean) > 0.20 and (early_mean - late_mean) > 0.20:
        return 'early_collapse'

    # 3. Late drift: final segment slopes clearly negative
    if slope_l < -0.12:
        return 'late_drift'

    # 4. Steady degradation: overall decline without a structural break
    if (s[0] - s[-1]) > 0.15:
        return 'steady_degradation'

    return 'healthy'

All four article trajectories now classify correctly. At N=1,500 trajectories across five patterns, V3 achieves 94% overall accuracy and a macro F1 of 0.94 — compared to 76% and 0.69 for V1.

Pattern	V1 F1	V3 F1	What changed
early_collapse	0.00	0.98	Segment mean comparison replaces within-segment slope
late_drift	0.89	0.85	Slight drop — mild drifts overlap with healthy at low noise
steady_degradation	0.66	0.97	No longer absorbs early_collapse misclassifications
recovery	1.00	0.99	Recovery-first ordering protects this branch
healthy	0.93	0.90	Mild late drifts occasionally read as healthy — acceptable

The late_drift F1 is worth a note. A mild late drift that builds toward a peak and then softens gradually can genuinely look like a healthy arc when the decline is below the slope threshold. That 13% leak into healthy is not a bug to chase: those are trajectories where the drop is too gentle to warrant a Repair sprint, and classifying them as healthy and moving on is probably the right call. The cases worth catching, such as sharp late collapses, are already classified at near-perfect precision.

How Noise Degrades Classification

The noise sweep uses the four article trajectories as base signals, adds Gaussian noise at five levels from σ = 0.02 to σ = 0.15, and runs 5,000 trials per cell. σ represents the combined measurement noise from your NLE scorer plus natural step-to-step variation — the ambient noise floor of any real evaluation run.

Synthetic validation · V3 classifier · 5,000 trials per cell

Classification accuracy vs. NLE measurement noise (σ)

Article example trajectories as base signals · Gaussian noise · 4 failure patterns

Early Collapse Late Drift Steady Degradation Recovery

The dashed line marks 80% accuracy — a practical reliability floor for production use. All three non-recovery patterns cross it at σ = 0.08. Recovery stays above 80% even at σ = 0.15. At σ = 0.08 the gap between the strongest and weakest patterns is 25 percentage points. Measure your NLE scorer's own variance before treating classifier output as actionable.

Three Boundary Conditions

Boundary 1 — Steady degradation requires a low-noise scorer

Steady degradation is the pattern most sensitive to measurement noise. At σ = 0.05 accuracy is 95%; by σ = 0.08 it drops to 76%, and by σ = 0.11 it is at 58%. The detection logic — s[0] − s[-1] > 0.15 — has the smallest signal margin of the four branches. Any trajectory with a higher start than finish qualifies, so noisy early_collapse and late_drift trajectories bleed into this bucket under measurement variance.

Fix: Run 10–15 duplicate scorings on the same step outputs to estimate your NLE's σ before committing to a task suite. If σ > 0.06, steady_degradation results are not reliably actionable - flag them as unresolved and pull a manual sample before intervening.

Boundary 2 — Recovery detection measures cliff size, not recovery

Recovery is the most robust pattern across all noise levels, staying above 80% even at σ = 0.15. But that robustness comes from the size of a single-step drop in the base trajectory (0.45 points), not from anything general about detecting self-correction. A genuine recovery that unfolds across two smaller drops (−0.12, then −0.12) produces no entry in dips and gets silently classified as steady_degradation. Real agent self-correction often unfolds gradually. The classifier is detecting the presence of a cliff, not the act of recovering.

Fix: Replace the pairwise dip detector with a 2-step rolling minimum: min(s[i-1], s[i]) < s[i-2] - 0.15, which catches recoveries that compress across two steps without requiring a single-step cliff.

Boundary 3 — Task length below 7 steps breaks segment estimation

The simulation used n = 10. At n = 5, n//3 = 1, so the early segment is a single score and the late segment is two scores. Both early_mean and slope_l are computed from samples of size 1–2, where one noisy observation can swing the entire result. Classification accuracy at n = 5 falls roughly 25 percentage points below the n = 10 baseline for all four patterns.

Fix: Require n ≥ 7 before running the classifier. If the task naturally decomposes into fewer steps, aggregate adjacent step pairs to produce a longer effective sequence before scoring.

The validated classifier operates reliably within a specific envelope: trajectories of 7 or more steps, NLE scorer variance below σ = 0.08, and failure events that compress into a single step. Outside that envelope it degrades in predictable ways. Knowing the failure mode of your measurement tool is part of trusting it. A heart rate monitor that drifts at altitude is still useful — you just need to know where it starts lying.

What Comes Next

The simulation above validates classifier behavior and maps its operating envelope on synthetic data. The next post builds on the same validation strategy used in earlier work on model quality and user retention — baking a known causal effect into a synthetic dataset so the estimator can be checked against ground truth.

The goal is the same: show that the ARC estimator recovers the right answer, characterize where it falls short, and build something teams can actually instrument in production. The disambiguation protocol will be validated against ground-truth failure labels. The behavioral grounding discipline will be tested empirically by checking whether the quality dimensions ARC identifies as load-bearing in eval actually surface as predictive factors in production engagement data.

That last test is the one that closes the loop between controlled evaluation and the real world.

No training data, model weights, or proprietary systems are involved in this analysis. Full technical specification is in the companion document.

References

Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. (2023). "Let's Verify Step by Step." arXiv preprint arXiv:2305.20050. https://arxiv.org/abs/2305.20050
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. (2021). "Training Verifiers to Solve Math Word Problems." arXiv preprint arXiv:2110.14168. https://arxiv.org/abs/2110.14168