Stress-Testing the Arc

This is the second post in the ARC series. The first post, The Shape of a Good Answer, introduces the framework — what a good agent trajectory looks like and how to classify when it breaks. This post stress-tests the classifier itself.

Article by John Tribbia

There is a second problem on Bear Peak that I did not mention in the first post. You can run a perfect arc — controlled start, building rhythm, holding form through the summit and still not know it, if your heart rate monitor is lying to you. The Garmin spikes to 190 on a flat section. The GPS trace shows a detour you did not take. Trusting the instrument is a separate problem from running the right shape, and it requires its own kind of rigor. You calibrate a GPS watch against known distances. You verify the heart rate sensor against a chest strap. You do not ship the data until you understand what the errors look like.

The same separation applies to ARC. The previous post described what a good agent trajectory looks like and how to classify deviations from it. This one does what you should do with any new instrument before trusting it in production: deliberately tries to break it. The classifier gets stress-tested against synthetic trajectories at increasing noise levels. The failure modes are documented precisely. Knowing the operating envelope of your measurement tool is part of trusting the measurement.

Testing the Framework

The most honest thing you can do with a new framework is try to break it. We generated 1,500 synthetic trajectories (300 per pattern) labeled each one with its true failure type, ran them through the classifier, and checked how often it got the right answer. The test used the four trajectories defined in the previous post as base signals, then added noise at increasing levels to map the performance envelope.

It took three versions to get a clean classifier. The path there is worth documenting, because it validates the framework in a way that a single clean result would not.

Step 1: Run the Classifier on Its Own Examples

Before touching noise levels or bulk trials, the obvious first test is: does classify_trajectory() correctly label the four example trajectories shown in the first post?

Model	Expected	V1 Returns	Status
A	early_collapse	steady_degradation	✗ wrong
B	late_drift	healthy	✗ wrong
C	steady_degradation	steady_degradation	✓ correct
D	recovery	recovery	✓ correct

Two of four fail. Both have the same root cause. The original classifier checks the slope within each segment to detect early_collapse and late_drift. But both patterns, as they appear in the example data, are cross-segment drops — invisible to within-segment arithmetic.

Model A: early = [0.90, 0.91, 0.88], within-segment slope = −0.007. The collapse happens at step 4, after the early segment ends. The slope check reaches −0.007, far from the −0.15 threshold, and falls through to steady_degradation.

Model B: late = [0.88, 0.87, 0.55, 0.45], per-step slope = −0.1075. The threshold is −0.15. Close but not there. Falls through every branch and returns healthy.

Step 2: Three Rounds to Get It Right

V1 → V2: The obvious fix for early_collapse is to compare segment means instead of within-segment slope: if the early-third mean drops substantially by the mid-third, flag it. That works. early_collapse classification jumps from 0% to 67% accuracy. But it immediately breaks recovery. Recovery trajectories also have a large early-to-mid drop before they correct. The mean comparison cannot tell them apart because it is not looking at what happens in the final third.

V2 → V3: The fix requires two changes together. First, check recovery before early_collapse: recovery has to clear the gate first. Second, add a confirming condition to early_collapse: the early mean must exceed both the mid mean and the late mean by more than 0.20. A recovery trajectory has a high late mean, so it does not get caught. Early collapse stays low through the end, so it does. Both conditions together resolve the confusion entirely.

The iteration matters because it demonstrates that the framework caught a real bug rather than a contrived one. Fixing it required understanding the structural difference between the patterns, not just adjusting a threshold.

V3: The Validated Classifier

def classify_trajectory(scores):
    n = len(scores)
    s = list(scores)
    t = n // 3
    early_mean = sum(s[:t]) / t
    mid_mean   = sum(s[t:2*t]) / t
    late_mean  = sum(s[2*t:]) / (n - 2*t)
    slope_l    = (s[-1] - s[2*t]) / max(n - 2*t - 1, 1)

    # 1. Recovery first: dip > 0.20, then climbs back > 0.10 above the dip
    dips = [i for i in range(1, n-1) if s[i] < s[i-1] - 0.20]
    if dips and s[-1] > s[dips[0]] + 0.10:
        return 'recovery'

    # 2. Early collapse: early strong, drops and stays low through mid AND late
    if (early_mean - mid_mean) > 0.20 and (early_mean - late_mean) > 0.20:
        return 'early_collapse'

    # 3. Late drift: final segment slopes clearly negative
    if slope_l < -0.12:
        return 'late_drift'

    # 4. Steady degradation: overall decline without a structural break
    if (s[0] - s[-1]) > 0.15:
        return 'steady_degradation'

    return 'healthy'

All four article trajectories now classify correctly. At N=1,500 independently-generated synthetic trajectories across five patterns, V3 achieves 80% overall accuracy and a macro F1 of 0.79. Recovery classifies at near-perfect precision; the weaker patterns — early_collapse, late_drift, and steady_degradation — cluster around F1 = 0.70–0.75, reflecting genuine boundary ambiguity between categories rather than classifier bugs.

Pattern	V1 F1	V3 F1	What changed
early_collapse	0.00	0.68	Segment mean comparison replaces within-segment slope; mild collapses still bleed into steady_degradation
late_drift	—	0.70	Gradual late drops fall below the slope threshold and read as healthy
steady_degradation	—	0.75	Acts as a catch-all; absorbs misclassified early_collapse and late_drift cases
recovery	—	0.98	Recovery-first ordering protects this branch; large cliff signal is distinctive
healthy	—	0.84	Mild late drifts below the slope threshold land here — acceptable by design

The F1 scores for early_collapse, late_drift, and steady_degradation cluster in the 0.68–0.75 range on independently-generated synthetic trajectories. The common failure mode is boundary ambiguity: a mild early collapse that doesn't drop far enough reads as steady_degradation; a gradual late drift that stays above the slope threshold reads as healthy; a noisy steady decline gets misattributed to whichever structural pattern the noise mimics. These are not classifier bugs to chase — they reflect cases where the signal is genuinely below the detection threshold. Recovery is the exception: its large single-step cliff (>0.20) is structurally distinctive and classifies at F1 = 0.98 even on hard independently-drawn samples. The noise sweep in the next section shows that these same patterns hold under measurement noise, with recovery remaining robust well above σ = 0.15 while the other three cross the 80% reliability floor around σ = 0.08.

How Noise Degrades Classification

The noise sweep uses the four article trajectories as base signals, adds Gaussian noise at five levels from σ = 0.02 to σ = 0.15, and runs 5,000 trials per cell. σ represents the combined measurement noise from your NLE scorer plus natural step-to-step variation — the ambient noise floor of any real evaluation run.

Synthetic validation · V3 classifier · 5,000 trials per cell

Classification accuracy vs. NLE measurement noise (σ)

Article example trajectories as base signals · Gaussian noise · 4 failure patterns

Early Collapse Late Drift Steady Degradation Recovery

The dashed line marks 80% accuracy — a practical reliability floor for production use. All three non-recovery patterns cross it at σ = 0.08. Recovery stays above 80% even at σ = 0.15. At σ = 0.08 the gap between the strongest and weakest patterns is 25 percentage points. Measure your NLE scorer's own variance before treating classifier output as actionable.

Three Boundary Conditions

Boundary 1 — Steady degradation requires a low-noise scorer

Steady degradation is the pattern most sensitive to measurement noise. At σ = 0.05 accuracy is 95%; by σ = 0.08 it drops to 76%, and by σ = 0.11 it is at 58%. The detection logic — s[0] − s[-1] > 0.15 — has the smallest signal margin of the four branches. Any trajectory with a higher start than finish qualifies, so noisy early_collapse and late_drift trajectories bleed into this bucket under measurement variance.

Fix: Run 10–15 duplicate scorings on the same step outputs to estimate your NLE's σ before committing to a task suite. If σ > 0.06, steady_degradation results are not reliably actionable - flag them as unresolved and pull a manual sample before intervening.

Boundary 2 — Recovery detection measures cliff size, not recovery

Recovery is the most robust pattern across all noise levels, staying above 80% even at σ = 0.15. But that robustness comes from the size of a single-step drop in the base trajectory (0.45 points), not from anything general about detecting self-correction. A genuine recovery that unfolds across two smaller drops (−0.12, then −0.12) produces no entry in dips and gets silently classified as steady_degradation. Real agent self-correction often unfolds gradually. The classifier is detecting the presence of a cliff, not the act of recovering.

Fix: Replace the pairwise dip detector with a 2-step rolling minimum: min(s[i-1], s[i]) < s[i-2] - 0.15, which catches recoveries that compress across two steps without requiring a single-step cliff.

Boundary 3 — Task length below 7 steps breaks segment estimation

The simulation used n = 10. At n = 5, n//3 = 1, so the early segment is a single score and the late segment is two scores. Both early_mean and slope_l are computed from samples of size 1–2, where one noisy observation can swing the entire result. Classification accuracy at n = 5 falls roughly 25 percentage points below the n = 10 baseline for all four patterns.

Fix: Require n ≥ 7 before running the classifier. If the task naturally decomposes into fewer steps, aggregate adjacent step pairs to produce a longer effective sequence before scoring.

The validated classifier operates reliably within a specific envelope: trajectories of 7 or more steps, NLE scorer variance below σ = 0.08, and failure events that compress into a single step. Outside that envelope it degrades in predictable ways. Knowing the failure mode of your measurement tool is part of trusting it. A heart rate monitor that drifts at altitude is still useful — you just need to know where it starts lying.

What Comes Next

The simulation above validates classifier behavior and maps its operating envelope on synthetic data. The next post builds on the same validation strategy used in earlier work on model quality and user retention — baking a known causal effect into a synthetic dataset so the estimator can be checked against ground truth.

The goal is the same: show that the ARC estimator recovers the right answer, characterize where it falls short, and build something teams can actually instrument in production. The disambiguation protocol will be validated against ground-truth failure labels. The behavioral grounding discipline will be tested empirically by checking whether the quality dimensions ARC identifies as load-bearing in eval actually surface as predictive factors in production engagement data.

That last test is the one that closes the loop between controlled evaluation and the real world.

No training data, model weights, or proprietary systems are involved in this analysis. Full technical specification is in the companion document.

References

Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. (2023). "Let's Verify Step by Step." arXiv preprint arXiv:2305.20050. https://arxiv.org/abs/2305.20050
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. (2021). "Training Verifiers to Solve Math Word Problems." arXiv preprint arXiv:2110.14168. https://arxiv.org/abs/2110.14168

AI Usage

Ideas, analysis, and opinions are my own. Generative AI was used as an editor after the writing and analysis were complete — sentence restructuring and light copy-editing. The author reviewed all suggested changes.