Article by John Tribbia
The model quality framework established that within-version quality variation predicts user engagement, and showed an estimator that recovers 90% of a known injected effect. It used two measurement choices that were serviceable starting points: scalar 1–5 quality ratings per prompt category, and active days per week as the engagement outcome.
Each choice compresses information that matters. Scalar ratings collapse a trajectory into a single number. A category can score 3.5 because it held steady across all 10 steps of a task, or because it opened at 4.5 and collapsed. Those are different problems with different fixes, as described in The Shape of a Good Answer. Active days is coarser than it needs to be: whether a user returns next week is a sharper decision boundary than how many days they were active in the current week.
This post replaces both. ARC trajectory scores replace scalar ratings; 7-day return probability replaces active days. The identification strategy and model specification are untouched. Mean score alone recovers 89% of the injected effect, consistent with the prior work. But early trajectory slope as a second predictor improves model fit and shifts the investment map in a way scalar ratings cannot produce.
What ARC Adds to the Quality Score
The prior framework computed each user's experienced quality as a weighted average of category-level ratings:
$$Q_{i,t} = \sum_c w_{i,c} \cdot q_{c,v(t)}$$where $w_{i,c}$ is the user's frozen pre-period category mix and $q_{c,v(t)}$ is the offline quality rating for category $c$ under model version $v(t)$.
The ARC version replaces $q_{c,v(t)}$ with $\bar{s}_{c,v(t)}$: the mean per-step score from the full ARC trajectory evaluation for that category under that version. Step scores run from 0 to 1, scored by a calibrated NLE judge against per-step sub-goal definitions. The composite quality exposure formula is otherwise identical:
$$\text{ARC}_{i,t} = \sum_c w_{i,c} \cdot \bar{s}_{c,v(t)}$$The mean score occupies the same structural role as the original scalar rating. The 10-step vector also exposes slope features: the early slope (linear trend across steps 1–3) and the late slope (linear trend across steps 8–10). Early slope is the focus here because Early Collapse (the ARC failure pattern where a model commits to a flawed direction before fully orienting) is the most consequential trajectory failure for user experience. Confidently wrong early steps with locally coherent downstream work built on bad foundations is a different failure than a steady 0.70 throughout.
The slope exposure is constructed using the same frozen weights applied to the early slope values instead of mean scores:
$$\text{EarlySlope}_{i,t} = \sum_c w_{i,c} \cdot \text{esl}_{c,v(t)}$$Both composites are centered within version before entering the model, removing deployment-boundary jumps exactly as in the prior work.
Setup
Same structure as the prior work: 100,000 users, three model versions over 26 weeks, ~1.65 million session records. The offline evaluation (50,000 records) now produces a 10-step trajectory per category-version pair. Mean score and early slope are extracted from each.
Mean ARC scores and early slopes by category and version:
ARC trajectory metrics — mean step score (0–1) and early slope (Δ score per step, steps 1–3)
| Category | v1.0 mean | v1.0 slope | v1.1 mean | v1.1 slope | v1.2 mean | v1.2 slope |
|---|---|---|---|---|---|---|
| Coding | 0.700 | +0.011 | 0.822 | +0.018 | 0.882 | +0.022 |
| Creative Writing | 0.558 | −0.028 | 0.658 | −0.015 | 0.754 | −0.006 |
| General QA | 0.700 | +0.008 | 0.776 | +0.012 | 0.834 | +0.016 |
| Math/Logic | 0.582 | −0.019 | 0.700 | −0.008 | 0.826 | +0.004 |
| Scientific | 0.658 | +0.005 | 0.720 | +0.009 | 0.812 | +0.014 |
The mean scores are the prior work's 1–5 ratings normalized to 0–1. The early slope column is a separate story. Math/Logic was the most severe Early Collapse category in v1.0 (slope −0.019); by v1.2 it has crossed into positive territory (+0.004). Creative Writing moved the opposite way: its mean score improved 0.196 points across three versions, the second-largest absolute gain (behind Math/Logic at 0.244), yet its early slope is still negative in v1.2 (−0.006). The model is better at creative writing but still tends to commit before it has oriented.
A product team tracking only mean scores would read the v1.2 Creative Writing column as a success. A team tracking trajectory shape would see a persistent orientation problem that improvements in raw output quality have not resolved.
Coding and General QA trace a flat or gently rising arc — both orient before the task steepens. Creative Writing and Math/Logic open high and fall, a signature of Early Collapse. Coding (0.700) and General QA (0.700) have identical mean scores; their trajectories are indistinguishable by the scalar. An NLE scoring each step separately resolves that ambiguity.
Outcome: 7-Day Return Probability
The prior work used active days per week modeled as a bounded proportion (0–7, binomial link). This analysis uses a cleaner binary: given that user $i$ was active in week $t$, did they return in week $t+1$? The 7-day return indicator $r_{i,t} \in \{0,1\}$ is modeled with a logistic link.
The binary framing maps directly to weekly retention as a business concept. An active-days model asks how engaged users were this week; a return probability model asks whether they came back at all, the decision that compounds into long-run retention and lifetime value.
The known causal effect injected into the DGP is $\beta_{\text{true}} = 10.0$ log-odds per unit of centered composite ARC score (on the 0–1 scale). With a within-version SD of approximately 0.019, the marginal per-SD effect is 0.19 log-odds. At a baseline return probability of 65%, that translates to roughly 4 percentage points of retention per standard deviation of quality exposure, the same magnitude as the prior work expressed in return-probability units.
The model:
bam(returned_next_week ~
s(ARC_it_c, bs = "tp", k = 10) + # within-version mean ARC (key predictor)
version_f + # absorbs deployment-boundary shifts
s(week, bs = "tp", k = 10) + # residual time trends
s(user_id_factor, bs = "re") + # user-level random intercepts
user_type + # Consumer vs Enterprise
pre_project_engagement_score, # baseline historical engagement
family = binomial(), method = "fREML", discrete = TRUE)
Fit on a stratified 2,000-user subsample (~30,000 observations), preserving the 70/30 Consumer/Enterprise split from the full panel.
Results
Mean ARC Score Predicts Return Probability
| Term | Estimate / edf | Test Stat | p-value |
|---|---|---|---|
| s(ARC_it_c) | edf = 1.58 | chi-sq = 318.44 | < 2 × 10⁻¹⁶ |
| version_f v1.1 | B = 0.201 | z = 10.82 | < 2 × 10⁻¹⁶ |
| version_f v1.2 | B = 0.384 | z = 11.27 | < 2 × 10⁻¹⁶ |
| s(week) | edf = 1.44 | chi-sq = 0.47 | 0.814 |
| pre_project_engagement_score | B = 0.024 | z = 92.18 | < 2 × 10⁻¹⁶ |
| user_type (Enterprise) | B = 0.011 | z = 1.04 | 0.298 |
Deviance explained: 26.8% | Adj. R² = 0.284
The within-version quality effect is highly significant (chi-sq = 318.44, p < 2 × 10⁻¹⁶). The version-level shifts (B = 0.201 and B = 0.384) are larger in absolute magnitude than the within-version smooth, the same structure as the prior work. Those jumps are causally uninterpretable for the same reason: everything changes at a deployment boundary. The smooth on ARC_it_c is doing the interpretable work.
The residual time trend is flat (p = 0.814). The edf of 1.58 indicates slight curvature, matching the prior work (edf = 1.62), but the dominant relationship is linear. Pre-period engagement is the strongest individual predictor; Enterprise user type adds nothing once baseline engagement is controlled.
Calibration: Does the Estimator Recover the Right Effect?
| Method | β̂ | 95% CI | Recovery |
|---|---|---|---|
| Linear parametric (ARC_it_c) | 8.87 | [7.93, 9.81] | 89% |
| GAM smooth (effective slope) | 8.71 | — | 87% |
| Cluster bootstrap (B=100) | 8.84 | [7.66, 10.02] | 88% |
The estimator recovers 89% of the true effect (β_true = 10.0). The 11% attenuation is the same mechanism as before: observed usage proportions are noisy estimates of each user's true category preferences, and that measurement error attenuates the exposure coefficient toward zero, textbook errors-in-variables bias. The cluster bootstrap CI [7.66, 10.02] contains the true value; the single-model parametric CI [7.93, 9.81] narrowly excludes it. That asymmetry replicates the prior finding exactly, and the recommendation is unchanged: use cluster-robust inference.
The falsification check uses user-weight permutation: shuffle quality exposure across users within each version period, breaking all correlation between the composite score and user identity while preserving marginal distributions. The permuted model returns β = −0.82, p = 0.411. No signal where none was planted.
Early Slope as a Second Predictor
Adding early slope alongside mean ARC score tests whether trajectory shape predicts retention independently of average quality level.
| Model | Predictors | Dev. Explained | ΔAIC |
|---|---|---|---|
| GAMM-1 | mean ARC score only | 26.8% | — |
| GAMM-2 | mean ARC + early slope | 28.4% | −34.2 |
Early slope adds 1.6 percentage points of deviance explained and cuts AIC by 34. Both predictors are significant (mean ARC: chi-sq = 276.1, p < 10⁻¹⁶; early slope: chi-sq = 88.7, p = 5.2 × 10⁻¹²) and only modestly correlated (r = 0.34).
Holding mean ARC score constant, a user whose category mix skews toward Early Collapse-prone categories is less likely to return next week than a user with the same mean score but better orientation quality.
Investment Map in Retention Units
Same counterfactual structure as the prior work: recompute each user's composite ARC score with frozen weights, propagate through the fitted model (recovered coefficient 8.87), and read off the P(return) change at the v1.0 baseline of 65%. A "+0.05 improvement" for a category means moving its mean step score from 0.700 to 0.750, roughly the scale of a targeted fine-tuning pass. The early slope scenario is separate: "+0.015" means reducing an Early Collapse pattern toward neutral without changing mean score.
Counterfactual retention lift — v1.0 baseline, P(return) = 0.65
| Scenario | Mean ΔARC | ΔP(return) | Per 100K users/week |
|---|---|---|---|
| All categories +0.02 | +0.0200 | +4.0 pp | +4,000 |
| Coding +0.05 | +0.0123 | +2.4 pp | +2,400 |
| Math/Logic +0.05 | +0.0107 | +2.1 pp | +2,100 |
| Creative Writing: mean +0.05 + fix early slope (+0.015) | +0.0076 | +2.1 pp | +2,100 |
| Creative Writing +0.05 (mean score only) | +0.0076 | +1.5 pp | +1,500 |
| Creative Writing: fix early slope only (+0.015) | — | +0.6 pp | +600 |
The combined Creative Writing scenario (mean improvement + early slope fix) reaches parity with Math/Logic mean-only. Scalar quality rankings make Creative Writing look like the worst investment; trajectory-aware analysis reveals it's undervalued.
The targeted improvement order follows usage weights (Coding at 24.6%, Math/Logic at 21.4%, Creative Writing at 15.1%), but the magnitudes aren't derivable from the ordering. You need the fitted model.
Creative Writing ranks last as a single-category target at 1,500 users per week, matching the prior analysis. But that ranking rests entirely on mean score. When early slope contributes separately, a Creative Writing investment addressing both mean quality and orientation quality returns 2,100 users, on par with Math/Logic. The trajectory penalty was invisible to scalar ratings.
The broad investment ("All categories +0.02") returns 4,000 users per week, 67% more than the best single-category scenario. Each category gets only a 0.02-point improvement, but it lifts every user rather than only those whose usage mix overlaps the targeted category. The argument for breadth holds across both frameworks.
Limitations
This is proof-of-concept on synthetic data. Finding 89% recovery validates the estimator against a known answer, not a claim about any deployed product. On real data the effect could be larger, smaller, or absent.
The early slope finding also requires infrastructure most teams don't have. Per-step evaluation scores require either a step-level NLE scoring pipeline (as described in the ARC article) or a retrospective decomposition of existing evaluations into early and late phases. Without that, the trajectory features are unobservable.
Late slope is a natural third predictor but was excluded here: on 10-step tasks, adding both slopes introduces multicollinearity. On longer tasks with more heterogeneous trajectories, a three-predictor model is a reasonable extension.
The novelty-vs.-durability problem remains. A quality improvement in mean score or trajectory shape might spike retention for the first cohort and taper as the new level becomes expected. Distinguishing a durable gain from a novelty effect requires cohort tracking that can't be derived from the within-version structure used here.
Technical Appendix
- Analysis code:
aqr_analysis.R(R 4.5.2, mgcv, dplyr, ggplot2, patchwork) - Data generation:
aqr_generate_data.py(Python 3.9, pandas, numpy; generates per-step ARC trajectories withTRUE_BETA_MEAN = 10.0,TRUE_BETA_SLOPE = 12.0, binary retention outcome) - Model fitting:
mgcv::bam(), fREML,discrete = TRUE, 2,000-user stratified subsample (~30K obs) - Cluster bootstrap: B=100, user-level block resampling, linear parametric model
| File | Records | Description |
|---|---|---|
aqr_trajectories.csv |
50,000 | Per-step ARC scores — category × version × task |
aqr_user_demographics.csv |
100,000 | User characteristics, subscription tier, baseline engagement |
aqr_session_logs.csv |
~1.65M | Weekly session records with category counts and 7-day return indicator |
Prior work. The within-version quality exposure framework, identification strategy, frozen-weights design, and GAMM specification are described in full in Does Making AI Smarter Actually Make People Use It More? (Feb. 2026). The ARC trajectory evaluation framework — including the NLE scoring protocol, four failure patterns, and calibration procedure — is described in The Shape of a Good Answer (Feb. 2026).
Data note. All data is synthetic. No real users, no proprietary models, no production systems. The injected causal effects are β_true = 10.0 log-odds per unit centered mean ARC score and β_true = 12.0 log-odds per unit centered early slope exposure. The estimator recovers 89% of the mean score effect; early slope recovery was not separately validated against its own β_true in this analysis.
Software. R 4.5.2; mgcv (Wood 2017), dplyr, ggplot2, patchwork. Python 3.9; pandas, numpy.
Ideas, analysis, and opinions are my own. Generative AI was used as an editor after the writing and analysis were complete — sentence restructuring and light copy-editing. The author reviewed all suggested changes.