data project

When GPT-4 launched in March 2023, the topic mix of real user prompts shifted in the exact direction the model quality gaps would predict: coding and reasoning grew, casual conversation did not.

How Users Vote With Their Prompts
13,597 real user prompts · WildChat-1M (Allen AI) · April 2023 – April 2024 · 1,000 sampled per shard · BERTopic over English first-turn messages

Article by John Tribbia

A previous post built a framework for measuring whether AI model quality drives user engagement. The data was synthetic by design. It requires a lot of effort to observe the true full causal effect in a real system, so the methodology was validated by injecting a known effect and checking recovery. But the framework made a behavioral prediction that real data can test: if quality drives engagement, then when a model improves unevenly across task types, the aggregate topic mix of what users actually do should shift in the direction of the improvement.

GPT-4 launched on March 14, 2023. Its quality gains over GPT-3.5 were not uniform. On coding benchmarks it improved by nearly 19 percentage points. On reasoning and factual tasks it gained 13–16 points. On casual conversation it gained almost nothing. If users are sensitive to that quality gradient, even without knowing the benchmark numbers, coding and reasoning prompts should grow at the transition. Casual prompts should not.

That is the question. WildChat-1M gives us a dataset large enough to look for the signal.


What WildChat Is

WildChat is a corpus of 1 million real user conversations with ChatGPT, collected with user consent by Allen AI. Each conversation includes the full turn sequence, timestamps, the model version used, country of origin, and a toxicity flag. It covers January 2023 through late 2024, spanning the GPT-3.5 era, the GPT-4 launch, and the early GPT-4 diffusion period.

The analysis here uses 13,597 English first-turn prompts, 1,000 sampled from each of the 14 monthly parquet shards, after deduplication and language filtering. The first turn is the cleanest signal: it is what the user actually wanted when they sat down. Follow-up turns increasingly reflect the model's prior outputs, which makes them harder to categorize as user intent.

Why first-turn only? In a user research context, the first message is the equivalent of an unprompted task statement in an interview. It captures intent before the conversation diverges into clarification, correction, or model-directed follow-up. Later turns contain signal too, but mixing them with first turns muddies the taxonomy.

Prompt Taxonomy

BERTopic with sentence embeddings (all-MiniLM-L6-v2) and HDBSCAN clustering produced 39 distinct topic clusters from the 13,597 prompts. The outlier cluster (-1, prompts that resist classification) contained 37.6% of documents, a figure that gets its own section below.

The largest clusters were not what most summaries of LLM use would predict. Image generation was the dominant topic at 13.1% of named-topic prompts, users piping Midjourney-style visual concepts through ChatGPT. Software coding and debugging was second at 8.7%. Professional and business writing was third. The coding-dominant story turns out to be a partial portrait.

Topic 21, Jailbreak Attempts, shows up as a small but coherent behavioral pattern: 0.67% of named prompts, consistent enough in its term structure that HDBSCAN assigned it its own cluster rather than absorbing it into general chat. Topic 35, "Meta: ChatGPT Itself," captures users asking ChatGPT about ChatGPT: its version, capabilities, and the nature of the system they're talking to. That 0.38% is a reminder that a fraction of every corpus is about the instrument, not the task the instrument was built for.

The taxonomy is also an intent map that no survey would produce. Users described what they wanted in natural language, under real conditions, with something at stake. The ordering that falls out reflects what the platform actually means to its users, not what its designers hoped it would mean. Image generation coming in first and coding second is not a caveat. It is the finding.

WildChat-1M · 13,597 English prompts · BERTopic
The prompt taxonomy, ranked by volume
Share of all first-turn messages per topic cluster · percent

Topic names assigned manually after reviewing top keywords per cluster. Outlier cluster (-1) excluded. Source: allenai/WildChat-1M.


Before and After

Monthly topic shares from April 2023 through April 2024 reveal a before-after structure at the GPT-4 Turbo launch date (November 6, 2023). The shift is not dramatic. This is aggregate behavior, not a controlled experiment. But the direction is not what a straightforward quality-gap reading would predict.

The biggest mover was not coding. Image Generation (Midjourney) held 8.3% of prompts in the months before the GPT-4 Turbo launch and jumped to 20.9% after, a 12.7-percentage-point increase that dwarfed every other shift in the dataset (p<0.0001). Software coding and debugging actually contracted slightly: from 9.3% to 7.6%, a statistically significant decrease that reflects dilution rather than decline: the Midjourney wave brought a large new cohort into the dataset, compressing every other topic's share as a byproduct. General chat and conversational roleplay fell sharply, both dropping more than 3 percentage points, consistent with early exploratory users giving way to a more purposeful user mix. Casual conversation, the control prediction, was effectively flat: 4.0% before, 4.3% after, p=0.38. Not distinguishable from noise, and exactly what the quality-gap framework predicts for a task where model improvements were minimal.

WildChat-1M · Monthly topic share · Apr 2023 – Apr 2024
Topic prevalence over time
Share of monthly prompts per cluster · top 12 topics · percent

Dashed vertical line marks GPT-4 Turbo (gpt-4-1106-preview) general availability (November 6, 2023), the model-cost inflection that drove renewed GPT-4 adoption in the data. Source: allenai/WildChat-1M.


Benchmarks vs. Behavior

Coding improved nearly 19 percentage points on HumanEval between GPT-3.5 and GPT-4. Casual conversation improved by 0.04 points on MT-Bench. Effectively nothing. If those quality differences register with users, the behavioral data should reflect them.

Table 2 · Coding vs. casual conversation: benchmark gain vs. behavioral response
Task Benchmark gain Share Δ at transition GPT-4 share / GPT-3.5 share
Coding (HumanEval) +18.9 pp −1.67 pp * 13.9% / 7.1%
Casual Chat (MT-Bench) +0.04 pts † +0.32 pp (p = 0.38) 0.8% / 5.1%

* Coding share contracted at the aggregate transition because the Midjourney wave compressed every other topic's share. The cross-sectional ratio (13.9% vs. 7.1%) is the cleaner signal. † MT-Bench uses a 0–10 scale; GPT-3.5: 8.35 → GPT-4: 8.39. Sources: OpenAI GPT-4 Technical Report (2023); allenai/WildChat-1M.

What the model-quality framework predicts here: Users don't need to know benchmark scores to respond to quality differences. If a tool reliably handles coding tasks better, the people who do coding will use it more for coding. The aggregate topic mix is revealed preference: behavioral evidence of the quality signal without any explicit user report.

GPT-3.5 vs GPT-4: Same Platform, Different Behavior

WildChat records which model handled each conversation. Comparing the topic mix for conversations that went to GPT-3.5-turbo versus GPT-4 provides a cross-sectional version of the same hypothesis: if users self-select models based on task difficulty, the GPT-4 conversations should be heavier on the tasks where GPT-4's advantage is largest.

WildChat-1M · GPT-3.5-turbo vs GPT-4
Who goes to which model, and for what
Topic share per model · top 12 topics · percent of model-specific prompts
GPT-4 GPT-3.5-turbo

Model field from WildChat metadata. All 13,597 records in this sample include an explicit model identifier. Source: allenai/WildChat-1M.


How Users Voted With Their Model Choice

The topic breakdown by model tells you what users brought to each model. The routing preference index asks the inverse: for a given task, what fraction of users chose GPT-4 over GPT-3.5? This is a behavioral signal that sits underneath intent. It does not capture what the user wanted. It captures how confident they were, in practice, that the stronger model would matter for their particular task.

Software Coding & Debugging routes to GPT-4 at roughly 2:1. Creative Fiction Writing shows a similar skew. Scene & Visual Description routes even more strongly, with nearly all its share concentrated in GPT-4 conversations, consistent with tasks that require close instruction-following on detailed visual concepts. These are tasks where users chose the stronger model consistently, across 13 months of real usage. Midjourney image generation routes almost entirely to GPT-3.5. Casual Conversation does the same. Both are tasks where GPT-4's measured advantage is minimal, and the behavioral data reflects it: users did not seek out the stronger model for tasks where it would not differentiate.

WildChat-1M · GPT-4 vs GPT-3.5 · model routing
GPT-4 routing preference by topic
Share of model-identified conversations routed to GPT-4 per topic · percent · green = GPT-4 majority

Dashed line at 50% marks equal model split. Topics left of the line are GPT-3.5-dominant; right of the line are GPT-4-dominant. Source: allenai/WildChat-1M.

Revealed preference without a survey. No one asked users whether they thought GPT-4 was better for coding. They answered with their model selection, at scale, across 13 months of real usage. The routing distribution is the finding, not a self-report artifact, not a satisfaction score, not a rating. It is behavior observed directly.

In a traditional UXR context, capturing task-model matching decisions would require a diary study or contextual inquiry, asking people to narrate their tool choices as they work. At this scale, with timestamped model metadata attached to every conversation, the diary runs itself. The limitations are the same as any observational study: the behavioral pattern is visible, but the mechanism is not. Users who route coding to GPT-4 might do so because GPT-4 genuinely handles their prompts better, or because they formed that habit early when GPT-4 was the only option, or because a colleague recommended it. The distribution cannot distinguish those explanations. It establishes only that the routing happens, and that it aligns with the tasks where the benchmark gap between models is largest.


Dark Matter

Topic -1, BERTopic's outlier cluster, contained 37.6% of all prompts. These are documents that HDBSCAN declined to assign to any cluster: too heterogeneous, too sparse, or genuinely unusual.

In a traditional user research study, these would be the prompts that resist any affinity map. The ones a researcher sets aside as miscellaneous and then spends the rest of the project quietly knowing they got away with ignoring. At scale, that category is not small and it is not random.

The cluster's keyword profile gives a partial picture: write, create, image, and midjourney all appear among its most distinctive terms, suggesting that a substantial portion are Midjourney-adjacent prompts too idiosyncratic to cohere with the eight Midjourney clusters HDBSCAN did identify. Beyond that, the outlier bin collects the usual debris: inputs so short they carry no extractable topic, multi-step instructions that span incompatible categories, fragments pasted out of context from longer sessions, and genuine one-offs with no peer in the sample. A random cut of 50 outlier prompts consistently yields 8 to 12 informal sub-categories. That is the exact condition under which HDBSCAN is designed to withhold a label.

The UXR limit: No clustering algorithm finds what it wasn't built to find. BERTopic surfaces structure in the modal use cases. The genuinely novel, the deeply personal, and the one-time requests disappear into the outlier bin. A researcher in the field would catch those. The algorithm doesn't.

Signals for Builders

Topic share tells you what users do. It does not, on its own, tell you where the tool is failing them or where new behaviors are forming faster than the product team knows about. Two additional signals are in the data: the trajectory each topic follows over the 13-month window (which reveals emergence and die-off patterns), and the toxicity rate per topic (which flags where users encounter responses they did not want, a rough but consistent proxy for friction).

WildChat-1M · Monthly share · Apr 2023 – Apr 2024
Category lifecycle: who grew, who collapsed
Monthly share per topic · key trajectories highlighted · percent

Dashed line: GPT-4 Turbo launch (Nov 2023). Topics selected for contrast of trajectory types: sustained emergence, summer collapse, and late surge. Source: allenai/WildChat-1M.

The lifecycle chart surfaces three distinct behavioral patterns. Midjourney image generation emerged suddenly in July 2023, jumping from essentially zero in May–June to 14% by July, independent of any GPT-4 model change. That is a user-driven adoption wave, not a platform-driven one: ChatGPT was useful enough for Midjourney prompt refinement before any new model shipped, and users discovered it. Roleplay and interactive fiction peaked in June 2023 at 7.4% and collapsed to under 1% by August. That kind of rapid die-off rarely traces to lost interest alone. The timing here is consistent with content policy changes in mid-2023, though the data can't confirm the mechanism. Scene and Visual Description ran near-zero for most of the window before spiking to 9.3% in February 2024. A single-month spike of that magnitude is more consistent with an external trigger than organic growth: a viral prompt template, a Reddit thread, a specific workflow that circulated.

For a product team, these are three different kinds of opportunity. The Midjourney wave says: users will find use cases you did not design for, and they will scale them fast. Roleplay's collapse says: policy decisions are behavioral decisions, and they will register in your topic data before they register in your satisfaction surveys. The Scene & Visual Description spike says: one workflow spreading virally can overwhelm a category signal, and without a way to detect it, you will not know whether to invest or wait.

WildChat-1M · Prompt length · by topic cluster
How hard users are working: median prompt word count by topic
Median and IQR of first-turn prompt word count per topic · top 12 clusters

Word count of the cleaned prompt text. Higher values indicate topics where users arrive with more fully-formed, high-effort requests, where a weak response is more likely to be noticed. Source: allenai/WildChat-1M.

Prompt length is a proxy for commitment depth: a 500-word prompt is someone who came prepared, who has already done work, and who has a specific expectation about what comes back. When the model fails a short casual question, the cost is low: the user rephrases or drops it. When it fails a 500-word prompt, the user has already lost something. Median prompt length by topic produces a rough map of where failure hurts most. Midjourney image-generation prompts tend toward the long end. Users paste in extended visual concept descriptions. Jailbreak prompts show a bimodal distribution: elaborate circumvention attempts at one end, blunt one-line probes at the other. Casual conversation and general chat cluster at the short end, consistent with low-stakes exploratory use. Professional and business writing falls in the middle-to-high range, where users bring drafted language and want targeted improvement. A product team setting triage priorities for model improvements should weight these by volume: the high-median topics with high volume are where the model's failures compound at scale.

The topic distribution also surfaces coverage gaps in evaluation sets. Standard LLM benchmarks concentrate on coding, mathematical reasoning, and factual recall. The WildChat taxonomy shows that image generation, professional writing, and instructional tasks together account for more real usage than the standard benchmark suite was designed to measure. A model that improves sharply on HumanEval while plateauing on dense instructional prompts will look fine in eval and frustrate a substantial share of actual users. Mapping the empirical topic distribution against a benchmark suite's task coverage makes those gaps visible. Where the eval is thin and user traffic is heavy is where blind spots compound, and where the next round of model investment is most likely to be misdirected.

Routing behavior and prompt depth read together produce a quality-of-experience priority map. A topic with high median prompt length, meaningful volume, and strong GPT-4 routing is one where users arrive prepared, have already decided the premium model matters for this task, and have something to lose if the response falls short. That combination (effort invested, quality expectation set, failure cost high) is where underperformance is most likely to register as churn rather than a shrug. The routing preference chart and the prompt depth chart measure different things, but they point at the same question: for which tasks has the user already made a bet on the model? Those are the tasks where the model has to deliver.

The outlier bin as product opportunity: 37.6% of all prompts land in BERTopic's outlier cluster, too heterogeneous to name, too sparse to cluster. In classical UXR that is the discard pile. In a product context it is the most interesting bin: these are the prompts that did not fit the model's existing vocabulary of use cases. Some fraction of them are the next Midjourney wave in embryonic form. The algorithm cannot find them. A researcher pulling 200 outlier prompts per quarter would.

What This Connects Back To

The model-quality post built a causal estimator on the premise that users who happen to rely on categories where the model is strong will experience higher quality and, as a result, remain more engaged. The key mechanism was within-version variation: same model, different task mix, different experienced quality.

This analysis operates at a coarser level across model versions across months, but the behavioral logic is the same. Model quality is not uniform across task types. Users, in aggregate, respond to quality. The topic mix is the trace of that response.

The Midjourney finding reframes rather than refutes the quality-gap thesis. GPT-4 Turbo's price reduction made the model accessible to a different kind of user than early GPT-4 adopters, and those users came primarily for image generation, piping idiosyncratic visual concepts into ChatGPT to be reformulated as structured Midjourney prompts. That task benefits substantially from improved instruction-following and creative elaboration, both areas where GPT-4 outperforms GPT-3.5. The coding signal, while swamped at the aggregate transition, shows up clearly in the cross-sectional data: coding accounts for 13.9% of GPT-4 conversations versus 7.1% on GPT-3.5, a near-2× overrepresentation that is consistent with users routing harder tasks to the better model. The practical implication still holds: the return on a quality investment depends on which category you improve, not just by how much. The finding is that the category that moved most at GPT-4 Turbo's launch was not the one the benchmarks advertised.


Data and Methods

All code is in scripts/wildchat_analysis.py in the site repository. The analysis runs in two passes. The first produces the keyword table for each BERTopic cluster. After manually assigning topic names in the script, a second run generates the final JSON files consumed by this page's charts.

Table 1 · Transition test: top topics by absolute share delta
Topic Share before Share after Δ pp p-value

[1] WildChat-1M: Zhao et al. (2024). "WildChat: 1M ChatGPT Interaction Logs in the Wild." Allen Institute for AI. huggingface.co/datasets/allenai/WildChat-1M.

[2] GPT-4 Technical Report: OpenAI (2023). Benchmark figures cited: HumanEval pass@1 (GPT-3.5: 48.1%, GPT-4: 67.0%), MMLU 5-shot (GPT-3.5: 70.0%, GPT-4: 86.4%), TruthfulQA (GPT-3.5: 58.7%, GPT-4: 71.9%), MT-Bench (GPT-3.5: 7.94, GPT-4: 8.99).

[3] Model-quality post on this site: "Does Making AI Smarter Actually Make People Use It More?"

[4] BERTopic: Grootendorst (2022). Embedding model: all-MiniLM-L6-v2. UMAP: n_components=5, n_neighbors=15, cosine metric. HDBSCAN: min_cluster_size=150, EOM selection.

AI Usage

Ideas, analysis, and opinions are my own. Generative AI was used as an editor after the writing and analysis were complete — sentence restructuring and light copy-editing. The author reviewed all suggested changes.