Four frontier models, one year of office sensor data, four very different dashboards

I have an Adafruit Clue sensor sitting on the desk in my home office. It has been quietly logging temperature, humidity, pressure, light, sound, and color every thirty seconds for a little over a year. 425,000 readings. About 33 megabytes of CSV. (I wrote up the build itself a year ago — CircuitPython on the device, a Python pywebview gateway on the Mac, USB Serial in between.)

I finally fed that year of data to four frontier models and asked each of them to build me a single self-contained HTML dashboard. Same prompt. Same data. Four very different answers.

This is not a benchmark. It’s a field report from one afternoon at one desk, using the tools that happened to be on it. By next quarter the rankings will probably invert. That’s the point.

The setup

One Adafruit Clue. About a year of readings at a thirty-second cadence. The prompt I used is one I’ve been refining across many of these “give a model a dataset, get back a dashboard” runs. It’s verbatim at the bottom of this post if you want to steal it.

The contestants:

Gemini CLI (model unspecified by the tool at the time I ran it)
Codex with GPT-5.4
Codex with GPT-5.5 (medium thinking)
Claude Cowork with Opus 4.7

Each one got the same CSV, the same prompt, the same instruction to discover what was in the data before charting anything. The dashboards each one produced are embedded live below — not screenshots. You can scroll through them, hover the charts, read the findings.

Contestant 1 — Gemini

Five sections: Workday Rhythm, Seasonal Solar Shift, Thermal Comfort Curve, The Silence Baseline, Chromatic Calendar. Clean, restrained, on-brand for the dark Felton-flavored aesthetic the prompt asked for. Pretty.

But thin. The headlines describe categories, not findings. “Thermal Comfort Curve” tells me there is a chart about thermal comfort; it does not tell me anything about my room. The whole thing reads like a competent designer producing a template that any room’s data could slot into. The prompt explicitly asked for “the genuinely interesting patterns in the data — not a generic ‘count things and make a bar chart’ report.” Gemini gave me the bar charts.

That outcome is what triggered everything that followed. I looked at this and thought: is the dataset boring, or is the model?

Contestant 2 — Codex with GPT-5.4

Now we’re getting somewhere. Nine sections, editorial headlines, real findings. “The room has a heartbeat, and it’s 1 pm.” “38% black, 15% red-brown, and a story of day vs night.” “21 reboots, one 53-day run, and a strong preference for rebooting at 10 a.m.”

GPT-5.4 found the device-reboot artifacts the prompt warned about and turned them into a section instead of a footnote. It pulled outdoor weather data into the analysis to compare indoor sensor readings against actual storm fronts. It noticed three rows that looked like garbage and gave them their own callout: “Three rows the dataset didn’t want you to see.”

Caveat I owe you up front: I gave 5.4 the outdoor weather file. It didn’t ask for it. The reason it had it to work with is the last contestant.

Contestant 3 — Codex with GPT-5.5

“Hot, dry, and intermittently awake.” The headline number — 56.5% of readings at or above 85°F — is a punch in the face, in a good way. Six sections, large light KPI numerals, a confident hand on typography.

But the section headers describe categories again — Heat Was The Default State, Dryness Is A Regime Not A Moment, Pressure Moves Are Visible But Not Causal — and the sections are numbered 1, 2, 3, 5, 7, 8. Whatever 4 and 6 were, they didn’t make it. That’s a tell: this is a model that produced an outline and didn’t tighten it.

No outdoor weather correlation, even though by this run the data was sitting in the workspace. The dashboard treats the Clue’s record as a closed system. Gorgeous closed system. Closed.

Contestant 4 — Claude Opus 4.7

“One year of office weather through a matchbox-sized sensor.”

Same nine-section depth as 5.4. Same editorial instinct: “A dry office, a warmer summer, and a winter that drains the humidity.” “The room doesn’t cycle — it drifts.” “A $40 dev board caught 87 weather fronts.” And, in a section that is genuinely the most interesting thing in the whole bake-off: “The $40 dev board caught X% of real storms.”

That section exists because Claude, in Phase 1, came back and asked me a question. It had inventoried the data, understood what the Clue was, inferred the room and the cadence, and then said: before I synthesize this, would you let me pull a year of outdoor weather for your zip code so I can compare your indoor readings against actual conditions? It handed me a curl command for a free Open-Meteo endpoint. I ran it. The dataset went from one source to two, and the analysis went from “what was happening in the room” to “what the room knew about the world outside it.”

That single behavior — enrich the dataset before synthesizing — is the punchline of the whole post. The prompt did not tell it to. The other three didn’t.

The weather JSON that 5.4 later used? That came from this run.

The workflow nobody is talking about

This Claude run did not happen in a chat window with me hand-holding it. It happened in Claude Cowork — Anthropic’s local agent mode — which I started, walked away from, and came back to a finished workspace with a dashboard, a verification script, a weather analysis script, and three different render passes saved as PNGs for me to compare.

The effort-to-output ratio matters as much as raw model quality. Set it up, walk away, come back to a builder-grade artifact. This is closer to delegating a one-day task to a junior data scientist than it is to “prompt engineering.”

What this is actually a snapshot of

It’s a snapshot of what it is like to live in this world right now, on one specific day in April 2026, with one specific dataset, on one specific desk. Capture has been a solved problem on that desk for over a year — a $40 board has been quietly doing its job. The interesting work is happening at synthesis. And the gap between a model that produces a competent template and a model that asks for the second dataset is, right now, the gap that decides whether the thing you get back is interesting.

Rankings, for the record: Claude first by a clear margin. GPT-5.4 and GPT-5.5 a toss-up for second — they made different trades, neither one obviously ahead of the other. Gemini fourth. Claude takes it on both fronts — editorial sensibility and raw visual polish — which is what makes the “asked for more data” moment land harder, not softer. It wasn’t a model trading depth for prettiness. It was the same model going further on both axes than anything else in the bake-off.

By the time I write the next one of these, that order will probably be different. Run your own bake-off. The dashboards are right above this paragraph; the prompt is right below it.

The prompt, verbatim

# Quantified Self Dataset → Dashboard

You are analyzing a personal dataset I've placed in this project's workspace.
Your job is to produce a single self-contained HTML dashboard that surfaces
the genuinely interesting patterns in the data — not a generic "count things
and make a bar chart" report.

## Optional context from me

I've had a adafruit Clue device in my office for over a year ambiently capturing environmental data.  That data is what you will find in the attached CSV file.

Treat this as hints, not ground truth. Verify against the actual data.

## Phase 1 — Discover what's here

Do NOT start charting. First:

1. Inventory the workspace. List every file, its size, and its apparent format
   (CSV, JSON, SQLite, XML, GPX, TCX, parquet, mixed archive, etc.). If there
   are many files, group them and sample.
2. For each relevant file, report the schema: every column/field, its type,
   example values, null rate, unique cardinality, min/max for numerics, date
   range for temporals.
3. Infer what the dataset represents in plain English — device, capture
   cadence, likely collection method, what each row/record means. State your
   inference so I can correct it before you build on a wrong model.
4. Flag data-quality issues: duplicates, impossible values (negative durations,
   GPS at 0,0, heart rate of 3), suspicious clusters of nulls, time gaps,
   timezone ambiguity, unit ambiguity (miles vs km, °F vs °C, J vs kcal),
   device-reboot artifacts, known-bad ranges.
5. List 5–8 candidate questions worth asking of THIS dataset specifically.
   Not "what are the totals" — things like "does my resting heart rate trend
   with season," "which intersections do I idle at longest," "which store
   accounts for the long tail of sub-$5 purchases." Rank by
   likely-interestingness and feasibility with the fields available.

Pause here and show me the inventory + inferred meaning + ranked questions.
I'll tell you which to pursue, or say "your call, pick the best 4–6." If I
say your call, proceed.

## Phase 2 — Analysis

For each chosen question:
- Compute the answer. Show the aggregation logic so I can sanity-check it.
- Identify the KPI or headline number that summarizes it.
- Choose the visualization form honestly: a sparkline beats a pie chart,
  small-multiples beat a crowded legend, a number-with-context beats a
  single-bar chart. If a chart adds nothing over a sentence, write the sentence.
- Call out anomalies, outliers, and changes-in-regime. Those are usually the
  most interesting thing in personal data.

## Phase 3 — Dashboard

Produce ONE self-contained HTML file. Constraints:

- **Single file.** Inline CSS and JS. No build step. No network fetches at
  runtime — embed the aggregated data as JSON literals. (Raw data stays in the
  workspace; the page ships only the rollups needed to render.)
- **Charting:** prefer vanilla SVG or a single CDN-pinned library (Chart.js or
  Observable Plot). Pick one and stick with it.
- **Aesthetic — Data-as-Design** (Felton-inspired):
  - Dark background (`#1c1917` stone-900), off-white text
  - Typography-first: large light numbers for KPIs, small-caps labels
  - Semantic color palette by data type:
    amber=time, teal=people, rose=blocker/urgent, emerald=progress/positive,
    cyan=places/location, indigo=connections, purple=future/ideas, gray=inactive
  - Brand orange accents: `#FF8F3B`, `#FDAD57`, `#FFCE9D`
  - No decoration for decoration's sake — every visual element encodes data
- **Structure:** Headline stat at top. KPI strip. Then one section per question
  with a chart + a 1–2 sentence plain-English finding underneath (not a caption
  restating the axes — an insight).
- **Include a "Provenance" footer:** source file(s), row count after cleaning,
  date range, any rows dropped and why.

## Phase 4 — Deliverable

When done, tell me:
1. Filename of the dashboard and how to open it
2. The 3 most surprising findings in one sentence each
3. What you'd want to pull in next to make this richer (other datasets I might
   have, joinable keys, enrichment APIs)

## Ground rules

- Units and timezones explicit, always.
- Correlation ≠ causation — never imply it.
- If the data can't answer a question well, say so and drop the question.
  Don't pad the dashboard with weak charts.
- Personal data: nothing leaves the workspace. No uploads, no external APIs
  unless I explicitly approve one.

Steal it. Run it on your own data. Send me what comes back.