A general-purpose MoE multimodal beat every dedicated vision model on my father's handwriting

I’ve been transcribing a multi-generational archive of handwritten family letters on my own hardware. Posts one and two covered why and how. This post is the surprise.

I assumed the right tool for a vision task was a vision model. If you’re reading handwriting, you reach for something labeled “VL.” If you can find one fine-tuned for OCR and handwriting, even better.

I was wrong — at least at the sizes I can run locally in 2026.

On a head-to-head across five local-capable models on deliberately hard pages from my archive, the winner was Qwen3.6-35B-A3B — a general-purpose, mixture-of-experts multimodal model. It beat every dedicated vision model I tested, including one fine-tuned specifically for OCR and handwriting.

The problem that forced the test

Baseline was Qwen2.5-VL-7B — a respectable dense vision model, ~12 seconds per page, usually fine.

Usually fine was the problem. It was producing confident misreads on names and everyday words. A first name came out as Asm. A different first name came out as joli. The same first name on a later page came out as Asin. playhouse came out as flaghouse. Arbor Lodge came out as Abacadige.

Those misreads are diagnostic. Asin only resolves to the correct first name if the model has enough language and world knowledge to prefer the coherent reading. flaghouse is what you get when a model is reading pixels; playhouse is what you get when it’s reading a sentence. Handwriting recognition, especially cursive, is a language problem wearing a vision problem’s costume.

The bake-off

Those specific misreads didn’t come from any automated evaluation — they came from me reading the baseline transcriptions against the scans by hand. That’s the entire quality gate: the original letter in one window, the transcription in the other, a highlighter on the printout when something was obviously wrong. Once I had enough flagged pages, the six hardest became the fixed test corpus every candidate model below had to run against.

Five models. Six deliberately hard pages. Same prompt. All local, on the same 64 GB M1 Max Mac Studio.

Model	Architecture	Active params	Total	Per-image
`Qwen2.5-VL-7B`	Dense vision	7B	7B	~11.7 s
`Qwen3-VL-8B`	Dense vision	8B	8B	~42.6 s
`Qwen3.6-35B-A3B`	MoE multimodal	~3B	35B	~31.6 s
`Gemma-4-26B-A4B`	MoE multimodal (Google)	~4B	26B	~13.8 s
`Chandra` (`Qwen3-VL-9B` fine-tune for handwriting)	Dense vision, specialized	9B	9B	~73.5 s

A quick MoE primer if you’re meeting it for the first time: the model is made of many small specialized sub-networks, and a router picks only a few to fire for each token. A 35B MoE has 35 billion parameters sitting in memory, but only ~3 billion activate per token. Large-model knowledge at small-model compute.

Dense models fire all weights for every token; MoE models use a router to fire only a small subset of experts per token

That turns out to matter enormously for handwriting. Here’s how each model handled the worst readings from baseline:

Known-bad reading	`2.5-VL-7B`	`3-VL-8B`	`3.6-A3B`	`Gemma-4`	`Chandra`
`Asm` → Adam	Asm ✗	Asm ✗	Adam ✓	Adam ✓	Adam ✓
`flaghouse` → playhouse	flaghouse ✗	Playhouse ✓	playhouse ✓	—	flaghouse ✗
`joli` → Jodi	joli ✗	Jodi ✓	Yodi ✗	“July” ✗	Yodi ✗
`Asin` → Adam	Asin ✗	Astin ✗	Adam ✓	“Ron” ✗	Asan ✗
`Abacadige` → Arbor Lodge	Abacadige ✗	“Arise Lodge” ✗	“see a lodge” ✗	—	“ABA LODE” + Nebraska City anchor ≈

The MoE resolved the most ambiguous names correctly — while running faster than the dense 8B vision model above it on the list. A 35B model firing 3B weights per token outperformed a 9B dense model firing all 9B, on the task that 9B model was specifically trained for.

Chandra is the instructive loss. A Qwen3-VL-9B fine-tune built specifically for OCR and handwriting, it’s the model you’d pick for this job on paper. In practice, it missed first names it had no business missing. It did pull off one impressive move — on the Abacadige page, it anchored on “Nebraska City” elsewhere in the text and got close. Real reasoning, just not enough to overcome the capacity gap. At the 9B dense tier, handwriting-specific fine-tuning can’t substitute for the broader language priors of a much larger general model.

The failure mode that scared me

Gemma-4-26B-A4B — the fastest MoE multimodal in the lineup — produced a confident, fluent transcription of a passage that included details that weren’t on the page. Not a misread. A fabrication. A coherent paragraph bearing no relationship to the actual letter.

One letter, one sample; I’m not indicting Gemma on a single bad read, and I’d expect every model in this lineup to produce something like this given enough pages. But the shape of the failure is what worried me. I’m keeping what it wrote private — the letter involves private family material — but the shape is worth publishing.

A slow, awkward, obviously-wrong transcription is fine; you see it and move on. A fast, fluent, plausibly-wrong one is the dangerous one. You can’t tell it’s fabricated without comparing against the source, and if you’re processing hundreds of pages you won’t compare every one.

If you take one thing from this post into your own local-model work on high-stakes content, take this: spot-check against the source, every batch. The models best at sounding right are also best at sounding right when they’re wrong.

What this means past my project

For handwriting and other context-heavy vision tasks, the frontier right now is general capability at scale, not task-specific fine-tuning at smaller dense sizes. If you’re evaluating local models for OCR, archival work, or document processing, don’t assume “VL” is the thing that matters most. Size and generation are winning over specialization, at least for now.

This pattern isn’t unique to me. In July 2025, Imperial War Museums, Capgemini, and Google Cloud used Gemini to transcribe 20,000 hours of oral history — ~8,000 interviews with veterans and civilians — in weeks instead of the estimated 22 years it would have taken by hand, at 99% word accuracy . That’s the institutional version. What’s new is that an individual with a four-year-old machine can now work at a similar ratio for their own records.

Somewhere around fifteen hundred pages of family letters, transcribable in hours, on a Mac Studio that isn’t new. Private content never leaves the house. Cost is electricity.

A snapshot, not a forecast — local-model quality is moving fast enough that the winner today won’t be the winner in six months. It isn’t that Qwen3.6-35B-A3B is a final answer. It’s that the ground has moved far enough, at the small-enough-to-run-locally tier, that a project like this is finally doable.

The letters are still there. What’s changed is that I can finally sit with them.