A general-purpose MoE multimodal beat every dedicated vision model on my father's handwriting

For context: I’ve been transcribing a multi-generational archive of handwritten family letters on my own hardware. The first two posts covered why I’m doing this at all and how to set it up on your own machine . This post is the surprising finding — the one I didn’t expect going in.

I assumed the right tool for a vision task was a vision model. That’s the obvious reach. If you’re reading handwriting, you reach for something labeled “VL.” If you can find one fine-tuned for OCR and handwriting specifically, even better.

I was wrong about that. Or at least, I was wrong about it at the sizes I can run locally in 2026.

The short version: on a head-to-head across five local-capable models on a set of deliberately hard pages from my archive, the winner was Qwen3.6-35B-A3B — a general-purpose, mixture-of-experts multimodal model. It beat every dedicated vision model I tested, including one that had been specifically fine-tuned for OCR and handwriting.

That finding is the whole post. The rest is how I got there.


What I was running before the bake-off

Baseline was Qwen2.5-VL-7B. A perfectly respectable dense vision model. Fast on my Mac Studio — about twelve seconds per page — and usually fine.

“Usually fine” was the problem.

It was producing confident misreads on names and everyday words. A first name I was trying to read came out as Asm. A different first name came out as joli. The same first name from a second page came out as Asin. The word playhouse came out as flaghouse. The place name Arbor Lodge came out as Abacadige.

What’s funny — and I noticed this before I sat down to actually benchmark — is that my personal experience had shown the 2.5 model was better at handwritten letters than the newer 3 model. I kept telling people that and not quite trusting it.

The misreads are diagnostic if you stare at them long enough. Asin only resolves to the correct first name if your model has enough language and world knowledge to prefer the coherent reading over the letter-by-letter one. flaghouse is what you get when a model is reading pixels; playhouse is what you get when a model is reading a sentence. Handwriting recognition, especially cursive, is a language problem wearing a vision problem’s costume.

That reframing is what led me to actually run the bake-off.

The bake-off

Five models. Six deliberately hard pages. Same prompt for each. All local, all on the same 64 GB M1 Max Mac Studio.

ModelArchitectureActive paramsTotal paramsPer-image
Qwen2.5-VL-7BDense vision7B7B~11.7 s
Qwen3-VL-8BDense vision8B8B~42.6 s
Qwen3.6-35B-A3BMoE multimodal~3B35B~31.6 s
Gemma-4-26B-A4BMoE multimodal (Google)~4B26B~13.8 s
Chandra (Qwen3-VL-9B fine-tune for OCR/handwriting)Dense vision, specialized9B9B~73.5 s

A quick primer if you’re meeting MoE for the first time. “Mixture of experts” means the model is made of many small specialized sub-networks, and each input token only passes through a few of them — the router picks which experts to fire. So a 35B MoE like Qwen3.6-35B-A3B has 35 billion parameters sitting in memory, but only about 3 billion of them activate per token. You get the knowledge of a 35B model with something closer to the speed of a 3B one.

Dense models fire all weights for every token; MoE models use a router to fire only a small subset of experts per token, giving large-model knowledge at small-model compute

That is the whole trick, and it turns out to matter enormously for handwriting.

Here’s a simplified view of how each model handled the worst readings from baseline. Same passages, just showing the name/word in question and whether the model resolved it:

Known-bad reading2.5-VL-7B3-VL-8B3.6-A3BGemma-4Chandra
Asm → AdamAsm ✗Asm ✗Adam ✓Adam ✓Adam ✓
flaghouse → playhouseflaghouse ✗Playhouse ✓playhouse ✓flaghouse ✗
joli → Jodijoli ✗Jodi ✓Yodi ✗“July” ✗Yodi ✗
Asin → AdamAsin ✗Astin ✗Adam ✓“Ron” ✗Asan ✗
Abacadige → Arbor LodgeAbacadige ✗“Arise Lodge” ✗“see a lodge” ✗“ABA LODE” + Nebraska City anchor ≈

A few things jump out.

Qwen3.6-35B-A3B — the MoE — resolved the most ambiguous names correctly. It did it while running faster than the dense 8B vision model above it on the list. A 35B model that fires 3B weights per token outperformed a 9B dense model that fires all 9B, on the task that 9B model was specifically trained for.

Chandra is the most instructive loss. Chandra is a Qwen3-VL-9B fine-tune built specifically for OCR and handwriting. On paper, it’s the model you’d pick for this job. In practice, it missed first names it had no business missing. It did pull off one genuinely impressive move — on the AbacadigeArbor Lodge page, it didn’t land the exact words but it anchored on “Nebraska City” elsewhere in the text and got close. That’s real reasoning. It just wasn’t enough to overcome the capacity gap.

The general lesson I take from that row: at the 9B dense tier, handwriting-specific fine-tuning cannot substitute for the broader language-and-world priors you get from a much larger general model. A fine-tune can’t know things the base model didn’t know.

The failure mode that scared me

Independent of which model won, one thing happened during the bake-off that I want to name clearly.

Gemma-4-26B-A4B — the fastest MoE multimodal in the lineup — produced a confident, fluent transcription of a passage that included details that were not on the page. Not a misread. A fabrication. A coherent paragraph that bore no relationship to the actual letter.

One letter, one sample; I’m not indicting Gemma on a single bad read, and I’d expect every model in this lineup to produce something like this given enough pages. But the shape of the failure is what worried me, and I want to say specifically where I saw it.

I’m keeping the specifics of what it wrote private because the letter involves private family material. The thing worth publishing is the shape of the failure.

This is the risk profile that separates usable local transcription from unusable. A slow, awkward, obviously-wrong transcription is fine — you see it’s wrong and you move on. A fast, fluent, plausibly-wrong transcription is the dangerous one. You can’t tell it’s fabricated without comparing against the source image, and if you’re processing hundreds of pages you’re not going to compare every one.

If you take one thing from this post into your own local-model work on high-stakes content, take this: spot-check against the source. Every batch. The models that are best at sounding right are also best at sounding right when they’re wrong.

Why this matters past my project

The counterintuitive result here — general-purpose MoE multimodal beats specialized dense vision, at the sizes you can run locally — is the part worth sharing widely.

For handwriting and other context-heavy vision tasks, the frontier right now seems to be general capability at scale, not task-specific fine-tuning at smaller dense sizes. If you’re evaluating local models for OCR, historical archive work, or document processing — don’t assume “VL” in the name is the thing that matters most. Size and generation are winning over specialization, at least for now.

This pattern isn’t unique to me. In July 2025, Imperial War Museums announced a project with Capgemini and Google Cloud that used Google’s Gemini models to transcribe 20,000 hours of oral history — roughly 8,000 interviews with veterans and civilians recorded between 1945 and the early 2000s. A task estimated at 22 years of manual transcription was completed in weeks, at 99% word accuracy and 94% speaker diarization . That’s what the institutional version of this looks like. What’s new is that an individual with a four-year-old machine can now play at a similar level for their own records.

Closing

So here’s where I am. Somewhere around fifteen hundred pages of handwritten family letters, transcribable in hours, on a Mac Studio that isn’t new anymore. Private content never leaves the house. Out-of-pocket cost is electricity.

That’s a snapshot, not a forecast. Local-model quality is moving fast enough that the winner today won’t be the winner six months from now, and I think that’s the most important thing about this moment. It isn’t that Qwen3.6-35B-A3B is some final answer. It’s that the ground has moved far enough, at the small-enough-to-run-locally tier, that a project like this is finally doable at all.

If you missed them, the first post in this series covered why I’m doing this at all, and the second covered how to set up the hardware and software side on your own machine.

The letters are still there. That part hasn’t changed. What’s changed is that I can finally sit with them.