Running real models locally on a Mac Studio that isn't new anymore

I have a multi-generational archive of handwritten family letters sitting in my house, and I wanted to read it without sending the private content to a cloud provider. The first post in this series is the why. This one is the how — on the specific hardware I already own, with the specific software I landed on.

My Mac Studio is almost four years old now, but it has 64 GB of RAM and it has an M1 Max chip. It has proven itself to be quite capable of running large language models locally for personal benefit. It helps with maintaining privacy, and it also is much less expensive than leveraging any cloud provider.

That summary is the whole post in one paragraph. The rest is detail.


The aging-hardware angle

The reason any of this works on a four-year-old machine is unified memory. On Apple Silicon, the CPU and GPU share the same pool of RAM, so a 64 GB machine can actually hand 40-something gigabytes to a model without copying it across a bus. On a comparably priced PC, you’re generally limited by the VRAM on your GPU, which is a much smaller number.

That one architectural choice is why my Mac Studio — not a new one, not a top-spec one — can load a 35-billion-parameter multimodal model and transcribe a page of cursive in under a minute. Apple Silicon has been quietly excellent for local inference since the M1, and most people with one of these machines don’t realize what’s already sitting on their desk.

What doesn’t work so well: the very newest, very largest frontier models. I can’t run a 200B parameter model at home, and I don’t need to. For reading handwriting, a well-chosen 30-ish-billion-parameter model is more than enough.

Privacy and cost are the actual reasons

What do I need to do to run any of this stuff on my local hardware? What kind of hardware do I need? What kind of software do I need to install? Are there any GUIs available that I can use?

Those are the questions I had when I started, and I’ll answer them below. But before the mechanics, the motivation.

Running inference on my own machine means private family content — letters, medical references, names of people who aren’t alive to consent — never leaves the house. Cloud inference for a project this size is affordable but not free, and for a private archive the privacy is the deciding factor on its own. Electricity is the only real cost, and this machine was already sitting in my office for other work.

LM Studio, in practice

If you’re starting from zero, use LM Studio . It’s a free desktop app that runs local models behind an OpenAI-compatible API. “OpenAI-compatible” is the important part: any tool, script, or library that knows how to talk to OpenAI’s API can talk to LM Studio with a URL change.

Here’s the flow.

Install LM Studio. Download, drag to Applications, launch. Nothing unusual.

Browse the model catalog in the GUI. LM Studio has a built-in search that shows you what’s available and — crucially — whether a given model will actually fit on your hardware. That saves you from downloading a 60 GB model only to discover it won’t load.

Download a GGUF-quantized build. GGUF is the efficient file format used by local inference tools. Quantization trades a small amount of precision for a much smaller memory footprint — a Q4 quantization is roughly a quarter of the size of the full-precision weights and, for most tasks, you can’t tell the difference in the output. For a 35B model, you’re looking at something in the 20 GB range after quantization.

Load the model. Click load. Wait a minute. The model I settled on for this project is Qwen3.6-35B-A3B , a multimodal mixture-of-experts model. Mixture of experts (MoE) means the model has 35 billion total parameters, but only about 3 billion of them fire for any given token. So it reads pages about as fast as a 3B model would, while having the knowledge of a 35B model. That’s why it fits and why it’s fast enough to be useful.

Start the local server. One toggle in the LM Studio UI. It exposes an OpenAI-compatible endpoint at http://localhost:1234/v1. Any script I point at that URL can now do inference on the loaded model. My transcription pipeline is a plain Python script calling that endpoint in a loop.

Load one model at a time. LM Studio will happily let you load several models into memory at once, but it serializes GPU work, so extra loaded models just consume RAM without speeding anything up. Unless you have a specific reason, keep it to one.

Ollama is the obvious alternative, also OpenAI-compatible on http://localhost:11434/v1, command-line-first instead of GUI-first. I tried both. LM Studio was faster in my setup and the GUI made swapping models enough easier that it became my primary. My pipeline still supports Ollama as a fallback.

Two non-obvious gotchas

These are the things I wish someone had told me before I burned a weekend on them.

“Enable Thinking” silently eats your output budget. Qwen3.6 and other reasoning-capable models can emit <think>...</think> blocks before answering. LM Studio hides those blocks from the visible output by default, which is fine for chat. It is not fine for long-form generation, because the tokens inside those blocks still count against your max_tokens budget. Symptom: your transcription stops mid-word, or a generated chapter cuts off halfway through a paragraph, and you can’t figure out why — the model burned its output budget on hidden reasoning you never saw. Fix: turn Thinking off for transcription and any other long-form work. There’s a deeper point here, too. Transcribing handwriting and describing an image are perception tasks, not reasoning tasks. Chain-of-thought doesn’t help you read cursive. It just costs tokens.

A text-only model handed an image will happily hallucinate a full transcription. This is the scariest failure mode I hit in the whole project. An auto-detect routine in my pipeline picked a text-only MoE from my loaded models and sent it an image. Instead of erroring out — which is what you’d want and expect — the model produced a confident, fluent, completely fabricated transcription. No visual input, just invented handwriting. Fast, fluent, wrong. The practical defense: verify that your vision runs actually touched a vision-capable model. One easy check is the model ID — the vision-capable Qwen builds are tagged VL (vision-language). If the ID you’re using doesn’t signal vision capability and you’re expecting it to read an image, something is off. Don’t trust the output.

The pipeline, briefly

The full pipeline runs in three stages, all local, all against http://localhost:1234/v1.

Three-stage local pipeline: transcribe page scans to markdown sidecars, analyze with a text model across four passes, build a landscape PDF with Typst

Stage one is transcription: transcribe_images.py walks a directory of page scans and, for each image, asks the vision model for a verbatim transcription. It writes a Markdown sidecar per page. On Qwen3.6-35B-A3B this runs around 45–50 seconds a page, which puts the remaining archive — something north of a thousand pages — at roughly 18 hours of wall-clock time. I leave it running overnight.

Stage two is analysis: analyze_letters.py makes four text-only passes over the transcriptions. One pass pulls per-letter metadata (dates, people, places). One drafts a narrative chapter for each year. One captures voice and recurring catchphrases. One fills structured reference tables.

Stage three is typesetting: build_year_book.py generates Typst source and compiles it into a landscape-format PDF, scan on the left page, transcription on the right. Typst is a modern typesetting system — think of it as a cleaner LaTeX. The output is a keepsake I can actually hold.

The code lives in the project repo. I’m not going to walk through the scripts themselves here — the repo is where that belongs.

Close

This runs end-to-end on my desk. No cloud, no API bill, no content leaving the house. The hardware is almost four years old and I’m doing work on it that wasn’t possible at any price a few years ago.

In the next post I’ll get into the surprising part of this whole project: which model actually turned out to be best at reading my father’s cursive, and why it wasn’t the one I expected.