Designing a local-first ASR pipeline — engineering principles

These are working notes from building a local-first ASR pipeline for desktop. We do not claim them as universal rules — they are six principles that we keep returning to, often after losing time to violating them. Writing them down before we forget which mistakes produced which lessons.

Principle one: keep the hot path narrow and synchronous. Speech-to-text on the desktop is an interactive system. The user is talking and watching the screen at the same time. Anything that introduces a wait — a network call, a model load, a serialization step — has to live outside the per-utterance loop. The loop should do as little as possible: capture audio, run recognition, emit text. Everything else (term learning, evaluation, telemetry) is a side path.

Principle two: prefer single-write, verified insertion. Voice input only feels reliable if the words actually land where the user expected them to land. We treat insertion as a critical operation: there is one path that writes the text into the focused field, the path is verified to have succeeded, and "approximately succeeded" is not a completion state. If the verification fails, the system reports the failure rather than silently retrying with a slightly different result. A user who cannot trust the cursor will stop using the tool.

Figure 01

The local-first ASR pipeline — six stages, one tight loop

Every per-utterance step lives in the hot loop. Everything else (term learning, evaluation, telemetry) gets pushed out to side paths.

Hot path · per utterance

Capture

AVFoundation · 48 kHz local buffer

ASR

SenseVoice via Sherpa runtime (int8)

Correct

pre-punctuation + lexicon rewrite

Enhance

Qwen3-4B (local MLX, ~420 ms)

Insert

clipboard tx + paste-over

Verify

strong-verify; report on failure

Side paths · async

dictionary updateseval / regression harnessopt-in telemetry

TargetPublished targets: 0.82 s write latency · 97.4% strong-verify accuracy.

Principle three: make corrections deterministic before adding learned behavior. The pipeline has several stages where text can be modified after recognition: punctuation restoration, term substitution from a user dictionary, history-aware correction, and a model-based refinement pass. We learned the hard way that the deterministic stages have to be reliable and explainable on their own before the learning layer is allowed to do anything. If the user does not trust the simple stages, they will not trust the smarter ones.

Principle four: refinement should be conservative by default. The last stage of the pipeline — model-based polish — is where it is tempting to be ambitious. Resist this. The refinement layer's job is to fix obvious mistakes (clear mishearings, swapped homophones, wrong tense markers) without rewriting the user's intent. We give it tight prompts, low temperature, and explicit instructions to preserve negation, technical terms, and proper nouns. Aggressive refinement produces a tool that "improves" what the user said into something they did not say. That is worse than not refining at all.

Principle five: the dictionary belongs to the user. As the system gets used, it sees patterns in the user's vocabulary that the base model does not know — domain terms, project names, frequent collocations. We use these to improve later recognition, but new terms enter through a "pending" gate: the system proposes, the user approves. Words never enter the user's dictionary silently. This is partly about correctness (model proposals are wrong sometimes) and partly about the relationship between the user and the tool. The tool should always be answerable.

Figure 02

ORDO home dashboard — six modules a click away

Shipping v3 layout. Status, today, dictionary, history, pipeline detail, and quick settings all live one tap from the home screen.

Engine status

● Ready

Mainline + Enhance + Selection — all hot, all local.

Today's activity

236

246 utterances · 4 hours hands-free, +8.3% vs 7-day average.

Adaptive dictionary

847

Code / Project / Daily — auto-learned from replays, gated by user approval.

Utterance history

∞

Raw ASR · final text · per-stage timing preserved per entry.

Pipeline detail

5 → 1

Capture → ASR → Correct → Enhance → Insert. Each stage swappable.

Quick settings

⌥⌘

Hotkeys, engines, refinement model — every change applies instantly.

Layout principle: nothing critical is more than one click deep. The home screen is a status panel, not a launcher.

Principle six: evaluation runs on your own material. We have an internal harness that replays recorded fixtures through the entire pipeline and tracks accuracy on materials that look like our actual users' work. It is not a competitive benchmark — it is a regression net. Public benchmarks are useful for picking models, but the question that matters in production is "did our last change make the experience worse for our users." Only your own material can answer that.

A few honest caveats. None of these principles rule out using cloud models — they argue for keeping the user-facing loop tight, regardless of where individual stages run. None of them are original to us. Most of them appear, in different vocabulary, in good signal-processing systems and in well-built editors. We are writing them down because the temptation to violate each of them comes up at every design review, and having the list out where everyone can see it is cheaper than relearning the same lesson three times.

If you take one principle: the user has to keep trusting the cursor. Everything else follows from that.

From the Blog

View all