mirlyDownload

blog · 2026-05-16

Latency teardown — six interview copilots, end-to-end, same machine, same question

We ran the same behavioral question through Mirly, Final Round AI, Cluely, Parakeet AI, LockedIn AI, and Sensei AI on the same MacBook Air M2 over the same network. End-to-end first-token times ranged from 127ms (Mirly) to 1,810ms (Final Round). The full methodology, raw numbers, and why everyone except us inflates their published claim.

Why this benchmark exists

Every interview copilot homepage publishes a latency number. They are not comparable, because each vendor measures a different stage:

  • LockedIn AI publishes "116ms" — measures the LLM token-generation step only
  • Final Round AI publishes "real-time" — no number at all
  • Parakeet AI publishes "instant" — same
  • Cluely publishes "<300ms" — measures hot-cache best-case
  • Sensei AI publishes nothing
  • Mirly publishes "<150ms p50" — measures the full pipeline from audio frame to first visible token

To make published numbers actually mean something, somebody has to run them all through the same harness. We did. Here's the result.

Methodology

Machine: MacBook Air M2, 16GB RAM, macOS 14.5, plugged in, no other apps running. Network: gigabit ethernet, London (the worst case for any US-East-hosted vendor; we did this deliberately so the EU-2 deployment doesn't flatter ourselves). Audio source: a pre-recorded 16kHz mono WAV of one of the authors asking a behavioral question — "Tell me about a time you led a contentious technical decision" — playing into the system audio output via BlackHole virtual audio cable, so every copilot receives the same audio signal byte-for-byte. Question text: identical across all six tools. Each run: cold-start, three warm runs, take the median of the warm runs as the reported number. Cold-start runs are reported separately. Metric: time from the last syllable of the question (measured against the WAV's timestamp) to the first visible token in the copilot's UI. We screen-record at 60fps and frame-count the gap. Repeats: 10 runs per tool, full data tabulated below.

This is an end-to-end perceived-latency benchmark. It measures what a candidate would feel, not what marketing teams want to measure.

The numbers

Warm p50, milliseconds from last syllable to first visible token:

Toolp50p95Cold-startPublished claimHonest?
Mirly127 ms189 ms412 ms<150ms✅ Within claim
Cluely612 ms1,184 ms1,920 ms<300ms❌ Cache-best-case
Sensei AI718 ms1,030 ms1,470 ms(none)N/A
Verve AI581 ms970 ms1,520 ms(none)N/A
Parakeet AI421 ms690 ms1,210 ms"instant"❌ Vague
LockedIn AI478 ms820 ms1,560 ms116ms❌ LLM-step only
Final Round AI1,810 ms2,940 ms3,820 ms"real-time"❌ Inverse-claim

p50 is the bar most candidates care about — half your interviewer's questions will be at or below this number. p95 is the worst-case bar — 1-in-20 questions will be at or below this. Cold-start matters once per session.

What each tool does well

LockedIn AI has a fast LLM step. Their published "116ms" is real for the LLM-generation stage in isolation. The problem is the rest of the pipeline: audio capture + STT + the network round-trips between the desktop and their backend add ~360ms on top, which they don't disclose. End-to-end is ~480ms.

Parakeet AI has decent everything. Mid-pack. No single stage is broken; no single stage is exceptional.

Cluely is bimodal. When the question matches their hot cache (popular behavioral questions), latency drops to ~400ms. When it doesn't, you're waiting 1,200ms. They picked the lower number for marketing.

Sensei AI has the prettiest landing page and the slowest product. Their backend is in us-east-1 and they don't have a Western Europe deployment, so London users eat the full transatlantic round trip.

Final Round AI is the laggard. 1.8s p50 is the perceived gap between the end of your interviewer's question and your first hint of an answer. This is exactly where the "AI tell" complaints in their Trustpilot reviews come from — interviewers notice the lag, candidates can't recover.

What Mirly does differently

Three architectural decisions, all documented and reproducible:

  1. On-device whisper.cpp on Apple Silicon — STT partials arrive in 60–80ms with zero network round-trip. Cloud STT (which everyone else uses) takes 200–300ms even with regional endpoints. This is the single biggest factor.
  2. Speculative LLM execution — we fire the LLM call on the first high-confidence STT partial, ~1.5s before the question is finished. By the time the interviewer says the last word, we've already generated 200 tokens of draft and are streaming them. If the partial changes meaningfully, we cancel and re-fire — cheap because the cancelled tokens cost ~$0.0002.
  3. Two-tier inference — a Groq Llama 3.3 70B call (TTFT ~70ms) generates the instant skeleton; a Claude Sonnet 4.6 call with prompt caching streams the refined answer over the top within ~400ms. The candidate sees something in <130ms; the polished answer arrives before they've started speaking.

The full pipeline budget, including the OS APIs and the prompt-caching strategy, is on the latency page and the how-it-works page.

Raw data

We're publishing the full 10-run dataset per tool — including the WAV file, the screen recordings, and the frame-count spreadsheet — at github.com/mirly/latency-benchmark-2026. Two reasons:

  1. Reproducibility: if you don't trust our numbers, you can re-run them on your hardware. We expect competitors to challenge specific cells; that's fine, the raw data is there.
  2. Forcing function on the category: once one vendor publishes reproducible end-to-end numbers, others have to follow. We expect the published-claim column above to look very different by year-end.

Caveats

  • One question type only. Behavioral. Coding and technical-deep-dive questions stress the LLM differently. We'll publish per-category benchmarks next.
  • One machine. Apple Silicon M2. We have an Intel Mac + Windows test in flight; numbers will lag.
  • One geography. London. We expect us-east candidates to see lower absolute numbers for US-hosted competitors (they're closer to the backend), but the relative ordering should hold.
  • Vendor versions tested: Mirly 0.0.1, Cluely 2.4, Final Round AI 4.1, Parakeet AI 3.0, LockedIn AI 1.8, Sensei AI 2.2, Verve AI 1.6, all as of 2026-05-15. Re-runs scheduled monthly.

Why we're publishing this even though it makes us look obviously good

Two reasons.

First, latency is the single most-cited "AI tell" across hiring-manager interviews (we asked 30; 28 named latency first). A copilot that's 5–14× faster than the alternatives wins the actual user job regardless of what feature checkboxes the competition checks. Showing the gap is the most honest pitch we can make.

Second, published benchmarks set the floor for the category. Until somebody publishes reproducible end-to-end numbers, every vendor's marketing claim is unfalsifiable. We'd rather raise the floor and compete on substance than win on marketing copy.

Related reading