Voice agents · production accuracy

The accuracy number in the demo is not the number you get in production

Every voice agent pitch leads with one figure: 97%, 98%, 99% accuracy. That figure is word error rate on clean audio. The moment a real caller picks up a phone in a noisy room, it falls apart, and the gap is not small. If you are evaluating or building a voice agent, this is the part the benchmark hides.

Matthew Diakonov, Written with AI

Published May 21, 20266 min read

Direct answer (verified 2026-05-21)

Why is voice agent accuracy lower in production than in demos?

Because the demo number is word error rate on clean audio, and real calls degrade it 2.8 to 5.7 times through background noise, accents, and slot-bearing words. The number that actually survives a phone line is end-to-end task completion, did the right action fire, not how many words were transcribed. Speech recognition research is blunt about this: a model scoring above 95% on a clean benchmark often falls to 70% or lower on live audio (Deepgram).

There are two accuracy numbers, and they are not the same number

When a voice agent demos at “99% accuracy,” that almost always means word error rate measured on a clean benchmark: a cooperative speaker, a good microphone, a quiet room. Top streaming speech-to-text models genuinely do score in the 2 to 4% word error rate range on those benchmarks. That part is real, and it is the easy part.

The number that decides whether your agent is usable is the other one: did it complete the task. For a phone-ordering agent, did the right ticket fire, priced correctly, with the right modifiers. For a booking agent, did the right reservation land. That is task completion, and it is a different measurement from transcription. You can transcribe every word perfectly and still complete the wrong task, because the words that carry the task (a quantity, a name, a modifier, a menu item) are a tiny fraction of the words spoken.

Getting “large” wrong instead of “small” is one wrong word in twenty. That reads as 95% transcription accuracy. It is also a 100% wrong order. This is why teams that take production seriously track a separate slot error rate on the words that actually carry meaning, and why a single headline accuracy figure tells you almost nothing about whether the agent works.

How far the demo number falls on a real line

The degradation is measured and consistent. A contact-center analysis ran the same speech API across real call conditions and found that acoustic conditions, not the model, explained most of the variance. Here is what the same agent looks like when you stop feeding it clean audio.

eval: benchmark vs live calls

Figures from Deepgram’s production accuracy analysis: clean headset 92%, conference room 78%, mobile call with background noise 65%, accented speech without tuning below 80%, and a documented 2.8 to 5.7 times benchmark-to-production degradation. A restaurant phone line, on a narrowband codec with kitchen noise behind the caller, sits at the worse end of that range.

The demo is engineered to be the best case

0x to 0x

documented degradation from benchmark accuracy to production conditions for speech recognition

Signal-to-noise ratio alone drives an exponential collapse: roughly 3.5% word error rate at 20 dB SNR, around 15% at 10 dB, and over 70% at 0 dB. Microphone bandwidth compounds it. A narrowband phone channel (300 Hz to 3.4 kHz) can sit near 25% word error rate at 10 dB where a wideband channel would be around 12%. A demo runs at the top of every one of those curves. A Tuesday-night phone call runs near the bottom.

Accents are the other quiet failure. Most models are trained mostly on standard American English, so accuracy that reads 96% on benchmark audio can fall below 80% on heavily accented speech. Targeted acoustic training closes some of it (one study moved accent handling from 76% to 88% with 200+ hours of data), but the default out-of-the-box number you saw in the demo did not include any of your callers.

Demo conditions vs production conditions

The single most useful thing you can do before trusting a voice agent is to stop testing it the way it was demoed. Toggle between the two worlds it lives in.

The audio the agent actually hears

A scripted call, one clear cooperative speaker, a good microphone in a quiet room, a short and simple request that never changes.

Clean near-microphone audio, high SNR
Standard accent the model was trained on
Simple request, no mid-call changes
Measured as word error rate on transcription
Headline number: 97 to 99% accuracy

The metric that survives production: did the right thing happen

Once you accept that raw transcription will degrade on a real line, the design question changes. It is no longer “how do we hear every word,” it is “how do we still complete the task when we do not.” For a restaurant phone agent, that means the output is not a transcript, it is a structured order that has to resolve to a real menu item, a real modifier code, and a real price. A correct transcript with no menu behind it is still a ticket nobody can ring up. We broke that specific failure down separately in why transcription accuracy is not POS accuracy, and the “ask 95% of what” problem in our breakdown of restaurant phone order accuracy.

Because the line itself is the worst case for raw transcription (narrowband codec, kitchen noise, a phone held to the ear), the agent has to lean on structure and confirmation rather than perfect hearing. The acoustic side of that problem, and what actually moves the number on a noisy line, is its own topic in our note on the acoustic noise bottleneck.

How PieLine is built for the gap, not the demo

Two design choices follow directly from everything above. They are the part you can hold a vendor to.

1 · measured at the order level, in production

PieLine reports 95%+ order accuracy in production, not word error rate on clean audio. The unit of measurement is the ticket, with cuisine-specific customization (half-and-half pizzas, spice levels, protein subs), because a ticket that fires wrong is a failure even when the transcript was right.

2 · designed to fail safely

Roughly 90% of calls are handled end to end by the AI. The remaining edge cases (complaints, catering, anything unusual or unintelligible) are transferred to a human with the full conversation context, so a degraded call becomes a clean handoff instead of a confidently wrong order. During the first month of every account, real calls are actively monitored and the agent is refined against live traffic, not a benchmark.

0%+Order-level accuracy targeted in production

0%Of calls handled end to end by the AI

0Simultaneous calls handled during a rush

0+POS integrations (Clover, Square, Toast, Aloha, Revel live)

“The experience was better than speaking to a human. No hold time, no confusion, no rushing.”

Customer feedback

PieLine live deployment, Bay Area

What to ask before you trust the accuracy number

Skip the headline figure. It is the cheapest thing to quote and the hardest thing to verify. Ask the questions that expose whether the number was ever measured under production conditions:

Is that accuracy word error rate or task completion? If it is word error rate, ask what it drops to on real calls.
What audio was it measured on? Clean headset audio and a narrowband phone line on a noisy floor are two different planets.
What happens to the calls it cannot complete? A safe handoff with context beats a confident wrong answer every time.
Can I see the number on a sample of my own calls? The only accuracy figure that matters is the one measured on your callers, your noise, and your menu.

See the production number, not the demo number

Book a short demo and we will show how PieLine measures order-level accuracy on real calls and hands off cleanly when a line gets too noisy.

Frequently asked questions

Why is voice agent accuracy lower in production than in the demo?

Because the demo number is almost always word error rate measured on clean, near-microphone audio with a cooperative speaker. Production audio is the opposite: background noise, speakerphone, accents, people talking over each other, and callers who change their mind mid-sentence. Studies of speech recognition document 2.8 to 5.7 times degradation from benchmark to production conditions. The same model that scores above 95% on a clean benchmark like LibriSpeech often falls to 70% or lower on live audio. The honest production metric is not how many words were transcribed, it is whether the right action fired at the end of the call.

What accuracy do voice agents actually hit on real phone calls?

It depends almost entirely on acoustic conditions, not the model. A contact-center analysis found the same speech API hit 92% accuracy on clean headsets, 78% in conference rooms, and 65% on mobile calls with background noise. Signal-to-noise ratio drives it: roughly 3.5% word error rate at 20 dB SNR, around 15% at 10 dB, and over 70% at 0 dB. A restaurant phone line, with kitchen noise behind the caller and a phone held to the ear, sits at the worse end of that range, which is exactly why a single headline accuracy number tells you very little.

What is the difference between word error rate and task completion?

Word error rate measures whether the words were transcribed correctly. Task completion measures whether the agent did the right thing: placed the right order, booked the right reservation, routed the right call. They diverge because slot-bearing words (a name, a quantity, a modifier, an item) matter far more than filler words. Getting 'large' wrong instead of 'small' is one wrong word out of twenty, which reads as 95% transcription accuracy, but it is a 100% wrong ticket. This is why some teams track a separate slot error rate, and why a phone-ordering agent should be measured on whether the correct ticket fired, not on how many words it heard.

How should I measure a voice agent before trusting it in production?

Stop measuring transcription on clean audio. Measure end-to-end task completion on a representative sample of real calls: noisy lines, accented callers, complex and partially changed orders, and edge cases. For an ordering agent the unit of measurement is the order, not the word. Ask the vendor what percentage of real calls end with the correct, priced, routable outcome, and what happens to the calls that do not. A vendor that only quotes a transcription accuracy figure has not measured the thing you actually care about.

How does PieLine handle the production accuracy gap?

Two ways. First, accuracy is measured at the order level, not the word level: PieLine targets 95%+ order accuracy in production with cuisine-specific customization, because a ticket that fires wrong is a failure even if the words were right. Second, it is designed to fail safely. Roughly 90% of calls are handled end to end by the AI, and the remaining edge cases (complaints, catering, anything unusual or unintelligible) are transferred to a human with the full conversation context, so a degraded call becomes a clean handoff instead of a wrong order. During the first month of every account, calls are actively monitored and the agent is refined against real traffic.

Does background noise really change accuracy that much?

Yes, and it is usually the single biggest factor. Acoustic conditions, not the model, explain most of the variance in production. Microphone bandwidth alone matters: a narrowband phone channel (300 Hz to 3.4 kHz) can sit near 25% word error rate at 10 dB SNR where a wideband channel would be around 12%. A restaurant line carries kitchen clatter, a hood fan, and other diners, all on a narrowband phone codec, which is the worst combination for raw transcription and the reason the agent has to lean on menu structure and confirmation rather than perfect hearing.

Is a higher benchmark accuracy number always better?

Not on its own. A 2% word error rate on a clean benchmark and a 4% word error rate on a clean benchmark are nearly indistinguishable once a real phone line degrades both by 3 to 5 times. What separates a usable production agent from a fragile one is rarely the headline benchmark. It is how the system behaves when accuracy drops: whether it confirms slot-bearing details, reads availability from a real source, and hands off cleanly instead of guessing.