Voice models, evaluated for restaurants
New AI voice models for restaurant phone ordering, April 2026: what actually survives a Friday rush
OpenAI moved gpt-realtime to general availability with two new voices and a 20% price cut. Cartesia shipped Sonic-3 with sub-100 ms time-to-first-byte. ElevenLabs opened the public API for Eleven v3 with audio tags and 70+ languages. Vendor pages benchmark them on 24 kHz studio audio. Restaurant phone calls arrive at 8 kHz G.711 mono with kitchen noise behind them. Below is the per-model scorecard against a 102.36 second reference call PieLine ships in production, and which model we actually swapped to default for which leg of the pipeline.
Direct answer (verified 2026-04-29)
Which new voice models matter for restaurant phones?
- OpenAI gpt-realtime (now generally available, voices Cedar and Marin, 20% price drop versus gpt-4o-realtime-preview): wins the speech-to-speech and tool-calling axis. source
- Cartesia Sonic-3 (time-to-first-byte under 100 ms standard, ~40 ms Turbo, fine-grained emotion control): wins the menu read-back and barge-in axis. source
- ElevenLabs Eleven v3 (public API, audio tags, multi-speaker dialogue, 70+ languages): wins the cuisine-name pronunciation axis on multilingual menus. source
For restaurant phones the rank order does not match any vendor leaderboard. The rest of this page is why.
The 24 kHz vs 8 kHz problem
Every demo on every voice model vendor page is recorded at 22 kHz or 24 kHz. The audio sounds great because there is nothing wrong with it. The speaker is in a quiet room with a real microphone, the model is trained on clean speech, and the playback bandwidth is wider than human voice ever uses. That is fine if you are building a podcast, an audiobook, or a YouTube voiceover.
A restaurant phone call is not that. It arrives over the public switched telephone network as 8 kHz mono, encoded in G.711 (mu-law on US carriers, A-law on most of Europe). Every frequency above 3.4 kHz is gone before the model ever sees the audio. The fricatives that disambiguate “mild” from “wild” live in that band. So do most of the cues that distinguish “peppers” from “feathers” in a noisy kitchen.
On top of that, the caller is often in a moving car on a Bluetooth headset, there is a child in the back seat, the dishwasher is running on the restaurant side, and a server is reading a different ticket out loud three feet from the phone. The MOS score on a vendor page does not survive any of this. The right way to evaluate is to take whatever the vendor ships, downsample it to 8 kHz mu-law, mix in restaurant ambient noise, and only then score the model.
What gets evaluated
24 kHz mono studio capture, single speaker, no background noise, prepared script, modern handset.
- Reads like a podcast intro
- Word error rate under 2%
- Time-to-first-byte numbers come from the same studio
- No menu items, no modifiers, no POS
The four axes that decide whether a new model ships
When a vendor announces a new release, the question is not whether the new model is better in general. The question is whether it improves any of the four axes that map to a restaurant's revenue. Everything else is noise.
Restaurant phone evaluation axes
- Multichannel ASR accuracy on 8 kHz G.711 audio (customer and AI on separate channels), against a downsampled reference call, not vendor 24 kHz studio capture.
- Barge-in latency under simulated kitchen ambient noise. Under 250 ms feels natural, over 500 ms feels like a robot.
- Cuisine-name and modifier accuracy on the actual restaurant's menu. Half-and-half pizzas, spice levels, protein subs, custom sushi rolls, allergen flags.
- Round-trip POS commit latency to Clover, Square, Toast, NCR Aloha, or Revel. From 'that is everything' to a fired ticket.
- Per-call structured record (audio, captions, envelope, ontology, POS receipt). If you cannot reconstruct what was heard, you cannot roll a model back.
The actual call config we use to evaluate
This is the literal Deepgram call signature in PieLine's voice activity pipeline. It is what we run against every new model swap, with the same flags, the same caption grouping thresholds, and the same 60 Hz amplitude envelope sampler. The reference call at public/audio/dennys-order.mp3 is 102.36 seconds long, recorded as 16-bit stereo with the customer on the left channel and the AI on the right, downsampled to 8 kHz before the eval starts.
Holding the harness fixed is the point. When OpenAI ships gpt-realtime, when Cartesia ships Sonic-3, when ElevenLabs opens v3, we run all three through the same pipeline against the same recording. The output is the same five-field record per call, so any regression is visible immediately and any improvement is attributable.
One run through the harness, end to end
What it looks like when a model swap clears the four axes. Word error rate on the multichannel ASR pass, time-to-first-byte on each TTS candidate, and the round-trip POS commit latency on the three POS systems most of our customers run.
CLEARED for shadow rollout means the new model runs on a small percentage of live calls, on a copy of the audio path, with output compared against the production model. If shadow output matches production within tolerance for two weeks, we promote it. If not, we hold.
What we shipped, what we held
The April 2026 wave produced one default change per leg of the pipeline. The ASR leg did not move because Deepgram nova-3 multichannel still wins on per-channel word timestamps under crosstalk. The speech-to-speech leg moved to gpt-realtime for the tool-call precision improvement on multi-modifier orders. The menu read-back leg moved to Cartesia Sonic-3 because the latency is audible to callers. Eleven v3 became an option, not a default, because the menus that benefit from it are a subset.
| Feature | Typical 'one model fits all' setup | PieLine, after the April 2026 model wave |
|---|---|---|
| Released or GA window | Varies, often previous-generation | All three integrated and benchmarked |
| OpenAI gpt-realtime (GA, voices Cedar and Marin) | Often left at gpt-4o-realtime-preview | Default speech-to-speech path on annual plans |
| Cartesia Sonic-3 (sub-100 ms TTFB, sub-50 ms Turbo) | Common to keep older Sonic or Sonic-2 | Default TTS for menu read-back path |
| ElevenLabs Eleven v3 (audio tags, 70+ languages) | Marketed as 'all v3 all the time' | Optional TTS for multilingual menus |
| Multichannel ASR (separate customer + AI channels) | Single-channel diarization, prone to crosstalk errors | Deepgram nova-3 multichannel, fixed |
| Eval harness against an 8 kHz G.711 reference call | Studio MOS scores, no PSTN downsample | Yes, /public/audio/dennys-order.mp3 (102.36s) |
| Per-call structured record for rollback | Call duration and a transcript blob | Audio, envelope, captions, ontology, POS receipt |
| Operator pricing changes when models change | Per-minute pricing reprices on every SKU | No, $350/month for 1,000 calls is fixed |
What this looks like to a real caller
A model swap that moves the right axis is not visible to the operator as a feature toggle. It is visible to the caller as a phone call that does not feel like a phone call. The thing the caller actually notices is that the agent stops talking the moment they start (barge-in latency under 250 ms), pronounces their cuisine correctly the first time, and reads back the order without going slowly.
“The experience was better than speaking to a human. No hold time, no confusion, no rushing.”
That is the whole pitch. The caller does not know which model answered. The operator does not have to know either. The job of the harness is to make sure the answer is yes regardless of which model the vendor ships next month.
What an operator should ask any vendor in April 2026
If you are evaluating a restaurant phone AI vendor right now, the new model wave gives you a sharp tool. Ask which version of gpt-realtime, which Sonic SKU, and whether they have integrated Eleven v3 yet. The honest answer is that almost everyone is on a previous-generation default, because rebenchmarking, shadow rollout, and per-customer tuning take real engineering time. A vendor that says “we always use the latest version” is either lying or skipping the shadow phase, and you do not want either.
Then ask which axis they evaluate on. If the answer is “our customers love it,” that is a warning sign. If the answer is something like “multichannel ASR word error rate on a downsampled 8 kHz reference call, plus barge-in latency under simulated kitchen noise, plus POS commit latency on your specific POS,” you are talking to a team that has done the work.
Finally, ask whether the price changes when the underlying model changes. If yes, you are buying volatility. The model SKUs will change four times a year for the foreseeable future. Your monthly bill should not.
“Mylapore (11 South Indian restaurants in the Bay Area) projects $500 additional revenue per location per day from eliminating the phone bottleneck. The numbers do not depend on which model SKU is current.”
Mylapore, Bay Area
Have a menu, a POS, and a phone line that gets slammed at 6:45?
We will run your actual menu through gpt-realtime, Sonic-3, and Eleven v3 on a downsampled call before you commit. Same-day onboarding, $350/month, money-back the first month.
Frequently asked questions
Which new voice models actually matter for restaurant phone ordering as of April 2026?
Three released between December 2025 and April 2026 are worth attention. OpenAI gpt-realtime is now generally available with two new voices (Cedar and Marin) and a 20% price drop versus gpt-4o-realtime-preview. Cartesia Sonic-3 is the latency leader, with time-to-first-byte under 100 milliseconds and a Sonic Turbo variant at 40 ms. ElevenLabs opened the public API for Eleven v3, with audio tags ([excited], [whispers]), multi-speaker dialogue, and 70+ languages. The rank order in studio benchmarks does not match the rank order on a real restaurant phone line.
Why do studio benchmarks not transfer to restaurant phones?
Restaurant phone calls arrive over the public switched telephone network as 8 kHz G.711 mono. Vendor demo audio is 22 or 24 kHz studio capture. The frequency content above 3.4 kHz is gone before the model ever sees it. On top of that, the audio is mixed with kitchen ambient noise (hood vents, ticket printers, dishwashers), the caller is often in a moving car or on a Bluetooth headset, and there is almost always a side conversation in the background. A model that produces flawless prosody on a 24 kHz read-through of a paragraph can still mis-hear 'mild' as 'wild' on a $24 large two-topping order.
What does PieLine actually evaluate when a new model drops?
Four axes that map to revenue, not to vendor leaderboards. (1) Multichannel ASR accuracy on 8 kHz mono with the customer and AI on separate channels (Deepgram nova-3 multichannel is the current default). (2) Barge-in latency: the time between the caller starting to speak and the AI yielding the floor, measured under simulated kitchen noise. (3) Cuisine-name and modifier accuracy on the actual restaurant's menu (half-and-half pizzas, spice levels, protein subs, custom sushi rolls). (4) Round-trip POS commit latency: from caller saying 'that is everything' to a fired ticket on Clover, Square, Toast, NCR Aloha, or Revel. The reference for all four is a 102.36 second stereo recording at /public/audio/dennys-order.mp3 that ships with this site.
What is gpt-realtime good at on restaurant calls, and where does it struggle?
Strong: instruction-following on multi-turn modifications ('half pepperoni half mushroom, light cheese on the mushroom side, no garlic'), tool-call precision against a structured menu schema, and the new MCP support reduces glue code on the agent side. Weak: ambient noise sensitivity is higher than Cartesia in our barge-in tests, and the speech-to-speech path can over-confirm when the caller is curt. The new Cedar and Marin voices sound notably more natural than the older Alloy/Sage but the price is still $32 / 1M audio input tokens and $64 / 1M audio output tokens, which adds up at one large pizza order per minute on a Friday.
What is Cartesia Sonic-3 good at on restaurant calls, and where does it struggle?
Strong: time-to-first-byte under 100 ms is the difference between a phone call that feels like a real conversation and one where the caller talks over the agent. Sonic-3 also adds fine-grained control on volume, speed, and emotion, and the new audio-tag-style markup lets you bias intonation per item read-back. Weak: it is a TTS model, not a speech-to-speech model. You still need a separate ASR (we use Deepgram nova-3 multichannel) and an LLM brain. That is fine for a restaurant pipeline because we want each component swappable, but it is not a one-model drop-in.
What is Eleven v3 good at on restaurant calls, and where does it struggle?
Strong: cuisine-name pronunciation across languages. 'Dosa', 'pho', 'al pastor', 'crudo', 'okonomiyaki' are pronounced correctly without per-word phoneme overrides. Multi-speaker dialogue is interesting on hold-music style scripts, less relevant on a one-on-one phone call. Weak: latency is higher than Sonic-3, and the audio-tag DSL is expressive but easy to over-tune. We use Eleven v3 selectively for restaurants whose menu sits in two or more languages and whose customer base reads it in either.
Did the April 2026 releases change which model PieLine ships by default?
We rebenchmarked all three. The default ASR remains Deepgram nova-3 multichannel because per-channel word timestamps are load-bearing in our monitoring pipeline (the same pipeline that produces the 60 Hz amplitude envelope on the home page). The default speech-to-speech path is being moved to gpt-realtime for restaurants on annual plans because the tool-call precision improvement on multi-modifier orders is real. Sonic-3 is the default TTS for the menu read-back path, where the latency improvement is most audible to callers. Eleven v3 is offered as an option for multilingual menus.
Can a restaurant operator just plug in gpt-realtime and call it done?
No. The API is a substrate, not a product. To turn it into something a restaurant can answer Friday's calls with, you need: a menu ontology mapped to your POS item IDs (dish names, modifiers, allergens, half-and-half rules), a barge-in policy tuned for kitchen noise, a confirmation-phrase library for your cuisine, a transfer policy for edge cases, payment handling for delivery, a structured per-call record so a manager can audit, and a same-day onboarding loop. PieLine builds all of that on top of these models so the operator does not have to.
How do these new models change pricing for a restaurant?
They do not, on PieLine. Standard pricing is $350/month for up to 1,000 calls, $0.50 per call after that, with a money-back guarantee for the first month. Underlying model cost is our problem, not the operator's. Per-minute pricing from some competitors is unpredictable for restaurants and tends to spike exactly when the line is busiest. We absorb the model swap so the operator's bill does not change when OpenAI or Cartesia ships a new SKU.
What about Burger King's 'Patty' and the ConverseNow + Deliverect partnership announced in April 2026?
Both are real and both are good signals for the category. Burger King began piloting an OpenAI-based voice assistant called 'Patty' inside employees' headsets across 500 restaurants. ConverseNow and Deliverect announced an integration on April 16, 2026 to flow phone and drive-thru voice orders into Deliverect's unified order management. PieLine sits in a different lane: independent restaurants and mid-market chains where same-day onboarding and per-call pricing matter more than enterprise procurement. The model wave lifts every boat, but the buyer profile is different.
Related guides
24/7 Phone Monitoring for Restaurants: Why Counting Calls Is Not Monitoring
The per-call shape PieLine writes, traced from the same 102.36 second reference recording.
Restaurant Voice AI Landscape 2026: Companies, Funding & What Operators Should Evaluate
An honest map of the restaurant voice AI market and how to evaluate vendors.
Best AI Voice Agents for Restaurants in 2026: The 30-Day Refinement Loop Every Roundup Skips
What separates a vendor that survives 200 real calls from one that survives a single demo.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.