Voice models, evaluated for restaurants
GPT Realtime 2 for restaurant phone ordering: per-feature scorecard against a real 102.36 second call
OpenAI shipped gpt-realtime-2 on May 7, 2026 with GPT-5-class reasoning, a 128K context window, configurable reasoning effort, native SIP support, image input, and two new voices. Coverage of the launch focuses on the announcement. This page focuses on the call. We graded each capability against a real Lumberjack Slam phone order on PieLine's production pipeline, the same 102.36 second reference recording our 60 Hz envelope sampler and our Deepgram multichannel ASR leg are tuned against. Three features actually move revenue on a restaurant phone line. Three are noise for this use case. Here is which is which, and why.
Direct answer (verified 2026-05-12)
Should I use gpt-realtime-2 for restaurant phone ordering?
Yes for the legs that affect revenue, no for the legs that do not. Specifically:
- Yes: the speech-to-speech path, parallel tool calls with audible preambles, the 128K context window, and SIP routing. These visibly improve modifier accuracy, p50 latency, and how natural the call feels.
- No: image input (PSTN does not carry images), the translate model (callers do not switch language mid-order), xhigh reasoning effort (2.33 seconds to first audio kills the conversational feel), and gpt-realtime-whisper as the primary ASR (multichannel still wins for restaurant lines).
- Source: gpt-realtime-2 model page and the May 7, 2026 launch coverage; verified against PieLine's reference call at
public/audio/dennys-order.mp3(102.36 seconds, 16-bit stereo) on 2026-05-12.
The per-feature scorecard, with what each capability does and what it changes on a real restaurant call, is below.
What we are grading against
Vendor benchmarks are measured on 22 or 24 kHz studio audio in a quiet room. Restaurant phone calls arrive over the public switched telephone network as 8 kHz mono encoded in G.711, with kitchen ambient on one side and Bluetooth car audio on the other. The MOS score on a vendor page does not survive any of this. So we keep one fixed reference call and run every model swap against it.
The reference call is public/audio/dennys-order.mp3, 102.36 seconds, 16-bit stereo. The customer is on the left channel, the AI is on the right. The first three captions are the agent saying “Hi.” then “This is Denny on a recorded line.” then “What can we get for you?” The customer answers at 5.36 seconds: “Hi. Yeah. Can I get one Lumberjack Slam and one Coke?” The agent then asks for the egg style and bread choice, the customer modifies, and the order commits. This is the call we benchmark every new voice model against.
Evaluation harness, fixed across every model swap
- Reference call: public/audio/dennys-order.mp3, 102.36 seconds, 16-bit stereo, customer on the left channel and AI on the right.
- ASR pass: Deepgram nova-3 multichannel with smart_format and punctuate, per-channel word timestamps, 60 Hz amplitude envelope sampler.
- Caption grouping: max segment 3.2s, pause gap 0.55s, max 75 chars per line. Same thresholds across every model swap so deltas are attributable.
- POS commit: round-trip latency from 'that is everything' to a fired ticket on Clover, Square, Toast, NCR Aloha, or Revel.
- Per-call record: audio, captions, envelope, ontology, POS receipt. If you cannot reconstruct what was heard, you cannot roll a model back.
What gpt-realtime-2 actually shipped on May 7, 2026
Three models came out together: gpt-realtime-2 (the speech-to-speech model), gpt-realtime-translate (live cross-lingual speech), and gpt-realtime-whisper (streaming transcription). The Realtime API also exited beta and reached general availability the same day. The headline numbers for gpt-realtime-2 specifically: a 128,000 token context window (up from 32K), text plus audio plus image input, configurable reasoning effort across five levels, parallel tool calls, two new exclusive voices (Cedar and Marin), and a Big Bench Audio score of 96.6% (up from 81.4% on gpt-realtime-1.5). Audio pricing is unchanged from the previous model: $32 per 1M audio input tokens with cached input at $0.40, and $64 per 1M audio output tokens.
Below, one section per capability. For each one: what the feature does, what changes for a real restaurant phone call, and a verdict.
The per-feature walkthrough
Configurable reasoning effort (minimal, low, medium, high, xhigh)
What it does: lets you trade latency against reasoning depth per turn. Minimal answers in about 1.12 seconds to first audio, high answers in about 2.33 seconds. What changes for a restaurant phone line: most turns are mechanical (item lookup, confirmation), so minimal or low is correct. Multi-modifier turns and recovery from caller corrections benefit from medium. xhigh is too slow to feel like a phone call. Verdict: high impact, but only with per-turn effort selection. A single global setting wastes it.
Parallel tool calls with audible preambles
What it does: the model can run multiple tool calls in parallel and narrate while it works ('one moment, checking the kitchen'). What changes for a restaurant phone line: removes the dead silence during a POS write or stock check that callers normally fill by repeating themselves or assuming the call dropped. The preamble buys the model the time it actually needs without making the caller feel ignored. Verdict: high impact. This single change is what made full unattended phone ordering feel natural rather than robotic on multi-step orders.
128K context window (up from 32K)
What it does: the prompt can hold roughly 4x more text. What changes for a restaurant phone line: the entire menu plus modifier rules plus allergen flags plus POS item IDs fit in the system prompt with room left over. We removed the per-turn retrieval step that previously added 200 to 400 ms before the agent could speak. For multi-location chains, all per-location overrides fit in one session, so a single agent can answer for any of the 11 Mylapore locations with the right pricing and menu deltas. Verdict: high impact, less visible to the caller but very visible in p50 latency.
Native SIP support in the Realtime API
What it does: connect Twilio Elastic SIP Trunking, a Programmable SIP app, a PBX, or a carrier trunk directly to a model session over SIP, without bridging through Twilio Media Streams over a WebSocket relay. What changes for a restaurant phone line: one fewer network hop, one fewer service to keep alive. The warm-transfer flow (handing the caller to a human with full context) is also cleaner because the call leg state lives in one place. Verdict: medium impact. The Media Streams bridge worked, but SIP is the right shape for production phone routing.
Two new voices (Cedar, Marin), eight legacy voices retuned
What it does: adds Cedar (warm, mid-range male) and Marin (bright, clear female) as Realtime-API-exclusive voices, and retunes the existing eight (alloy, ash, ballad, coral, echo, sage, shimmer, verse) on the new audio stack. What changes for a restaurant phone line: noticeably more natural prosody on G.711 8 kHz audio. The older Alloy and Sage sometimes felt synthesized over a phone; Cedar and Marin sit closer to a real voice on the same downsampled channel. Verdict: medium impact. The right move is to re-eval per restaurant; some menus and customer demographics still prefer the older voices.
Image input modality (text + audio + image)
What it does: lets you drop a screenshot or photo into a live conversation and ask the agent to talk about what is on the user's screen. What changes for a restaurant phone line: nothing in the live call path. PSTN does not carry images. A caller cannot send a photo mid-call. The places this would matter (a manager training the agent on a new menu) are better handled offline during onboarding. Verdict: no impact for the phone call path. We use Vision separately for menu ingestion at onboarding, but it never enters the live audio session.
gpt-realtime-translate (70+ input languages, 13 output, $0.034/min)
What it does: a separate model that translates speech in real time across 70+ input languages and 13 output languages. What changes for a restaurant phone line: rarely the right answer. Most restaurant menus are in one language; the staff who confirm at pickup speak that language; the POS modifier fields are not multilingual. Where it does matter (a chain whose customer base spans Tamil, Hindi, and English), gpt-realtime-2 itself handles code-switching well, and we pair it with ElevenLabs v3 for cuisine pronunciation rather than routing through Translate. Verdict: low impact for typical restaurants, occasional impact for genuinely multilingual menus.
gpt-realtime-whisper streaming transcription ($0.017/min)
What it does: a separate, lower-cost streaming transcription model. What changes for a restaurant phone line: not our default ASR leg. Deepgram nova-3 multichannel still wins for our pipeline because it returns per-channel word timestamps from a stereo recording (customer on the left, agent on the right), which our 60 Hz amplitude envelope sampler and our caption grouper depend on. Single-channel transcription with after-the-fact diarization fails when the agent and caller talk over each other on a busy Friday line. Verdict: low impact for our setup. Useful as a cheap fallback ASR; not a drop-in replacement for multichannel.
The scorecard, side by side
Same eight features, scored against the actual Lumberjack Slam reference call and the production POS commit path. The right-hand column is what we typically see when a vendor says “we use the latest OpenAI model” without doing the per-leg work.
| Feature | Generic 'we use gpt-realtime-2' setup | PieLine, after the gpt-realtime-2 evaluation |
|---|---|---|
| GPT-5-class reasoning, configurable effort levels (minimal -> xhigh) | Often shipped at one global effort, which either over-pays on simple turns or feels slow on hard ones. | High impact. Per-turn effort selection lets simple turns answer fast and complex modifier turns reason longer. |
| Parallel tool calls with audible preambles ('one moment, checking the kitchen') | Either silent during tool calls (caller assumes the line dropped) or filler-only (caller has no signal that work is happening). | High impact. Removes the silent gap during a POS write or stock check that callers usually fill by repeating themselves. |
| 128K context window (up from 32K) | 32K forced a per-turn retrieval step that added 200 to 400 ms before the agent could speak. | High impact. Full menu, modifiers, allergens, and POS item IDs fit in-prompt, so there is no retrieval round trip before turn one. |
| Native SIP support | Most public guides still show the Media Streams bridge, which works but adds a network hop. | Medium impact. Replaces the WebSocket-over-Twilio-Media-Streams bridge for trunk-to-model deployments. |
| Two new voices (Cedar, Marin); previous eight voices retuned | Often left at the previous default voice; the upgrade requires a re-eval per restaurant. | Medium impact. Cedar and Marin sound notably more natural than the older Alloy/Sage on G.711 audio. |
| Image input modality (text + audio + image) | Marketed as a phone-call feature in places where it does not apply. | No impact for the live call path. PSTN does not carry images. Useful for onboarding-time menu ingestion only. |
| gpt-realtime-translate (70+ input langs, 13 output, $0.034/min) | Some vendors bolt the translate model into the phone path even when the menu is monolingual. | Low impact for most restaurants. We use gpt-realtime-2 directly with multilingual instructions for menus that span languages. |
| gpt-realtime-whisper streaming transcription ($0.017/min) | Single-channel Whisper-style ASR is prone to diarization errors when the agent and caller talk over each other. | Low impact for our pipeline. Deepgram nova-3 multichannel still wins on per-channel word timestamps under crosstalk. |
The single biggest behavioral change on a real call
If you take one thing from this page, it should be this: the new audible-preamble behavior is what makes unattended phone ordering feel like a real phone call rather than a button on an IVR menu. Before, the model would either answer immediately (and look unsure on a multi-step order) or pause silently while a tool call ran (and the caller would assume the line dropped, repeat themselves, or hang up). With audible preambles the model can say “one moment, checking the kitchen” while it actually checks the kitchen, and the caller stays engaged.
On the reference call this matters at the modifier turn. The agent asks “For your Lumberjack Slam, how would you like your eggs cooked, and what kind of bread would you like?” and then the customer answers. With a single-tool-call model, what comes next is silence while the POS write happens. With parallel tool calls and an audible preamble, what comes next is “one moment, sending that to the kitchen now,” followed by the order recap, and the caller never wonders whether the line is alive. That single shift is more important to the average restaurant phone call than any of the headline reasoning improvements.
What this looks like on the line
“The experience was better than speaking to a human. No hold time, no confusion, no rushing.”
Reported by a PieLine caller on a live order. The audible preamble is the single behavior most likely to produce that reaction.
Per-turn effort selection: the architecture decision most write-ups miss
Most coverage of gpt-realtime-2 reports a single latency number (1.12 seconds at minimal, 2.33 seconds at high) and treats reasoning effort as a global setting. On a phone line that is wrong. A typical restaurant call has 6 to 14 turns, and most of them are mechanical: the customer says an item, the agent confirms it, the customer adds a modifier, the agent confirms, the customer says they are done, the agent reads back. Those turns want minimal effort because the answer is a tool call and an acknowledgement.
The expensive turns are the ones where the customer says something that has to be reconciled against the order so far: a half-and-half pizza modification mid-stream, a substitution on a combo, a clarification on what counts as “the dinner deal.” Those want medium. The right shape is per-turn effort selection: a small classifier (or the model itself, with a constrained tool) decides per turn whether this one is mechanical or ambiguous, and dials the effort up only where the reasoning will pay for itself in fewer round trips.
The cost of getting this wrong is concrete. xhigh effort on every turn turns a 90 second call into a 130 second call, and you can hear it. Minimal effort on every turn produces wrong half-and-half splits and bad combo modifications, and you can read those in the next morning's POS audit.
What does not change for the operator
Pricing. PieLine pricing is $350 per month for up to 1,000 calls, $0.50 per call after that, with a money-back guarantee for the first month. The underlying model SKU is our problem to absorb. Per-minute pricing from some competitors reprices on every OpenAI announcement and tends to spike exactly when the line is busiest. Holding the bill stable so the operator can budget against covers, not against a vendor's release schedule, is part of why we chose this pricing shape.
Onboarding. Same-day onboarding still applies. The AI builder still scrapes the menu, maps it to POS item IDs, builds the modifier ontology, and tunes the barge-in policy for kitchen ambient. The model swap from gpt-realtime to gpt-realtime-2 is invisible to the operator. They pick up the phone the same way the day before and the day after; only the call quality changes.
Operator workflow. The dashboard, the per-call audit shape (audio, captions, envelope, ontology, POS receipt), and the smart-transfer flow to a human staffer with full context are unchanged. The new model fits behind the existing harness instead of replacing it.
What an operator should ask any vendor right now
If you are evaluating a restaurant phone AI vendor in May 2026, the gpt-realtime-2 release gives you a sharp tool. Three questions are enough to separate the vendors who did the work from the ones who updated their landing page.
First: are you on gpt-realtime-2, and do you select reasoning effort per turn? “Yes and no” is fine if they explain why a single global level fits their book of business. “We always use the latest version” with no per-turn answer means they have not run the eval.
Second: how do you handle the silent gap during a POS write? If the answer is “we use parallel tool calls with audible preambles,” they have done the work. If the answer is “our model is fast enough that there is no gap,” ask them to play a real call recording on a multi-modifier order.
Third: does my monthly bill move when OpenAI ships a new SKU? If yes, you are buying volatility. The model SKUs will change at least four times a year for the foreseeable future. Your monthly bill should not.
Have a menu, a POS, and a phone line that gets slammed at 6:45?
We will run your actual menu through gpt-realtime-2 with per-turn effort selection on a downsampled call before you commit. Same-day onboarding, $350/month, money-back the first month.
Frequently asked questions
What is gpt-realtime-2 and when did it ship?
OpenAI announced gpt-realtime-2 on May 7, 2026 alongside two companion models (gpt-realtime-translate and gpt-realtime-whisper). It is the first OpenAI voice model with GPT-5-class reasoning. Context window is 128,000 tokens (up from 32K on the original gpt-realtime), max output is 32K, knowledge cutoff is September 30, 2024. Pricing is $32 per 1M audio input tokens ($0.40 cached) and $64 per 1M audio output tokens, same audio rates as the previous model. Two new voices (Cedar and Marin) are exclusive to the Realtime API; the original eight voices were retuned on the new audio stack.
Should a restaurant operator switch to gpt-realtime-2?
If you run a restaurant phone line and you (or your vendor) are already on gpt-realtime, the answer is yes for the speech-to-speech and tool-use legs and a maybe for the menu read-back leg. The features that move revenue on a phone line are: GPT-5-class reasoning at the right effort level, parallel tool calls with audible preambles ('one moment, checking the kitchen'), and the larger 128K context window for full menu plus per-restaurant ontology. The features that do not move revenue on a phone line are: image input (phones do not send images), the translate model (callers do not switch language mid-order), and xhigh reasoning (the 2.33 second time-to-first-audio kills the conversational feel).
What does 'configurable reasoning effort' mean for a phone order?
Five levels: minimal, low, medium, high, xhigh, with low as the default. Time-to-first-audio is roughly 1.12 seconds at minimal and 2.33 seconds at high. For most restaurant turns ('one Lumberjack Slam and a Coke') minimal or low is correct, because the answer is a tool call and a confirmation. For ambiguous turns ('actually make that two slams, but one with no eggs and the second with eggs over easy and rye instead of sourdough, and put the Coke under my wife's name') medium is what you want, because the model has to reconcile a multi-step modification before responding. The right architecture is per-turn effort selection, not a global setting.
What is the actual time-to-first-audio for gpt-realtime-2 on a phone call?
OpenAI publishes 1.12 seconds at minimal effort and 2.33 seconds at high effort, measured at the model. Add the carrier round trip on G.711 over the public switched telephone network (typically 80 to 180 ms each way), plus your TTS time-to-first-byte if you use a separate TTS model for read-back. Practical numbers for a US PSTN caller on minimal effort sit around 1.3 to 1.5 seconds end-to-end. That is fine for the first turn (the caller has just finished speaking and expects acknowledgement) and starts to feel slow above 1.6 seconds mid-conversation, which is why per-turn effort selection matters more than a one-time benchmark number.
Does gpt-realtime-2 support SIP and Twilio for direct phone integration?
Yes. The OpenAI Realtime API now supports SIP, which means you can connect Twilio Elastic SIP Trunking, a Twilio Programmable SIP application, a PBX, a carrier trunk, or a desk phone directly to a gpt-realtime-2 session without bridging through a separate VoIP server. Twilio published a tutorial on the SIP Connector pattern, and the OpenAI Agents SDK has a Twilio extension for the warm-transfer flow (caller-with-context handoff to a human). For PieLine the SIP path replaces the older 'Twilio Media Streams over WebSocket' bridge in deployments where the customer wants direct trunk-to-model routing.
What about image input on gpt-realtime-2 for restaurants?
Image input is real and well-built for screen-sharing voice copilots (a user shows the model what is on their screen and keeps talking). For a restaurant phone line it is not the right tool. PSTN does not carry images. A customer cannot send a photo of a menu page mid-call. The use cases that occasionally come up (an operator or manager training the agent by feeding it menu photos) are better handled offline during onboarding, not over the live audio session. We use Vision separately for menu ingestion when a restaurant onboards, but never inside the live call path.
What about gpt-realtime-translate? Can callers order in their own language?
The translate model handles 70+ input languages and 13 output languages at $0.034 per minute. It is genuinely useful in waiting rooms, hotel lobbies, and multilingual support lines. For restaurant phone ordering it is rarely the right answer, because the menu is in one language at most two, the staff who confirm the order at pickup speak those same languages, and the POS schema does not have multilingual modifier fields. Where multilingual genuinely matters (a Bay Area South Indian chain whose customers split between English, Tamil, and Hindi), we either run gpt-realtime-2 directly with multilingual instructions (it handles code-switching) or pair it with ElevenLabs v3 for cuisine-specific pronunciation, not the translate model.
What does the 128K context window unlock for restaurant phones?
Three things that mattered. First, the full menu plus all modifier rules plus allergen flags plus POS item IDs fit in the prompt, so we no longer need a retrieval step before the first turn (the lookup that adds 200 to 400 ms before the model can speak). Second, multi-restaurant operators can fit their full per-location overrides in one session, so a single agent serves all 11 Mylapore locations with location-aware pricing and menu deltas. Third, the call transcript-so-far stays in context across long calls (a customer who calls back at 7:40 to add a side to the order they placed at 7:25), which means the agent can reconcile against the earlier turn instead of starting from scratch.
What changes for PieLine's pricing if OpenAI keeps shipping new model SKUs?
Nothing. PieLine pricing is $350/month for up to 1,000 calls, $0.50 per call after that, with a money-back guarantee for the first month. The underlying model SKU is our problem to absorb. Per-minute pricing from some competitors reprices on every OpenAI announcement and tends to spike exactly when the line is busiest. We hold the bill stable so the operator can budget against covers, not against a vendor's release schedule.
Is gpt-realtime-2 reliable enough to take phone orders unattended?
On the four axes that matter for restaurant revenue (multichannel ASR accuracy, barge-in latency, modifier accuracy, POS commit latency), gpt-realtime-2 clears the same harness our existing pipeline clears, with measurable improvements on multi-modifier turns. It is not a drop-in product. To ship it under a restaurant's phone number you still need: a menu ontology mapped to your POS item IDs, a barge-in policy tuned for kitchen ambient noise, a confirmation-phrase library for your cuisine, a transfer policy for edge cases, payment handling for delivery, structured per-call records for audit, and a same-day onboarding loop. PieLine builds those on top of the model so the operator does not have to.
Related guides
New AI Voice Models for Restaurant Phone Ordering, April 2026
The April 2026 wave: gpt-realtime GA, Cartesia Sonic-3, ElevenLabs v3. Per-leg evaluation against the same reference call.
24/7 Phone Monitoring for Restaurants
The per-call shape PieLine writes, traced from the same 102.36 second reference recording.
Restaurant Voice AI Landscape 2026
An honest map of the restaurant voice AI market and how to evaluate vendors.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.