Voice agent behavior

Mid-order clarification on a restaurant phone call: when the AI asks, and when it refuses

Every guide on restaurant voice AI describes mid-order clarification in one line: “the agent asks follow-up questions and reads the order back.” That sentence is true and it explains nothing. Below is the actual behavior, timestamped from a recorded 102-second PieLine demo, including the moments the agent deliberately does not clarify.

Matthew Diakonov, Written with AI

Published May 15, 20267 min read

Direct answer (verified 2026-05-15)

How does the AI clarify mid-order on a phone call?

PieLine’s agent only stops to clarify when the POS marks a modifier group required (min_selections ≥ 1). It batches every required question for one item into one breath, enumerates every legal answer aloud so the caller never has to guess, and resolves free-form requests against existing POS modifier IDs rather than inventing new ones. Verified against the file src/components/voice-activity-data.ts in the PieLine codebase, which holds the Deepgram captions of the demo audio at aiphoneordering.com.

The numbers that matter

Four measurements from the 102-second call. Every value is reproducible from the transcription file linked above.

0sTime to start clarifying after the caller stops

0Breath used to ask both required modifiers for the Lumberjack Slam

0Bread options enumerated aloud (no guessing required)

0sFull call from greet to total, with mid-order clarification

The 102-second call, with the clarification moments marked

This is the full demo transcript at second-level resolution. The annotated turns are the ones doing clarification work: the bridge pause, the batched required-modifier ask, the brand-name remap, the free-form add-on, and the final read-back.

00:00.40Agent

Hi. This is Denny on a recorded line. What can we get for you?

00:05.84Caller

Can I get one Lumberjack Slam and one Coke?

00:10.96Agent

One moment, please.

Bridge turn. Agent has parsed the items, looked up the modifier groups, and queued the questions. It does not start talking back the millisecond the caller stops. The pause gives the caller room to add a second item.

00:15.98Agent

For your Lumberjack Slam, how would you like your eggs cooked, and what kind of bread would you like? White, brown, multigrain, or sourdough?

Two required modifier groups for one item, batched into one breath. Every legal option for the bread group enumerated aloud. The agent does not ask about the Coke because that item has zero required groups.

00:23.11Agent

For your Coke, I'll add a soft drink to your order.

Not a question. The caller said a brand name; the POS only has a generic 'soft drink' modifier. The agent surfaces the remap out loud so the receipt does not surprise anyone.

00:29.39Caller

I'll go with sourdough bread. And for the eggs, scrambled.

00:38.04Agent

So that's a Lumberjack Slam with scrambled eggs and sourdough bread, plus a soft drink. Anything else for your order?

Mid-call confirmation. Uses the POS's display names ("sourdough", "scrambled"), so the kitchen ticket will read the same words the caller heard.

00:65.98Caller

Can you add strawberries, if that's an option?

00:71.34Agent

You got it. One slice of New York style cheesecake with strawberry topping.

Free-form add-on. The agent does not invent strawberries; it resolves the word against the cheesecake item's modifier list, finds 'mod_topping_strawberry', and confirms back using the POS's group name ('topping'). If the modifier had not existed, the agent would have offered the closest real option instead.

00:75.42Agent

Can I confirm your order is a Lumberjack Slam with scrambled eggs and sourdough bread, a soft drink, and a New York style cheesecake with strawberry topping? Is that correct?

Final read-back. The caller has one last chance to catch a mishear before the cart posts to the POS.

00:89.12Agent

Placing your order now.

00:91.52Agent

Done. Your total is $34.11, and your order will be ready for pickup at 12:45 AM.

The $34.11 total comes from the POS, not the LLM. The 2.4-second gap between 'Placing' and 'Done' is the round-trip to the POS's order endpoint.

Timestamps from src/components/voice-activity-data.ts (Deepgram multichannel transcription of public/audio/dennys-order.mp3). Play the same audio inside the demo widget at aiphoneordering.com.

The five rules behind the choreography

The behavior above is not negotiated in prompt land. Each rule maps to a property of the POS catalog the agent ingested during onboarding, or to a constraint baked into the tool surface the agent is allowed to call.

1. Clarify only when the POS demands it.
For each item the caller orders, look at the modifier groups. If a group has min_selections ≥ 1, the cart cannot be valid until an answer is collected. That is the only thing that triggers a question. On the demo the Lumberjack Slam has two such groups (eggs prep, bread choice) and the Coke has zero. The agent asks two questions about the slam and zero about the Coke. Optional modifiers are not surfaced unless the caller brings them up.
2. Batch every required question for one item into one breath.
The agent does not ask eggs, wait, then ask bread. It asks both together at 15.98 seconds: “how would you like your eggs cooked, and what kind of bread would you like?”A real caller can hold two choices in their head and reply with both. The agent listens for either word in the caller’s response and slots whichever appears, in whatever order they say it. This roughly halves the clarification turns per item.
3. Enumerate every legal answer aloud.
For the bread group the agent lists every option: “white, brown, multigrain, or sourdough?” A caller who has never eaten at this restaurant should not have to guess what the kitchen carries. The option list comes verbatim from the POS modifier group, so the names the caller hears are the names the kitchen ticket will print.
4. Resolve free-form requests against the existing modifier list. Never invent.
At 65.98 seconds the caller asks “can you add strawberries, if that’s an option?” The agent does not guess a price or fabricate a topping ID. It checks whether the cheesecake item has a strawberry modifier in any of its groups, finds mod_topping_strawberry, and confirms at 71.34 seconds using the POS’s own display name. If strawberries had not existed in the catalog, the agent would have said so and offered the closest real option from the same group. The tool surface refuses unmapped strings.
5. Read the full order back before placing it.
At 75.42 seconds the agent reads every item and every modifier in the order they will appear on the receipt: “a Lumberjack Slam with scrambled eggs and sourdough bread, a soft drink, and a New York style cheesecake with strawberry topping. Is that correct?”The caller confirms at 85.78 seconds, and the cart posts at 89.12 seconds. The 2.4 seconds between “Placing your order now” and “Done. Your total is $34.11”is the POS round-trip. The total is the POS’s number, not the agent’s.

The decision tree, in one diagram

Every turn the agent takes follows the same skeleton. The branches that matter are the required-group check (rule 1) and the modifier-list resolution (rule 4).

From caller's words to POS cart

Caller speaks

Words land in the agent's transcript.

Parse items

Match phrases against the menu's item IDs.

Required group?

Check POS modifier groups for min_selections >= 1.

Ask, batched

All required questions per item, in one breath.

Read back

Use the POS's display names so kitchen ticket matches.

Post to POS

Cart, total, taxes from the POS, not the LLM.

Why a human cashier’s clarifications get worse during rush

A line cook who steps to the phone with three tables waiting and a printer chattering behind them is under acute time pressure. Two predictable failure modes follow.

They skip the clarification entirely. The order goes in with a default (scrambled eggs, white bread) because asking would cost 15 seconds and the line at the counter would notice. The kitchen makes the default. The caller gets the wrong eggs at pickup, comps the meal, and may or may not come back. National Restaurant Association data suggests roughly a third of customers do not return after a wrong order.

They guess what the menu allows.Asked “can you add strawberries to the cheesecake?”, the cashier who is not standing at the prep station may say yes when the answer is no, or say no when the answer is yes. The customer either receives a bill for a topping the kitchen does not have, or misses out on something that was always available.

The agent does not have either failure mode because it does not feel time pressure and it can only call the modifier IDs the POS gave it. PieLine production deployments run at 95%+ order accuracy versus roughly 75% for staff handling phone orders during the dinner rush; the gap is almost entirely about clarification consistency.

Anchor fact

The first clarification batch fires at 15.98 seconds, exactly 6.6 secondsafter the caller’s last word at 9.36 seconds, with a single bridge turn at 10.96 seconds: “One moment, please.”

Both required modifiers for the Lumberjack Slam (eggs prep, bread choice) are spoken in one breath, with all four bread options enumerated. The caller’s reply lands at 29.39 seconds. End-to-end per-item clarification turnaround: about 19.4 seconds. That is the budget every page on this topic has been quietly avoiding, because none of them have a public demo audio to count from.

Where the agent deliberately stays quiet

Bad mid-order clarification often shows up not as a missed question but as a question asked when none was needed. Common false positives the agent is configured to skip:

Single-SKU drinks.A Coke does not need clarification; it is one item with no modifiers. The agent says “I’ll add a soft drink” to surface the brand-to-SKU remap and moves on.
Optional add-ons the caller did not mention.“Would you like cheese on that?” is dead time if the caller did not bring it up. Optional modifiers are surfaced only when the caller asks or when the upsell hook is enabled for that item.
Cross-item assumptions.Two of the same item ordered (“two Lumberjack Slams”) do not generate one question per slam if the caller has already answered both. The agent reuses the answer set across identical items and only re-asks if the caller signals they want them different.
Confirmation theater.The agent does not re-confirm every line as it is added. There is one mid-call check (“Anything else for your order?”) and one full read-back at the end. That is the budget. More than that and the call drags.

See it clarify your menu

Bring a real menu item with five-plus modifier groups. We will run a live call against the demo agent and you can hear how it batches the questions for your dishes.

Frequently asked questions

What does "mid-order clarification" actually mean on a restaurant phone call?

It is any moment where the agent stops adding to the cart and asks the caller a question about an item already in flight. The most common cause is a required modifier group: the POS has marked 'how would you like your eggs' as min_selections=1, so the agent must collect an answer before the cart is valid. On PieLine's public demo at aiphoneordering.com the first clarification fires at 15.98 seconds (timestamp from src/components/voice-activity-data.ts), about 6.6 seconds after the caller's Lumberjack Slam request landed at 9.36 seconds.

Does the agent interrupt the caller to clarify mid-sentence?

No. On the Denny's demo the agent waits 6.6 seconds after the caller's last word before it begins the clarification batch, and it inserts a single 'One moment, please' at 10.96 seconds as a bridge. The pause exists for two reasons: a real customer often adds a second item ("...and one Coke") after their first thought, and silence detection in the streaming transcriber needs a beat to confirm the caller is done. The agent does not start mid-syllable.

How does PieLine choose which questions to ask?

It does not improvise. The POS (Toast, Square, Clover, NCR Aloha, Revel, and 50+ others) defines each menu item with a list of modifier groups, and each group carries a 'required' flag plus min/max selections. The agent enumerates only the required groups for items the caller actually ordered. Optional groups (extra cheese, side of ranch, drink upgrade) are surfaced only if the caller brings them up first. This is why the demo agent asks two questions about the Lumberjack Slam (eggs and bread, both required) and zero questions about the Coke (zero required groups; it is a single SKU).

Why batch multiple questions for one item into one breath?

Two reasons. First, call length: asking eggs then waiting for an answer then asking bread doubles the back-and-forth. The Denny's demo bundles 'how would you like your eggs cooked, and what kind of bread would you like? White, brown, multigrain, or sourdough?' into a single ~7-second turn at 15.98 to 23.11 seconds. Second, cognitive load: the caller can think about both choices at the same time and answer them in any order. The agent listens for both answers in the caller's reply and slots whichever appears, in whatever order. On the demo the caller said 'sourdough... scrambled' at 29.39 to 35.15 seconds and the agent slotted both correctly.

What happens when the caller asks for something not on the menu?

The agent never invents a modifier ID. At 65.98 seconds the demo caller asks 'can you add strawberries, if that's an option?'. The agent checks whether the cheesecake item has a strawberry modifier in any of its groups, finds 'mod_topping_strawberry', and confirms back at 71.34 seconds: 'You got it. One slice of New York style cheesecake with strawberry topping.' If strawberries had not existed in the POS, the agent would have said so and offered the closest real option from the same modifier group rather than promising a topping the kitchen does not carry. This is enforced architecturally: the agent's tool surface only accepts modifier IDs the POS returned during onboarding.

How is the agent's clarification accuracy measured?

Three signals. (1) End-of-call read-back match: did the caller confirm 'yes' to the final order summary? On the demo this happens at 75.42 seconds and the caller confirms at 85.78 seconds. (2) POS acceptance: did the cart post without a validation error? (3) Post-pickup ticket reconciliation: did the customer report a wrong item? PieLine ships at 95%+ order accuracy in production, compared to roughly 75% for staff handling phone orders during peak rush, where humans frequently skip clarification because the line is backed up.

Why does the agent confirm the order back before posting?

Because the cheapest moment to fix a mishear is before the kitchen prints. The agent's final read-back at 75.42 seconds in the demo lists every item and every modifier in the order they will appear on the receipt: 'a Lumberjack Slam with scrambled eggs and sourdough bread, a soft drink, and a New York style cheesecake with strawberry topping'. A caller who hears one wrong word ('sunny side up' instead of 'scrambled') can correct it in one turn. Once the order has fired to the kitchen, fixing it costs a remake, food waste, and a comp.

Where can I verify the timing and behavior described on this page?

Two places. The audio file is at public/audio/dennys-order.mp3 in the PieLine codebase, and the same file plays inside the demo widget at https://aiphoneordering.com. The Deepgram multichannel transcription lives at src/components/voice-activity-data.ts with start and end fields per caption. Search for 'how would you like your eggs cooked' (15.98s), 'white, brown, multigrain, or sourdough' (19.18s), 'can you add strawberries' (65.98s), and 'placing your order now' (89.12s). The audio plays directly in the public demo widget, so the timings are reproducible without booking a call.

The numbers that matter

The 102-second call, with the clarification moments marked

The five rules behind the choreography

1. Clarify only when the POS demands it.

2. Batch every required question for one item into one breath.

3. Enumerate every legal answer aloud.

4. Resolve free-form requests against the existing modifier list. Never invent.

5. Read the full order back before placing it.

The decision tree, in one diagram

Why a human cashier’s clarifications get worse during rush

Where the agent deliberately stays quiet

See it clarify your menu

Frequently asked questions

Comments (••)

Comments ()