Forensic walkthrough / one call / 102.36 seconds

An AI phone calling agent, walked second by second through one real restaurant order

Every other guide on this topic ranks horizontal platforms by latency, naturalness, and per-minute price. None of them publish a call you can listen to. So we did. The audio is at /public/audio/dennys-order.mp3. The captions, with millisecond timestamps, are at /src/components/voice-activity-data.ts. What follows is the four moments in those 102.36 seconds that decide whether the system actually works inside a restaurant.

M
Matthew Diakonov
9 min read

Moment 1 / t=5.84 to t=15.985

The agent silently corrects a misspoken dish name

The caller opens with: “Can I get one lumberjack slimand one Coke?” That is what the audio says. Run the file through Deepgram and that is what comes back as text. Open the captions array and you can see it on row five.

The agent's first response, ten seconds later, opens with “For your lumberjack slam, how would you like your eggs cooked?” The caller never said “slam.” The agent committed the correction silently inside a 7.025 second think pause, then proceeded straight into modifier intake as if the caller had ordered the dish by its real name from the start.

That moment, and not the LLM, and not the voice quality, and not the latency budget, is the part that determines whether an AI phone calling agent works inside a restaurant. Callers do not speak menu strings. They speak approximations of what they half-read on the takeout site three nights ago. The agent has to translate.

src/components/voice-activity-data.ts
verify it yourself

Moment 2 / t=65.985 to t=72.145

A free-form modifier request gets normalized to a real POS modifier

At t=65.985 the caller says “Can you add strawberries, if that's an option?” Note the hedge. The caller does not know the menu well enough to know whether strawberries are a thing on the cheesecake. They are asking, not ordering.

At t=71.345 the agent answers “You got it.” By t=72.145 it has restated “One slice of New York style cheesecake with strawberry topping.” Two quiet things happened in those six seconds. The agent looked up whether the cheesecake item has a strawberry-topping modifier in its modifier tree (it does, that is why the menu has it). Then it normalized the caller's plural “strawberries” to the modifier's canonical singular name. The kitchen printer reads strawberry topping. The cashier reads it. The cook reads it. They all see the menu's word, not the caller's word.

A horizontal AI calling agent platform will gladly transcribe “strawberries” and pass it on. What it cannot do, on its own, is decide whether the modifier exists, attach it to the right line item, and rewrite it to the canonical POS string. That decision lives in the dish ontology, which is restaurant-specific and lives in our onboarding pipeline, not in the speech stack.

Moment 3 / t=10.96 and again at t=37.56

The pause length is dynamic, and that is on purpose

The default benchmark for “natural” conversation in the horizontal voice agent guides is sub-600ms response latency. That benchmark is correct as a floor and wrong as a ceiling. On this call the agent's gaps are 1.6, 5.0, and 2.4 seconds, in that order. None of them feel awkward, because the long ones are filled.

At t=10.96, after the caller stops speaking, the agent says “One moment, please.” That is a verbal placeholder that buys the system the time it needs to do the menu lookup and fill the modifier slots before talking again. Without it, the caller would be staring at five seconds of silence. With it, they are not.

At t=37.56, after the caller's modifier choices land, the agent says “Great” in 480ms and rolls straight into the partial readback. Short turn, short wait. Long turn, verbal hold. The threshold is dynamic and the floor is set by what the agent has to look up, not by a fixed 600ms target.

Call trace, t=0 to t=102.36

CallerAgentMenu / POSt=0.00 Hi. This is Denny on a recorded line.t=5.84 Can I get one lumberjack slim and one Coke?t=10.96 hold the line, look up 'lumberjack slim't=15.94 match: Lumberjack Slam (eggs, bread)t=15.985 How would you like your eggs and bread?t=29.39 Sourdough bread. Scrambled eggs.t=37.56 Read back: Slam, scrambled, sourdough, soft drink. More?t=45.96 No, that's it. Name is Rob.t=52.52 Add a slice of NY-style cheesecake?t=63.10 Sure. Add strawberries if possible.t=68.30 modifier lookup: strawberry topping?t=71.30 yes, modifier exists on cheesecake itemt=72.145 Cheesecake with strawberry topping, confirmed.t=75.42 Final readback, full order. Is that correct?t=85.78 Yeah. That's right.t=89.12 commit ticket to POSt=92.00 $34.11. Pickup 12:45 AM.t=101.10 Thanks. Bye.

Moment 4 / t=75.42 to t=85.78

The full readback closes the loop on the whole compound order

The agent waits until the order is fully assembled (item, all modifiers, name, optional upsell, optional free-form modifier graft), then reads the entire string back in one breath at t=75.42. The caller has 8.64 seconds of agent voice to interrupt with a correction. They do not. They say “Yeah. That's right” at t=85.78.

That is the difference between the order being right at the kitchen printer and the order being right after a refund. A human cashier might or might not read the order back; a reliable phone agent does it every call, regardless of queue depth, and waits for the affirmative before committing. Only after the affirmative does the agent fire “Placing your order now” at t=89.12. The commit step is gated on the readback, not on the original intake.

Why the menu layer is where this stands or falls

Caller says 'lumberjack slim'. Speech-to-text returns 'lumberjack slim'. The dialogue manager has nothing to compare it to. The agent either asks the caller to repeat (slowing the call, frustrating the caller) or commits 'lumberjack slim' verbatim. The kitchen printer prints a ticket for an item that does not exist. A staff member walks the floor to figure out what the caller meant. If they can't, they call back, and the order is lost.

  • no menu mapping at the agent layer
  • callers must speak the exact menu string
  • tickets for non-existent items hit the kitchen
  • modifier slots ('eggs cooked, what kind of bread') are not auto-populated

What this means if you are evaluating an AI phone calling agent for a restaurant

Most of what is written about AI phone calling agents is written about horizontal platforms (Bland, Retell, Vapi, Synthflow, Lindy, Cognigy, PolyAI). Those platforms are real and they are good. They sell a primitive: a clean speech stack, a fast LLM, a telephony layer, a webhook back to whatever you want. If your team has the engineering bandwidth to build the menu layer on top of one of those primitives, that is a valid path. Bland in particular has a strong developer story for outbound at scale.

The work this page describes, the dish ontology, the alias and phonetic table, the modifier tree, the POS item mapping, the normalization of caller words to canonical modifier names, the gating of commit on a full readback, none of that ships with a horizontal platform. It is built per restaurant, by hand, the first time. After that it is maintained week to week as the menu changes.

That work is what we sell. PieLine ships the call you just heard, for your menu, on your POS, in under 24 hours of onboarding work (mostly on our side; for the owner it is hands-off). It handles up to 20 simultaneous calls during peak. The price is $350 per month for the first 1,000 calls, then $0.50 per call. The first month carries a money-back guarantee.

The experience was better than speaking to a human. No hold time, no confusion, no rushing.
C
Caller
recorded customer feedback, Idly Express

What an AI phone calling agent will not do, and that is fine

The agent will not invent a price for a menu item it has never seen. It will not invent an allergen fact. It will not pretend to know whether the kitchen is out of an item; that requires a live POS inventory feed and the absence of one is the right reason to escalate. It will not handle a complaint about last night's order; that is also a named escalation trigger.

When any of those conditions hits, the agent transfers the call with the full transcript attached, so the manager who picks up the line is not starting from zero. The transferred-call rate on live deployments runs under 10%. The other 90%-plus runs end to end on the agent.

Want to hear this on your menu, on your POS?

Book a 15 minute call. We will run a recorded demo with three of your menu's hardest-to-spell items and show you the same trace, second by second.

Frequently asked questions

What is an AI phone calling agent, in plain terms?

A software agent that picks up a phone line, holds a back-and-forth conversation in natural speech, extracts what the caller wants, takes an action against a backend system, and either commits or escalates. In a restaurant the backend system is a POS (Clover, Square, Toast, NCR Aloha, Revel) and the action is firing a line-itemized ticket to the kitchen. The reference call this page walks through is 102.36 seconds long and ends with a $34.11 order on the kitchen printer.

Where can I see this call as data, not just hear it?

Open /src/components/voice-activity-data.ts in this repo. It is a TypeScript file with 47 captions, each one a {speaker, start, end, text} record produced by Deepgram multichannel transcription. The audio it was generated from is at /public/audio/dennys-order.mp3. The build script is /scripts/build-voice-activity-data.py. You can grep for any caller utterance and get back the exact second the agent responded, plus what the agent said.

What actually happens at the 'lumberjack slim' moment, technically?

The caller's audio frames between t=5.84 and t=8.96 hit the speech-to-text layer and come back as the literal token sequence 'one lumberjack slim and one Coke'. The agent does not pass that string to the dialogue manager unmodified. It runs it against the dish ontology built during onboarding (a list of POS item IDs, display names, ingredients, modifier trees, and aliases for the restaurant's actual menu). 'Lumberjack slim' has a high phonetic-edit-distance match to 'Lumberjack Slam', which is on the menu, while 'Lumberjack Slim' is not. The agent commits the match silently, fills the modifier slots that 'Lumberjack Slam' demands (eggs cooked, bread choice), and emits the response at t=15.985. The 7.025 second gap between the caller's last word and the agent's first response is the round trip plus the ontology lookup.

Why is the dish-name mapping the part that decides whether this works in a restaurant?

Because callers do not speak menu strings. They speak approximations, abbreviations, mishearings of menus they read on a takeout site three days ago, and free-form modifier requests. A horizontal AI calling agent platform (Bland, Retell, Vapi, Synthflow) gives you a great speech stack and an LLM, but the LLM has no idea what is on the menu unless you bring that ontology yourself. Without the mapping, 'lumberjack slim' becomes a clarifying question ('I'm sorry, can you repeat that?') or, worse, an unrecoverable order. With the mapping, it becomes a kitchen ticket.

What does the call do when the caller asks for something free-form, like 'can you add strawberries'?

Watch t=65.985 in the file. The caller says 'Can you add strawberries, if that's an option?' At t=71.345 to t=72.145 the agent says 'You got it.' At t=72.145 the agent confirms 'One slice of New York style cheesecake with strawberry topping.' Two things happened in between. First, the agent looked up whether the cheesecake item has a strawberry topping modifier in its modifier tree (it does). Second, it normalized 'strawberries' (caller's plural) to 'strawberry topping' (the modifier's canonical name on the POS). Both of those happen because the dish ontology carries the modifier tree, not just the item.

How long are the agent's pauses, and why do they vary?

The agent's first response after the order arrives is at t=10.96 ('One moment, please'), about 1.6 seconds after the caller stops speaking. The actual modifier question lands at t=15.985, another 5 seconds later. After the caller's modifier choices end at t=35.15, the agent confirms at t=37.56, a 2.41 second gap. The variation is intentional: short turns get short waits, longer think-required turns get a verbal placeholder ('one moment, please') to hold the line. A horizontal voice agent guide will tell you that latency under 600ms is the threshold for 'natural' conversation; in practice, the threshold is dynamic and the floor is set by what the agent has to look up.

What if the caller asks for something not on the menu at all?

The agent escalates. Novel menu items not yet uploaded are a named edge case in the escalation contract. The call is transferred to a manager with the full transcript attached. The agent does not invent a price for a dish it has never seen, and it does not claim an allergen fact it cannot verify. A no-readback cashier will sometimes guess the price; the agent will not.

How does this compare to a horizontal AI calling agent platform like Bland, Retell, or Vapi?

Bland, Retell, Vapi, and the rest sell the speech and dialogue primitive. They give you a great voice, a fast LLM, telephony, and a webhook to your backend. They do not give you the dish ontology, the menu scrape, the modifier tree, or the POS item ID mapping for your restaurant. PieLine is the layer above: a working AI phone calling agent for restaurants, with the menu mapping built in. If you have the engineering team to build the menu layer yourself on top of one of those primitives, that is a valid path. If you run a restaurant and just want the phone to work tonight, that is what we sell.

What does pricing look like, and what is the catch?

$350 per month for up to 1,000 calls, $0.50 per call after that. Onboarding (menu scrape, POS mapping, dish ontology build) is hands-off for the owner. There is a money-back guarantee for the first month. The catch is honest: setup work happens on our side before the first call, so we do not offer a free trial; we offer the refund instead. Active call monitoring runs during month one to refine the menu prompts.

What if my menu is more complex than a Denny's-style breakfast menu?

The reference call is more complex than it looks. The Lumberjack Slam line item alone forces three modifier choices (eggs cooked, bread choice, plus a free-form addition). The cheesecake line item carries a modifier tree (strawberry topping). Half-and-half pizzas, protein substitutions, custom sushi rolls, and spice levels are named cases in the menu mapping. The pattern scales: every modifier the menu can express becomes a slot the agent can fill on the call. Pizza shops, Indian, Chinese, Mexican, and sushi menus are explicitly in scope.

📞PieLineAI Phone Ordering for Restaurants
© 2026 PieLine. All rights reserved.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.