Voice agents · edge cases

On edge cases, the honest target is not a higher accuracy number. It is whether the agent knows it is in over its head.

Every voice agent vendor leads with one figure. That figure is measured on the easy 80 to 90% of calls. The accuracy that decides whether the system is actually shippable lives in the other 10 to 20%, and it is a different problem: not how often the agent answers a hard call, but whether it recognizes a hard call is happening and hands off cleanly. Here is the restaurant phone-order taxonomy of edge cases, why optimizing recognition harder makes them worse, and what a clean handoff actually carries.

Matthew Diakonov, Written with AI

Published May 22, 20267 min read

Direct answer (verified 2026-05-22)

How accurate are voice agents on edge cases?

Lower than the headline number, and the honest target on edge cases is not a higher recognition score. It is triage correctness: does the agent recognize it is in an edge case and hand off to a human with full context, or does it keep guessing. Even well-tuned systems plan for 5 to 15% recognition failure on real audio (Hamming AI). PieLine handles ~90% of calls end to end and routes the rest to a human with the full conversation, the partial cart, and the moment confidence dropped.

The edge case is the call no testing dataset has

The pages that currently cover this topic all describe edge cases the same way: a generic list of seven or ten things a model can stumble on. Silence. Background voices. Accents. Off-topic queries. The fix is always a longer evaluation set and a tighter confidence threshold. That is correct as far as it goes, and incomplete in the place that actually matters.

What gets missed is that edge cases are not random noise around a stable mean. They are a distribution shift. Every real phone line carries a long tail of calls that were never going to be in any training set, because the long tail is by definition the calls that do not look like the other calls. A restaurant phone line has its own long tail and it is not the same as a support-center long tail. The catering caller, the allergen panic, the four-person family relaying an order across the kitchen, the regular who wants "the usual" without ever giving an item name. None of these fail because the model heard the wrong words. They fail because the right action is not in the agent’s reachable set at all.

That changes the goal. If the agent cannot complete an edge case no matter how cleanly it hears the audio, the question stops being “how do we hear better” and becomes “how do we know we cannot complete this, fast, and route it cleanly.” Recognition is necessary. Triage is sufficient.

Why optimizing recognition harder on edge cases makes them worse

The failure cost is asymmetric. On the happy 80 to 90% of calls, an extra point of recognition accuracy is real money: more orders firing right, faster. On the long-tail 10 to 20%, the same extra point of confidence usually translates to a wrong action, taken confidently, instead of a clean stop. A confident wrong ticket is worse than dead air. An invented allergen answer is worse than “let me put you on with a manager.” A fabricated address is worse than a refused delivery.

~0%

handled end to end by PieLine, where recognition is the right lever

~0%

routed to a human with full context, where triage is the right lever

The interesting work is in the second column. Building an agent that can read a confidence drop, decide that the next thing about to happen is a wrong action, and stop, is harder than nudging the recognition number up by another point. It is also where the user experience actually lives, because every caller who hits the long tail will tell at least one other person what happened.

Two responses to the same hard call

Same caller, same audio, same line. The difference is what the agent is permitted to do when it knows it is uncertain.

A caller asks: I need 50 pieces of tandoori chicken for a party Saturday at six, and the last order was missing the naan.

The agent tries to treat the request as a large order. It posts 50 line items into the cart, quotes a price the kitchen has not approved, attempts to read a missing-item refund from a transcript with no order ID, and offers a Saturday slot it has no calendar visibility into. The ticket fires. Someone calls the customer back to fix it.

Confident wrong answer beats clean stop
Refund path invented from a transcript, not a record
Saturday capacity quoted with no calendar source
Caller hangs up satisfied; manager finds out tomorrow

The restaurant phone-order edge case taxonomy

Generic voice-AI edge case lists (silence, accents, interruptions, off-scope queries) are real, and a restaurant line sees all of them plus a category that nobody else has. These are the calls that look like normal orders until they are not.

Off-menu request

“Can you add bacon to the pad thai, or do you do a half-and-half noodle bowl?”

The words resolve cleanly. The combination is not a real modifier group in the POS. No amount of better hearing fixes a request the menu cannot post.

The usual

“Same as last Tuesday. Whatever I got for my daughter the time before.”

The caller is asking the agent to query history, not a menu. Without a tied phone-to-customer record and explicit confirmation, a confident guess is a wrong ticket.

Group order relayed across the room

“Hold on, what do you want? She wants the green curry. No, the red one. Mom, do you want rice?”

Multiple speakers, side conversations, mid-call revisions. Voice activity detection treats the secondary speakers as the caller. The agent that stays quiet and waits for one speaker beats the agent that tries to keep up.

Allergen and dietary panic

“Wait, does anything in the kofta have peanuts? My son has a peanut allergy.”

A wrong answer here is a real safety event, not a refund. The right behavior is read-from-source, not infer-from-name, and escalate when the source does not cover it.

Code-switching mid-call

“Mujhe ek butter chicken chahiye, and one mango lassi, plus naan for the kids.”

Models trained mostly on standard American English drop accuracy on code-switched speech. The slot-bearing words (the items) often survive but the connective tissue around them does not.

Address or zone failure

“I am at the corner of, hold on, the new apartments next to the old Walgreens.”

Geocoding fails on landmarks and apartment-complex names that do not exist on a map. Confidently fabricating a deliverable address is worse than refusing it.

Mid-call cancel or restart

“Actually, forget that. Can I just do two paneer tikkas instead? And take the rice off.”

Cart state has to roll back, not append. The failure mode is a doubled order that fires anyway because the agent treated the revision as an add.

Catering, complaint, or refund

“I need 50 pieces of tandoori chicken for a party Saturday at six. Last week my order was missing the naan and nobody picked up when I called.”

These are explicitly out-of-scope for an order agent and explicitly in-scope for a manager. The right number to optimize is how cleanly they get there, not how often the agent simulates a manager.

Tuning recognition harder addresses maybe two of these. The rest are not recognition failures, they are scope failures. The agent heard the words. The agent cannot act on them inside its reachable set without inventing something that did not exist. The right number to grade on this category is whether the call ended with a clean human pickup, not whether the agent posted a ticket.

What a clean handoff actually carries

A cold transfer that drops a caller into a generic “hello, how can I help you” is a worse experience than no AI at all, because the caller already explained themselves once. The unit of measurement on the long tail is how much of that work is preserved across the handoff. Here is the payload PieLine actually passes to the human picking up.

handoff payload (illustrative)

caller_phone: +1 415 555 0142

caller_known_to_pos: true (returning, 14 prior orders)

intent_detected: catering inquiry + prior-order complaint

partial_cart:

(empty, agent did not post a ticket)

transcript_window:

“...need 50 pieces tandoori chicken Saturday six... and last week the naan was missing, nobody called back...”

confidence_drop_at: 00:38

drop_reason: out-of-flow (catering quantity exceeds standard order flow), prior-order complaint (refund flow not in agent scope)

recommended_action: warm-transfer to manager line, open catering form, pull order #4471 from 2026-05-15

The anchor of this whole topic is in that payload. The agent did not try to be a manager. It noticed that the next thing about to happen was a wrong action, stopped, and passed the work, fully assembled, to the human who actually has the authority and the calendar visibility to handle it. Across an account, this is the shape the long tail takes: not a tail of wrong answers, but a tail of clean stops.

“The experience was better than speaking to a human. No hold time, no confusion, no rushing.”

PieLine customer

Reported on a live restaurant line, Bay Area

The public lesson nobody wants to write up

Taco Bell rolled AI ordering to roughly 500 U.S. locations starting in 2023, then publicly slowed the expansion in 2025 after a string of viral edge-case failures: a customer “crashing” the bot by asking for 18,000 cups of water, soda-order loops where the agent kept asking what to drink with the drink that had just been ordered, and a steady supply of trolling clips. The company shifted to a model where employees monitor nearly every order and step in when the system stumbles, and the chief digital and technology officer said publicly that the busier locations may benefit from a human handling orders (RetailWire).

The takeaway people keep writing is “voice AI is not ready.” That is not the lesson. The lesson is that the edge case escalation path is not optional. The brands that ship voice ordering without a clean handoff pay for the gap publicly. The ones that ship it with a clean handoff get the 90% upside and absorb the other 10% as a routine operational handoff that nobody films.

What to ask a vendor about edge cases

Skip “what is your accuracy.” It is the cheapest thing to quote and the hardest thing to verify. Ask the questions that force a vendor to describe their triage, not their recognition:

What percentage of calls do you hand off, and what gets handed with them? A vendor that handles 100% of calls is either lying or hurting you on the long tail.
How does the agent decide to stop? Confidence threshold on a single token is not a stop rule. Repeated retries on the same slot, out-of-scope intent detection, and explicit no-source refusals are.
What happens to an allergen question your menu source does not answer? The correct answer is escalate. A confident inference is the wrong answer.
Show me a real handoff transcript on my own menu. Not a demo, the actual payload that lands with the human. If the human is starting cold, the agent did not really hand off.

Bring your three hardest edge-case calls

We will run them on a live PieLine line and show you exactly what gets handed off to the manager when the agent recognizes it is in over its head.

Frequently asked questions

How accurate are voice agents on edge cases?

Lower than the headline number, and the honest answer is that the number itself is the wrong target. On the long tail (off-menu requests, group orders relayed across the room, code-switching, allergen questions, callers giving an address as a landmark), recognition collapses no matter which model you pick. The metric that survives production is triage correctness: does the agent recognize it is in an edge case and hand off to a human with full context, or does it keep guessing. Hamming AI's edge-case guide tells builders to plan for 5 to 15% STT failure even on well-tuned systems, and to put validation layers between recognition and action. The vendor question that matters is not 'what is your accuracy' but 'what happens to the calls you cannot complete.'

What counts as an edge case on a restaurant phone line, specifically?

The restaurant edge-case taxonomy is different from a generic support-agent one. It includes off-menu combinations that the words resolve to but the POS modifier groups cannot, 'the usual' and other history-based requests, group orders with multiple speakers and side conversations, allergen and dietary panic where the cost of a wrong answer is a safety event, code-switching mid-order, address failures on landmarks or new apartment names, mid-call cancels that require rolling cart state back rather than appending, and catering or complaint calls that belong with a manager, not an order agent.

Why is a higher accuracy number not the right goal on edge cases?

Because the failure cost is asymmetric. On the happy 80 to 90% of calls, an extra point of recognition accuracy is real money. On the long-tail 10 to 20%, a confidently wrong answer is worse than a clean stop in almost every direction: a remake beats a wrong ticket fired silently, a refused delivery beats a fabricated address, and a handoff to a human beats an invented allergen answer. Optimizing recognition harder on the long tail trades a clean failure for a confident wrong one. The right design lever is calibration: a system that knows when it does not know, and stops.

What should a vendor hand off to a human on an edge case?

Everything that lets the human pick up exactly where the agent stopped, with no need for the caller to repeat anything. At minimum: the caller phone, the partial cart already built, the last full transcript window, the moment confidence dropped and why (a modifier that did not resolve, a name that geocoded nowhere, an allergen question outside the menu source), and the recommended next action. A cold transfer that drops the caller into a 'hello, how can I help you' is a worse experience than no AI at all. A warm transfer with the cart and transcript is a better experience than a single human juggling phones and counter.

How does PieLine actually handle the long tail of calls?

Two design choices. First, the agent is constrained to act on things that exist in the source of truth: modifier IDs that exist in the POS, addresses inside the configured delivery zone, items on the live menu. Anything that does not resolve gets surfaced on the call instead of saved as a hopeful free-text note. Second, when the agent recognizes it is outside its scope (catering, complaints, allergen questions the menu source does not answer, repeated low-confidence retries on the same slot), it transfers to a human with the full conversation context: phone, partial cart, transcript window, confidence-drop reason. About 90% of calls are handled end to end. The other ~10% become a clean handoff, not a confidently wrong ticket.

What did Taco Bell's AI drive-thru rollback actually teach the industry?

That the calls on the long tail are not random noise, they are the actual product. Taco Bell pushed AI ordering to roughly 500 U.S. locations starting in 2023, then publicly slowed the rollout in 2025 after viral edge-case failures: a customer 'crashing' the bot by asking for 18,000 cups of water, soda-order loops, and a steady volume of trolling clips. The company shifted to a model where employees monitor nearly every order and jump in when the system stumbles. The lesson is not 'voice AI is not ready,' it is 'the edge-case escalation path is the part you cannot skip,' and a brand without it pays for that gap in public.

How should I audit a voice agent on edge cases before I sign?

Skip the headline accuracy figure. Place the three calls you actually dread: the off-menu combination you know is not in the POS, the allergen question that touches a dish the agent did not write, and the catering or complaint that does not look like a normal order. Watch what happens. Did the agent surface the unmappable request and offer the closest real option, or hide it in a free-text comment. Did it refuse to fabricate an allergen answer it had no source for. Did it route the catering call to a human with the conversation already loaded, or cold-transfer the caller into a generic queue. Those three outcomes tell you the real edge-case number, the one no demo highlights.

The rest of the accuracy story