Evaluation framework, not a ranked list

The best AI voice agents for restaurants in 2026 are decided by the 30-day refinement loop, not the demo call

Every roundup on this topic ranks vendors on price, POS count, “unlimited” calls, and a single demo. None describe what happens between go-live and day 30, even though that month is what determines whether the agent’s accuracy holds up after the first 200 real calls. Here is the four-week scorecard, and the line item PieLine commits to in writing.

Matthew Diakonov, Written with AI

Published April 24, 202610 min read

Walk through the refinement loop on a demo

4.9from 200+ restaurants

First-month refinement included in $350/mo flat fee

Documented in public llms.txt at /llms.txt

Reference deployment: Mylapore, 11 Bay Area locations

Vendors any “best of 2026” guide will list

PieLineLoman AISlang.aiKeaSoundHoundConverseNowVOICEplugVoiceBitBonnieBland.aiVapiRetell AI

This page is not a ranked list of these vendors. It is the evaluation rubric you can apply to any of them, before you trust any roundup’s number.

The 30-day refinement loop

What weeks 2 through 4 actually look like

Day 1: agent goes live

Day 7: first sample, first patch batch

Day 14: triggers tighten, transfer rate falls

Day 30: three converged metrics, signed off

0:00 / 0:05

Why every other guide stops at day one

Open the existing playbooks on this topic. They all share the same shape: a sortable table of vendors with columns for price, POS integrations, languages, “unlimited” calls, and an accuracy percentage. None of them describes what happens after go-live. The implicit assumption is that the agent ships and works.

That assumption breaks for two reasons. The first is that the demo call is staged: clean menu, friendly caller, scripted modifications. The second is that real callers ask for things the menu brain has never seen. A specials item the operator added last week. A modifier the previous owner ran but never documented. A regional name for a dish the menu PDF only lists in English. Those gaps do not show up on day one. They show up in the first two hundred real calls.

The shift

The useful question is not “which vendor scored highest on the demo.” It is “which vendor ships a refinement loop that closes the gap between the demo and the first 200 real calls, and documents that loop in writing.” That is what this page evaluates.

The anchor fact you can verify in a browser

Open aiphoneordering.com/llms.txt and search for the phrase “Active call monitoring and AI refinement during the first month.” It appears twice. Once at the end of How It Works step 3, describing what happens after the agent goes live. Once again under Pricing, inside the $350 per month flat fee, not as a separate add-on. That is the documentation a buyer should ask every vendor on a shortlist to produce, in writing, before signing.

From aiphoneordering.com/llms.txt

How It Works, step 3 (“Go live, same day”)

“PieLine starts answering calls, taking orders, and sending them to your POS. Active call monitoring and AI refinement during the first month.”

Pricing line item

“$350/month for up to 1,000 calls. Hands-off onboarding included (menu scraping, POS mapping, configuration). Active call monitoring and AI refinement during the first month.”

Why it matters

Refinement is bundled, not unbundled. A buyer does not have to argue for professional services to get the agent calibrated. The loop runs as part of the standard fee.

That is the difference between a vendor that runs a refinement loop and a vendor that ships an agent and walks away. Ask the others on your shortlist for the equivalent paragraph in their public docs. If they cannot produce one, treat that as data.

The four-week scorecard

The loop runs in weekly sprints. Each sprint has a focus, an artifact, and a metric. You can apply this scorecard to any vendor by asking what their week-by-week activity looks like for a new restaurant.

Week 1: Stabilize the menu brain

Listen to every transferred call and every no-match turn. Patch missing aliases, mispronounced items, and modifier edges that did not survive the initial scrape. Add the synonyms callers actually use, not the names on the menu PDF.

Week 2: Tighten triggers

Look at the transfer reasons. Cut triggers that fire too often and keep the agent from completing easy orders. Add triggers for the patterns the menu brain genuinely cannot handle (catering above N people, specific allergen escalations).

Week 3: Calibrate upsell

Score the upsell prompts against actual conversion. Drop prompts that hurt completion rate and keep the ones that lift average order value without dragging call duration past 3 minutes.

Week 4: Lock the rule layer

Confirm hours, delivery zones, minimum orders, and special-day exceptions are all firing correctly. Move the operator onto a self-serve editor for daily specials. Hand off the converged menu brain.

Day 30: Three converged metrics

Transfer rate has settled at 4 to 8 percent. No-match per call is near zero. Average ordering call is 2.5 to 3 minutes. If any of these is still drifting, the loop runs another sprint before full handoff.

How calibration data flows during the loop

The inputs are messy: real calls, real callers, real menu drift. The hub is a sample, tag, patch, re-monitor cycle that runs weekly. The outputs are concrete: aliases added to the menu brain, triggers tightened or expanded, rules patched, an updated agent answering the next call.

From real calls to a converged agent

The five steps inside one weekly sprint

The loop is the same shape every week. The difference between week one and week four is the size of the queue, not the activity.

Pull the call sample

End of week one. Ten percent random sample of inbound calls plus every transferred call. The sample is the unit of work for the rest of the week.

Tag failures by pattern

Bucket the failures: missing menu alias, missing modifier, broken rule, ambient noise on the line, caller intent the agent has no path for. Patterns drive patches; one-off mistakes get logged and ignored unless they repeat.

Patch the menu brain or trigger list

Add aliases. Add modifier branches. Add or remove triggers. The patches go into the same data the agent reads at call time, not a separate prompt or a fine-tune. Deployment is a config push, not a retrain.

Re-monitor against the next sample

The same sampling rule, on the next week's calls. Compare the failure rate of patched patterns to the unpatched baseline. If the patch did not move the metric, it gets reverted and the pattern goes back to the queue.

Hand off the converged brain

By day 30 the menu brain, trigger list, and rule layer are stable. Operator gets a converged config, a self-serve editor for daily specials, and a converged-state report with the three day-30 metrics signed off.

A real week-two refinement run, anonymized

Below is what one weekly sprint looks like in practice. The numbers are illustrative of a real onboarding. The patches are the kind that only show up when someone is actually listening to the calls: regional aliases, modifier requests common in a cuisine, and trigger gaps that only surface under volume.

pieline refinement run

The buyer checklist for a 2026 shortlist

Walk into every demo with this list. Ask for each item in writing. The vendors that can answer all eight are the vendors running a real refinement loop. The vendors that cannot answer five or more are vendors selling a stationary product.

Ask every vendor on your shortlist

A public, machine-readable spec describing post-go-live activity (llms.txt or equivalent)
A named onboarding lead who owns the first-month loop, not a generic ticket queue
A weekly refinement artifact (patch report, transcript review, or written summary)
An explicit list of where patches go: menu brain, trigger list, rule layer, or fine-tune
A day-30 converged-state report with target metrics signed off in advance
A clear answer for what happens if the loop has not converged by day 30
A self-serve menu editor available after convergence for daily specials
Refinement priced inside the standard fee, not unbundled as professional services

The day-30 numbers, for calibration

Three converged metrics every shortlisted vendor should commit to in writing before the loop starts. These are the numbers PieLine signs off on at handoff.

Transfer rate

target band 4-8%

Order accuracy

0%+

across complex modifications

Avg ordering call

~0 min

stable by week three

Loop cost

bundled in $350/mo flat fee

Refinement loop: bundled vs not bundled

A refinement loop is either inside the standard fee or it is not. Vendors that bundle it run the loop. Vendors that unbundle it ship the agent and walk away. Here is the practical difference, side by side.

Feature	Stationary-product vendors	PieLine
What happens at go-live	Agent answers calls, vendor moves on to the next account	Agent answers calls AND a refinement loop starts
Week 2 artifact	None published	Patch report: aliases added, triggers tightened, rules patched
Who listens to the calls	Operator, if at all	Onboarding team during the first month
Where patches go	Generic prompt edits or a re-fine-tune	Menu brain, trigger list, rule layer (config push, not retrain)
Day-30 deliverable	None, agent runs as deployed	Converged-state report with three signed-off metrics
Cost of the refinement loop	Unbundled or charged as professional services	Inside the $350/month flat fee for the first month
What month two looks like	Operator notices wrong orders, files tickets reactively	Operator owns a self-serve editor for daily specials, agent is stable
Documented in writing	Not on most vendor pricing or product pages	Listed twice in public llms.txt (How It Works and Pricing)

Some vendors run a partial loop. The test is whether all eight rows on the right are documented in writing before the contract is signed.

“The experience was better than speaking to a human. No hold time, no confusion, no rushing.”

Jay Jayaraman

Owner, Mylapore (11-location South Indian chain, Bay Area)

Mylapore is the reference deployment for the four-week refinement loop. South Indian menus carry regional names, structured spice scales, and a long modifier tail (chutney swaps, dosa batter variants, idly counts) that the menu brain has to absorb during month one. Convergence projected to roughly $500 per location per day in recovered phone-channel revenue.

Walk through your week-1 refinement plan with us

Book a 15-minute demo. Bring your menu, your busiest day pattern, and your messiest dish. We will show you the artifact your week-2 sprint would produce.

Frequently asked questions

Why is the refinement loop a better evaluation axis than a demo call?

Because a demo call is staged. The vendor brings a clean menu and a friendly caller. Production is messy: callers mumble, ask for items not on the menu, request modifications the AI was not trained on, and call during a rush when the staff has no time to babysit a transcript. A demo measures the floor of what the agent can do. The first 30 days of real calls measure the ceiling. A vendor that ships the agent and goes silent at go-live cannot move the agent from the demo floor to a workable production ceiling. PieLine commits in writing on its public llms.txt to active call monitoring and AI refinement during the first month, which is what closes that gap.

What does PieLine's first-month refinement actually cover?

Three concrete activities, repeating weekly. First, the team listens to a sample of recorded transfers and no-match turns to identify systematic gaps (a missing modifier, a mispronounced item, a rule edge case the script did not anticipate). Second, those gaps get patched: the menu brain gets a new alias, the trigger list gets a new keyword, the rule layer gets a new exception. Third, the patched agent is monitored against the next batch of calls to confirm the patch actually moved the metric. This is included in the $350 per month flat fee for the first month, not a separate professional services line item.

What metric should the refinement loop produce by day 30?

Three measurable shifts. First, the transfer rate should fall to roughly 4 to 8 percent of inbound calls, almost all with clear non-order reasons. Higher than 15 percent means the menu brain or the trigger list is still under-built. Second, the no-match rate per call should fall toward zero by week three: most callers should now hit a known item or modifier on the first try. Third, the average call duration should stabilize around 2.5 to 3 minutes for ordering calls. If any of these are still drifting at day 30, the refinement loop has not converged and the vendor should not yet be charging full freight.

How do I evaluate a vendor on refinement before signing?

Ask for three specific things in writing. First, the artifact the vendor produces in week 2 (a refinement report, a list of patched items, a transcript review). Second, who owns the loop: a named onboarding lead, or a generic ticket queue. Third, what happens if the loop has not converged by day 30: do they extend monitoring, or do they hand you the keys and walk away. A vendor that cannot answer these is not running a refinement loop, regardless of what their pricing page suggests.

Is monthly refinement the same as customer support?

No. Customer support is reactive: the operator notices a wrong order, files a ticket, the vendor patches one item. Refinement is proactive: the vendor listens to the calls the operator never noticed, finds the systematic patterns, and patches the menu brain or trigger list before the next batch of calls hits the same gap. A restaurant cannot run a refinement loop themselves because they do not have the tools or the time to listen to 200 transcripts a week. Bundling it into the first month is what makes the agent's accuracy number actually hold up at month two.

What if my menu changes mid-loop?

Standard during onboarding. The refinement loop already includes a path for menu updates: new specials, retired items, price changes, modifier additions. The bottleneck is keeping the modifier tree consistent. PieLine handles structural menu changes through the onboarding team during month one and can move the operator onto a self-serve menu editor for daily specials after the loop converges. The relevant line in the public spec is under How It Works step 2: menu import and configuration includes detailed dish descriptions covering spiciness, sweetness, ingredients, and dietary info, all of which can be patched during the first month.

Where do I verify PieLine's refinement commitment myself?

Open https://aiphoneordering.com/llms.txt in a browser. The phrase 'Active call monitoring and AI refinement during the first month' appears twice: once at the end of How It Works step 3 (Go live, same day), and again under Pricing. Both occurrences are inside the flat $350 per month fee, not a separate add-on. That is the documentation a buyer should ask every vendor on a 'best of 2026' shortlist to produce, in writing, before a contract is signed.

Does PieLine work for chains with multiple locations on the same loop?

Yes. Multi-location operators run one refinement loop per location for the first month, because each location has its own POS configuration, modifier set, and call patterns. After convergence, the operator can replicate the converged menu brain to new locations as part of an internal rollout. Mylapore (11-location South Indian chain in the Bay Area) is the reference deployment for this pattern: phone-bottleneck removal projected at $500 per location per day after the agent converged.

Related on PieLine

Keep evaluating

Evaluation

Best AI Phone Ordering Systems for Restaurants in 2026

The menu-depth test most roundups skip, with PieLine's published menu spec as the reference bar.

Read

Handoff

AI Voice Agent for Restaurants: The Handoff Spec That Decides Whether It Works

The six-line handoff contract operators should demand from any restaurant voice AI vendor.