Reference / per-call data shape / 102.36 s

24/7 phone monitoring for restaurants is a per-call record, not a tally

Most products that sell “24/7 phone monitoring” give a restaurant operator three numbers: calls in, calls answered, calls missed. That is a tally, not monitoring. A tally cannot tell you what a caller actually asked for at 19:42 on a Friday, whether the agent on the line heard them correctly, or which line items hit the kitchen printer afterward. Monitoring means a structured record written per call. The reference recording behind this page lives at /public/audio/dennys-order.mp3. The structured record it produced lives at /src/components/voice-activity-data.ts. What follows is what is in that record, and why nothing weaker than this counts.

Matthew Diakonov, Written with AI

Published April 28, 20269 min read

The instrumentation gap

A typical “phone monitoring” dashboard for a restaurant shows three counters and one chart. Calls in. Calls answered. Calls missed. A 24-hour timeline with little spikes at lunch and dinner. The owner glances at it once a day, sees the missed-call counter, and forms a vague impression that things are okay or not okay. Almost nothing about a real shift can be recovered from that screen.

The gap is not bandwidth, it is fidelity. The dashboard is counting events. To run a phone line, an operator needs to look at events. Per call, they need to know who called, what they asked for, in what words, whether the system understood, what modifier slots were filled, and what landed on the kitchen printer. None of that is visible from a counter, because none of that is in a counter.

The honest framing: counting calls is to monitoring what counting receipts is to running a restaurant. It tells you something happened. It does not tell you what.

The smallest useful per-call record

Five fields. None of them is optional, because each one compensates for what is missing in the others.

Channel separated audio. The caller on one channel, the agent on the other, recorded as stereo. Mono recordings drop the moment of overlap, which is the most common single point of failure on a restaurant call.
A per-channel amplitude envelope. Sampled at 60 Hz, smoothed with separate attack and release constants so a phrase ends visibly. Lets the operator see silence, speech, and overlap without listening to the audio.
Word level timestamps. Multichannel speech-to-text, grouped into captions on pause and punctuation boundaries (the current thresholds are 3.2 s max segment, 0.55 s pause gap, 75 character max line).
Ontology-mapped intent. The matched POS item ID, modifier slots filled, and free-form additions normalized to canonical modifier names. This is the single field that turns a transcript into something a restaurant can act on.
POS commit or transfer reason. Either the kitchen ticket fired (with total and prep time) or the call was handed to a manager with a reason attached (catering, complaint, novel item, payment failure).

Skip any one of those and the record stops being auditable. Drop channel separation and overlap is invisible. Drop the envelope and you have to listen to find a 9 second silence. Drop word timestamps and you cannot reconcile the captions with the kitchen ticket. Drop ontology mapping and you have a transcript that does not say which dish the caller asked for. Drop the POS commit and you cannot tell whether the order actually shipped.

The data shape, as it exists in this repo

The reference call shipped with this site, used to render the live waveform on the home page, is the same record shape PieLine writes for production calls. It is a single TypeScript file. The excerpt below is real, with the envelope arrays elided for length.

src/components/voice-activity-data.ts

Every field on that record is observable from outside the call. The 60 Hz sample rate is a deliberate choice: high enough to show pauses and overlap, low enough that 102 seconds of stereo produces about 12 KB of envelope data per call. Thespeakerfield is not a guess, it is the channel index from the original stereo wav: customer is channel 0, agent is channel 1.

The pipeline, end to end

Audio comes in on two channels. A single hub on PieLine's side is doing all of the work that follows: recording the stereo wav, calling Deepgram multichannel for word level timestamps, walking the raw samples for the 60 Hz envelope, mapping the extracted intent against the restaurant's menu ontology, and committing the resulting ticket to the POS or routing to a human. The artifacts on the right are written as the call is happening, not after.

One call, one stream, four artifacts

There is no second “monitoring agent” on the line. The audio is already there because the agent needs it to hear the caller. The transcription is already there because the dialogue manager needs the words. The ontology lookup is already there because the POS commit needs an item ID. Keeping those artifacts and indexing them is what turns the call into a monitorable event.

What it looks like when the script runs

The build script that produced the reference record is committed to the repo at scripts/build-voice-activity-data.py. It is 118 lines of Python: HTTP POST to Deepgram, two passes over the wav for the envelope, group the words into captions on punctuation boundaries, write the TypeScript file. Running it on the Denny's reference call produces the log below. The same call into a production restaurant produces the same artifacts.

scripts/build-voice-activity-data.py — reference call

90%+

“The experience was better than speaking to a human. No hold time, no confusion, no rushing.”

Customer feedback, calls handled end-to-end by AI at Idly Express, Almaden

Generic call tracking versus an answering-system record

A restaurant operator evaluating a “phone monitoring” product is usually being shown one of two things: a call tracking dashboard (CallRail, CallScaler, Imagicle, et al.) or an answering service that publishes the same dashboard with answer-rate added. Both leave most of the per-call record unwritten. The table below compares what each layer can see.

Feature	Generic call tracking layer	PieLine per-call record
What was said	Not captured. Carrier metadata only. Some products bolt on a transcript, decoupled from intent.	Word level transcript per channel, with caller and agent on separate tracks. Punctuated, smart formatted.
Whether it was understood	No signal. No way to tell if the menu mapping resolved or the call ended in 'sorry, can you repeat that?'	Ontology lookup result is in the same record: matched item, modifier slots filled, free-form additions normalized to canonical names.
What the kitchen got	Out of scope. The POS is a separate system and call tracking has no read into it.	POS commit (or transfer reason) on the same record. The captions and the kitchen ticket reconcile by timestamp.
Detecting silence and overlap	Total call duration only. A 60 second call with 40 seconds of dead air looks the same as a clean order.	Per channel amplitude envelope at 60 Hz. Dead air, talk-over, and abrupt drops are visible without playing the audio.
Cost model	Per number, per minute, or per call recording, with separate seat licensing for the analytics dashboard.	Same flat $350 per location per month covers 1,000 calls plus the full per call record. No analytics tier.
Coverage of the line itself	None. A monitoring product does not answer the phone; if the line goes to voicemail, the call is missed and the analytics tally a 'missed call.'	The same system answers the call. There is no missed-call bucket because there is no rollover.

The two columns are not graded on the same axis. Call tracking is a strong product for marketing attribution. It is not the right product for a restaurant phone line, because the marketing source of an inbound order is roughly irrelevant compared to what the caller asked for and whether the kitchen got it. The record that matters lives one layer below the carrier, in the system the caller is talking to.

The 24/7 part is non-trivial because the leak hides at 03:14

For an independent restaurant taking phone orders, the bucket that quietly bleeds revenue is not lunch. Lunch is loud and visible: missed calls during the rush get noticed. The bucket that bleeds is everything outside business hours and everything during shift changes. The 21:30 caller on a Sunday who hits voicemail and never calls back. The 14:45 caller during the between-shifts gap when one cashier left and the next has not clocked in. The 03:14 caller, which sounds absurd until you build a 24-hour phone product and find out how many of those there are on a delivery menu.

Continuous coverage means there is no “outside business hours” in the data. Every ring at every hour produces the same record, written to the same file, with the same captions and the same envelope. After-hours calls either become orders (the kitchen sees them in the queue when it opens) or transfers with a reason attached (caller asked for catering, novel item, etc.). The point is that they stop disappearing. The bucket gets bottomed out because the bucket has bottom-and-sides made of data.

What the operator does with it on a Monday morning

Three concrete uses, in order of frequency.

Audit a single call. A customer says they ordered a half-and-half and got a plain. The operator opens the record by phone number and timestamp, reads the captions, sees the modifier slots that were filled (and not filled), and knows within thirty seconds whether the caller said the wrong thing, the agent heard the wrong thing, or the kitchen made the wrong thing. All three are recoverable outcomes; no record means none of them are.

Find anomalous calls without listening. The envelope is a 60 Hz time series; long runs of near-zero on both channels mean dead air, which usually means an unhappy caller. The operator filters the day's calls by “mean dead-air over five seconds” and gets a short list to review. They never had to scrub through a 102 second recording to find the silent six seconds in the middle.

Reconcile captions with kitchen tickets. Captions show the spoken item names. The POS commit shows the line items. They line up by timestamp. When they disagree (caller said one thing, ticket says another) the disagreement is the bug, and the bug is now a thing you can hand to the menu configurator instead of a vague “our AI sometimes mishears.”

See the per-call record on your menu

Walk through one of your own restaurant's calls. We map your menu, run the call, and show the captions, envelope, and POS commit on the same screen.

Frequently asked questions

What does '24/7 phone monitoring' mean for a restaurant, in concrete terms?

Continuous coverage of the phone line, with a structured record written for every inbound call. Coverage means somebody (or something) answers every ring, every hour the line is open, with no rollovers to voicemail and no busy signals. Structured record means each call leaves behind data the operator can later read, query, and audit. The two things go together. Coverage without records is just a phone being answered. Records without coverage is just a tally of what got missed.

What is the smallest useful per-call record?

Five fields, observable on a real PieLine call. (1) Channel separated audio: the caller on one channel, the AI agent on the other, written as stereo so neither voice can drown the other out. (2) A per-channel amplitude envelope sampled at 60 Hz, so silence, speech, and overlap are each visible without reading the audio. (3) Word level timestamps from a multichannel speech-to-text pass. (4) The intent extracted from those words, mapped against the restaurant's actual menu (POS item ID, modifier slots, normalized modifier names). (5) The POS outcome: ticket fired with a total, transfer to a manager, or escalation reason. The reference recording at /public/audio/dennys-order.mp3 carries every one of those.

Why is the answering system the right place to do the monitoring?

Because that is where the conversation actually exists. A bolt-on call analytics layer can read call duration, caller ID, and a missed-call signal from the phone carrier. It cannot tell you what the caller asked for, whether the AI heard it correctly, or which kitchen ticket the call produced. The answering system has all of that natively, because it is the thing the caller is talking to. PieLine emits the record as a byproduct, not a separate integration.

How does PieLine actually capture and structure each call?

Inbound audio is recorded as 16 bit stereo with the caller on the left channel and the AI on the right. The recording is sent to Deepgram at /v1/listen?model=nova-3&multichannel=true&punctuate=true&smart_format=true, which returns word level timestamps per channel. A grouping pass turns those words into captions on pause/punctuation/length boundaries (current thresholds: 3.2 s max segment, 0.55 s pause gap, 75 char max). A second pass walks the raw samples and computes a smoothed RMS envelope at 60 samples per second, per channel. The whole thing is written as a TypeScript record, the same shape used to render the live waveform on the home page. The script that does this is /scripts/build-voice-activity-data.py and the output is /src/components/voice-activity-data.ts.

What does the operator actually do with this record?

Three concrete things. First, audit any single call by reading the captions: who said what, in what order, with how much pause. Second, find calls that look anomalous on the envelope (long silences, overlap, abrupt drops) and re-listen to a 5 second window without scrubbing through the whole recording. Third, reconcile what the kitchen got with what the caller said: the captions show the ordered item names, the POS commit shows the line items, and the timestamps line them up.

Is the monitoring really 24/7, or just 'business hours plus voicemail'?

24/7. The line answers at 03:14 the same way it answers at 19:00. The phrase 'after hours' has no special meaning in the data shape. After hours calls produce the same record as rush hour calls, with the same captions, the same envelope, the same POS hand-off (or transfer reason). For most independent restaurants, the after-hours bucket is where the leak hides: callers who tried at 21:30 on a Sunday, hit voicemail, and never called back. Continuous coverage closes that bucket and the record proves what was in it.

How does this compare to call tracking products like CallRail or CallScaler?

Call tracking is a layer on top of carrier metadata. It tells you the call existed, how long it lasted, and which marketing source it came from. It is excellent at marketing attribution and indifferent to what was actually said. Restaurant phone ordering is the opposite: the marketing source is irrelevant, the words and the resulting kitchen ticket are everything. PieLine's per call record is built around the words and the ticket. The two products solve different problems.

Does monitoring slow down or change the call?

No. The capture happens in line with the call the AI is already running. There is no separate monitoring agent, no human listening, no taps. The audio is already on the wire because the AI needs it to talk back. The transcription, ontology mapping, and POS commit are already happening because that is how the order gets to the kitchen. The 'monitoring system' is the same code path; the difference is that the artifacts (audio, captions, envelope, POS receipt) are kept and indexed.

What about privacy and call recording disclosure?

The reference call opens with the agent saying 'Hi, this is Denny on a recorded line.' Disclosure is part of the greeting and is a configurable line per restaurant during onboarding. Local recording laws vary; PieLine's onboarding sets the disclosure phrasing to match the restaurant's jurisdiction. Records are scoped to one restaurant's tenant and not commingled across customers.

How does pricing work for the monitoring side?

There is no separate price for monitoring. $350/month covers up to 1,000 calls per location, $0.50 per call after that. The per call record is part of the answering service, not an upsell. There is no 'analytics tier' to upgrade to in order to read transcripts.

Walkthrough

AI Phone Calling Agent: A Forensic Walk Through One Recorded Restaurant Call

The same 102.36 second reference call, walked second by second, with the four moments that decide whether the order ships or comes back wrong.

Read

Architecture

AI Phone Handles 20 Simultaneous Calls: The Throughput Math

Twenty concurrent slots produce 480 calls per hour. Why a tested ceiling beats an 'unlimited' claim with no published numbers.

Read

Integration

AI Phone Ordering and POS Integration

How the agent's parsed order ends up as a line itemized ticket on the kitchen printer for Clover, Square, Toast, NCR Aloha, or Revel.

Read

24/7 phone monitoring for restaurants is a per-call record, not a tally

The instrumentation gap

The smallest useful per-call record

The data shape, as it exists in this repo

The pipeline, end to end

One call, one stream, four artifacts

What it looks like when the script runs

Generic call tracking versus an answering-system record

The 24/7 part is non-trivial because the leak hides at 03:14

What the operator does with it on a Monday morning

See the per-call record on your menu

Frequently asked questions

Related

AI Phone Calling Agent: A Forensic Walk Through One Recorded Restaurant Call

AI Phone Handles 20 Simultaneous Calls: The Throughput Math

AI Phone Ordering and POS Integration

Comments ()

The instrumentation gap

The smallest useful per-call record

The data shape, as it exists in this repo

The pipeline, end to end

One call, one stream, four artifacts

What it looks like when the script runs

Generic call tracking versus an answering-system record

The 24/7 part is non-trivial because the leak hides at 03:14

What the operator does with it on a Monday morning

See the per-call record on your menu

Frequently asked questions

Related

AI Phone Calling Agent: A Forensic Walk Through One Recorded Restaurant Call

AI Phone Handles 20 Simultaneous Calls: The Throughput Math

AI Phone Ordering and POS Integration

Comments (••)

Comments ()