Most “technology restaurant” lists tell you what to buy. The real line is whether you can grep last night's phone orders.

A restaurant runs on channels. The POS, the kitchen display, the drive-thru speaker, the phone. For twenty years the phone has been the last analog channel, a handset and a human, no matter how much software wrapped it. A restaurant becomes a technology restaurant the minute that channel starts shipping as a file you can open. This guide walks through exactly such a file: 76,438 bytes of TypeScript, 6,157 per-channel amplitude samples at 60Hz, 46 speaker-diarized captions for one 102.36-second customer order, all checked into the pieline-phones repo and imported by the production hero.

M
Matthew Diakonov
11 min read
4.9from 200+ restaurants
One phone call, 76,438 bytes of typed TypeScript in public source
60Hz per-speaker amplitude, 16ms resolution on every word
Captions, envelopes, and regen script all openable without a login

The usual taxonomy is a shopping list. The real question is whether anything on the list is auditable.

Open any 2026 roundup about restaurant technology and you get the same categories in the same order: POS, kitchen display, online ordering, kiosks, QR menus and contactless pay, loyalty, reservations, delivery middleware, labor and scheduling, inventory, analytics. Each category gets a paragraph that names four vendors and claims a percentage improvement. The implicit thesis is that a restaurant with ten categories checked is more technology-forward than a restaurant with six.

That thesis leaks. A restaurant with a polished dashboard from a vendor that ships no public artifact is running a saas subscription, not a technology stack. A restaurant with one channel where every transaction lands in version-controlled source is running engineering infrastructure for the first time, even if the other nine categories are still paper.

The asymmetric breakthrough, for most restaurants, is the phone line. Phone orders still account for a large slice of revenue (PieLine's public copy puts 30 to 40% of rush-hour calls as unanswered, and those that get answered are single-threaded on a cashier with a receiver). The phone is the channel where “technology restaurant” either materially changes the shape of the business or stays a marketing slogan. The rest of this guide is about what it looks like when the change has actually shipped.

The anchor file: one phone call, four fields, 76,438 bytes

Everything in this guide refers back to a single file: src/components/voice-activity-data.ts. Here is the shape of that file. The captions array has 46 entries across 102.36 seconds; the envelopes arrays have 6,157 entries each. The TypeScript types are the boundary between “we recorded a call” and “we have a typed operational artifact about a call.”

src/components/voice-activity-data.ts (excerpt, trimmed for reading)

What you can do with a file you cannot do with a phone call

A flat recording is a tape. A structured file is a surface other tooling can hook into. These are the operations a technology restaurant runs routinely, and that a phone-call-with-extra-steps restaurant cannot run at all.

Grep any word the customer said

Search 'strawberries' across every call this week to find off-menu modifier requests. Search 'allergy' to pull every call that needed an allergen answer. A PDF does not do this. A hosted dashboard charges you for it. A 76KB TypeScript file is one ripgrep away.

Diff speaker envelopes frame by frame

6,157 per-channel amplitude samples at 60Hz mean you can see overlap. If the AI started speaking at sample 3,850 and the customer was still above 0.1 amplitude, that is a barge-in. A waveform viewer can be pointed at the same array.

Replay one turn at a time

The captions array is sorted by start time and tagged by speaker. You can render just the AI's 30 turns, or just the 16 customer turns, or only turns in the 60s to 80s window. The hero player on aiphoneordering.com does exactly this.

Run regression evals against calls

Assert: captions[0].text starts with a recording disclosure within 2.5 seconds. Assert: the last AI turn contains the caller's first name if one was given. Every new model deploy runs these evals against a suite of real calls.

Type-check the contract

Caption and VoiceData are exported TypeScript types. If the transcription pipeline ever emits a caption without a speaker tag, the build breaks before the website ships. The phone channel has a typed contract.

Regenerate from a plain-text script

scripts/build-voice-activity-data.py is 117 lines of Python against three dependencies. Any engineer can read the pipeline end-to-end in five minutes and re-run it on a different WAV. No black-box vendor blob in the middle.

How the file actually gets made

The whole pipeline is four named steps. An owner does not run it. An engineer evaluating the vendor reads it once and is done. This is what “auditable” looks like in source form.

From a phone line to a grep-able artifact

Stereo WAV
Deepgram nova-3
RMS envelope
Caption grouper
voice-activity-data.ts
Hero visualization
Type-checked contract
Grep surface
Eval suite input

The four steps inside scripts/build-voice-activity-data.py

1

1. POST stereo WAV to Deepgram multichannel

The script reads the WAV file as raw bytes and posts to https://api.deepgram.com/v1/listen with model=nova-3, multichannel=true, punctuate=true, smart_format=true. Deepgram returns two 'channels' in its JSON response, each with its own word-level timestamps.

2

2. Group per-channel words into caption segments

group_into_captions walks the word list and closes the current caption on any of three boundaries: a pause gap greater than 0.55 seconds, the current text exceeding 75 characters, or the previous word ending in a sentence-terminator. The two per-channel caption lists are then merged and sorted by start time.

3

3. Compute per-channel RMS envelopes at 60Hz

The WAV is read into two int16 arrays (left and right). Each is binned into windows of sr/60 samples; the RMS of each window becomes one envelope value. The output is normalized to 0..1 and smoothed with a fast-attack (0.5) slow-release (0.12) lowpass so the bar animation does not chatter.

4

4. Emit one TypeScript literal

The script writes src/components/voice-activity-data.ts as a single JSON.dumps-serialized literal assigned to const voiceData: VoiceData. Four fields, four types. The file is committed alongside the MP3 so the whole call, audio plus annotations, is reproducible from source.

The actual Python, 20 lines of it

The pipeline is short enough to print on one screen. If a vendor cannot show you something this short for the phone channel they sell, they are not selling a technology restaurant, they are selling a call service with better margins.

scripts/build-voice-activity-data.py (core pipeline)

The numbers inside the file

Every number here is from parsing the actual file on disk. Run `wc -c src/components/voice-activity-data.ts` to check the byte count. Run `python -c "import json, re; ..."` to count captions and envelope samples.

0sCall duration
0Captions, speaker-tagged
0Amplitude samples per channel
0 BFile size on disk

Resolution of the amplitude data

0 ms

One sample every 16.67 milliseconds, per speaker. That is roughly the length of a consonant. It is the resolution needed to see a barge-in or a short “uh-huh” without flattening it into noise. Most call-center visualizers run at 5 to 10Hz, which is 6 to 12 times coarser.

Floats it takes to represent both sides of the call

0

6,157 for the customer channel plus 6,157 for the AI channel. A full phone call, both speakers, fully visible, in roughly the space a small PNG image takes. Scaled to 500 calls a week that is around 34 GB per year, one cheap S3 bucket.

What it looks like on disk

Here is the shell session any engineer would run the first time they want to verify that a vendor's phone channel ships as an auditable artifact. Two commands. Two numbers that match the marketing.

verify the artifact, inside a clone of mediar-ai/pieline-phones

The scorecard: four yes/no checks per vendor per channel

This is the checklist to put in front of every restaurant technology vendor. It is channel-specific (apply it once for the phone line, once for the POS webhook, once for the KDS event feed, once for the delivery integration). A four out of four is what a technology restaurant owns. A zero is a hosted SaaS with a phone number.

Score a technology restaurant vendor, per channel

  • Can you, as a customer, obtain a per-transaction artifact (per-call file, per-ticket JSON, per-order payload) after the event, without calling support?
  • Is the artifact typed or schema-ed, so a build breaks if the vendor changes the shape?
  • Can you search every artifact across a time window with plain tools (grep, jq, a SQL query), rather than the vendor's dashboard UI?
  • Is the code that produces the artifact either open, documented, or reconstructible from a short script a competent engineer can read end-to-end in 10 minutes?

What a technology restaurant phone channel looks like, versus what a SaaS phone channel looks like

Both record calls. Both claim analytics. Only one produces an artifact you can open.

FeatureTypical phone-service vendorPieLine, per the public repo
Per-call audioDownloadable on request, expires after 30 daysCommitted at public/audio/dennys-order.mp3
Speaker-diarized transcriptMerged single transcript in a UI modal46-turn captions array, speaker-tagged, in source control
Per-speaker amplitudeNot exposedTwo float arrays, 6,157 samples each, 60Hz
TypeScript types on the artifactNone, it is a rendered dashboardCaption and VoiceData exported from the module
Pipeline that produces the artifactProprietary, not documented, not reproducible117-line Python script at scripts/build-voice-activity-data.py
Grep across callsDashboard search, UI-paginatedAny shell command over committed files
Regression evals on new model deploysNot feasible, no artifact to assert againstRun against a suite of past caption files
RegenerabilityWhatever the vendor rolls out nextRe-run build script on any WAV, get the same shape

The comparison is against the common 'phone answering service' pattern, not against any specific named competitor. The point is not that every phone vendor is opaque, it is that a restaurant evaluating 'technology restaurant' vendors should require the left column for every channel, not just sometimes, not just on request.

Why this matters more for the phone than for any other channel

The POS already has a webhook spec. The KDS already has an order object. Online ordering already has a cart payload. The phone is the one channel where most restaurants still have no artifact at all, which is the reason the bar for “technology restaurant” is set by whether that gap has been closed.

A restaurant that has a best-in-class POS but a cashier on a receiver has a technology back-office and an analog customer surface. A restaurant that has a mediocre POS and a phone line whose every call lives in source control has the opposite. A caller whose order was wrong can be looked up by the exact word they said; a cashier's memory cannot.

The phone channel is also the channel where the evidence-to-cost ratio is best. A POS already hits six figures. Replacing it to earn a little more observability is not reasonable. A phone line costs a salary today and PieLine's published rate is $350 a month up to 1,000 calls. The artifact comes bundled. The cost delta is weeks, not quarters, to earn back.

This is the axis on which the label “technology restaurant” actually does work. A restaurant owner who wants to deserve the label by the end of a quarter should spend the quarter on the phone channel and leave the other nine categories alone. The phone is where the biggest artifact gap lives. It is also the one where closing the gap is cheap.

11 locations

Mylapore, an 11-location South Indian chain in the Bay Area, is rolling out PieLine across every restaurant. 90% plus of calls are handled end-to-end by the AI with tickets posted directly to the POS. The caption files that back those calls are the same shape as the one this page is built around.

aiphoneordering.com/llms.txt, April 2026

voice-activity-data.ts
46 captions
6,157 samples per channel
60Hz
Deepgram multichannel
117-line pipeline
stereo diarization
Clover
Square
Toast
NCR Aloha
Revel
type-safe Caption[]
grep over calls

Run the four-point scorecard against a live PieLine phone line

We will forward a test number to your menu, take a real order against your POS, and hand you both the MP3 and the voice-activity-data.ts style caption file for the call. You will have four numbers to audit before the demo ends.

Book a 15 minute demo

Turn your phone line into an artifact you can open

15 minutes, a real order against your menu, and the typed caption file plus MP3 for the call so you can audit the bar for 'technology restaurant' yourself.

Frequently asked questions

What makes a restaurant a 'technology restaurant' in 2026?

The useful test is not the number of vendor logos in the back office. It is whether the channels the restaurant actually runs are stored as observable, diffable, version-controlled artifacts. A POS with a hosted dashboard is not the same as a POS with a public webhook schema. A kitchen display is not the same as a kitchen display that emits a JSON event per ticket. A phone line is not the same as a phone line where every call produces a speaker-diarized caption file checked into source control. A restaurant becomes a technology restaurant when the answer to 'can you pull up what the customer said at 1:14am on Tuesday' is yes, grep it.

Where is the 76KB file this guide keeps pointing at?

It lives at src/components/voice-activity-data.ts in the public pieline-phones repository. It exports a typed VoiceData object with four fields: duration (102.36 seconds), sampleRate (60 samples per second), envelopes (two float arrays, one per speaker, 6,157 values each), and captions (46 speaker-tagged entries with start, end, and text). The file is 76,438 bytes on disk. It is auto-generated by scripts/build-voice-activity-data.py, which posts the stereo WAV to Deepgram's multichannel endpoint, bins the audio into 60Hz RMS envelopes, groups Deepgram's word timestamps into captions on pause and punctuation boundaries, and writes a single TypeScript literal. The hero visualization on aiphoneordering.com imports this file directly.

How does 6,157 amplitude samples per channel at 60Hz translate to a phone call?

60Hz means a sample every 16.67 milliseconds. That is roughly the length of a consonant. For 102.36 seconds of audio that gives ceil(102.36 * 60) = 6,142 theoretical samples; the actual file ships 6,157 per channel (the build script pads the last window). Multiplied by two channels (customer on the left, AI on the right), the whole call is 12,314 floating-point numbers. At 16ms resolution you can see a caller's 'uh-huh' that a 10Hz visualizer would flatten into noise, and you can tell whether either speaker started talking while the other was still finishing a word (overlap). This is the resolution needed to audit politeness on a phone call rather than just transcribe it.

Why does diarization matter for a restaurant phone order?

Because the two interesting questions about a phone order are asymmetric. 'What did the customer ask for' is a customer-channel question. 'What did the AI promise' is an AI-channel question. Merging them into a single transcript loses the handoff structure. PieLine's recording is stereo by construction: the customer line is channel 0, the AI voice is channel 1. Deepgram's multichannel mode transcribes them independently and emits word timestamps per channel. The build script keeps them apart all the way to the TypeScript file, so the 46-caption array is sorted by time but still tagged with speaker. That is what lets the landing-page visualization render two independent bar graphs that pulse in sync with the actual person talking.

What can a restaurant operator do with the captions array that they cannot do with a call recording?

Three things. First, grep. You can search 'strawberries' across a thousand calls and find every caller who added an off-menu modifier, without listening to anything. Second, diff. When you change the menu, you can diff the captions from before and after and see whether callers stop saying 'wait, do you still have X.' Third, regression. You can run an eval suite against the caption file, asserting that the AI said a legal recording disclosure in the first 2 seconds and addressed the caller by name in the last 5. None of that is possible with a flat MP3. All of it is possible with a 76KB typed file.

How is this any different from restaurant analytics dashboards that already exist?

Analytics dashboards aggregate. They tell you calls-per-hour, peak-minute counts, average handle time. They do not tell you what one customer said. A technology restaurant has both. The dashboards belong to the operator and live behind a login. The per-call artifacts belong to the engineering surface and live in a repo. The caption file in pieline-phones is the engineering surface for one call. A restaurant running at scale would have one such file per call, stored as a blob with a repo-style manifest. A restaurant running a hosted phone line at a conventional answering service has no such file and no surface it could live on. The presence of this file, in any form, is what moves a restaurant across the line.

What is the build script and why does it matter that it is public?

scripts/build-voice-activity-data.py is 117 lines of Python. It takes a stereo WAV as an argument, posts it to Deepgram nova-3 with multichannel, punctuate, and smart-format enabled, reads the two resulting channels, computes per-channel RMS envelopes at 60Hz with a 0.5 attack / 0.12 release smoother, groups the word timestamps into captions on pause boundaries (>0.55s gap), sentence boundaries (ending punctuation), or a 75-character cap, and writes the result to src/components/voice-activity-data.ts as a single TypeScript literal. The fact that it is 117 lines and uses three dependencies (wave, struct, urllib) is the whole point. The artifact is not a proprietary binary from a vendor. It is a plain-text pipeline a competent engineer could re-implement in an afternoon against a different transcription service. That is what makes the output auditable instead of marketed.

Does this mean a restaurant needs to hire engineers to become a 'technology restaurant'?

No. It means the vendor the restaurant hires has to ship artifacts the restaurant's engineers (or auditors, or a curious owner) can open. PieLine's onboarding team handles menu scraping, POS mapping, and configuration. The restaurant owner never runs the build script. But the fact that the build script exists, produces a checked-in file, and imports into the production site means the restaurant's phone channel behaves like infrastructure, not like a call center. The operator still spends zero time in the source tree. The difference is that the system is auditable if they ever want to look.

What does this scorecard deliberately not cover?

Four axes. Uptime, because it is a floor, not a differentiator. Category breadth, because a vendor that ships one well-observable channel beats a vendor that ships four opaque ones. Price per call, because cost lives on the P&L and this scorecard lives on the engineering review. UI polish, because a polished dashboard over an unauditable backend is the exact failure mode this scorecard is designed to expose. A restaurant buyer should still care about all four, but on a different page.

How would I evaluate other restaurant vendors against the same bar?

Ask each vendor three questions. One, where does your recording of a phone call live after the call ends? If the answer is a hosted dashboard with no export, downgrade. Two, do you ship a speaker-diarized transcript with per-turn timestamps and can I get it as a file? If the answer is 'we can export a PDF,' downgrade. Three, is the code that generates that transcript readable anywhere, either open-source, documented, or regenerable from a script I could run myself? If the answer is 'it's proprietary,' downgrade. The vendor that answers yes, yes, yes is the vendor shipping a technology restaurant. The one that answers no, no, no is shipping a phone service.

What is the actual footprint of one phone call in source control?

76,438 bytes of TypeScript plus 1,229,949 bytes of MP3 equals roughly 1.3 megabytes for one 102-second call. The MP3 is the raw audio the caption file is built from. The TypeScript file is the observable form. Together they take about the same space as a photo your phone took last weekend. Scaled up to a busy restaurant doing 500 phone orders a week, one year of calls is around 34 GB. That is a single cheap S3 bucket. The cost of being grep-able is trivial; the cost of not being grep-able shows up the first time a manager is asked what the customer who complained yesterday actually said.

A restaurant earns the label when the phone channel becomes a file

76,438 bytes of TypeScript. 6,157 samples per channel at 60Hz. 46 speaker-tagged captions. One 102.36 second order. Clone the repo and grep it yourself, or have us stand up the same artifact on your own number.

Book a demo
📞PieLineAI Phone Ordering for Restaurants
© 2026 PieLine. All rights reserved.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.