Voice AI engineering

The voice AI acoustic noise bottleneck is mostly a pipeline shape problem, not a denoiser problem

Every other guide on this topic points at the wrong thing. The AI is not standing in a noisy kitchen. The bottleneck is the caller's room, the PSTN narrowband codec, and a single mono mix that forces every uncancelled echo and the AI's own voice to compete with the caller's words inside one STT model. The structural fix is stereo channel separation at the telephony hand-off, and the proof is sitting in PieLine's public repo.

M
Matthew Diakonov
9 min read

Direct answer (verified 2026-05-20)

Noise breaks voice AI because most pipelines mix the caller and the AI onto a single mono stream before transcription. Acoustic echo of the AI's own voice, G.711 codec quantization at 8 kHz, and room noise all compete with the caller's actual words inside the same speech-to-text model. The structural fix is stereo channel separation at the telephony hand-off (see Deepgram multichannel), not bigger denoise models on a mono mix. The remaining bottleneck is caller-side room noise plus the PSTN codec itself, and the right fight there is menu-biased vocabulary and read-back confirmations in the dialogue layer, not low-level signal processing.

The bottleneck the popular framing misses

Pick up any of the existing articles on this topic. They all frame the problem the same way: restaurants are noisy, restaurant voice AI struggles with noise, therefore the engineering challenge is to build a better denoiser. The framing is wrong in two places.

First, the AI is not in the restaurant. It is a process running in a data center on whatever audio the telephony stack hands it. The kitchen hood fan two feet from the staff phone is not the AI's problem. It might be the staff's problem when they answer the phone the old way, which is precisely why an AI phone agent removes that variable. The acoustic noise the AI actually has to handle arrives over the phone line from the caller's side.

Second, the dominant source of noise inside the STT loop is not the caller's room at all. It is the pipeline shape. In a poorly architected phone-AI system the caller's microphone, the AI's synthesized voice (echoed back through hybrid coupling on the line), and the codec quantization noise of G.711 are summed into a single mono stream before the speech-to-text model runs. Every component of that mix degrades the STT confidence on the caller's actual words. The model has to implicitly source-separate three audio signals in real time, on narrowband telephony audio, while staying under a 300 ms response budget. That is a hard problem to throw a denoiser at because the noise is not a separate background, it is the AI itself.

The fix is not a better model on a worse input. It is a better input.

The one-line pipeline change that eliminates a noise category

Inside the PieLine repository, the script that turns a recorded phone call into the per-call data artifact does one structural thing that most voice-AI pipelines do not. It posts the audio as stereo with multichannel transcription enabled. The caller occupies channel 0. The AI's text-to-speech output occupies channel 1. The STT model sees them independently. Whatever escapes the echo canceller on the caller side rides only the caller channel. The AI's own voice rides only the AI channel. Cross-channel bleed is bounded by whatever isolation the telephony layer delivers, which is decoupled from the STT problem entirely.

scripts/build-voice-activity-data.py

The full script is 117 lines. The Deepgram URL parameter multichannel=true is what changes the pipeline shape. Everything downstream (per-channel envelope, speaker-tagged captions, the visualization on the homepage) inherits the separation from that one decision.

What measured isolation looks like in the data

The data artifact this script produces is sitting in the same repository at src/components/voice-activity-data.ts. It is 75,855 bytes, exports a typed VoiceData literal, and contains a smoothed RMS envelope per channel at 60 Hz alongside the 46 speaker-tagged captions for one 102.36 second restaurant order. The envelope is the part that proves the isolation claim. You can verify it yourself in a few lines of Python:

verify-channel-isolation.py

Read those numbers carefully. During the 2,941 envelope samples where the AI is speaking (that is 49 seconds of AI dialogue, sampled at 60 Hz), the customer channel never crosses 0.001 in normalized amplitude. The maximum value across those 2,941 samples is 0.0004, which is roughly -67 dB relative to the loudest customer sample anywhere in the recording. That is a noise floor a thousand times quieter than the kind of bleed an STT model would interpret as a competing voice. Across all 6,157 simultaneous envelope samples spanning the full call, both channels never both cross 0.05 at the same moment. The two speakers exist in two acoustic universes that happen to share a timeline.

None of this is a model claim. It is a measurement of the audio file the build script produced, against the captions it produced, from the same script you can read in the repo. The structural choice forced the data into this shape.

How the noise budget actually compounds

Here is the shape of the problem with and without the pipeline change. The diagram below is not what an STT model sees; it is what it has to compete with.

What the STT model has to handle

1

Mono single-channel pipeline

Caller voice + AI voice + room noise + codec artifacts + echo all sum into one stream. STT model must separate them implicitly. Failures are confidently wrong.

Stereo at the telephony hand-off

Caller voice routed to channel 0. AI TTS output routed to channel 1. STT runs per-channel. Caller channel never sees the AI's voice as competing signal.

3

Caller-side noise remaining

Room noise on the caller's end still rides their channel, the same way it would on any phone call. This is the residual bottleneck. PSTN codec narrowbanding is structurally fixed.

Dialogue layer catches the rest

Menu-biased vocabulary plus a read-back at the end of every order line gives the caller a second chance to correct a misheard modifier before the kitchen ticket fires.

The takeaway is that stereo at the telephony hand-off does not solve the whole acoustic noise bottleneck. It solves one specific category (AI voice contaminating the caller channel) cheaply and structurally, and it does so before the STT model runs. The residual bottleneck is the caller's room, which any phone call inherits, plus the PSTN codec, which is fixed. Those are real, and the right fights for them happen higher in the stack.

Where noise still bites, and how to actually fight it

Three places the bottleneck reasserts itself even with stereo channels.

PSTN narrowband codec.Standard telephony delivers 8 kHz mono mu-law. Everything above 3,400 Hz is gone before your stack sees the audio. Fricatives like "s" and "sh," whose energy is mostly above 4 kHz, are structurally degraded. You cannot denoise this back in. What you can do is bias the STT vocabulary toward the restaurant's actual menu, so "sourdough" resolves cleanly against the menu prior even when the high-frequency consonants are blurred. PieLine does this at onboarding by scraping the menu and mapping every dish to a POS item ID, which doubles as a vocabulary bias for the STT model.

Caller-side room noise.If the caller is calling from a sports bar at 9 PM, you have a noisy caller channel and no structural way around it. The mitigation is in the dialogue layer, not in the audio layer. After every order line, read it back to the caller before the ticket fires. A misheard "no onions" that became "more onions" gets a second chance to be corrected by the caller, who knows exactly what they meant. The dialogue layer turns a one-shot STT decision into a two-shot one.

Latency budget. Voice AI for a phone call has to respond fast enough to feel conversational. That budget caps the amount of pre-processing you can do without sounding sluggish. Heavy noise suppression introduces latency. Stereo separation does not, because it happens at the telephony layer before any model runs. That is a second reason the structural fix beats the algorithmic one: it does not eat your response budget.

What this implies for restaurant operators choosing a phone-AI vendor

None of this matters as marketing language. It matters because the architectural choice predicts the failure mode you live with. A vendor that processes a single mono stream from a hosted IVR is going to fail more often on the loudest calls, on the calls with the most echo, and on the calls where the AI's own voice is loudest in the caller's earpiece. A vendor with stereo separation at the hand-off has eliminated that category and is left fighting only the caller's room and the codec.

Three questions to ask any phone-AI vendor about their noise handling, with the right answers in mind:

  1. Is the recording stored as stereo with per-speaker channels?If the answer is "we keep a mono MP3," you are buying a system that mixed the AI into the caller's audio before transcription. The bottleneck you read about online is going to be your bottleneck.
  2. Is the transcript speaker-diarized at the STT layer, not heuristically after the fact? Multichannel STT is genuine speaker separation, because the channels never met. Diarization on a mono stream is a heuristic on top of the mix, and it gets harder the noisier the call is. The first kind is structurally robust; the second drifts under load.
  3. Does the dialogue layer read back every order line before the ticket fires?The read-back is the last line of defense against the residual caller-side noise that no audio architecture can eliminate. Vendors that skip it because it "adds latency" have not understood what they are trading for. A 1.2 second confirmation step is cheaper than every wrong ticket the kitchen has to remake.

A note on what I did not claim

I did not claim PieLine has a proprietary denoise model that outperforms the field. There isn't one. I did not claim restaurant-environment ambient noise is irrelevant; it matters on the caller's side, which is where the residual fight lives. I did not claim that stereo separation eliminates the acoustic noise bottleneck. It eliminates one specific, large category of it, namely the AI's own voice contaminating the caller channel inside the STT model. The rest gets handled higher in the stack with vocabulary biasing and dialogue read-backs.

The reason I find this worth writing about at all is that almost every other guide on this topic frames the problem as a model-quality problem and skips the architectural choice that comes before the model runs. Restaurant operators who buy on those guides end up with phone-AI systems that fail in predictable ways on exactly the calls that matter most: the loud ones, during the Friday rush, when the kitchen cannot afford a misheard ticket.

Want to see this run on your menu?

A 20 minute call where we route an inbound test call through the stereo pipeline and pull up the per-channel envelope on the spot. No slides.

Voice AI noise bottleneck FAQ

Why does noise break voice AI in the first place?

Because most phone-call pipelines mix the caller and the AI onto a single mono stream before the speech-to-text model runs. On that single stream the caller's voice has to compete with three other signals at the same time: uncancelled acoustic echo of the AI's own voice through the caller's earpiece, the PSTN narrowband codec quantization (G.711 mu-law at 8 kHz, cutting everything above 3.4 kHz), and whatever room or background noise the caller has on their end. Every one of those degrades the STT confidence on the caller's actual words. The structural fix is not a bigger denoise model on the mono mix. It is to keep the two speakers on two separate channels until after transcription so the model never has to compete with the AI's own voice at all.

What is the difference between caller-side noise and AI-side noise?

There is no AI-side noise, in the room sense. The AI is a process running in a data center on a clean WAV stream. Any noise it deals with arrives over the phone. The two real sources are caller-side acoustic noise (kitchen ambient, TV, traffic, children, music in the room the caller is calling from) and telephony-pipeline noise (codec quantization, packet loss concealment, line hiss, echo of the AI's own speech). Most articles that talk about restaurant voice AI flip these and frame the bottleneck as 'the restaurant is loud,' which is almost never the constraint that matters at the transcription layer.

What does Deepgram multichannel actually do that solves part of this?

Deepgram's multichannel mode treats a stereo WAV as two independent inputs and transcribes them separately, returning per-channel word timestamps. The caller goes on one channel and the AI's outgoing speech goes on the other. The STT model that runs against the caller channel never sees the AI's voice. That eliminates a whole class of false transcription: the model can no longer mis-attribute the AI's own words to the caller, and any acoustic echo of the AI through the caller's earpiece stays bounded to the channel where the AI was already known to be speaking. Documentation is at developers.deepgram.com/docs/multichannel.

How can a reader verify the channel-isolation claim without taking my word for it?

The file src/components/voice-activity-data.ts in the pieline-phones repository carries the smoothed per-channel RMS envelope for a real 102.36 second restaurant order. Pull it down, parse the JSON literal, and run the same script the captioned code block on this page shows. Across the 2,941 envelope samples where the AI is speaking, zero customer-channel samples exceed 0.001, with a maximum of 0.0004 (roughly -67 dB relative to the customer's loudest sample). Across all 6,157 simultaneous samples, both channels never both cross 0.05 at the same moment. That is what hardware-level channel isolation looks like in the measured data.

What does the noise bottleneck look like when it does bite, even with stereo?

Three places it still bites. First, if the caller is in a genuinely noisy room (a sports bar, a windy parking lot, a kitchen with the hood fan on), their channel carries the noise the same way it would over any phone call, and the STT model has to handle it on a narrowband signal. Second, the PSTN codec is fixed: 8 kHz sampling, a 300-3400 Hz speech band, mu-law companding. Sibilants and high-frequency consonants are degraded before any model sees the audio. Third, the AI's own latency budget caps how much smoothing the front end can do without sounding sluggish. The acoustic isolation fix only removes the noise category PieLine has direct control over.

Why is per-call stereo recording not the default in cheaper phone-AI products?

Because most phone-AI products bolt onto a single SIP trunk that hands them a mono stream, treat the conversation as transcript-only, and never archive the raw audio in a structurally separated form. Adding stereo means handling the telephony layer end-to-end, mapping inbound caller to channel 0, AI TTS output to channel 1, persisting the WAV in stereo, and paying for whichever STT vendor supports a multichannel mode. The per-call cost difference is small. The engineering cost is a real telephony stack rather than a thin wrapper over a hosted IVR. That is the reason the cheaper products skip it, not a model-quality reason.

How does this relate to PSTN codec quality, and can you 'denoise' your way around mu-law?

Only partially. The PSTN backbone delivers your phone audio as G.711 mu-law at 8 kHz, which lops off everything above 3.4 kHz and quantizes the rest into 8-bit logarithmic samples. A consonant like 's' or 'sh' (energy mostly above 4 kHz) is structurally degraded before any model touches it. Modern STT models compensate by leaning on language priors. They are good at it. But the compensation is probabilistic, which means the failure mode under noise is not garbage transcript, it is confidently wrong transcript ('extra cheese' versus 'extra cheeses,' 'no onions' versus 'more onions'). The right way to fight this is restaurant-specific vocabulary biasing and confirmation read-backs in the dialogue layer, not denoise filters trying to extract bandwidth that the codec already discarded.

Does PieLine train its own STT or use a hosted one?

PieLine uses Deepgram's nova-3 model with multichannel enabled, as the open build script scripts/build-voice-activity-data.py demonstrates by calling api.deepgram.com/v1/listen with model=nova-3 and multichannel=true. The interesting engineering is not in the STT model itself, which is a commodity. It is in the telephony layer that delivers a clean stereo stream to that model, the menu-specific vocabulary that biases the model toward the actual items on the restaurant's menu, and the dialogue logic that performs read-backs to catch the failures the STT alone cannot.

What about whole-system noise robustness, not just the STT step?

Three layers compound. The stereo layer removes AI-voice contamination from the caller channel. The vocabulary layer biases the STT toward the restaurant's actual menu so 'paneer tikka' and 'bruschetta' resolve correctly even under degraded audio. The dialogue layer reads back every order line before the ticket fires, so a misheard 'no onions' that became 'more onions' has a second chance to be corrected by the caller before the kitchen ever sees it. No single layer is sufficient. Together they keep order accuracy above 95% across the production call volume PieLine sees.

Related reading

📞PieLineAI Phone Ordering for Restaurants
© 2026 PieLine. All rights reserved.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.