Back to Blog
Voice AIEngineeringBehind the Build

Your AI Has a Voice. That Was the Easy Part.

A good AI model isn't a voice agent, any more than a brilliant new hire is a functioning employee. Here are 60 things we had to build around the model to make a phone call actually sound human, answer in real time, and get real work done. The model is the easy part.

M
MoltBot Ninja
12 min read
🦞

There is a comfortable myth going around right now: that a good enough AI model is a voice agent. Plug a speech model into a phone number and you have a receptionist, a scheduler, a sales rep who never sleeps.

You don't. You have a brain in a jar.

The gap between "a brilliant AI model" and "a phone agent that actually sounds human and gets things done" is the same gap as between a genius new hire and a functioning employee. The hire is smart. But on day one they don't know where anything is, they can't reach the tools, they have no idea when it's their turn to speak, and they will happily talk straight over a customer mid-sentence. The intelligence was never the hard part. The harness around it is.

We build voice agents for a living. And we can tell you that the model, the part everyone fixates on, is maybe 10% of the work. Here's a tour of the other 90%, and why it's the difference between a demo and something you'd actually let answer your phone.


The model doesn't know you stopped talking

Here's the thing nobody tells you. A model doesn't hear a phone call. It hears a raw stream of audio, and something else entirely has to decide the exact instant a human finished a sentence so the model can respond. That decision has a name in our world: endpointing. And it is brutal.

Wait a beat too long and the agent feels slow, dead, robotic. Cut in a fraction too early and you've just interrupted your own customer mid-thought, which is precisely how a call goes off the rails. Off-the-shelf, this decision can take well over a second of dead silence after every single sentence. That alone is the difference between "wow, that felt human" and "ugh, this is a robot."

So the first thing we had to build was our own listening system, tuned to shave that delay down to a fraction. Then it got subtle in ways that took months to get right:

  • A quick "yes" and a long, thoughtful sentence with a pause in the middle look identical to a naive system. Both are short bursts followed by silence. One is a complete answer; the other is a person who isn't done. So the agent has to apply different patience depending on what it's hearing.
  • A cough, a click, a stray "um" should not start a turn and make the agent respond to nothing.
  • The first soft syllable of a word is quiet. By the time a system notices speech, that syllable is already gone, so the agent hears a clipped, confusing fragment. We had to learn to capture the audio from before it decided you were talking.

And the one that haunted us: early on, an agent hung up on a real person. It cut them off mid-sentence, heard the garbled half-word, read it as "that's everything, goodbye," and ended the call. That bug does not live in the model. It lives in the seam between the audio and the model, and closing it took a dedicated mechanism whose entire job is to never, ever mistake an interruption for a farewell.

The line is never actually silent

A phone line is a filthy place. There's background noise on the caller's end, there's the little pop the phone network makes when a call connects, and, the truly maddening one, the model itself makes noise when it's thinking.

We learned this the hard way. During a pause, some models don't go quiet. They emit half-formed sounds, breaths, a faint electrical hum that builds from a whisper to a drone. Callers described one stall as "a strange buzzing." Another sounded like 30 seconds of static. None of that is fixable by being smarter or writing a better prompt. We had to build actual audio processing that can tell a sustained electrical hum apart from real speech, and silence the hum without ever touching a real word.

We also had to teach the agent not to react to the wrong sounds: to ignore its own voice echoing back through the network (so it doesn't think the caller is talking and apologize to itself), and to let the phone carrier strip out the lawnmower or the cafe chatter before it ever reaches the brain.

"How fast is it" is a lie you can tell yourself

Early on we measured how fast our agent responded by reading our own internal logs, and felt great about the numbers.

They were wrong, by more than a second. The logs start counting after the system has already finished listening, so they quietly skip the part the caller actually feels. We had to build an entire, separate analysis pipeline whose only job is to measure the truth: from the real-world moment you stop making sound to the moment the agent starts talking back. And then a second, independent check that re-derives every number a different way, because "it feels faster to me" is not an engineering standard, and the convenient measurement was a fairy tale.

You cannot improve what you refuse to measure honestly.

A tool that takes 20 seconds isn't a feature. It's a hang-up.

This is where the second, deeper harness lives, and where most voice "agents" quietly fall apart.

The whole point of an agent, versus a chatbot that just talks, is that it does things. Checks your calendar. Looks up an order. Books the appointment. But the naive way to give a model your tools makes a simple "what was my last message?" take 20, 40, even 90 seconds, because the agent reasons its way through every step. On a chat screen, fine. On a live phone call, that is a human being listening to silence, wondering if you died.

A live call needs answers in about a second. So we built a whole bridge layer over our existing capabilities, with a crucial rule: the right tool for the job has three completely different shapes.

  • For a known lookup, reach the real system directly, in well under a second, with no reasoning in the loop. Bonus: because the agent is handed real data, it physically cannot make the answer up.
  • For an open-ended request where a short pause is fine, let the full reasoning agent think it through.
  • For genuinely slow work ("research this and email me"), acknowledge instantly and do it after the call, delivering the result to a text channel. The call never waits.

Picking the right one of those three is most of the battle. And it came with constraints that have nothing to do with intelligence:

  • The result has to fit the ear. A wall of data that looks fine on a screen is unspeakable on a call, and if you hand a model a giant blob it silently sees half of it and invents the rest. So results have to be reshaped for listening: a summary first, then detail on request.
  • It has to be honest. This is a real danger with voice. When a model can't reach a tool in time, it will, with total confidence and in a voice that sounds exactly like you'd expect, make up a perfectly plausible answer. Early on we had calls where the agent described an email so convincingly that we went and checked, only to find it never existed. So writes and lookups have guardrails: read the answer back, never claim something is "booked" or "sent" without it actually happening, and never fabricate a confirmation number.
  • Reads are safe to make instant. Writes are not. The fast path runs with full permission and skips every "are you sure?" That's perfect for a lookup and catastrophic for "send the email." So anything that changes the world goes through a separate, deliberate path: the agent proposes the action, reads it back to you, and only does the exact thing it described once you say yes. A caller can't trick it into doing something else, and a glitch can't double-send.

And then it has to survive the real world

The last stretch of "done" is the part that only shows up at 2am on a real customer call:

  • The connection drops mid-sentence. The agent has to reconnect, and then continue the thought rather than restarting the whole answer, while knowing not to repeat itself if it had actually just finished.
  • Silence deserves grace, not a guillotine. A caller who goes quiet gets a gentle "still there?", then a softer warning, then a warm goodbye that is never cut off mid-word, and a single word from the caller at any point cancels the whole thing and picks the conversation right back up.
  • It has to know it's really you before sharing anything personal, and it has to shrug off the people who try to talk it into leaking data or going off-script.
  • It should greet you with today's date in your timezone, not the server's. It should match your language. In languages with grammatical gender, it should address you correctly. It should keep its answers short, partly because that's nicer and partly because shorter is faster.

None of this is glamorous. All of it is the difference between a toy and a tool.


The short version: 60 things, none of which are "the model"

Every one of these is something we had to build around the brain to make a real conversation work. The model does none of them. Swap in a smarter model tomorrow and every single one of these still has to exist.

Knowing when it's your turn

  1. Detecting the instant you stopped talking, fast enough to feel human.
  2. Different patience for a quick "yes" versus a long, considered thought.
  3. Adapting to people who pause in unusual places.
  4. Never mistaking a cut-off word for "I'm finished." (So it doesn't hang up on you.)
  5. Ignoring coughs, clicks, and stray noises as false starts.
  6. Capturing the soft first syllable before it even decided you were speaking.
  7. Telling real speech apart from line hum.

Hearing clearly 8. Not reacting to its own voice echoing back through the network. 9. Keeping that protection alive through the opening greeting. 10. Letting background noise get filtered out before it reaches the brain. 11. Reordering audio that arrives scrambled.

Letting you interrupt 12. Stopping instantly the moment you cut in. 13. Never making you talk into stale, buffered audio. 14. Smoothing bursty audio so the line never starves. 15. Letting a filler word finish naturally if you jump in over it.

Sounding human, not synthetic 16. Silencing the electrical hum a model emits when it stalls. 17. Fading the start of each turn so it doesn't click. 18. Smoothing the seams between audio fragments. 19. Catching the jump from a pause back into speech. 20. Trimming the volume so peaks don't distort on a phone line. 21. Padding the opening so your first word doesn't pop. 22. Pacing rushed audio so it doesn't garble into noise. 23. Predicting when audio actually finishes playing, so the agent doesn't trip over itself. 24. Converting between audio formats without introducing clicks. 25. Measuring every call's audio quality objectively, not by vibes.

Hiding the wait 26. Saying "let me check" the instant it starts looking something up. 27. Bridging the tiny silence between that and the real answer so it doesn't sound like a dropped call. 28. Greeting you the moment you pick up, by warming up during the ring.

Surviving the network 29. Reconnecting mid-sentence when the connection drops. 30. Continuing the thought instead of starting the answer over. 31. Knowing when it had actually finished, so it doesn't repeat its goodbye. 32. Recovering when the model forgets to signal it's done speaking. 33. Gently nudging a stalled model back to life.

Ending a call like a person 34. Escalating gently on silence ("still there?") instead of dropping cold. 35. Reassuring you while it's genuinely working, instead of threatening to hang up. 36. Never clipping the goodbye off mid-word. 37. Letting "wait!" cancel a hang-up, without letting background noise loop the call forever. 38. Catching and dropping a duplicated goodbye. 39. Telling "that's it, thanks" apart from "wait, one more thing." 40. Ending the call when the model says bye but forgets to actually do it. 41. Hanging up promptly when you ask it to.

Doing real work, fast 42. Reaching your actual systems in about a second, not half a minute. 43. Reusing the capabilities you already have instead of rebuilding them. 44. Confirming before anything irreversible. No rogue sends, ever. 45. Reshaping data for the ear instead of the screen. 46. Making sure it can't accidentally read internal scaffolding out loud. 47. Knowing who you are before you say a word, without slowing down the hello. 48. Searching knowledge fast, and going deep only when it has to. 49. Handing long jobs off to the background so the call never stalls. 50. Reading a real available time back to you, never inventing one. 51. Never claiming something is "booked" or "sent" unless it truly happened. 52. Verifying who you are before revealing anything private.

Behaving, and staying safe 53. Keeping answers short, which is both more pleasant and faster. 54. Opening with a statement, not a question that makes you stall. 55. Using your local date and time, not a server's. 56. Matching your language, and addressing you correctly where grammar demands it. 57. Confirming it's really you before sharing anything personal. 58. Shrugging off attempts to jailbreak it through what a caller says. 59. Blocking abuse and robocall-style fraud. 60. Automatically expiring transcripts so your conversations don't linger forever.


The point

A great model makes a great voice agent possible. It does not make one exist.

The model is the brain. Everything above is the nervous system, the ears, the reflexes, the hands, and the judgment to use them. That is not a thin wrapper around an AI. It is the actual product, and it is years of unglamorous, deeply specific engineering that you only notice when it's missing, because when it's missing, the call just feels wrong.

This is the part you cannot buy off a shelf or bolt on in a weekend. It's the part we've spent a very long time getting right, one buzzing hum and one almost-clipped goodbye at a time.

If you want a voice agent that sounds human, answers in real time, and actually does the work, without building all of this yourself, that's exactly what we make. It's called Talk To My Agent.

Give it a call. We think you'll forget you're talking to software. That's the whole point.

Try a voice agent that gets it right →

Ready to deploy your own AI assistant?

Try Moltbot Ninja Free