Designing for the spoken word
Trust, timing and first impressions: what we've learned from building voice AI
Recently, Des wrote about AI as a deeply convergent force, reversing decades of unbundling in software. Millions of web products and apps look set to be bundled into unified AI tools. As we think about the shapes those AI tools might take, we’re also considering where voice AI fits in — after all, conversation is the great human unifier.
But first, we have to overcome some of the baggage “automated voice” brings with it. It’s a phrase that conjures up years of awful phone jail experiences — rigid IVR menus, endlessly repeating options, and the frustration of trying to “break through” to a human by shouting “representative!” into the phone.
This collective trauma has created a powerful negative bias, leading people to approach an AI voice agent with a specific set of low expectations. They just don’t expect it to understand them if they speak naturally, or to be able to interrupt it, or to have a fluid conversation.
This often leads to issues that a voice AI could resolve ending up with human agents, simply because users disengage or don’t trust the AI to help. When this happens, it’s clear we need to shift some deeply ingrained mental models.
Because people engage willingly with general-purpose LLMs and interact naturally with them, we know it’s possible to have a trusted interaction between humans and AI. But designing a voice tool for customer service requires additional legal and ethical considerations, because the caller might not be expecting an AI to answer the call.
Users sometimes trust AI less when it explicitly calls itself out, even if a human later gives them the exact same answer. But hiding that a system is AI could have diminishing returns in the long run — and there are legal requirements that prohibit it. So to design voice AI tools the right way, we have to grapple head on with how people first engage with voice AI, and study how their expectations change and become more flexible over time. Our ultimate goal is to deliver resolution and a great experience.
The voice design challenge
The distinct design challenges of spoken interactions are less about technology and more about timing, rhythm, and trust — the mechanics of conversation we usually take for granted. Designers have to think about four things at once.
First, conveying the right-sized “chunk” of information. In chat, users can scan and skim; on the phone, every extra word can feel like a monologue. Agents need to size up a problem in real time and deliver responses that are complete yet digestible.
Next, knowing when to speak. Humans develop turn-taking intuition over a lifetime. Teaching it to an AI means encoding timing, tone, and intent. Get it wrong and the result is either frustrating silence or chaotic overlap — both signal a loss of flow.
Then there’s timing itself. Milliseconds matter. The difference between a “thoughtful pause” and a “frustrating delay” is vanishingly small, but it shapes conversational rhythm, perceived intelligence, and trust.
Finally, user trust and expectations. Decades of rigid phone menus have trained people to expect the worst. Those old systems are the baseline we’re designing against.
Here’s how we’re approaching this challenge in the lab:
First impressions, social graces, and the chameleon effect
As soon as a voice call starts, users instinctively react to the emotional tone and delivery of the voice they hear. It’s called the chameleon effect: the more natural and conversational the AI sounds, the more naturally the user responds. When Fin Voice sounds warm and emotionally aware, users engage more deeply, offer detailed context, and use complex phrasing. But voice AI with a robotic or flat voice leads users to disengage, simplify their speech, and withhold useful information.
First impressions count for a lot. Get the first exchange right, and you have an opportunity to get good up-front context from your caller. From then on, a positive feedback loop further increases trust if the bot can actually handle a complex, nuanced query — as Fin Voice can. But flub that first attempt, and you miss out on that nuanced context, and have no chance to deliver complexity and recover from that first impression.
But even with a strong initial impression, if voice AI later exhibits limitations comparable to old-school bots — trouble conversing naturally, difficulty tracking drifts in the conversation, or an inability to truly understand the caller — then all those underlying bad bot expectations swing back into action. Callers’ behavior can change dramatically. They become impatient, speak in shorter, simpler sentences, and lose trust.
Social graces also matter when closing a conversation. When testing Fin, we confirmed that ending a call without proper sign-off or confirmation can confuse users, even in otherwise positive experiences that led to a successful resolution. We heard things like: “It just hung up. No ‘goodbye,’ no ‘is there anything else?’ That’s jarring.” And: “Is the call over? Did something go wrong? A clear closing message confirms and preserves the trust built during the conversation.
So a great first impression, a natural tone, and an elegant ending of a conversation all help to drive resolution and satisfaction — which leads to better, more detailed inputs from customers — which in turn leads to more effective support.
Getting that positive feedback loop going is a delicate design challenge. It asks us to aim for more than just the successful handling of complex questions. Because it’s not just what the bot says, but also how it says it, that matters. You can do all the right things on paper, but if the pitch and tonality and delivery are off, you’ve still lost.
With Fin, we’re addressing this by:
Allowing teams to fine-tune the AI’s tone, pacing, and delivery, setting everything from voice and tone to speed and pitch.
Integrating Fin Voice directly with Fin’s Procedures capability, so it can follow exact rules, handle complex queries with multiple steps, business logic, or third-party systems, and execute actions like retrieving order status or fetching a customer’s recent orders from Shopify.
Designing social graces directly into the system by optimizing the flow of conversation, ensuring clarification, follow-ups, and response length & tone are appropriate.
Making the conversation multi-modal. A voice call doesn’t have to just end. By connecting Fin Voice to Workflows, the call can lead to a follow up over email, all while the support person retains full context. This is a friendlier way to sign off, keeping the experience useful, reliable, and trust-building.
Addressing tone and conversational continuity and complexity are how we move from a “bad bot” that fails on complexity and trust to a reliable agent that can actually get things done.
Escalation as a safety valve
Of course, even an AI agent that can get things done will need some kind of escape hatch in case of a roadblock or a disgruntled caller. If a user feels trapped in an AI loop, their expectations and emotions go in all directions.
With Fin Voice, we’ve found that users want to be able to ask for a human themselves — but they also expect Fin to be able to recognize when a query is beyond its capabilities and proactively offer a seamless escalation. It’s an interesting intersection of users’ doubt and high expectations — they want a system that knows its own limits.
That’s why we’ve built in escalation as a core feature of Fin, whether in voice or chat. It’s not merely a failure state — it can be a primary path should a business need it. Fin Voice can be set up with escalation guidance, and workflows can include explicit escalation triggers.
Most importantly, we’ve invested in the experience between Fin Voice and human teammates that make this handoff seamless. Teams need full visibility and control, so we let them monitor calls in real time, review recordings and transcripts, and give them an instant, AI-generated summary of the call. So when a human joins, they have all the context, and the customer never has to repeat themselves.
Where the conversation’s headed
Great voice AI is all about designing an experience so seamless and effective that users stop thinking about the “construct” they’re interacting with, and simply and naturally get to the answer they need. A voice AI experience has to earn their trust, making the “AI” label secondary to the experience itself.
This means empowering Fin with a voice that is accurate and empathetic, with natural pacing, and clear controls and off-ramps.
We think Fin Voice has strong potential to transform phone-based support. Not just because of quick connection times, understanding, and adaptability, but because of user comments like this:
“It felt like a real conversation with someone who got it.”
Getting voice right means solving one of the hardest problems in interaction design — making machines sound natural to people. It’s the kind of challenge that makes the work feel deeply meaningful when it finally clicks.


