The Future is Spoken

SANTORI
LABS
the humans
behind tempo
17 Feb '25 · 10 min read

The Future is Spoken

How Voice AI Are Revolutionizing Human-Computer Interaction
Albert
This article is part of our series on the evolution of human-computer interaction in the era of Generative AI.

The Evolution of Voice Interfaces: From Star Trek to Modern AI

Voice interfaces and voice-based control of computers have been around for decades. We've seen aspirational versions of this in Star Trek, we've seen early voice inputs via microphones since the 1950s, and we've seen early modern versions on our phones and home devices. Early on, voice interfaces were a fun, gimmicky way of controlling computers by selecting from a fixed set of commands: close window, open notepad, page up, page down. There were useful applications in accessibility and hands-free input, but in most cases, if you were able to use a mouse and keyboard, you would be better off using a mouse and keyboard.
Modern voice interfaces such as Alexa or Siri still feel like an incomplete implementation, due in part to issues with latency and accuracy, but also due to latent feelings of distrust in the voice interface being able to translate user intent into the correct actions. This feeling is compounded for anyone who is forced to interact with phone-based voice input menus.
At Santori, we spent a lot of time experimenting with AI interfaces, and we repeatedly came to the same conclusion: voice interfaces are quickly becoming a necessary part of interacting with AIs and will be an important part of the future of AI agent user interfaces. There are a number of reasons for this incongruity. Much of this is due to how much more advanced language technologies are now versus where they were 10 years ago, but there are also changes to the environment and to the ways that people use computers and mobile devices that have changed significantly over the last five years or so.

Why Now is the Time for Voice

The Rise of Accurate Voice Models

The first obvious change to the software world is that the accessibility and quality of ASR models have improved dramatically over the last few years. It's been amazing how easy it's been to integrate cloud or edge models like Whisper, or to have as much competition and optionality from AI startups in the voice model space as we do now.
Just the accuracy itself has been a mindset change for me. I used to have a "voice input" accent that I would put on when speaking into a microphone or device: an overarticulated, mildly annoyed, drawn out voice to ensure the transcription was properly understood. But with modern models, I don't – I just speak in my normal conversational tone of voice, and trust that the model can get most of the way there. My personal reticence is fading.

Breaking Free from Command-Based Interfaces

The improvements are more than just the accuracy of the voice models. In the past, voice interfaces were typically restricted to one of two main usage patterns. Either voice was used to select one of a fixed set of commands, or to be an interface purely for transcription, such as with Dragon Dictation. There were limited ways to combine the two, such as saying "start playing a playlist of contemporary Japanese jazz" that will translate into a specific command plus a transcribed parameter, but our expectations around the active capabilities of the AI were still anchored around a library of commands.
Models like GPT and Claude break free from this limitation. We are still in the early stages of developing their control capabilities, but the ability for AI agents around tool use, or to write and execute sandboxed code against an API, allow for an exponentially larger action space.
We also believe that current interfaces just scratch the source of what can potentially be built, but that user education and user interfaces have not caught up yet. In the early days of the iPhone, skeuomorphic design dominated the application landscape. As familiarity with the mechanics and interfaces of the smartphone solidified, we saw mobile design quickly transition into the modern era. We expect an analogous shift, the same buildup of the mechanics in AI interactions, to transition over the next couple years.

Building Better Voice Interfaces: Our Core Principles

This is the future that we are building towards at Santori, but in order to do so, we stick to a few principles of our voice interfaces.

1. Natural Conversation Flow

Interacting with our voice agents should be like hopping on a call with your AI. In the post-pandemic world, familiarity with digital remote-working software has accelerated. Interfaces built around quick, serendipitous, and synchronous collaboration have shown themselves to be fairly durable, and our products lean into this shared expertise.
There are many examples of this in popular collaboration tools:
  • Google Docs is a rich text editor that is often used as a shared thinking and notetaking space for groups of people on the same call
  • FigJam can act as a shared whiteboard to record ad-hoc ideas, spatially group them together, and draw relationships between different elements on the board
  • Video conferencing software like Zoom have affordances for non-vocal communication, like with the chat window, or gestures

2. Augmentation Over Replacement

Augment but don't replace existing input methods. It can be frustrating to be given a pure voice interface when it is faster expressing yourself in a classical GUI. We don't want to go from keyboard + mouse to voice, but instead move to a world of keyboard + mouse + voice. This interface has been well-explored in both professional and gaming environments, and can encapsulate the best of both worlds.
Our own internal tools are built around this incremental approach to adding voice input. For example, our LLM prompt builder has a text box for input, but the initial drafts are produced by hitting a button for "voice input" and describing in natural language the prompt you want to build. This gets filtered through an LLM and translated into a proper prompt, and the text box can be used to make fine-tuned adjustments.
The future isn't voice-only - it's voice-and.
Voice and keyboard, voice and mouse, voice and touch.

3. Context-Aware Communication

Avoid treating voice input as a drop-in replacement for text input. The kind of messages that we type and send over Slack are different in tone and composition from the ways we communicate over Zoom or Google Meets. I've had plenty of experiences where we have to take a conversation off of Slack and onto a live call – and this isn't just because of the bandwidth or communication being higher over voice.
The medium affects the message. Voice communication lends itself to higher rates of back-and-forth participation, of looser requirements around the refinement of our ideas, and even different grammatical expectations around run-on sentences, asides, and the boundaries of complete thoughts.
hot take
The Voice-Only Trap
Every AI voice agent we've seen today is stuck with "voice in, voice out". While this symmetric interface makes sense for mobile scenarios where screen real estate is constrained, it fundamentally misses the point for desktop computing and how we work.

Voice is an excellent input mechanism for human expression and intent - but it's a maddeningly slow way to consume information.

Making Our Way to the Future

Our insistence on voice has been met with both excitement and skepticism. Some people get it right away, while some people have the same reservations about voice that we had before we committed to our UI explorations. The right interfaces with the right technology can lock into something that feels like a natural and fluid form of expression.
We are excited about what opening these tools can do for human expression. As language models and our understanding of human-AI interaction deepens, we're confident in a future where voice is a fundamental part of our relationship with AI-enabled software.
Stay in touch
Thoughts on AI, knowledge work, and the future of human-computer interaction.
we respect your privacy. privacy policy