
Amazon has introduced Nova Sonic, a new foundation model built to make voice-based AI apps more natural, responsive, and cost-effective. The company says that Nova Sonic rivals—and in some cases beat—OpenAI’s GPT-4o and Google’s Gemini Flash.
Key Points:
- Nova Sonic unifies speech recognition, generation, and understanding in one model
- Beats OpenAI’s GPT-4o and Google Gemini in key accuracy and latency benchmarks
- Supports tool use, accents, real-time barge-ins, and nuanced turn-taking
- Powers the upgraded Alexa+ and is 80% cheaper than GPT-4o in real-time speech
Like those other multimodal voice models, Nova Sonic understands what you’re saying and how you’re saying it—capturing nuances like tone, rhythm, hesitation, and even interruptions. This lets it respond in a way that feels more like talking to a human and less like issuing commands to a robot.
"When it comes to conversation, words have meaning, but words alone can fall flat without acoustic context that give them depth," Amazon explained in its announcement. This approach allows the model to adapt its responses based on acoustic context, including handling natural pauses and interruptions—a feature Amazon calls "barge-ins."
The company is positioning Nova Sonic directly against OpenAI's GPT-4o (Realtime) and Google's Gemini Flash 2.0. They say Nova Sonic achieves a 51% win rate against OpenAI's model and nearly 70% against Google's in conversational quality tests.
On the Multilingual LibriSpeech benchmark, Nova Sonic reportedly achieved a 4.2% word error rate across five languages, which Amazon says is 36.4% better than OpenAI's GPT-4o Transcribe model. For noisy environments with multiple speakers—the kind that typically confound voice systems—Amazon claims a 46.7% relative improvement.
Finally, the company reports an average perceived latency of 1.09 seconds from when a user stops speaking to when Nova Sonic starts responding—slightly faster than OpenAI's 1.18 seconds and Google's 1.41 seconds, according to benchmarking by Artificial Analysis.
Amazon also points out that Nova Sonic is "nearly 80% less expensive than OpenAI's GPT-4o (Realtime)," which could potentially give Amazon a competitive edge as businesses look to deploy these technologies at scale.
Early adopters include education company EF, which is using Nova Sonic to help students practice new vocabulary and improve pronunciation. "The model is capable of accurately understanding non-native English speakers with a variety of accents," said Tim Hesse, VP of AI and Data at EF.
Currently, Nova Sonic offers both masculine and feminine voices in American and British English accents, with Amazon promising additional languages and accents soon. The model is available through Amazon Bedrock, the company's generative AI service on AWS, via a new bi-directional streaming API.