OpenAI Previews Voice Engine, And Shares Perspective on Synthetic Voice Technology

OpenAI Previews Voice Engine, And Shares Perspective on Synthetic Voice Technology

Last week, internet sleuths discovered that OpenAI had trademarked the term "Voice Engine" and rumors began swirling that the company was about to release a Siri and Alexa competitor. Today, OpenAI released a preview of Voice Engine, a new model that generates natural-sounding speech resembling the original speaker using a single 15-second audio sample.

The technology, which has been under development since late 2022, has already been used to power preset voices in OpenAI's text-to-speech API, ChatGPT Voice, and Read Aloud. Despite these promising applications, the company has not yet announced a date for public availability, and is taking a cautious approach to its broader release due to potential misuse.

To better understand the potential uses and implications of Voice Engine, OpenAI has been privately testing the technology with a small group of trusted partners. These early adopters have developed impressive applications, such as providing reading assistance with natural-sounding voices, translating content to reach global audiences, improving essential service delivery in remote settings, and supporting individuals with speech impairments or disabilities.

Examples shared by OpenAI of Voice Engine in action:

Providing reading assistance to non-readers and children through natural-sounding, emotive voices representing a wider range of speakers than what's possible with preset voices. Age of Learning, an education technology company dedicated to the academic success of children, has been using this to generate pre-scripted voice-over content

audio-thumbnail
Age of learning (reference)
0:00
/15.024
audio-thumbnail
Age of learning (generated)
0:00
/16.824

Translating content, like videos and podcasts, so creators and businesses can reach more people around the world, fluently and in their own voices. One early adopter of this is HeyGen, an AI visual storytelling platform that works with their enterprise customers to create custom, human-like avatars for a variety of content, from product marketing to sales demos.

audio-thumbnail
Heygen (reference)
0:00
/16.08
audio-thumbnail
Heygen spanish
0:00
/21.744
audio-thumbnail
Heygen japanese
0:00
/21.984
audio-thumbnail
Heygen mandarin
0:00
/24.6

Reaching global communities, by improving essential service delivery in remote settings. Dimagi is building tools for community health workers to provide a variety of essential services, such as counseling for breastfeeding mothers. To help these workers develop their skills, Dimagi uses Voice Engine and GPT-4 to give interactive feedback in each worker's primary language including Swahili or more informal languages like Sheng, a code-mixed language popular in Kenya.

audio-thumbnail
Dimagi swahili (reference)
0:00
/15.464489795918368
audio-thumbnail
Dimagi swahili nutrition
0:00
/41.808

Check out the OpenAI blog for even more examples.

While these use cases demonstrate the positive potential of Voice Engine, OpenAI recognizes the serious risks associated with generating speech that closely resembles people's voices, particularly in the context of an election year. The company is actively engaging with partners from various sectors to incorporate their feedback and ensure responsible development and deployment of the technology.

OpenAI's approach to building Voice Engine safely includes requiring partners to adhere to usage policies prohibiting impersonation without consent, obtaining explicit and informed consent from the original speaker, and clearly disclosing the use of AI-generated voices. The company has also implemented safety measures such as watermarking to trace the origin of generated audio and proactive monitoring of how the technology is being used.

OpenAI is not alone in advancing synthetic voice technology. Other players in the space, such as ElevenLabs, provide state-of-the-art AI voice solutions for various products and services, including professional voice cloning, dubbing, and translations.

Elevenlab’s Professional Voice Cloning AI Now Publicly Available
ElevenLab promises PVC can produce a “perfect digital copy” that is “virtually indistinguishable from the original

This week also saw the debut of Hume AI's Empathetic Voice Interface, which utilizes an empathic large language model to adjust its language and tone of voice based on context and the user's emotional expressions. These developments underscore the rapid progress and growing interest in AI-powered voice technologies across industries.

Hume AI Raises $50M Series B, Unveils Empathic Voice Interface
This influx of capital brings the company’s valuation to $219 million, signaling strong investor confidence in Hume’s innovative approach to AI.

Looking ahead, OpenAI encourages societal resilience against the challenges posed by increasingly convincing generative models. This includes phasing out voice-based authentication for sensitive information, exploring policies to protect individuals' voices in AI, educating the public about AI capabilities and limitations, and accelerating the development and adoption of techniques for tracking the origin of audiovisual content.

As the debate surrounding synthetic voice technology continues, OpenAI's preview of Voice Engine underscores both the potential benefits and the need for responsible deployment. The company's cautious approach and ongoing engagement with stakeholders is critical to understanding and mitigating the risks associated with this powerful technology.

Let’s stay in touch. Get the latest AI news from Maginative in your inbox.

Subscribe