Meta AI today announced Voicebox, the company’s latest artificial intelligence model focused on generative speech. This new model is a significant leap forward in AI speech synthesis, demonstrating its versatility and efficiency by outperforming existing models in various tasks, even those it was not specifically trained for.
Right now, the model can synthesize speech in six languages - English, French, Spanish, German, Polish and Portuguese. It can also do noise removal, enable content editing and style conversion, and generate diverse speech samples.
Prior AI systems for speech generation required training for each specific task using carefully curated data. Voicebox however learned directly from raw audio and transcripts from over 50K hours of audiobooks, allowing it to generalize across speech generation tasks.
Voicebox is impressive. It achieves a 1.9% word error rate for English text-to-speech, significantly outperforming Microsoft's state-of-the-art model, VALL-E, which had a 5.9% error. It is also up to 20x faster, and has set a new standard for audio style similarity metrics for English and multilingual benchmarks. Rather than trying to describe it further, you should just listen to it yourself.
While Meta has made many of its other AI models and code freely available (for non-commercial use), the company has decided against open-sourcing Voicebox due to risks of misuse. However, in an effort to advance the field, they have shared details of Voicebox's capabilities on their demo lab website, and detailed their approach in a research paper. Additionally, in an early attempt to get ahead of what will undoubtedly be a challenging problem in the future, Meta noted that they have also built a classifier to help distinguish speech generated by Voicebox from authentic human speech.
Voicebox represents a breakthrough for AI speech generation and could unlock many useful applications. The model’s ability to perform well across languages and tasks could make speech translation and audio editing much more seamless. Voicebox could also generate more natural-sounding speech for virtual assistants and characters in games or films.
When you first hear Voicebox in action, it is easy to be impressed by the seeming magic of its technological capabilities. But soon that excitement is grounded by an unsettling feeling - that our society is not currently equipped to handle a technology of this potency responsibly.
The risks of manipulated audio are real and concerning. Meta’s decision not to release Voicebox's code or model right now is prudent. Generative speech models may transform communication, but they also introduce risks that require proactive management that we need to stay ahead of. Here are five critical questions I would encourage readers to consider:
- Ethical use: How can we ensure that powerful tools like Voicebox are used ethically? What guidelines should be put in place to prevent misuse?
- Accessibility: How can we leverage the potential of Voicebox to make communication more accessible to people with speech impairments or language barriers?
- Authenticity: With the advent of AI that can mimic human speech so convincingly, how do we define and preserve the authenticity of human communication?
- Privacy: Given that Voicebox can replicate speech styles, what are the implications for privacy and consent? How can we ensure the rights of individuals are respected when their voice or speech style could potentially be replicated?
- Regulation: Should technology like this be regulated? If so, how? What role should governments, corporations, and individuals play in this process?