Meta Unveils Audiobox, A new foundation model for Custom Audio Generation

What sets Audiobox apart is its ability to accept both voice recordings and natural language text as inputs. This dual input mechanism grants more granular control over the generated audio.

Meta Unveils Audiobox, A new foundation model for Custom Audio Generation
Image Credit: Meta AI

Meta AI has unveiled Audiobox, its new foundation research model for audio generation that allows both voice and text prompts to create customized speech, sound effects, and soundscapes.

Building on Meta's previous Voicebox model for speech generation, Audiobox significantly advances controllability and quality for audio AI. The model outperforms prior systems in evaluations for generating voices and sounds that accurately match desired styles and environments described in text prompts.

Describe-and-generate sound: Users can provide a short description of the desired sound and ask the model to generate it.

What sets Audiobox apart is its ability to accept both voice recordings and natural language text as inputs. This dual input mechanism grants more granular control over the generated audio.

For example, users can input a voice sample then add a text prompt like "speaks slowly in a large cave" to make that voice adopt new cadences or environments. The voice input retains the distinct vocal timbre while the text alters other parameters.

Vocal restylization: Audiobox can restyle a voice to make it sound as though it’s in a different environment — in a large cathedral in this example.

Meta designed Audiobox to make audio production more accessible. The model lowers barriers for creating custom sounds, speech, and soundscapes needed for podcasts, videos, games, and more. Novices can easily generate quality audio elements to enrich media projects without extensive expertise.

However, like all impactful AI innovations, responsible development is crucial. Meta is selectively granting Audiobox access to researchers with a track record in speech and responsibility research. The company has also implemented audio watermarking and voice authentication safeguards into the model to deter misuse.

Earlier today, Alibaba Cloud fully open-sourcing its own Qwen-Audio model. Similar to Audiobox, their multimodal foundation model also processes diverse audio data alongside text and produces remarkable results across a range of sound understanding benchmarks.

Between Meta's more control-focused Audiobox and Alibaba's versatility-driven Qwen-Audio, rapid open innovation in responsible and equitable audio AI appears well underway. As researchers gain wider access to these powerful technologies, we will likely see the field continue to push limits in capability, versatility, and quality.

Let’s stay in touch. Get the latest AI news from Maginative in your inbox.

Subscribe
Mastodon