Meta Unveils MAGNeT: A Breakthrough Model That Generates Studio-Quality Audio 7x Faster

February 14, 2024 • 2 min read

Researchers at Meta have open-sourced MAGNeT (Masked Audio Generation using Non-autoregressive Transformers), a new AI model capable of generating studio-grade text-to-music and text-to-sound results – at speeds up to 7 times faster than current state-of-the-art models.

Meta says MAGNeT was trained on 16K hours of licensed music. Specifically, used an internal dataset of 10K high-quality music tracks in addition to ShutterStock and Pond5 music data.

Unlike leading model that rely either on slower autoregressive decoding that sequentially generates audio signals, or on diffusion-based architectures that require lengthier sampling procedures, MAGNeT uses parallel masked predictive coding. This allows the model to generate 30-second musical compositions and soundscapes in a fraction of a second, with quality rivaling conditional language models and diffusion techniques.

This impressive speed stem from its masked generative sequence modeling directly on raw audio waveform tokens. The model is composed of a single transformer that predicts spans of masked input conditioned on visible context in a series of non-autoregressive decoding iterations. A novel rescoring component then refines audio fidelity by leveraging an external pre-trained model to rescore and rank predictions, which are then utilized in subsequent decoding steps.

The researchers also introduced a hybrid MAGNeT version that combines the strengths of both autoregressive and non-autoregressive models that generates initial seconds of audio sequentially before switching to high-speed parallel generation - capturing the benefits of both approaches.

Evaluations found MAGNeT achieved parity with or marginally trailed baselines on metrics like Frechet Audio Distance while exceeding their speed by up to 700%. For applications like interactive music creation demanding real-time sound synthesis, MAGNeT promises a transformative advance.

MAGNeT's development sheds light on the trade-offs between autoregressive and non-autoregressive modeling, particularly in terms of latency, throughput, and generation quality. By offering a detailed analysis of these aspects, the researchers behind MAGNeT provide valuable insights into the potential directions for future research in audio generation technology.

Meta has open-sourced MAGNeT as a part of AudioCraft, their single-stop code base for all generative audio solutions.

Chris McKay is the founder and chief editor of Maginative. His thought leadership in AI literacy and strategic AI adoption has been recognized by top academic institutions, media, and global brands.