Mistral Unveils Mixtral 8x7B: A Leading Open SMoE Model

Leveraging the mixture of experts design allows scaling model capacity and performance substantially while controlling memory and compute costs, since any single token does not end up using all weights.

Mistral Unveils Mixtral 8x7B: A Leading Open SMoE Model
Image Credit: Mistral

On Friday, Mistral released Mixtral 8x7B, a high-quality sparse mixture of experts (SMoE) model with open weights by simply sharing a magnet link for the torrent. Mixtral 8x7B demonstrates strong performance across areas like language generation, code generation, and instruction following while keeping costs low.

A mixture of experts architecture utilizes multiple specialized submodels, or "experts," to handle different aspects of a task. An input token is processed by a "router" network that selects only a few relevant experts, rather than utilizing all the weights.

The result is a model that combines the depth and breadth of a large-scale neural network with the speed and cost-effectiveness of a much smaller model, making it an exemplary instance of efficiency in AI modeling. So although Mixtral 8x7B has 56 billion total parameters, it only uses 12 billion per token thanks to its mixture of experts design. This allows efficient and fast inference comparable to a standard 12B model.

Mixtral 8x7B achieves state-of-the-art results among open models, matching or exceeding the performance of models like Llama 2 70B and GPT-3.5 Base in evaluations. Specific strengths include:

  • Language generation over long contexts (32k tokens)
  • Code generation
  • Achieving top instruction following scores among open models (8.3 on MT-Bench)
 Mixtral matches or outperforms Llama 2 70B, as well as GPT3.5, on most benchmarks.

The Mistral team reports Mixtral displays higher truthfulness (74% on TruthfulQA) and less bias on benchmarks compared to other models like Llama 2. This gives it a strong foundation, although fine-tuning can further improve safety.

Compared to Mistral 7B and LlaMA models, Mixtral is significantly stronger in science, in particular in mathematics and code generation.

Complementing the Mixtral 8x7B is the Mixtral 8x7B Instruct, optimized through supervised fine-tuning and Direct Preference Optimization (DPO) for precise instruction following. This variant reaches an impressive score of 8.30 on MT-Bench, rivaling the GPT3.5 and affirming its position as the leading open-weights model in its class.

The model was trained on open web data and handles English plus French, Italian, German and Spanish. It is licensed under the permissive Apache 2.0 license.

Mixtral 8x7B masters French, German, Spanish, Italian, and English.

By efficiently scaling up parameters while controlling costs, SMoEs represent a promising path to move beyond standard training paradigms. Mixtral highlights the potential for open models to match leading proprietary systems while promoting transparency. The release of Mixtral 8x7B furthers Mistral’s goal of empowering the community with open models that can drive new innovations in AI.

Let’s stay in touch. Get the latest AI news from Maginative in your inbox.