How to Get Started with the New Mixtral-8x7B Mixture of Experts Model

How to Get Started with the New Mixtral-8x7B Mixture of Experts Model
Image Credit: Maginative

Last week, French startup, Mistral AI not only raised $415 million, but dropped a fancy new sparse mixture of experts model, Mixtral-8x7B, and announced the beta launch of it's new platform services.

Mixtral-8x7B outperforms the Llama 2 70B model in most benchmarks while delivering six times faster inference. Plus, it's an open-weights model released with an Apache 2.0 license, meaning anyone can access and use it for their own projects.

In this article, I'll cover the best ways to get started with Mixtral-8x7B. Let's dive in!

Directly From the Source

If you are an experienced researcher/developer, it goes without saying that you can download the torrent directly using the maginet link that Mistral has provided. This download includes Mixtral 8x7B and Mixtral 8x7B Instruct.

magnet:?xt=urn:btih:5546272da9065eddeb6fcd7ffddeef5b75be79a7&dn=mixtral-8x7b-32kseqlen&tr=udp%3A%2F%http://2Fopentracker.i2p.rocks%3A6969%2Fannounce&tr=http%3A%2F%http://2Ftracker.openbittorrent.com%3A80%2Fannounce

RELEASE a6bbd9affe0c2725c1b7410d66833e24

Use Mistral AI Platform

If you haven't already done so, now is probable a good time to sign up for beta access to Mistral's AI platform where you will be able to easily access the models via API. If you have a business use-case, just reach out to the company, the Mistral team will work with you and accelerate access. N.B. The Mixtral-8x7B model is available behind the mistral-small endpoint. As a bonus, you'll also get access to mistral-medium, their most powerful model yet.

Use Perplexity Labs

If you want to simply use the instruction tuned version via a chat interface, you can use the Perplexity Labs playground. Just choose it from the model selection dropdown in the bottom right corner.

Use Hugging Face

Hugging Face has both the base model and the instruction fine-tuned model which can be used for chat-based inference.

There are ready-to-use checkpoints that can be downloaded and used via the HuggingFace Hub or you can convert the raw checkpoints to the HuggingFace format. Read this for detailed instructions including loading and running the model using Flash Attention 2.

The Mixtral-8x7B-Instruct-v0.1 model is also available on Hugging Face Chat. Simply select it as your model of choice.

Use Together.AI

Together provides one of the fastest inference stacks via API. These are the folks behind FlashAttention after all.

Access the instruction fine-tuned model using Together's chat interface

They have optimized the Together Inference Engine for Mixtral and it is available at up to 100 token/s for $0.0006/1K tokens. You can access DiscoLM-mixtral-8x7b-v2 and Mixtral-8x7b via the playground or via their API. See instructions here.

Use Fireworks Generative AI Platform

Fireworks is another handy option. The platform aims to bring fast, affordable, and customizable open LLMs to developers. Once you log in, just search for the mixtral-8x7b-instruct model to get started.

N.B., you can also access models hosted by Fireworks via other platforms like Vercel.

Run it Locally with LM Studio

LM Studio is one of the easiest ways to run models offline locally on your Mac, Windows or Linux computers. LM Studio is made possible thanks to the llama.cpp project and supports any ggml Llama, MPT, and StarCoder model on Hugging Face.

LM Studio v0.2.9 supports Mixtral 8x7B. Use models through the in-app Chat UI or an OpenAI compatible local server

Simply download and install LM Studio then search for Mixtral to find compatible versions.

Run it Locally with Ollama

Ollama is an easy way for you to run large language models locally on macOS or Linux. Simply download Ollama and run one of the following command in your CLI of choice. 

ollama run mixtral

N.B. Mixtral requires 48GB of RAM to run smoothly. By default, this will be allow you to chat with the model. If you'd like to go further (like using your own data), Ollama has provided detailed instructions on how to use LlamaIndex to create a completely local, open-source retrieval-augmented generation (RAG) app complete with an API.

Chris McKay is the founder and chief editor of Maginative. His thought leadership in AI literacy and strategic AI adoption has been recognized by top academic institutions, media, and global brands.

Let’s stay in touch. Get the latest AI news from Maginative in your inbox.

Subscribe