French AI startup Mistral has released a magnet link for Pixtral 12B, its first model capable of processing both images and text. We still don't know a lot about the model, so expect more details to emerge as more developers get their hands on it.
Pixtral 12B is based on Mistral’s Nemo 12B, a previously released text model, with the addition of a 400-million-parameter vision adapter. The new model allows users to input images either through URLs or encoded via base64, alongside text, enabling tasks such as image captioning and object counting.
The model’s vision encoder can handle images with a resolution of 1024 x 1024 pixels, broken down into patches of 16 x 16 pixels, giving it flexibility when processing high-resolution images. The combination of text and visual data processing extends Pixtral 12B’s range of potential use cases, including tasks like image classification or responding to questions based on visual input.
Key Technical Features
Pixtral 12B employs 2D RoPE (Rotary Position Embeddings) for the vision encoder, improving the model’s ability to understand the spatial relationships in images. Here are some additional details:
- Parameters: 12 billion across 40 layers.
- Vision adapter: 400 million parameters, using GeLU activation for image data.
- Image input: Images can be passed via URL or base64 encoding.
- Vocabulary size: Expanded to 131,072 tokens.
- Special tokens: Supports three new tokens:
img
,img_break
, andimg_end
for processing images.
Mistral has only released a magnet link to download the torrent of Pixtral 12B’s model weights (GitHub), and much of the technical details and licensing terms remain unclear. While some of Mistral’s models have been released under Apache 2.0, it’s not yet confirmed whether Pixtral 12B falls under the same license. For now, it is presumed to be free for research and academic use, with a paid license required for commercial applications.
As the AI community begins to download and examine Pixtral 12B, we'll get more concrete information about its capabilities and performance. We'll update the article accordingly.