Mistral's new Small 4 model punches above its weight with 128 expert modules

Mistral AI has unveiled Small 4, a compact yet potent open-weight language model that leverages a sophisticated Mixture of Experts (MoE) architecture to deliver performance rivaling much larger counterparts. With 128 expert modules, Small 4 exemplifies how targeted architectural innovations can enable smaller models to punch above their weight class, making advanced AI capabilities more accessible for deployment on resource-constrained environments.

At its core, Small 4 employs an MoE design where only a subset of experts activates for each input token, optimizing computational efficiency. This approach allows the model to maintain high performance while keeping active parameters low, typically around 24 billion during inference. Trained on a massive dataset exceeding 11 trillion tokens, the model excels in multilingual tasks, coding, and reasoning, positioning it as a versatile tool for developers and researchers.

Benchmark results underscore Small 4’s strengths. On the MMLU-Pro evaluation, a challenging benchmark for multitask language understanding, it achieves 68.5 percent accuracy, surpassing models like Gemma 2 27B (65.0 percent) and matching or exceeding Phi-3 Medium 128B in several categories. In coding assessments such as HumanEval, Small 4 scores 78.5 percent, demonstrating robust code generation abilities. For multilingual prowess, it attains 72.3 percent on MGSM, a math benchmark spanning 10 languages, and 58.8 percent on GPQA Diamond, a graduate-level question-answering test.

Small 4 also shines in speed and efficiency metrics. It processes up to 150 tokens per second on a single NVIDIA H100 GPU, making it suitable for real-time applications. Compared to dense models of similar total parameter counts, its sparse MoE structure reduces memory footprint and inference latency, enabling deployment on consumer-grade hardware like laptops with decent GPUs.

The model’s training regimen contributes significantly to its edge. Mistral AI refined Small 4 through a multi-stage process: initial pretraining on diverse web data, followed by supervised fine-tuning (SFT) on high-quality instruction datasets, and reinforcement learning from human feedback (RLHF) using pairwise comparisons. This pipeline ensures alignment with user preferences, reducing hallucinations and improving factual accuracy. Post-training quantization support further enhances its practicality, with 4-bit versions maintaining over 90 percent of full-precision performance.

In head-to-head comparisons, Small 4 outperforms Mistral’s prior Small 3.1 (62.2 percent on MMLU-Pro) and rivals proprietary models like GPT-4o Mini in select domains. Against open alternatives, it edges out Qwen 2.5 32B Instruct (67.1 percent MMLU-Pro) and Llama 3.3 70B (65.6 percent), highlighting the efficacy of its 128-expert configuration. The experts specialize in distinct domains, such as mathematics, coding, or natural language, allowing dynamic routing to the most relevant modules for optimal results.

Availability is a key selling point. Small 4 releases under the permissive Apache 2.0 license, downloadable via Hugging Face. Mistral provides GGUF-quantized variants for seamless integration with tools like llama.cpp and Ollama, facilitating local inference. Integration with Mistral’s La Plateforme API offers serverless access, with pricing at $0.10 per million input tokens and $0.30 per million output tokens, competitive for production use.

For fine-tuning enthusiasts, Mistral supplies detailed training recipes and unaligned base models, enabling customization. The release coincides with Mistral’s broader ecosystem push, including Le Chat enhancements and enterprise-grade deployments.

Small 4’s architecture draws from proven MoE precedents like Mixtral 8x22B but scales to 128 experts for finer granularity. Router networks intelligently dispatch tokens, balancing load across experts to minimize bottlenecks. This sparsity not only boosts throughput but also enhances interpretability, as individual experts can be probed for specialized behaviors.

Challenges remain, such as potential routing instabilities in extreme edge cases, though Mistral’s load-balancing techniques mitigate these. Multilingual support covers over 80 languages, with particular strength in European tongues and emerging markets.

In summary, Small 4 redefines small-model expectations, blending efficiency, capability, and openness. It empowers edge AI, from mobile apps to on-device assistants, democratizing frontier performance.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.