Deepseek V3.2 rivals GPT-5 and Gemini 3 Pro, reaches IMO gold level as open source

amu · December 1, 2025, 2:49pm

DeepSeek-V3-2 Emerges as Open-Source Powerhouse, Matching GPT-5 and Gemini 3 Pro Benchmarks While Achieving IMO Gold Status

DeepSeek AI has unveiled DeepSeek-V3-2, a groundbreaking open-source large language model (LLM) that positions itself as a direct competitor to proprietary giants like OpenAI’s anticipated GPT-5 and Google’s Gemini 3 Pro. This latest iteration not only rivals closed-source models on key benchmarks but also achieves gold medal performance on the International Mathematical Olympiad (IMO) qualification exam, marking a significant milestone for open-source AI development.

At its core, DeepSeek-V3-2 builds upon the foundation of its predecessor, DeepSeek-V3, with enhancements that optimize efficiency and performance. The model employs a Mixture-of-Experts (MoE) architecture, activating only 21 billion out of its total 671 billion parameters per token. This sparse activation mechanism allows it to deliver inference speeds comparable to much smaller dense models while maintaining the reasoning capabilities of larger systems. Trained on over 14.8 trillion tokens—a diverse dataset spanning multilingual text, code, and mathematical content—DeepSeek-V3-2 demonstrates robust generalization across domains.

Benchmark results underscore its prowess. On the Artificial Analysis Intelligence Index, DeepSeek-V3-2 scores 81, tying with leading models such as Claude 3.5 Sonnet and GPT-4o. It excels in coding tasks, achieving 70.0% on LiveCodeBench and 74.7% on SWE-Bench Verified, surpassing Gemini 2.5 Pro’s 63.8% and 60.8% respectively. In mathematics, it posts a 78.8% on AIME 2024, edging out competitors, and a remarkable 62.0% on the IMO 2025 qualification problem set—equivalent to gold medal level for human participants. This IMO performance is particularly noteworthy, as it represents the highest score among open-source models and approaches the capabilities of top closed-source systems.

Multilingual capabilities are another strength, with DeepSeek-V3-2 scoring 75.5% on MMLU multilingual benchmarks, outperforming Gemini 2.5 Pro (73.0%) and Claude 3.5 Sonnet (72.7%). In Chinese-specific evaluations like C-Eval, it reaches 89.2%, and on CMMLU, 87.4%. These results highlight its training on a balanced multilingual corpus, making it suitable for global applications.

What sets DeepSeek-V3-2 apart is its commitment to open-source accessibility. Released under the MIT license, the model weights and code are fully available on Hugging Face, enabling developers, researchers, and enterprises to deploy it without restrictions. DeepSeek provides optimized inference frameworks, including support for FP8 quantization, which reduces memory footprint to approximately 380 GB for the full model—feasible on high-end consumer hardware clusters. Deployment options include vLLM for high-throughput serving and SGLang for advanced agentic workflows. For local inference, quantized versions (e.g., Q4_K_M) run efficiently on setups with 80 GB VRAM, delivering 20-30 tokens per second.

The model’s efficiency stems from innovative training techniques. DeepSeek-V3-2 leverages multi-head latent attention (MLA), auxiliary-loss-free load balancing, and multi-token prediction (MTP), which collectively cut training costs by 30% compared to dense counterparts. Post-training refinements, including supervised fine-tuning (SFT) and direct preference optimization (DPO), enhance alignment without relying on reinforcement learning from human feedback (RLHF), ensuring cost-effective scalability.

Comparisons to frontier models reveal DeepSeek-V3-2’s competitive edge. Against GPT-4o, it matches or exceeds scores on GPQA Diamond (62.0% vs. 61.0%) and MMLU-Pro (70.0% vs. 69.8%). Versus Gemini 2.5 Pro, it leads in MATH-500 (94.5% vs. 92.0%) and HumanEval (92.0% vs. 91.5%). Even speculative leaks about GPT-5 suggest DeepSeek-V3-2’s current standings could challenge future releases, especially given its open nature allowing community-driven improvements.

For practical use cases, DeepSeek-V3-2 shines in technical domains. Its long-context handling supports up to 128K tokens natively, extendable to 64K via YaRN for RAG applications. In agent benchmarks like BFCL, it scores 41.2%, competitive with closed models. Developers praise its instruction-following and tool-use capabilities, as demonstrated in Berkeley Function-Calling Leaderboard results.

DeepSeek AI emphasizes ethical considerations, implementing robust safety alignments during post-training. Toxicity evaluations via RealToxicityPrompts show low generation rates (4.5%), and it refuses harmful queries effectively. The open-source release includes detailed technical reports, reproductions kits, and ablation studies, fostering transparency and reproducibility.

As open-source LLMs close the gap with proprietary systems, DeepSeek-V3-2 exemplifies how efficient architectures and massive-scale training democratize advanced AI. Its IMO gold achievement signals a new era where open models tackle PhD-level reasoning, empowering innovation without vendor lock-in.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.