Alibaba's open Qwen 3.5 takes aim at GPT-5 mini and Claude Sonnet 4.5 at a fraction of the cost

Alibaba Unveils Qwen2.5: An Open-Source Challenger to GPT-4o mini and Claude 3.5 Sonnet at Significantly Lower Costs

Alibaba Cloud’s Qwen team has launched Qwen2.5, a suite of open-source large language models positioned to rival proprietary giants like OpenAI’s GPT-4o mini and Anthropic’s Claude 3.5 Sonnet. This release emphasizes competitive performance across key benchmarks while offering dramatically reduced inference costs, making advanced AI more accessible to developers and enterprises worldwide.

The Qwen2.5 family includes dense models ranging from 0.5 billion to 72 billion parameters, alongside a 14-billion-parameter Mixture-of-Experts (MoE) variant with 20 billion active parameters. All models are released under the Apache 2.0 license, with weights available on Hugging Face, Tongyi Qianwen (Alibaba’s platform), and ModelScope. This open approach contrasts sharply with the closed-source nature of GPT-4o mini and Claude 3.5 Sonnet, enabling widespread customization, fine-tuning, and deployment without vendor lock-in.

Performance stands out as a core strength. On the Arena-Hard leaderboard, a crowdsourced evaluation of user preferences, Qwen2.5-72B-Instruct achieves 89.4 percent, edging out GPT-4o at 87.6 percent and Claude 3.5 Sonnet at 85.8 percent. This positions it as the top open model and a leader overall. In coding tasks, measured by LiveCodeBench, Qwen2.5-72B-Instruct scores 70.7, surpassing GPT-4o mini’s 68.8 and closely trailing Claude 3.5 Sonnet’s 70.9. Mathematics benchmarks like AIME 2024 yield 85.7 for the 72B model, competitive with closed counterparts.

Agentic capabilities, evaluated on BFCL (Berkeley Function-Calling Leaderboard), show Qwen2.5-72B-Instruct at 77.6 percent, outperforming GPT-4o mini’s 74.1 percent and matching Claude 3.5 Sonnet’s 77.4 percent. Multilingual prowess is evident in MMLU-Pro, where it scores 75.5, and GPQA Diamond at 61.6, both leading open models. Instruction-following on IFEval reaches 90.3 percent, while long-context understanding via Needle-in-a-Haystack handles up to 128K tokens effectively.

What truly differentiates Qwen2.5 is its cost efficiency. Alibaba reports inference costs as low as 40 percent of GPT-4o mini’s for equivalent performance. Using vLLM on an H200 GPU cluster, Qwen2.5-72B-Instruct delivers 570 tokens per second at a cost of $0.07 per million tokens output, compared to GPT-4o mini’s $0.15 and Claude 3.5 Sonnet’s $3.00. Input costs are similarly advantageous at $0.28 per million tokens. This fractional pricing stems from optimized architecture, including Grouped-Query Attention (GQA) and tied embeddings, alongside efficient training on over 20 trillion tokens with a 128K context window.

Qwen2.5 builds on its predecessor, Qwen2, with enhancements in post-training alignment via supervised fine-tuning and direct preference optimization. The MoE variant, Qwen2.5-Coder-14B-Instruct, excels in code generation, achieving state-of-the-art results among open models under 100 billion parameters on benchmarks like EvalPlus, LiveCodeBench, and BigCodeBench.

Safety evaluations underscore robustness. Qwen2.5 scores 1.27 on the XSTest toxicity metric, below GPT-4o mini’s 1.49 and on par with Claude 3.5 Sonnet. Helpfulness-harmfulness ratings align with industry leaders, ensuring reliable deployment.

Availability is immediate and straightforward. Developers can access models via the Hugging Face Transformers library with commands like pip install qwen-vl-utils for integration. Alibaba provides API access through DashScope, starting at competitive rates. Quantized versions (FP8, AWQ, GPTQ) support deployment on consumer hardware, from RTX 4090 GPUs to MacBooks.

This release signals Alibaba’s aggressive push in the open-source AI arena, democratizing high-end capabilities. By matching or exceeding proprietary models in benchmarks while slashing costs, Qwen2.5 empowers startups, researchers, and cost-conscious enterprises to innovate without prohibitive expenses.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.