Nvidia sets new MLPerf records with 288 GPUs while AMD and Intel focus on different battles

NVIDIA Achieves New MLPerf Training Benchmarks with 288-GPU Clusters as AMD and Intel Target Alternative Categories

In the latest MLPerf Training v4.0 results, NVIDIA has once again demonstrated its dominance in large-scale AI training workloads, shattering previous records across multiple benchmarks using configurations with up to 288 H100 GPUs. These submissions, powered by NVIDIA’s DGX H100 SuperPOD systems, highlight the company’s continued leadership in accelerating the training of massive foundation models, including GPT-3 at 175 billion parameters and Llama 2 at 70 billion parameters.

MLPerf, an industry-standard benchmark suite developed by a consortium including AI leaders like NVIDIA, AMD, Intel, Google, and others, evaluates hardware and software performance in training deep learning models from scratch. The v4.0 round introduced expanded workloads, such as the GPT-3 175B (now termed GPTJ-99.6B for consistency with open datasets), Llama 2 70B, and Stable Diffusion XL, alongside classics like BERT and ResNet-50. NVIDIA’s results set 11 new records, with the standout being the fastest training time for GPT-3 175B at just over three minutes using 288 H100 GPUs. This represents a significant leap, underscoring the efficiency of NVIDIA’s full-stack approach encompassing hardware, CUDA software ecosystem, and optimized libraries like TensorRT and NeMo.

The DGX H100 SuperPOD configuration leverages NVIDIA’s Grace Hopper Superchip architecture, which integrates high-bandwidth memory and NVLink interconnects for seamless scaling across hundreds of GPUs. For the Llama 2 70B benchmark, NVIDIA achieved a record time of approximately 96 minutes with 256 H100 GPUs, while the 288-GPU setup further optimized larger-scale runs. These feats were validated on MLPerf’s audited track, ensuring rigorous, reproducible results under standardized conditions. NVIDIA’s submissions also excelled in smaller-scale benchmarks, reinforcing versatility from edge to cloud deployments.

While NVIDIA focused on pushing the boundaries of exascale training, competitors AMD and Intel pursued different strategic emphases within the same MLPerf round. AMD, leveraging its Instinct MI300X GPUs, submitted results emphasizing inference performance and cost-efficiency rather than chasing training records in the largest models. In the MLPerf Inference v4.0 results released concurrently, AMD set a new record for the largest scale in GPT-J 99.6B offline scenario with eight MI300X GPUs, achieving high throughput while prioritizing power efficiency. AMD’s approach aligns with its growing traction in data center inference workloads, where the MI300X’s chiplet design and CDNA 3 architecture deliver competitive performance per watt.

Intel, meanwhile, concentrated on open ecosystem submissions using its Gaudi 3 AI accelerators. Intel’s results targeted mid-range training and inference benchmarks, such as BERT and Llama 2 70B, with configurations scaling to 256 Gaudi 3 chips. Notably, Intel claimed the top systems performance result for Llama 2 70B training in the non-audited track and excelled in edge inference categories. The Gaudi 3’s Ethernet-based scaling via the Open Network Ethernet Silicon Photonics Fabric (OSFP) offers an alternative to proprietary interconnects, appealing to customers seeking vendor-neutral architectures. Intel also submitted Xeon 6-based systems for CPU-only training, demonstrating maturity in hybrid CPU-GPU environments.

Other participants, including Google Cloud with custom TPU v5p pods and IBM with Power10 systems, contributed to a diverse results landscape. Google’s TPUs set records in select closed-division benchmarks, while IBM focused on enterprise-grade resilience. The submissions reveal a bifurcated competition: NVIDIA’s GPU-centric dominance in hyperscale training versus AMD and Intel’s bids for inference leadership and cost-optimized niches.

These MLPerf outcomes carry implications for AI infrastructure procurement. Enterprises training frontier models at petabyte scales will likely gravitate toward NVIDIA’s ecosystem for raw speed, while those optimizing for inference fleets or total cost of ownership may evaluate AMD’s MI300X or Intel’s Gaudi 3. The benchmarks also spotlight software maturity; NVIDIA’s NVLink and Magnum IO stack enables near-linear scaling, whereas AMD’s ROCm and Intel’s oneAPI are maturing to support broader model compatibility.

As AI models grow exponentially, MLPerf’s evolving benchmarks will continue testing scalability limits. The v4.0 results affirm NVIDIA’s entrenched position in training marathons, yet AMD and Intel’s targeted advances signal intensifying rivalry in inference and efficiency frontiers, potentially reshaping data center economics.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.