OpenAI built a networking protocol with AMD, Broadcom, Intel, Microsoft, and NVIDIA to fix AI supercomputer bottlenecks

OpenAI Leads Development of Advanced Networking Protocol to Overcome AI Supercomputer Limitations

In the race to build ever-larger AI models, compute power from GPUs and accelerators has scaled dramatically, yet networking infrastructure remains a critical bottleneck. OpenAI, partnering with AMD, Broadcom, Intel, Microsoft, and Nvidia, has engineered a new networking protocol specifically tailored for AI supercomputers. This initiative addresses the inefficiencies that slow down distributed training across thousands of accelerators, enabling faster and more efficient scaling of foundation models.

Traditional networking solutions fall short for AI workloads. InfiniBand, long the gold standard for high-performance computing, dominates AI clusters due to its low latency and high bandwidth. However, it is largely controlled by Nvidia, creating vendor lock-in concerns for hyperscalers and AI labs. Ethernet-based alternatives, while more ubiquitous and cost-effective, suffer from higher latencies and lower throughput in collective operations essential for AI training, such as all-reduce and all-gather. These primitives synchronize gradients and model parameters across nodes, and any delay multiplies across clusters with tens of thousands of GPUs.

The new protocol, developed collaboratively, introduces optimizations that bridge this gap. It builds on Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE) foundations but incorporates AI-specific enhancements. Key features include native support for 800 Gbps Ethernet speeds per port, with scalability to 1.6 Tbps. Latency is reduced to sub-microsecond levels for small messages, crucial for the frequent, small-packet exchanges in transformer-based model training. The protocol embeds hardware-accelerated collective operations directly into the network fabric, offloading them from the CPUs and accelerators.

This design stems from real-world needs observed in OpenAI’s training runs for models like GPT-4. During scaling experiments, networking accounted for up to 50 percent of total training time in large clusters. By integrating congestion control mechanisms tuned for bursty AI traffic patterns and priority flows for urgent collectives, the protocol minimizes tail latencies that previously caused stragglers in distributed jobs.

The collaboration leverages each partner’s strengths. Nvidia contributes expertise from its InfiniBand and Spectrum-X Ethernet lines, ensuring compatibility with existing Hopper and Blackwell GPUs. AMD provides input from its Instinct accelerators and ROCm software stack. Broadcom supplies Jericho and Tomahawk switch ASICs optimized for the protocol. Intel brings Ethernet controller innovations from its Data Center GPU Max series. Microsoft, as OpenAI’s primary cloud provider via Azure, integrates the protocol into its NDv5-series AI supercomputers, which already feature advanced networking.

Implementation details highlight the protocol’s practicality. It uses a packet format with extended headers for metadata like job IDs and operation types, allowing switches to perform in-network computations. For instance, switches can execute ring-based all-reduce algorithms natively, reducing data movement by up to 90 percent compared to host-based implementations. Error correction employs Forward Error Correction (FEC) with low overhead, maintaining reliability at scale. Security features include end-to-end encryption and secure multi-tenancy for shared clusters.

Testing in prototype clusters demonstrated compelling results. In a 1,000-GPU setup simulating GPT-4 training, the protocol achieved 1.5x faster time-to-train versus standard RoCE fabrics. Bandwidth utilization exceeded 95 percent during peak collectives, versus 70 percent on legacy systems. Power efficiency improved by 30 percent through reduced CPU involvement and optimized packet processing.

OpenAI’s push for this protocol aligns with broader industry trends toward disaggregated AI infrastructure. As supercomputers like Microsoft’s Stargate (planned for millions of GPUs) come online, open standards prevent monopolies and foster innovation. The group plans to upstream the specification to bodies like the Ultra Ethernet Consortium, ensuring broad adoption. Early adopters include Azure’s next-generation Maia clusters and potential integrations in AWS and Google Cloud.

Challenges remain, including ecosystem maturity. Driver support across ROCm, CUDA, and OneAPI must unify, and switch silicon availability lags behind compute roadmaps. Nonetheless, this protocol positions Ethernet as a viable InfiniBand alternative, democratizing access to exaflop-scale AI training.

By solving networking bottlenecks, OpenAI and its partners pave the way for the next era of AI scaling, where model sizes and capabilities grow without proportional infrastructure costs.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.