Meta unveils four generations of custom AI chips to cut inference costs for billions of users

Meta Unveils Four Generations of Custom AI Chips to Slash Inference Costs for Billions of Users

In a significant push toward AI efficiency, Meta has revealed details on four generations of its custom Meta Training and Inference Accelerator (MTIA) chips. These hardware innovations are designed specifically to optimize inference workloads for Meta’s Llama family of large language models, ultimately aiming to reduce costs dramatically for the billions of users interacting with Meta’s AI services daily.

The journey began with the first-generation MTIA, which Meta deployed in production in 2023. This initial chip focused primarily on accelerating inference for the Llama 2 model. Built on TSMC’s 7nm process, it featured 16 compute units per die, each containing 16 tensor compute cores, 16 vector compute cores, and 8 scalar compute cores. The architecture emphasized high-bandwidth memory access, with each compute unit paired to 16GB of HBM3 memory, enabling efficient handling of large model parameters. In deployment, the first-gen MTIA demonstrated substantial improvements in inference throughput compared to off-the-shelf GPUs, marking Meta’s entry into custom silicon for AI.

Building on this foundation, the second-generation MTIA v2 entered production earlier this year, delivering approximately three times the inference performance of its predecessor while using the same power envelope. Fabricated on TSMC’s 5nm process, MTIA v2 incorporates 24 compute units per die, with enhancements to the tensor cores for better support of newer model architectures. Key upgrades include improved sparsity handling, which allows for more efficient computation on pruned neural networks, and optimizations for the Llama 3 models. Meta reports that MTIA v2 systems are now powering a significant portion of Llama 3 inference across its data centers, contributing to lower latency and higher throughput for user-facing applications like chatbots and content generation tools.

Looking ahead, Meta has outlined plans for MTIA v3, slated for deployment in 2025. This third generation promises another threefold performance leap over MTIA v2, achieved through a shift to TSMC’s 3nm process node. The chip will feature 48 compute units per die, doubling the density while introducing advanced features such as enhanced support for grouped-query attention, a technique critical for scaling transformer-based models. Additionally, MTIA v3 will incorporate improvements in memory bandwidth and efficiency, with each compute unit accessing higher-capacity HBM3e memory stacks. These advancements are tailored to handle the growing demands of Llama 4 and future iterations, where model sizes are expected to exceed hundreds of billions of parameters.

Meta’s roadmap extends even further with a next-generation MTIA, internally referred to as the fourth generation, targeted for production post-2025. While specific details remain under wraps, Meta indicates it will leverage cutting-edge process technologies beyond 3nm, potentially incorporating chiplet designs for modular scalability. This chip aims to further reduce inference costs per token, aligning with Meta’s goal of making high-performance AI accessible at scale without prohibitive expenses.

Central to this progression is Meta’s focus on inference optimization, distinct from training workloads that dominate GPU-centric infrastructures. Inference, the phase where trained models generate outputs for real-world queries, accounts for the bulk of operational costs in production environments. By customizing silicon for Llama models, Meta addresses inefficiencies in general-purpose hardware like Nvidia GPUs, which, while versatile, incur high licensing and power costs. MTIA chips integrate tightly with Meta’s software stack, including the Executive Jump Instruction Set (EJIS) and custom compilers that map Llama operations directly to hardware accelerators.

Performance metrics underscore the impact. For instance, MTIA v2 achieves up to 42 tokens per second per user for Llama 3 70B models in multi-user scenarios, a marked improvement over GPU baselines. Power efficiency is another cornerstone, with MTIA systems consuming less energy per inference operation, crucial for Meta’s hyperscale data centers serving over three billion monthly active users across platforms like Facebook, Instagram, and WhatsApp.

Meta’s custom chip strategy also mitigates supply chain risks and vendor dependencies. By controlling the full stack from architecture to deployment, the company can iterate rapidly on Llama-specific needs, such as rotary positional embeddings and sliding window attention. This vertical integration has already yielded systems comprising thousands of MTIA chips, clustered for distributed inference.

The broader implications are profound. As open-source Llama models gain traction globally, reduced inference costs democratize access to powerful AI. Developers and enterprises can deploy Llama at the edge or in the cloud with lower barriers, fostering innovation in areas like personalized recommendations and real-time translation.

Meta’s commitment to this trajectory signals a maturing AI hardware ecosystem, where hyperscalers increasingly design purpose-built accelerators to sustain exponential growth in model capabilities and user demands.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.