Google opens its infrastructure for AI models via MCP

Google Unveils Model Control Plane to Democratize Access to AI Infrastructure

In a significant move to broaden the accessibility of its cutting-edge AI hardware, Google has introduced the Model Control Plane (MCP), a new platform that enables developers and organizations to deploy and scale open-source AI models directly on Google’s high-performance infrastructure. Announced recently, MCP represents a strategic pivot, allowing third-party large language models (LLMs) and other AI workloads to leverage Google’s Tensor Processing Units (TPUs) without the complexities of underlying hardware management.

Breaking Down the Model Control Plane

At its core, MCP serves as an abstraction layer over Google’s vast AI compute resources, primarily its TPU pods. These specialized accelerators, renowned for their efficiency in training and inference tasks, have historically been reserved for Google’s internal models like Gemini or PaLM. With MCP, Google extends this capability to external models, supporting popular open-weight architectures such as Meta’s Llama 2, Mistral AI’s Mixtral, and Stability AI’s Stable Diffusion variants.

The platform integrates seamlessly with Vertex AI, Google’s managed machine learning service. Developers can select pre-approved models from the Vertex AI Model Garden, customize configurations, and deploy them across multi-slice TPU v5p pods. These pods deliver up to 8,960 chips per pod, offering exaFLOP-scale performance optimized for transformer-based models. MCP handles orchestration, scaling, monitoring, and even cost optimization, abstracting away the intricacies of sharding, data parallelism, and fault tolerance.

Key technical features include:

  • Serverless Deployment: Users specify model parameters, input shapes, and throughput requirements via a simple API or UI. MCP automatically provisions TPUs, manages quantization (e.g., INT8 or BF16), and enables dynamic scaling based on demand.

  • Multi-Model Support: Beyond LLMs, MCP accommodates diffusion models for image generation and multimodal systems. It enforces strict isolation between workloads to prevent cross-contamination.

  • Performance Guarantees: Leveraging Google’s internal optimizations like Pathways and EfficientNet architectures, MCP promises up to 2x faster inference compared to GPU-based alternatives, with lower latency and energy consumption.

  • Security and Compliance: All deployments run in isolated environments compliant with SOC 2, HIPAA, and GDPR standards. Data remains within Google’s perimeter unless explicitly exported.

Strategic Implications for the AI Ecosystem

This initiative positions Google as a neutral player in the intensifying AI infrastructure race against AWS (with Trainium/Inferentia) and Microsoft Azure (with ND-series VMs powered by NVIDIA H100s). By opening TPUs to open models, Google aims to foster an ecosystem where developers prioritize model innovation over infrastructure lock-in.

Early adopters report substantial benefits. For instance, deploying a 70B-parameter Llama model on TPU v5p achieves 1,000+ tokens per second per user at a fraction of GPU costs. MCP’s preview phase, accessible via Google Cloud console, requires approval but promises general availability soon.

However, limitations persist. Currently, only select models are supported, with custom fine-tuning workflows still maturing. Integration with non-Google tools like Hugging Face Transformers is facilitated but not fully seamless. Pricing follows Vertex AI’s pay-as-you-go model, billed per TPU slice-hour, potentially favoring large-scale users.

Technical Deep Dive: Under the Hood

MCP builds on Google’s XLA (Accelerated Linear Algebra) compiler, which optimizes model graphs for TPU hardware. During deployment, the control plane compiles the model into a serialized executable, partitions it across slices using techniques like tensor parallelism and pipeline parallelism, and deploys via Kubernetes-based orchestration.

For inference, MCP employs just-in-time (JIT) compilation and speculative decoding to minimize cold-start times. Monitoring integrates with Cloud Logging and Profiler, providing granular metrics on utilization, memory bandwidth, and systolic array efficiency.

Developers interact via gcloud CLI commands or REST APIs:

gcloud ai custom-jobs create \
    --region=us-central1 \
    --display-name=mcp-llama-deployment \
    --config=mcp_config.yaml

The YAML config specifies replica count, machine type (e.g., tpu-v5p-256), and accelerator configurations.

Future Roadmap and Accessibility

Google envisions MCP evolving to support emerging modalities like video generation and agentic workflows. Partnerships with model providers will expand the catalog, while hybrid cloud options via Google Distributed Cloud will extend reach to on-premises setups.

To get started, eligible Google Cloud users can request access through the Vertex AI console. Documentation, including sample notebooks and benchmarks, is available in the Google Cloud AI docs.

This launch underscores Google’s commitment to an open AI landscape, where infrastructure prowess meets collaborative model development.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.