Google unifies text, image, video, and audio in a single vector space with Gemini Embedding 2

amu · March 11, 2026, 6:49pm

Google Advances Multimodal AI with Gemini Embedding 2: Unifying Text, Images, Videos, and Audio in a Single Vector Space

Google has introduced Gemini Embedding 2, a groundbreaking development in multimodal AI that embeds text, images, videos, and audio into a unified vector space. This innovation, announced through Google’s Vertex AI platform, enables seamless cross-modal retrieval and search capabilities, marking a significant step forward in handling diverse data types within a single embedding framework.

At the core of Gemini Embedding 2 are two primary models: text-embedding-004 for text-only processing and a multimodal embedding model that supports text alongside images, videos, and audio. The multimodal model processes these inputs by converting videos into frames and audio into spectrograms, allowing for consistent representation across modalities. All embeddings are projected into a 1024-dimensional vector space, facilitating direct similarity comparisons between different data types without the need for modality-specific handling.

This unification addresses a longstanding challenge in AI: bridging disparate data formats into a cohesive semantic space. Previously, systems often required separate embeddings for each modality, complicating retrieval tasks. With Gemini Embedding 2, developers can now perform queries like searching for images using text descriptions, retrieving videos based on audio cues, or finding relevant text passages from visual content. This is particularly powerful for applications such as enterprise search, recommendation systems, content moderation, and e-commerce, where multimodal inputs are commonplace.

Performance benchmarks underscore the model’s prowess. On the Multilingual Retrieval (MIR) benchmark across 41 languages, text-embedding-004 achieves an average score of 70.0% for average recall at 10 (R@10), surpassing competitors like OpenAI’s text-embedding-3-large (64.6% in English) and Voyage AI’s voyage-large-2 (64.1%). In long-context retrieval on the MR-Bench dataset, it excels with a 62.2% R@10 score for 8K token contexts, demonstrating robust handling of extended inputs.

The multimodal embedding model shines on the VideoMME benchmark, attaining 72.5% accuracy, which outperforms Meta’s ImageBind by 20.7 percentage points. On the AudioMME benchmark for audio understanding, it scores 68.5%, edging out ImageBind’s 66.9%. These results highlight Gemini Embedding 2’s superior cross-modal retrieval, where, for instance, text queries can effectively retrieve matching videos or audio clips.

Implementation is streamlined via the Vertex AI Embeddings API. Developers can access the models through familiar SDKs in Python, Node.js, Go, Java, and REST endpoints. The API supports dynamic batching for efficient processing of variable-sized inputs, with automatic frame sampling for videos (defaulting to 4 frames, adjustable from 1 to 50) and spectrogram generation for audio (8 seconds by default, up to 60 seconds). Input limits include 32K tokens for text, 3072x3072 pixels per image (with intelligent resizing), and specific durations for video and audio.

Safety and reliability features are integrated throughout. The models undergo rigorous evaluation for safety attributes, including harmlessness, using benchmarks like RealToxicityPrompts, and for factual accuracy via SimpleQA. Vertex AI provides content filtering with categories for harmful content, hate speech, and more, configurable via safety settings. Rate limits are enforced—1,000 requests per minute per project for text-embedding-004 and 10 queries per minute per project for the multimodal model—to ensure stable usage.

Pricing is competitive and usage-based. Text embeddings cost $0.000025 per 1,000 characters for the first 1 million characters per month, dropping to $0.00001 thereafter. Multimodal embeddings are priced at $0.0001 per 1,000 characters for text, $0.00038 per image, $0.00135 per video minute, and $0.0005 per audio second, with volume discounts applying after initial thresholds.

Google emphasizes the model’s versatility for real-world applications. In retrieval-augmented generation (RAG) pipelines, it enhances accuracy by incorporating multimodal context. For video search, it enables semantic querying across frames and audio. Audio-to-text retrieval supports podcast discovery or music recommendation, while image-text tasks power visual search engines.

Availability is immediate on Vertex AI in regions including us-central1, europe-west4, and asia-southeast1, with global expansion planned. Developers can experiment via Google AI Studio’s playground, testing cross-modal retrieval with sample datasets.

This release builds on Google’s Gemini family, leveraging advancements in multimodal understanding to create more intuitive AI systems. By collapsing modalities into one vector space, Gemini Embedding 2 lowers barriers for building sophisticated, context-aware applications, promising to redefine how machines interpret and connect the world’s multimedia data.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.