New AI model generates 45-minute lip-synced video from one photo and runs in real time

A Breakthrough in AI-Driven Video Generation: Real-Time Lip-Synced Videos Up to 45 Minutes from a Single Photo

Researchers from Peking University, in collaboration with teams from Shanghai AI Laboratory and other institutions, have unveiled a groundbreaking AI model capable of producing highly realistic, lip-synced talking head videos that extend up to 45 minutes in length. This innovation, detailed in a recent paper, starts with just a single static photograph and an audio input, transforming it into fluid, expressive video content that synchronizes facial movements perfectly with spoken words. Remarkably, the model operates in real-time, making it suitable for interactive applications on standard consumer-grade hardware.

The core technology behind this achievement is a novel framework called Long-period Lip Synchronization, or HiLM for short. HiLM addresses longstanding challenges in audio-driven facial animation, particularly the generation of long-sequence videos without accumulating errors over time. Traditional methods often falter beyond a few seconds, exhibiting artifacts like unnatural blinks, drifting expressions, or desynchronized lip movements. HiLM overcomes these limitations through a hierarchical, multi-stage approach that ensures temporal consistency across extended durations.

At its foundation, HiLM employs a cascaded architecture comprising three key modules: a motion representation network, a hierarchical sampling mechanism, and a diffusion-based refinement stage. The process begins with the motion representation network, which encodes both the input image and audio into compact latent representations. This network leverages a pre-trained audio encoder to extract fine-grained phonetic and prosodic features from the audio waveform, capturing nuances such as intonation, emphasis, and rhythm. Simultaneously, it processes the single reference photo to establish a personalized facial identity, preserving unique traits like skin texture, eye shape, and head pose.

The hierarchical sampling module is where HiLM truly excels in handling long sequences. It divides video generation into discrete temporal segments, predicting motion for short clips (around 2-5 seconds) before chaining them together. This segmentation prevents error propagation, a common pitfall in autoregressive models. Within each segment, the system uses a variance-preserving diffusion process to generate diverse yet coherent facial dynamics. Coarse-to-fine sampling refines head pose, eye gaze, and subtle expressions like micro-movements of the cheeks and brows, all while maintaining strict lip synchronization. The model incorporates a specially designed loss function that penalizes temporal inconsistencies, ensuring smooth transitions between segments.

Following motion prediction, the diffusion-based refinement stage renders the final video frames. This decoder upsamples latent motion codes into high-resolution pixels, employing a U-Net architecture conditioned on both identity and audio latents. Advanced techniques such as classifier-free guidance and temporal attention layers enhance realism, producing 512x512 resolution videos at 30 frames per second. HiLM supports audio inputs up to 45 minutes, far surpassing prior models like EMO or SadTalker, which typically max out at 10-20 seconds.

Performance evaluations underscore HiLM’s superiority. On benchmarks such as MEAD, HDTF, and custom long-sequence datasets, it achieves state-of-the-art results in lip synchronization accuracy (measured by SyncNet scores exceeding 0.95), facial identity preservation (via cosine similarity metrics above 0.98), and perceptual quality (Fréchet Video Distance scores lower than competitors). Human subjective tests rate HiLM-generated videos as more natural and engaging, with participants unable to distinguish them from real footage in blind evaluations over 80% of the time.

Real-time inference is a standout feature, enabled by optimizations like knowledge distillation and efficient diffusion sampling. On an NVIDIA RTX 4090 GPU, HiLM generates a 10-second clip in under 300 milliseconds, scaling linearly for longer content. This efficiency stems from pre-computing static identity embeddings and using lightweight motion decoders. CPU-only variants run at interactive speeds on mid-range laptops, broadening accessibility for developers and creators.

The model’s open-source implementation, available on GitHub, includes pre-trained weights and inference code. Users can generate videos via a simple command-line interface: provide a photo, audio file, and optional parameters for resolution or style. Demos showcase applications from virtual avatars and dubbing to educational content and accessibility tools for the hearing impaired. Ethical considerations are addressed in the paper, with recommendations for watermarking outputs to mitigate deepfake misuse.

HiLM represents a pivotal advancement in expressive AI video synthesis, bridging the gap between short-form clips and production-grade long-form content. By combining hierarchical control, diffusion modeling, and real-time optimization, it sets a new benchmark for single-image audio-driven animation, paving the way for immersive digital humans in multimedia, gaming, and beyond.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.