Marcus Edel
September 14, 2022
Reading time:
Over the past few years, different video codecs have been successfully developed, including H.265 and VP9, to meet the needs of various applicationsranging from video conferencing platforms like Zoom to streaming services like YouTube and software like OBS to broadcast to different sites.
The quality of the reconstructed video using these codecs is excellent at medium-to-low bitrates, but it degrades when operating at very low bitrates. While these codecs leverage expert knowledge of human perception and carefully engineered signal processing pipelines, there has been a massive interest in replacing these handcrafted methods with machine learning approaches that learn to encode video data.
Using open source software, Collabora has developed an efficient compression pipeline that enables a face video broadcasting system that achieves the same visual quality as the H.264 standard while only using one-tenth of the bandwidth. In a nutshell, the face video compression algorithms rely on a source frame of the face, a pipeline to extract the important features from a face image, and a generator to reconstruct the face using the extracted and compressed features on the receiving side.
Animating expressive talking heads is essential for filmmaking, virtual avatars, video streaming, computer games, and mixed realities. Despite recent advances, generating realistic facial animation with little or no manual labor remains an open challenge in computer graphics. Several key factors contribute to this challenge. Traditionally the generation process needs a lot of compute, making it nontrivial to run it in real-time in a video conference setting. Facial dynamics are difficult to reconstruct using based on a few images.
We present a method that generates expressive talking-head videos from a single facial image and a driving video. The key component of our method is the prediction of the facial landmarks reflecting the facial dynamics. Based on this intermediate representation, our method works with many portrait images in a single unified framework and generalizes well for faces that were not observed during training.
A neural network extracts and encodes the locations of key facial features of the user for each frame, which is much more efficient than compressing pixel and color data. The encoded data is then passed on to a generative adversarial network along with a reference video frame captured at the beginning of the session. The GAN is trained to reconstruct the new image by projecting the facial features onto the reference frame.
We base our generator network on the image-to-image translation architecture proposed by Johnson et al., but replace downsampling and upsampling layers with residual blocks similarly. For the discriminator, we use a similar network, which consists of residual downsampling blocks without normalization layers. We also use self-attention blocks, which are inserted at 32×32 spatial resolution in all downsampling parts of the networks and at 64×64 resolution in the upsampling part of the generator.
We also integrated our Super-Resolution model on top of the reconstructed output to enhance the overall image quality without increasing the necessary bandwidth.
The video shows the video compression model in action; the first video is the H.264 compression, and the second is the reconstructed video based on a single source image and predicted landmarks for the driving video. The last video applies Super-Resolution on top of it to improve the overall video quality.
The compression pipeline can be used as a standalone tool, but it can also be embedded directly into existing video conferencing tools. Thanks to that, the model can tap into all the metadata you have about your video stream and dynamically adjust the number of landmarks to improve facial reconstruction.
Currently, the key limitation of our method is that using landmarks from a different person leads to a noticeable mismatch. In addition, our reconstruction network takes a lot of compute, hindering wider adoption for resource-constrained devices.
Our work could not have been possible without the help of countless open source resources. We hope our contributions will help others in the video compression and web conferencing community build the next generation of innovative technology. We released the code to reproduce the results.
If you have questions or ideas on how to compress your data, join us on our Gitter #lounge channel or leave a comment in the comment section.
02/03/2026
Get the recap of Nicolas Frattaroli's FOSDEM talk detailing Rockchip’s mainline progress, including Vulkan 1.4 and NPU support as a vital…
02/12/2025
As an active member of the freedesktop community, Collabora was busy at XDC 2025. Our graphics team delivered five talks, helped out in…
24/11/2025
LE Audio introduces a modern, low-power, low-latency Bluetooth® audio architecture that overcomes the limitations of classic Bluetooth®…
17/11/2025
Collabora’s long-term leadership in KernelCI has delivered a completely revamped architecture, new tooling, stronger infrastructure, and…
11/11/2025
Collabora extended the AdobeVFR dataset and trained a FasterViT-2 font recognition model on millions of samples. The result is a state-of-the-art…
31/10/2025
Collabora has advanced Monado's accessibility by making the OpenXR runtime supported by Google Cardboard and similar mobile VR viewers so…
Add a Comment