Marcus Edel
September 21, 2020
Reading time:
Despite their great upscaling performance, deep learning backed Super-Resolution methods cannot be easily applied to real-world applications due to their heavy computational requirements. At Collabora we have addressed this issue by introducing an accurate and light-weight deep network for video super-resolution, running on a completely open source software stack using Panfrost, the free and open-source graphics driver for Mali GPUs. Here's an overview of Super Resolution, its purpose for image and video upscaling, and how our model came about.
Internet streaming has experienced tremendous growth in the past few years, and continues to advance at a rapid pace. Streaming now accounts for over 60% of internet traffic and is expected to quadruple over the next five years.
Video delivery quality depends critically on available network bandwidth. Due to bandwidth limitations, most video sources are compressed, resulting in image artifacts, noise, and blur. Quality is also degraded by routine image upscaling, which is required to match the very high pixel density of newer mobile devices.
The upscaling community has provided us with many fundamental advances in video and image upscaling, from classic methods such as Nearest-Neighbor, Linear and Lanczos resampling. However, no fundamentally new methods have been introduced in over 20 years. Also, traditional algorithm-based upscaling methods lack fine detail and cannot remove defects and compression artifacts.
All of this is changing thanks to the Deep Learning revolution. We now have a whole new class of techniques for state-of-the-art upscaling, called Deep Learning Super Resolution (DLSR).
Deep Learning Super Resolution (DLSR). |
An image's resolution may be reduced due to lower spatial resolution (for example to reduce bandwidth) or due to image quality degradation such as blurring.
Super-resolution (SR) is a technique for constructing a high-resolution (HR) image from a collection of observed low-resolution (LR) images. SR increases high frequency components and removes compression artifacts.
The HR and LR images are related via the equation:
LR = degradation(HR).
By applying the degradation function, we obtain the LR image from the HR image. If we know the degradation function in advance, we can apply its inverse to the LR image to recover the HR image. Unfortunately we usually do not know the degradation function beforehand. The problem is thus ill-posed, and the quality of the SR result is limited.
DLSR solves this problem by learning image prior information from HR and/or LR example images, thereby improving the quality of the LR to HR transformation.
The key to DLSR succsss is the recent rapid development of deep convolutional neural networks (CNNs). Recent years have witnessed dramatic improvements in the design and training of CNN models used by Super-Resolution.
Upscaling can be achieved using different techniques, such as the aformentioned Nearest-Neighbor, Linear and Lanczos resampling methods. The group of images below demonstrates these different options.
First, the lower resolution input image to be be upscaled:
(Photo by Jon Tyson on Unsplash)
Then, the various methods can be applied. Click on the image below to get a closer look at each result, as well as the original image before it was downscaled.
The objective is to improve the quality of the LR image to approach the quality of the target, known as the ground truth. In this case, round truth is the original image which was downscaled to create the low-resolution image.
The standard approach to Super-Resolution using Deep Learning or Convolution Neural networks (CNNs) is to use a fully supervised approach where a low-resolution image is processed by a network comprising convolutional and up-sampling layers to produce a high-resolution image. This generated HR image is then matched against the original HR image using an appropriate loss function. This approach is commonly known as "paired setting" as it uses pairs of LR and corresponding HR images for training.
More recently, and following the introduction of generative adversarial networks (GANs), GANs are one of the most utilized machine-learning architectures for Super-Resolution.
In generative adversarial networks, two networks train and compete against each other, resulting in mutual learning. The first network, called the generator, generates high-resolution inputs and tries to fool the second network, the discriminator, into accepting these as true high-quality inputs. The discriminator output predicts if an input is a real high-quality image (similar to the training set) or if it's a fake or bad upscaled image.
The technical details considerably more complex but follow these general principles.
The following shows different examples of X4 upsampling using our trained Deep Learning Super Resolution model. You can click on each image to view its original size. We also list the output for Nearest Neighbour, Bi-linear and Lanczos' interpolation for comparison.
The model adds details to the vegetables, the plates and the background. Input, Nearest Neighbour, Bi-linear, Lanczos, Original.
The model adds details to the sky and the signs. The hotel sign is is not 100% accurate, but compared with the other upscaling methods a huge improvement. Input, Nearest Neighbour, Bi-linear, Lanczos, Original.
The model was able to add even fine details to the hair and cleared up the overall image. Input, Nearest Neighbour, Bi-linear, Lanczos, Original.
Due to the complex lighting the output is not as sharp compared with the previous examples. Still the model was able to bring back details to the shirt and face. Input, Nearest Neighbour, Bi-linear, Lanczos, Original.
Since the model was trained on animation videos as well, the works on various contents. However, in our experiments a model trained on a specific content type showed even better results. Input, Nearest Neighbour, Bi-linear, Lanczos, Original.
Another animation example, compared with the other upscaling methods, our Super-Resolution model was able to add details to the background and objects in the foreground. Input, Nearest Neighbour, Bi-linear, Lanczos, Original.
For more examples: https://medel.pages.collabora.com/super-resolution-examples/.
Super Resolution is one of the areas where we can fortunately rely on an almost infinite supply of data (high-quality images and videos) which we can use to create a training set. By down-sampling the high-quality images we can create low resolution and high-resolution image pairs needed to train our model.
The low-resolution image is initially a copy of the ground truth image at half the dimensions. The low-resolution image is initially upscaled using a bi-linear transformation so that its dimensions match the target image, so that it is ready to serve as input for our model.
To make the model robust against different forms of image degradation and to better generalize, the dataset can be further augmented by:
There are also several datasets available which can be used for training, such as Diverse 2K (DIV2K), which contains 800 2K resolution images and the Flickr2K and OutdoorSceneTraining (OST) datasets.
In our case we trained the model on images extracted from videos released under the Creative Commons license, such as Sintel, Elephants Dream, Spring and Arduino the Documentary.
One big question we need to answer is how to quantitatively evaluate the performance of our model.
Simply comparing video resolution doesn't reveal much about quality. In fact, it may be completely misleading. A 1080p movie of 500MB may look worse than a 720p movie at 500MB, because the former's bitrate may be too low, introducing various kinds of compression artifacts.
The same goes for comparing bitrates at similar frame sizes, as different encoders can deliver better quality at lower bitrates, or vice-versa. For example, a 720p 500MB video produced with XviD will look worse than a 500MB video produced with x264, because the latter is much more efficient.
To solve the problem, over the past decade several methods have been introduced, commonly classified as either full-reference, reduced-reference, or no-reference based on the amount of information they assess from a reference image of ostensibly pristine quality.
Video quality has traditionally been measured using either PSNR (peak-to-signal-ratio) or SSIM (Structural Similarity Index Method). However, PSNR doesn’t take human perception into account, simply measuring the mean squared error between the original clean signal and the noise of the compressed signal. SSIM does consider human perception, but was originally developed to analyze static images and doesn’t allow for human perception over time, although more recent versions of SSIM have started to address this issue.
With the rapid development of machine learning, important data-driven models have begun to emerge. One such is Netflix’s Video Multi-method Assessment Fusion (VMAF). VMAF combines multiple quality features to train a Support Vector Regressor to predict subjective judgments of video quality.
At Collabora, we use a combination of SSIM and VMAF to train and test our Deep Learning Super-Resolution models. SSIM is fast to calculate and serves as a basic indicator for how the model is performing. VMAF, on the other hand, delivers more accurate results, which are usually missed by traditional methods.
Despite their great upscaling performance, deep learning backed Super-Resolution methods cannot be easily applied to real-world applications due to their heavy computational requirements. At Collabora we have addressed this issue by introducing an accurate and light-weight deep network for video super-resolution.
To achieve a good tradeoff between computational complexity and reproduction quality, we implemented a cascading mechanism on top of a standard network architecture, producing a light-weight solution. We also used a multi-tile approach in which we divide a large input into smaller tiles to better utilize memory bandwidth and overcome size constraints posed by certain frameworks and devices. Multi-tile significantly improves inference speed. This approach can be extended from single image SR to video SR where video frames are treated as a group of multiple tiles.
We designed our solution on top of the open-source Panfrost video driver, allowing us to offload compute to the GPU.
Coming up in Part 2 of this series, we'll take a deep dive into how our model works, and how you can use free, open source software to achieve a higher level of compression than existing video compression methods. Stay tuned!
Update (Sept. 24):
By popular demand, the code to train your own model and to reproduce the results from the blog-post can be found here: https://gitlab.collabora.com/medel/super-resolution.
Due to licensing issues (a large number of images used have a research license attached to them), we can't release the pre-trained model for the second stage of the Super-Resolution method at this point. However, we are currently re-training the model to solve the issue, and will be making the updated model checkpoint available soon!
08/10/2024
Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…
15/08/2024
After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…
01/08/2024
We're reflecting on the steps taken as we continually seek to improve Linux kernel integration. This will include more detail about the…
27/06/2024
With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…
26/06/2024
WirePlumber 0.5 arrived recently with many new and essential features including the Smart Filter Policy, enabling audio filters to automatically…
12/06/2024
Part 3 of the cmtp-responder series with a focus on USB gadgets explores several new elements including a unified build environment with…
Comments (9)
Giorgio B.:
Sep 22, 2020 at 10:09 AM
Hi, what happened to the hotel image: change also the point of view? low resolution the hotel goes to left, high resolution goes to right?
Reply to this comment
Reply to this comment
Marcus Edel:
Sep 22, 2020 at 08:10 PM
Hello, great question, in this case, it's actually an optical illusion. I created a simple example page that allows us to compare the two images - https://medel.pages.collabora.com/super-resolution-examples/compare.html
Reply to this comment
Reply to this comment
Charles Hill:
Nov 30, 2021 at 03:41 PM
It's an illusion caused by the sharpening effect.
Reply to this comment
Reply to this comment
MRT:
Jan 06, 2022 at 02:49 AM
It's just a plain old optical illusion. Has nothing to do with the sharpening effect.
You can mix and match the 2 images in any order/combination and the effect is always present.
https://imgur.com/a/9j9aREu
Reply to this comment
Reply to this comment
Daniel Pfeiffer:
Apr 07, 2021 at 02:15 PM
Interesting technology.
The AI has generated artificial nose pops in the girl's nose. Also, the hair strands look unnatural booth in the man's picture as well in the girl's. Visible in zoom-view.
I'm curious to see how super-res technology will evolve in the near future.
Reply to this comment
Reply to this comment
Marcus Edel:
Apr 07, 2021 at 03:25 PM
Agreed, hair is very different from other image categories, mainly because hair has some unique textures (thousands of long, thin strands full of textured details). And it's true; we observed that the encoded output image into a latent space often guarantees realistic output in a global fashion but lacks constraints on local sharpness, e.g. hair. We tried to compensate for the missing local sharpness by using another stage on top of the existing model, but the output didn't improve much, so we left it out.
Reply to this comment
Reply to this comment
Rasmus Schultz:
Feb 17, 2022 at 01:07 PM
This looks amazing - still the best I've seen.
No part 2 was ever released - was the pretrained model ever released?
Is there any finished or commercial product available with this kind of quality? (why not?)
Reply to this comment
Reply to this comment
Marcus Edel:
Feb 18, 2022 at 02:02 PM
Thanks for the comment; we started to work on a merge request to integrate the Super-Resolution model into GStreamer - https://gitlab.freedesktop.org/gstreamer/gst-plugins-bad/-/merge_requests/2506. The plan is to tie up all loose ends by the end of the month. We are also preparing the second part with an updated model that improves on issues we have seen with gaming content.
Reply to this comment
Reply to this comment
Justin Peter:
Apr 26, 2023 at 09:19 PM
This clearly hasn't happened - is there still a plan for the public to be able to use this model?
Reply to this comment
Reply to this comment
Add a Comment