Daniel Almeida
June 23, 2021
Reading time:
Earlier this year, from January to April 2021, I worked on adding support for stateless decoders for GStreamer as part of a multimedia internship at Collabora. The following is a recap about the completed work.
Before talking about stateless decoders, one must understand what stateful decoders are first. Here's a definition, sourced from the Linux kernel documentation:
A stateful video decoder takes complete chunks of the bytestream (e.g. Annex-B H.264/HEVC stream, raw VP8/9 stream) and decodes them into raw video frames in display order. The decoder is expected not to require any additional information from the client to process these buffers.
Stateful decoders do the processing of video data into actual frames directly, without requiring extra metadata to do so. A stateful decoder IP can parse the bitstream and extract the required data on its own. It will keep track of decoder-specific information by itself, producing video frames in the right display order.
These are the kind of things that make video decoding work for a particular codec, such as the current set of reference frames, the current position in the bitstream, whether any special decoding mode is enabled for the current frame and its current length, and also compression-specific metadata, if any.
This means that a client program can be much more lightweight, knowing that most of the responsibilities have been shifted onto another part of the decoding pipeline. It also means increasingly complex decoder hardware in what amounts to a black box running its own, mostly proprietary, firmware.
The current trend in the industry favors a different approach, centered around stateless codecs. In contrast with its stateful counterparts, a stateless codec will shift much of the bookkeeping onto userspace. The application becomes responsible for programming the chip with any metadata it may need in order to decode the bitstream and the resulting frames may not necessarily be in display order. This approach makes for simpler hardware at the cost of more complex userspace programs. Since userspace cannot directly program the underlying hardware, this drives the need for increasingly refined kernel APIs to abstract the ever-increasing amount of multimedia accelerators available on the market.
Stateless codecs need to be programmed with the necessary metadata at every frame in order to work, a use case that wasn't the focus of previous APIs in the Linux media subsystem. The control API was initially designed to set hardware parameters such as brightness, saturation and gain. Its evolution, the extended control API was designed to allow the implementation of more complex driver APIs for standards such as MPEG but was inflexible in the sense that the values would persist until explicitly changed. It was also impossible to link a particular set of controls with their corresponding bitstream and picture buffers, meaning that a given set of metadata could be associated with a totally different frame than it was originally intended for: a recipe for disaster!
Thankfully, the new Request API was designed from the ground up to support modern devices. It provided a way to associate bitstream buffers, picture buffers and controls together under a request object, such that applying per-frame metadata became possible. Requests themselves would now be queued, dequeued and recycled as necessary, providing userspace applications with a rich set of tools to program the underlying video decoding unit as it saw fit through a uAPI.
With the new APIs in place, it was only a matter of time for the emergence of new userspace implementations.
v4l2codecs is a GStreamer plugin written from the ground up by Nicolas Dufresne, GStreamer maintainer and Principal Multimedia developer at Collabora. It targets the new Request API, effectively adding GStreamer support for stateless codecs under Linux, ushering the platform into a new age for multimedia hardware.
v4l2codecs consists of classes to abstract key kernel ioctls for dealing request and buffer management, wrapping them into GStreamer objects that can be used by GStreamer decoders when processing a given bitstream. Here is a rundown of some of the features:
negotiation
at the decoder element.BufferPools
for efficient memory management.For all practical purposes, v4l2codecs is a framework which allows other developers to quickly add support for new stateless codecs under Linux, leaving codec specific details such as bridging the gap between parsed values and their uAPI counterpart to a codec specific implementation.
Before I started the internship, two working codec implementations had been in place: the all important H.264 and VP8.
I started at Collabora on 18th January 2021 as a Consultant Software Engineer intern to work on adding support for VP9 and MPEG2 to v4l2codecs and the related Kernel side. Another goal was to improve conformance testing for VP9.
I had some previous experience with codecs from my previous work on vidtv, a kernel module I had written as a mentee under a Linux Foundation program geared towards the Linux kernel. It was during its development that I first came across a bitstream specification - MPEG-TS, anyone? - and I must say, I quite liked it. In fact, I was mesmerized to see the amount of careful work that went into the details of how my television worked and that I selfishly took for granted while watching it.
Coming into Collabora to do VP9 work seemed like a natural progression and frankly, I felt right at home.
The first thing that needed to be done was, of course, adding a corresponding class to deal with the particularities of VP9, relaying the calls to the rest of the v4l2codecs framework. That's GstV4l2CodecsVp9Dec
.
The way these work is by subclassing a base decoder class, say GstVp9Decoder
, which itself is a subclass of GstVideoDecoder
. This base class will use a parser to extract data into a picture object, GstVp9Picture
and relay a few calls to our class. These are:
start, stop, set_format, finish, flush, drain
{new | start | decode | end | output}_picture
During the course of (b) we get the chance to negotiate
, extract the relevant bits from GstVp9Picture
to fill our V4L2 controls and finally enqueue our buffers and the request object. We then poll
on the request to retrieve a freshly decoded frame, at which point we pass the picture buffer downstream and the process repeats.
Naturally, even small projects come with their own set of obstacles.
To begin with, VP9 uses arithmetic encoding. This is a newer form of data compression, an evolution on top of the revered Huffman coding algorithm used quite extensively on video compression technology.
Arithmetic encoding basically attempts to reduce the amount of bits used to code symbols if they repeat often in the bitstream. The VP9 specification goes as far as to say that some symbols can be encoded using a fraction of a bit. Naturally new technologies do not come for free and for arithmetic encoding the price one must pay is keeping track of the probability of any given symbol occuring in the bitstream, updating them accordingly as frames are decoded.
The VP9 specification provides two mechanisms to do so:
The problem, in this case, was that the current VP9 parser implementation in GStreamer did not parse (1), so support for that had to be added as well.
Although relatively straightforward given a well written specification, writing or augmenting bitstream parsers can quickly become nightmarish to debug when a single mistake sneaks in somewhere. Often there will be absolutely no indication on where the error originated other than the checks at the end that informs us whether we parsed the whole frame perfectly or not. That meant diffing the results with a known working implementation in hopes I'd find the point where the failure originated. Yelp, not fun at all...
A second major issue was uncovered by a colleague mid-way through development, and it had to do with the kernel uAPI we were trying to upstream.
Turns out the update method described in (2) was actually implemented by means of a bi-directional control. Our userspace application had to actually read the symbol counts provided by the hardware in order to update the probabilities to be used on the current frame, which actually created a dependency: request n
depended on request n-1
.
Such dependency was immediately at odds with the general workflow behind the Request API: one in which users would query multiple requests and have them be processed at a time convenient for the system. This discovery prompted the rework of the VP9 kernel uAPI in order to remove the bi-directionality and thus make the requests completely independent of one another. Since the destaging of the VP9 kernel uAPI is being undertaken by Collabora itself, in a objective way, our GStreamer implementation actually helped us to validate its design. This is why having a GStreamer implementation as part of a multimedia project is so paramount: it helps tremendously with validation.
After hustling a bit with the VP9 decoder, I was very pleased to see that getting a MPEG2 decoder off the ground was very straightforward. I think this is a testament on how easy it is to add new codecs under v4l2codecs under normal conditions. This should mean the framework can house codecs for many other formats in the future, benefitting the Linux ecosystem in general.
I was very happy to see that this work prompted yet another Collaboran, Ezequiel Garcia, to move forward with the destaging of the MPEG2 uAPI as we were now more confident since it worked just fine with our userspace implementation!
Once the MPEG2 work was completed I then looked to pursue improvements on the v4l2codecs codebase itself. Since it is a new plugin, there are plenty of opportunities to add new features that can translate into more performance or just a better overall experience. A pretty straightforward way to improve throughput without changing much of the plugin itself is by supporting so-called "render delays", which is an artificial delay introduced between parsing and decoding.
Render delays are basically a way to ensure a decoder will not go idle. The main idea is enqueuing a few requests before asking the hardware to start the decoding process, thereby creating a surplus of requests to be processed by the driver at all times. This increases throughput for transcoding and was implemented on the codecs in v4l2codecs
, meaning users can transparently enjoy enhanced performance on VP8, VP9 and MPEG2 right out of the box, whereas support for this functionality was already available on H.264 from the start.
As the end of my internship drew near, I started to focus on improving conformance testing for our newly-minted VP9 decoder.
Here's the thing about these kinds of tests: they are very useful for regression testing, but they are also useful on their own, because in order to know whether your decoder does the right thing you actually need some verification at some point. Also, you can't go anywhere these days without a strategy for regression testing and for video codecs this means automatically comparing your results with a canonical reference by means of a test suite.
Owing to the work of another Collaboran - Andrzej Pietrasiewicz - support for VP9 was added to Fluster, a testing framework written in Python for decoders conformance. This means our implementation is tested by an automated testing framework using the official test vectors for VP9 and results are on par with the competition, if not slightly higher.
This project was fantastic to work on as an intern. A true testament to Collabora's very well structured internship program even in these new, uncertain times. This was a very comprehensive introduction to the world of multimedia with GStreamer, alongside engineers with more than a decade of experience on the matter ready to jump in and help at a moment's notice.
For more information on Collabora's internships, keep an eye on our careers page!
You can check out the code written during this internship by having a look at the merge requests below!
08/10/2024
Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…
15/08/2024
After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…
01/08/2024
We're reflecting on the steps taken as we continually seek to improve Linux kernel integration. This will include more detail about the…
27/06/2024
With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…
26/06/2024
WirePlumber 0.5 arrived recently with many new and essential features including the Smart Filter Policy, enabling audio filters to automatically…
12/06/2024
Part 3 of the cmtp-responder series with a focus on USB gadgets explores several new elements including a unified build environment with…
Comments (1)
Salvador:
Aug 07, 2021 at 10:33 AM
Please a need to test this. What OS I need, what kernel, give me some hints. I use armbian focal on mainline right now. Give me a hint on how get this to work. So far I have zero vpu on mainline.
Reply to this comment
Reply to this comment
Add a Comment