We're hiring!
*

High bitrate video streaming with GStreamer's RTP elements

Antonio Ospite avatar

Antonio Ospite
August 20, 2020

Share this post:

Reading time:

RTP is the dominant protocol for low latency audio and video transport. It sits at the core of many systems used in a wide array of industries, from WebRTC, to SIP (IP telephony), and from RTSP (security cameras) to RIST and SMPTE ST 2022 (broadcast TV backend). 

Being a flexible, Open Source framework, GStreamer is used in a variety of applications. Its RTP stack has been battle tested in multiple use-cases across all of the aforementioned industries, giving it the distinct advantage of being able to apply optimisations from one use case to another. Without a doubt, GStreamer has one of the most mature and complete RTP stacks available.

Additional unit tests, as well as key fixes and performance improvements to the GStreamer RTP elements, have recently landed in GStreamer 1.18:

The latter in particular provides an important boost in throughput, opening the gate to high bitrate video streaming.

Let's go deeper on that.

Pushing a buffer in GStreamer

One of the essential tasks of GStreamer is to move (push) buffers from an upstream element to the next downstream element, making the pipeline progress.

But what does pushing a buffer mean from a low level point of view?

Elements are connected through pads. Each element has a pad for each possible connection, a pad can either be a "source pad" which the element uses to output buffers or a "sink pad" that it uses to input buffers. To create a connection between two elements, the application programmer connects the source pad of one element to the sink pad of another. When an element wishes to send a buffer with data to the next element, it "pushes" it onto its source pad which then chains it to the sink pad which calls into the next element.

The basic tool that an element uses to push a buffer is the gst_pad_push function:

GstFlowReturn gst_pad_push (GstPad * pad, GstBuffer * buffer);

A buffer push is actually a series of intricate function calls and locks being taken, the sequence is as follows:

  1. The first element calls the gst_pad_push() function on its source pad.
  2. The source pad takes its own mutex, updates its internal state and releases it.
  3. The source pad takes a reference (increases an atomic counter) on the connected sink pad.
  4. The source pad calls the chain function of the connected sink pad.
  5. The sink pad takes its stream lock, which is a recursive mutex.
  6. The sink pad takes its own mutex, updates its internal state.
  7. The sink pad takes a reference on its parent (the second element), which means it increases an atomic counter.
  8. The sink pad releases its own mutex.
  9. The sink pad calls into the second element's chain function (the actual code of the element).
  10. The sink pad releases the reference (decreases an atomic counter) on onto its parent.
  11. The sink pad releases the stream lock, the recursive mutex.
  12. The source pad releases the reference (decreases an atomic counter) on the connected sink pad.
  13. The source pad takes its own mutex, updates its internal state and releases it.

As you can see from this incomplete list, each transfer of a buffer, even though it happens on one thread is actually a number of mutex locks and other atomic operations which are relatively costly on modern pipelined processors. When profiling a GStreamer pipeline, this is actually the part that causes the most overhead when transmitting a large number of small buffers.

Is it possible to do better?

Pushing buffer lists

GStreamer has a mechanism called "buffer list" which can be used to reduce the overhead of pushing a single buffer.

The entry point for an element to use this functionality is the gst_pad_push_list function.

GstFlowReturn gst_pad_push_list (GstPad * pad, GstBufferList * list);

What buffer lists do is to group together a number of buffers so that they are forwarded through the pipeline as one operation, which can significantly reduce this overhead as the sequence of operations described above will happen once per list and not once per buffer.

In case some elements do not support chaining buffer lists, GStreamer provides a fall-back mechanism like gst_pad_chain_list_default to push buffers one by one under the hood. This means that elements can always implement processing buffers in a list independently from the level of support in other elements.

This is nice for compatibility and allows incremental refinements, however to actually avoid the bottlenecks of pushing individual buffers and to get the biggest performance improvements all elements in a pipeline should natively support chaining buffer lists (i.e. have their own chainlist function installed on sink pads).

Buffer lists in rtpsession

The RTP specification, described in RFC 3550, defines a set of rules for the association of participants during a conversation using RTP, this is called an "RTP Session".

In GStreamer, the core element that implements the session management is rtpsession.

The rtpsession element already had support for buffer lists in its send path but not in its receive path.

Let's consider the following pipeline built around the rtpsession element:

gst-launch-1.0 -e \
    rtpsession name=rtpsess \
    videotestsrc ! imagefreeze num-buffers=10000 ! video/x-raw,format=RGB,width=320,height=240 ! rtpvrawpay ! rtpsess.recv_rtp_sink
    rtpsess.recv_rtp_src ! fakesink async=false sync=false

A test stream is generated (imagefreeze is used to reduce CPU usage in this case), split in RTP packets, processed by rtpsession, and consumed by a fakesink element.

The upstream element (rtpvrawpay) and downstream element (fakesink) could already chain buffer lists, but rtpsession could not.

After enabling buffer lists in rtpsession the element throughput improved dramatically:

A simplified visual interpretation can be obtained using flamegraphs.

⇨ Note: By clicking on the graphs below an interactive flamegraph will be opened in a new window.

When pushing individual buffers the call graph is deeper:

When pushing buffer lists the call graph is more balanced:

Real-world scenario considerations

To be fair this huge improvement is only achievable in controlled use cases, the boost in a generic real-world scenario is currently mitigated by other factors.

Usually the rtpsession element is not used directly but via rtpbin that, depending on the scenario, also connects it to other elements (like rtpjitterbuffer, rtpstorage, rtpssrcdemux); and the input may come from a remote source, like udpsrc.

Consider this more realistic pipeline:

gst-launch-1.0 -e '
    rtpbin name=rtpbin \
    udpsrc port=5000 caps=application/x-rtp,media=(string)video,clock-rate=(int)90000,encoding-name=RAW,payload=96,sampling=RGB,depth=(string)8,width=(string)320,height=(string)240 ! queue ! rtpbin.recv_rtp_sink_0 \
    rtpbin. ! fakesink async=false sync=false \
    udpsrc port=5001 caps=application/x-rtcp ! queue ! rtpbin.recv_rtcp_sink_0 \
    rtpbin.send_rtcp_src ! queue ! udpsink host=127.0.0.1 port=5003 sync=false async=false

This is the receiving pipeline for one sender, the two udpsink elements are one for RTP and one for RTCP, rtpbin handles all the RTP details and delivers media data to fakesink and RTCP replies for the other participant via udpsink.

Unless all elements support pushing buffer lists natively there will still be bottlenecks due to individual buffer pushes.

See a comparison of before and after using buffer lists in rtpsession with a pipeline that uses udpsrc and rtpbin:

The improvement is there but it is not as dramatic as in the controlled scenario.

Conclusion

The improvements in rtpsession available in GStreamer 1.18 are an important step towards a more efficient RTP implementation in high bitrate scenarios, but further work would be needed (e.g. enable buffer lists on udpsrc) to actually bring some of the theoretical improvements in for practical usage.

Comments (0)


Add a Comment






Allowed tags: <b><i><br>Add a new comment:


Search the newsroom

Latest Blog Posts

Faster inference: torch.compile vs TensorRT

19/12/2024

In the world of deep learning optimization, two powerful tools stand out: torch.compile, PyTorch’s just-in-time (JIT) compiler, and NVIDIA’s…

Mesa CI and the power of pre-merge testing

08/10/2024

Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…

A shifty tale about unit testing with Maxwell, NVK's backend compiler

15/08/2024

After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…

A journey towards reliable testing in the Linux Kernel

01/08/2024

We're reflecting on the steps taken as we continually seek to improve Linux kernel integration. This will include more detail about the…

Building a Board Farm for Embedded World

27/06/2024

With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…

Smart audio filters with WirePlumber 0.5

26/06/2024

WirePlumber 0.5 arrived recently with many new and essential features including the Smart Filter Policy, enabling audio filters to automatically…

Open Since 2005 logo

Our website only uses a strictly necessary session cookie provided by our CMS system. To find out more please follow this link.

Collabora Limited © 2005-2024. All rights reserved. Privacy Notice. Sitemap.