Antonio Ospite
August 20, 2020
Reading time:
RTP is the dominant protocol for low latency audio and video transport. It sits at the core of many systems used in a wide array of industries, from WebRTC, to SIP (IP telephony), and from RTSP (security cameras) to RIST and SMPTE ST 2022 (broadcast TV backend).
Being a flexible, Open Source framework, GStreamer is used in a variety of applications. Its RTP stack has been battle tested in multiple use-cases across all of the aforementioned industries, giving it the distinct advantage of being able to apply optimisations from one use case to another. Without a doubt, GStreamer has one of the most mature and complete RTP stacks available.
Additional unit tests, as well as key fixes and performance improvements to the GStreamer RTP elements, have recently landed in GStreamer 1.18:
The latter in particular provides an important boost in throughput, opening the gate to high bitrate video streaming.
Let's go deeper on that.
One of the essential tasks of GStreamer is to move (push) buffers from an upstream element to the next downstream element, making the pipeline progress.
But what does pushing a buffer mean from a low level point of view?
Elements are connected through pads. Each element has a pad for each possible connection, a pad can either be a "source pad" which the element uses to output buffers or a "sink pad" that it uses to input buffers. To create a connection between two elements, the application programmer connects the source pad of one element to the sink pad of another. When an element wishes to send a buffer with data to the next element, it "pushes" it onto its source pad which then chains it to the sink pad which calls into the next element.
The basic tool that an element uses to push a buffer is the gst_pad_push function:
GstFlowReturn gst_pad_push (GstPad * pad, GstBuffer * buffer);
A buffer push is actually a series of intricate function calls and locks being taken, the sequence is as follows:
gst_pad_push()
function on its source pad.As you can see from this incomplete list, each transfer of a buffer, even though it happens on one thread is actually a number of mutex locks and other atomic operations which are relatively costly on modern pipelined processors. When profiling a GStreamer pipeline, this is actually the part that causes the most overhead when transmitting a large number of small buffers.
Is it possible to do better?
GStreamer has a mechanism called "buffer list" which can be used to reduce the overhead of pushing a single buffer.
The entry point for an element to use this functionality is the gst_pad_push_list function.
GstFlowReturn gst_pad_push_list (GstPad * pad, GstBufferList * list);
What buffer lists do is to group together a number of buffers so that they are forwarded through the pipeline as one operation, which can significantly reduce this overhead as the sequence of operations described above will happen once per list and not once per buffer.
In case some elements do not support chaining buffer lists, GStreamer provides a fall-back mechanism like gst_pad_chain_list_default to push buffers one by one under the hood. This means that elements can always implement processing buffers in a list independently from the level of support in other elements.
This is nice for compatibility and allows incremental refinements, however to actually avoid the bottlenecks of pushing individual buffers and to get the biggest performance improvements all elements in a pipeline should natively support chaining buffer lists (i.e. have their own chainlist function installed on sink pads).
The RTP specification, described in RFC 3550, defines a set of rules for the association of participants during a conversation using RTP, this is called an "RTP Session".
In GStreamer, the core element that implements the session management is rtpsession.
The rtpsession
element already had support for buffer lists in its send path but not in its receive path.
Let's consider the following pipeline built around the rtpsession
element:
gst-launch-1.0 -e \ rtpsession name=rtpsess \ videotestsrc ! imagefreeze num-buffers=10000 ! video/x-raw,format=RGB,width=320,height=240 ! rtpvrawpay ! rtpsess.recv_rtp_sink rtpsess.recv_rtp_src ! fakesink async=false sync=false
A test stream is generated (imagefreeze is used to reduce CPU usage in this case), split in RTP packets, processed by rtpsession
, and consumed by a fakesink
element.
The upstream element (rtpvrawpay
) and downstream element (fakesink
) could already chain buffer lists, but rtpsession
could not.
After enabling buffer lists in rtpsession
the element throughput improved dramatically:
A simplified visual interpretation can be obtained using flamegraphs.
⇨ Note: By clicking on the graphs below an interactive flamegraph will be opened in a new window.
When pushing individual buffers the call graph is deeper:
When pushing buffer lists the call graph is more balanced:
To be fair this huge improvement is only achievable in controlled use cases, the boost in a generic real-world scenario is currently mitigated by other factors.
Usually the rtpsession
element is not used directly but via rtpbin that, depending on the scenario, also connects it to other elements (like rtpjitterbuffer
, rtpstorage
, rtpssrcdemux
); and the input may come from a remote source, like udpsrc
.
Consider this more realistic pipeline:
gst-launch-1.0 -e ' rtpbin name=rtpbin \ udpsrc port=5000 caps=application/x-rtp,media=(string)video,clock-rate=(int)90000,encoding-name=RAW,payload=96,sampling=RGB,depth=(string)8,width=(string)320,height=(string)240 ! queue ! rtpbin.recv_rtp_sink_0 \ rtpbin. ! fakesink async=false sync=false \ udpsrc port=5001 caps=application/x-rtcp ! queue ! rtpbin.recv_rtcp_sink_0 \ rtpbin.send_rtcp_src ! queue ! udpsink host=127.0.0.1 port=5003 sync=false async=false
This is the receiving pipeline for one sender, the two udpsink
elements are one for RTP and one for RTCP, rtpbin
handles all the RTP details and delivers media data to fakesink
and RTCP replies for the other participant via udpsink
.
Unless all elements support pushing buffer lists natively there will still be bottlenecks due to individual buffer pushes.
See a comparison of before and after using buffer lists in rtpsession
with a pipeline that uses udpsrc
and rtpbin
:
The improvement is there but it is not as dramatic as in the controlled scenario.
The improvements in rtpsession
available in GStreamer 1.18 are an important step towards a more efficient RTP implementation in high bitrate scenarios, but further work would be needed (e.g. enable buffer lists on udpsrc
) to actually bring some of the theoretical improvements in for practical usage.
15/01/2025
With VirGL, Venus, and vDRM, virglrenderer offers three different approaches to obtain access to accelerated GFX in a virtual machine. Here…
19/12/2024
In the world of deep learning optimization, two powerful tools stand out: torch.compile, PyTorch’s just-in-time (JIT) compiler, and NVIDIA’s…
08/10/2024
Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…
15/08/2024
After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…
01/08/2024
We're reflecting on the steps taken as we continually seek to improve Linux kernel integration. This will include more detail about the…
27/06/2024
With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…
Comments (0)
Add a Comment