Gert Wollny
May 17, 2021
Reading time:
Collabora has been investing into Perfetto to enable driver authors and users to get deep insights into driver internals and GPU performance which were not previously visible. This post shows how we applied this work and other peformance analysis tools to study a number of workloads on the virtualized VirGL implementation, and used this insight to improve performance by up to 6.2%.
Back in August 2019, I wrote about running games in a virtual machine by using virglrenderer. Now, let's look at how the code can be tweaked to squeeze out the last bit of performance.
In a first step to analyze performance related hot-spots in virglrenderer, a tool to trace and analyze the guest-host performance, perfetto, was used to obtain a run-time profile of virglrenderer in conjunction with the OpenGL calls on the host and the guest side. Perfetto was already discussed in another blog post.
Here, the virtualization was provided by CrosVM. The whole analysis was done within a docker environment that can be created by using scripts provided with the virglrenderer source code and that is based on running apitrace on trimmed traces.
Figure 1: Screenshot of the perfetto trace visualization focusing on the command decoding loop. Note that many short commands are executed here with the white gaps comprising the loop overhead. |
This analysis revealed that the command stream decoding loses a lot of time between the actual function calls. From the performance point of view two things came into play:
Firstly, the command buffer query always dereferenced the context and the decode buffer pointers:
static inline uint32_t get_buf_entry(struct vrend_decode_ctx *ctx, uint32_t offset) { return ctx->ds->buf[ctx->ds->buf_offset + offset]; }
and secondly, in the decoding loop the next buffer offset was evaluated two times and errors were checked two times per command (loop skeleton):
while (gdctx->ds->buf_offset < gdctx->ds->buf_total) { ... /* check if the guest is doing something bad */ if (gdctx->ds->buf_offset + len + 1 > gdctx->ds->buf_total) { break; } ret = ... /* decode and run command */ if (ret == EINVAL) goto out; if (ret == ENOMEM) goto out; gdctx->ds->buf_offset += (len) + 1; }
To improve the code the pointer dereferencing was moved out of the get_buf_entry function to the beginning of command decoding, the decode loop was refactored to check the error only once per command in the no-error case, and the position of the next buffer command is now also only evaluated once. A few other refactorings to the code were applied as well. Specifically, the interface of the decode functions was unified, and the switch statement was replaced by a callback table. With that the buffer query is now
static inline uint32_t get_buf_entry(const uint32_t *buf, uint32_t offset) { return buf[offset]; }
and the loop (skeleton) reads:
... /* sanitize loop parameters */ while (buf_offset < buf_total) { ... const uint32_t *buf = &typed_buf[buf_offset]; buf_offset += len + 1; /* check if the guest is doing something bad */ if (buf_offset > buf_total) { break; } ret = ... /* decode and run command */ if (ret) return ret; }
To also test a different environment the following analysis was done using Qemu, and to zoom in on the instruction level perf was used for instrumentation. With that to obtain a performance profile the Unigine Heaven benchmark was run in the guest and perf on the host.
The selection of the shader program was identified as the main hot-spot. In particular, in each draw call all already available shader programs are checked to see whether the current combination of shader stages and dual source state is already available as a linked program:
{ struct vrend_linked_shader_program *ent; LIST_FOR_EACH_ENTRY(ent, &ctx->sub->programs, head) { if (ent->dual_src_linked != dual_src) continue; if (ent->ss[PIPE_SHADER_COMPUTE]) continue; if (ent->ss[PIPE_SHADER_VERTEX]->id != vs_id) continue; if (ent->ss[PIPE_SHADER_FRAGMENT]->id != fs_id) continue; ... return ent; } return NULL; }
Various opportunities for optimization are immediately visible:
With these changes lookup_shader_program now reads
#define VREND_PROGRAM_NQUEUE_MASK (VREND_PROGRAM_NQUEUES - 1) ... { uint64_t vs_fs_key = (((uint64_t)fs_id) << 32) | (vs_id & ~VREND_PROGRAM_NQUEUE_MASK) | (dual_src ? 1 : 0); struct vrend_linked_shader_program *ent; struct list_head *programs = &ctx->sub->gl_programs[vs_id & VREND_PROGRAM_NQUEUE_MASK]; LIST_FOR_EACH_ENTRY(ent, programs, head) { if (likely(ent->vs_fs_key != vs_fs_key)) continue; ... /* put the entry in front */ if (programs->next != &ent->head) { list_del(&ent->head); list_add(&ent->head, programs); } return ent; } return NULL; }
An analysis with with a program that uses reasonable complex 3D scenes, i.e. Unigine Heaven, showed that with just one program list, the body of the loop to find a program was run on average about 120 times. By re-inserting used programs at the front, this number was brought down to about 60. Since struct list_head
is a struct of just two pointers the memory overhead per additional array element in gl_programs
is rather low considering the possible reduction of run-time that can be achieved shortening the length of the program lists that need to be searched linearly. In light of the numbers obtained from running the Unigine Heaven benchmark a value VREND_PROGRAM_NQUEUES = 64 was chosen. (Using a hash table to manage the shader programs was also considered, but its overhead resulted in a considerable performance regression.)
In virglrenderer most functions used to take a vrend_context
as parameter, only to later dereference the current vrend_sub_context
and never use the parent context, i.e. the code is littered with statements that contain ctx->sub->
. In order to improve code clarity, and to avoid this dereferenceing that the compiler might not always be able to optimize away, the code was refactored to pass the vrend_sub_context
directly when possible, and also to use a helper pointer for similar pointer-dereferences that where used multiple times in functions or loops.
In a final round of optimizations, the hash function for virgl resources was changed to not only use xor
but also a bit rotation so as to distribute the input bits and thereby avoid hash collisions better, and a series of if-conditions was combined into one condition so that its evaluation exits as soon as the boolean result is known. Finally, the VBO setup was checked in each draw call in order to work around a bug in older Intel graphics driver versions. Since this bug can no longer be reproduced, the check was dropped.
For an analysis of the performance improvements obtained by applying all these optimizations a series of benchmarks was run. The benchmarks were executed on an computer running Gentoo Linux, comprising a AMD FX-6300 processor, and a Radeon RX 580 grapics card. Virtualization was provided by Qemu git-7c79721606b compiled to support the SDL interface. Mesa host and guest version was 21.1.0-devel git103beecd36, and the guest OS was Ubuntu/Linux 20.10. The VM ran with a graphical resolution of 1440x900.
In order to get reproducible performance numbers the Phoronix test suite was used to run these benchmarks, and a suite consisting of a number of benchmarks and games was created comprising four Unigine benchmarks, GLmark2, Open Arena, Xotonic, and GPUtest/FurMark.
The Unigine benchmarks cover different levels of OpenGL with high quality texturing and graphical effects, Open Arena and Xotonic are two Open Source computer games with rather low requirements that can run at very high frame rates, and GLmark2 focuses on general purpose 3D graphics. GPUtest/Furmark, on the other hand, is a GPU stress test that handles all relevant computations in shaders.
The results of running the benchmarks before and after applying the optimizations can be found at openbenchmark.org and are summarized in the following table:
Benchmark | Baseline FPS/Score | Optimized FPS/Score | Change (%) |
---|---|---|---|
Open Arena | 89.7 ± 0.2 | 89.3 ± 0.8 | -0.4 |
GLmark2 | 1273 | 1312 | 3.1 |
GPUtest/FurMark | 6293 ± 15.4 | 6492 ± 76.0 | 3.2 |
Unigine Heaven | 60.7 ± 0.6 | 64.5 ± 0.2 | 6.2 |
Unigine Sanctuary | 141.9 ± 1.5 | 145.8 ± 2.0 | 2.7 |
Unigine Tropics | 118.4 ± 0.2 | 121.9 ± 0.2 | 2.9 |
Unigine Valley | 41.6 ± 0.0 | 42.5 ± 0.1 | 2.1 |
Xotonic | 76.6 ± 0.2 | 77.9 ± 0.3 | 1.7 |
Seven out of the eight selected benchmarks showed an increase in the framerate/benchmark score, and only one regression can be seen. A look at the confidence intervals of the results shows that this one regression is actually not significant, but six of the seven reported improvements are.
A number of micro-optimizations were applied to virglrenderer that each taken on their own would probably not give a notable performance improvement, but all taken together show a increase in performance for most of the selected benchmarks. With these changes, perf no longer shows any performance hot-spots in the code that can easily be optimized.
Future work to improve the performance of virglrenderer will focus on further reducing small overheads, e.g. by re-arranging and compressing data structures to optimize cache usage. On a higher level, optimizing the guest-host synchronization, reducing one-time overheads like recompiling shaders, and optimizing the command stream are currently being investigated.
15/01/2025
With VirGL, Venus, and vDRM, virglrenderer offers three different approaches to obtain access to accelerated GFX in a virtual machine. Here…
19/12/2024
In the world of deep learning optimization, two powerful tools stand out: torch.compile, PyTorch’s just-in-time (JIT) compiler, and NVIDIA’s…
08/10/2024
Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…
15/08/2024
After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…
01/08/2024
We're reflecting on the steps taken as we continually seek to improve Linux kernel integration. This will include more detail about the…
27/06/2024
With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…
Comments (0)
Add a Comment