Louis-Francis Ratté-Boulianne
July 09, 2020
Reading time:
Earlier this year, we announced a new project in partnership with Microsoft: the implementation of OpenCL and OpenGL to DirectX 12 translation layers (Git repository). Time for a summer update! In this blog post, I will explain a little more about the OpenGL part of this work and more specifically the steps that have been taken to improve the performance of the OpenGL-On-D3D12 driver.
In the initial steps of this project, we quickly realized that the best way forward was to build on top of Mesa. Zink, a project started by Erik Faye-Lund, has already proven that we could achieve a similar goal: translating OpenGL to a lower-level graphics API (s/Vulkan/DirectX12/). People familiar with that project will therefore experience a strong déjà-vu feeling when looking at the architecture of our current effort:
The Mesa state tracker is responsible for translating OpenGL state (blend modes, texture state, etc) and drawing commands (like glDrawArrays and glDrawPixels) into objects and operations that map well to modern GPU hardware features (Gallium API). The "D3D12 Driver" is thus an implementation of that interface.
On the shader side, the state tracker is able to convert OpenGL fixed-functions, traditionally implemented directly by the hardware, into shaders. Mesa will also translate GLSL shaders into an intermediate representation named NIR. We use that representation to produce the DXIL bytecode consumed by DirectX. I'm not gonna focus on the NIR-to-DXIL compiler here as it definitely deserves its own blog post.
Finally, a different component of Mesa, the WGL State tracker, is handling WGL calls (API between OpenGL and the windowing system interface of Windows). Internally, an existing implementation of the windowing system was using GDI (Graphics Device Interface) to actually display the rendered frames on the screen. We added a new implementation using DXGI Swapchains. More on that later.
In order to better understand the next sections, let's dive a little more into the details of DirectX 12.
DirectX 12 requires that we record commands (e.g. clearing the render target, draw calls, etc.) into a ID3D12GraphicsCommandList
and then call Execute()
to actually process the commands. But before we can record drawing commands, we first need to set some state on the command list. Including (not an exhaustive list):
Vulkan follows a similar model, with VkPipeline
objects encapsulating state such as image formats, render-target attachments, blend modes, and shaders all bounded into a single object. Like DirectX command lists, Vulkan pipelines are immutable once created and recorded. This is one of the biggest sources of impedance mismatch when translating from GL, where applications set global state parameters and the final state is only known when a draw call is submitted.
Our initial implementation of the driver was as straightforward as possible, as we wanted to validate our approach and not focus too much on performance early on. Mission accomplished, it was really slow! For each draw call, we were setting the entire pipeline state, filling the descriptor heaps, recording the draw command, immediately executing the command list, and waiting for it to finish.
When drawing a scene with 6 textured triangles (red, green, blue, red, green, blue), the sequence of events would look like this:
This is of course extremely inefficient, and one easy approach to reduce latency is to batch multiple commands (clear, draw, etc.) together. Concretely, we create multiple batch objects that each contains a command allocator and a set of descriptor heaps (sampler and CBV/SRV heaps). Commands are recorded in the batch until the descriptor heaps are full, at which point we can simply execute (send the commands to the GPU), create a fence (for future CPU/GPU synchronization) and start a new batch. Given that queuing of the command and its actual execution are now decoupled, this optimization also requires that we keep track of needed resources. For example, we need to make sure to not delete textures that the draw call will sample from, once executed.
Batch |
---|
Command Allocator |
Sample Descriptor Heap |
CBV/SRV Descriptor Heap |
Tracked Objects |
Fence |
When all of the batch objects are used (wrap-around), we wait for the oldest submitted batch to complete. It is then safe to unreference all of the tracked resources and re-use the allocator and heaps. Assuming a maximum of 2 active batches and when allocating heaps just big enough for two draw calls (let's start small), drawing our 6 triangles looks like this:
It is important to note that some more flushing (waiting for some or all of the commands to be finished) is needed in some scenarios. The main one being when mapping a resource currently used by a command (texturing, blit source/destination) for access by the CPU.
In real-life situations, it is really rare that ALL of the state bits changes in-between draw calls. It is probably safe to assume, for example, that the viewport would keep constant during a frame and that the blending state won't budge much either. So, in a similar fashion to how (real) hardware drivers are implemented, we use dirty-state flags to keep track of which state has changed. We still need to re-assert the entire state when starting a new batch (resetting a command list also resets the entire pipeline state). However, it saves us some CPU cycles when doing multiple draw commands in one batch (very likely).
In addition to that, and given that PSO (Pipeline State Object) creation is relatively costly, we cache those. If any of the PSO-related dirty flags are set, we can then search the cache and in case of a miss, create a new PSO.
The total rendering time hasn't changed much in our example scenario, but the CPU usage is lowered. Another effect of not re-asserting the descriptor heaps on each draw call is that we can sometimes fit more commands in one batch without allocating bigger heaps.
Initially, only CPU-driven GDI winsys integration was implemented. There are two downsides to that approach.
Let's zoom out and see what is happening when drawing 4 frames. For the purpose of this next diagram, we'll assume that we can draw an entire frame using only one batch:
By implementing integration with DXGI swapchains (only supported for double-buffered pixel formats; we still rely on the old GDI code path otherwise), we can solve these two issues. The swapchain provides us with a back buffer into which the GPU can render the current frame. It also keeps track of a front buffer that is used for the display. When the application want to present the next frame (wglSwapBuffers), the swapchain simply flip these two buffers.
Please note that this diagram is only valid for full-screen applications when throttling is disabled (wglSwapInterval(0)
). When syncing with V-Sync, the GPU might also introduce some stalling to make sure it doesn't render over the currently displayed buffer. When rendering in windowed mode, the window manager will use the front buffer to compose the final display scene. It can also, in some situations, directly use the buffer without any bliting if the hardware supports hardware overlays.
One final caveat: the application will suffer a performance hit when drawing on the front buffer (glDrawBuffer(GL_FRONT)
). The buffer-flip presentation model rules out that possibility; the front buffer needs to stay intact for scanout. If that happens, we have no choice but to create a fake front buffer and to performs some copies before drawing and swapping buffers.
Resource state transition barriers require that we specify both the initial state of the resource and the desired new state. The naive approach is to add a state barrier before and after each resource usage (COMMON
-> RENDER_TARGET
, draw, RENDER_TARGET
-> COMMON
). But that solution has a real performance cost: each transition may involve layout transitions, resolves, or even copies to shadow resources, so whilst this is an acceptable crutch for short-term development, the cost of using the lowest common denominator is too much for real-world usage. However, getting the details right for a better solution is tricky:
Luckily for us, the Microsoft team already worked on a solution for a very similar problem. They have previously developed a project named D3D11on12 (yep, you guessed it, translating DirectX11 API to DirectX12) that is itself relying on D3D12TranslationLayer. They were able to adapt some code from the latter into our Mesa tree, fixing all of the problems previously mentioned.
It is not always optimal to create a new commited resource for each buffers. To speed up resource allocation, we don't immediatly destroy unreferenced buffers but instead try to re-use them for new allocations. Allocating whole resources for small buffers is also inefficient because of the alignment requirements. Therefore, we create a new buffer slab on demand to sub-allocate smaller buffers. In the future, it might be possible to implement a similar solution for textures, but this approach is stricly used for buffers as of now.
TDLR; All of these incremental changes sum up to an amazing result: less CPU time wasted waiting on completions (pipelining), less CPU time wasted on overhead (batching), less CPU time wasted on redundant operations (caching), more efficient memory usage (suballocation), zero-copy presentation pipeline (DXGI), more efficient GPU hardware usage (explicit image states which aren't COMMON
).
For the ones that just came for the screenshot:
And some numbers I compiled for Doom 3 timedemo
benchmark. In this very specific scenario, on my system (mobile Intel CPU/GPU), the cumulative gain of our changes is around 40x!
Step | FPS |
---|---|
Initial State | 1.0 |
Command Batching | 4.2 |
Dirty State & PSO Cache | 12.9 |
DXGI Swapchain * | 14.8 |
Resource State Manager | 24.6 |
Buffer Caching/Suballoc | 42.5 |
* The improvement when switching to DXGI is low because of resource creation that requests a kernel lock.
Disclaimer: You might not be able to replicate these numbers if you use the main repository as I disabled the debug layer and changed the descriptor heaps size for my benchmark.
Here are some of the ideas we could consider going forward:
Our team consists of five additional Collabora engineers (Boris Brezillon, Daniel Stone, Elie Tournier, Erik Faye-Lund, Gert Wollny) and two Microsoft DirectX engineers (Bill Kristiansen, Jesse Natalie).
19/12/2024
In the world of deep learning optimization, two powerful tools stand out: torch.compile, PyTorch’s just-in-time (JIT) compiler, and NVIDIA’s…
08/10/2024
Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…
15/08/2024
After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…
01/08/2024
We're reflecting on the steps taken as we continually seek to improve Linux kernel integration. This will include more detail about the…
27/06/2024
With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…
26/06/2024
WirePlumber 0.5 arrived recently with many new and essential features including the Smart Filter Policy, enabling audio filters to automatically…
Comments (9)
theuserbl:
Jul 09, 2020 at 10:46 PM
There are on Linux ways to run Direct3D on top of Vulkan and OpenGL. And on Windows is the other way around, OpenGL and OpenCL on top of DirectX12.
And then there existing graphic cards, which support DirectX and OpenGL.
Why isn't is possible to support OpenGL and OpenCL with Mesa direct on Windows. Without the way to sit on top of DirectX12?
On Linux Mesa works also direct. Why not on Windows?
Reply to this comment
Reply to this comment
Erik Faye-Lund:
Jul 10, 2020 at 07:04 AM
There's several reasons why it's beneficial to build this on top of Direct3D 12 on Windows rather than building native OpenGL GPU drivers for Windows:
1. Existing drivers: The existing drivers in Mesa depends on a lot of non-Windows infrastructure (like DRM/DRI). Porting every driver over to support WGL natively would among other things require the introduction of a new Windows kernel driver per GPU. This is obviously a lot of work.
2. New drivers: Future GPUs needs future work. With work like Zink in place for Linux, there's a reasonable chance some future GPUs wont get full OpenGL native support in Mesa. That would mean more work in enabling support for new GPUs, including writing new kernel drivers. Combine this with the fact that OpenGL is unlikely to get much further development, this isn't a very attractive value-proposition.
3. Ecosystem: Microsoft's graphics ecosystem is based on D3D12. This is the API they support, document, ask their vendors to implement support for, and certify the implementations. It's only natural for them to build on that rather than starting over and having to certify drivers for a new API.
Just to be clear, there's nothing forcing users who have a GPU with a native OpenGL implementation to use the D3D12 based one. This simply adds an option to run OpenGL applications for those who don't.
I hope this clears up a bit.
Reply to this comment
Reply to this comment
lostpixel:
Jul 17, 2020 at 10:29 PM
I have some questions:
1. OpenGL drivers often come with certain penalty; some features are disabled in "game" versions and full/unlocked hardware implementations of OpenGL usually comes in "pro" versions (like in geforce vs quadro). Do you think that OpenGL over DX12 can give more "hardware" acceleration then what vendors driver implements? Is DX12 driver also crippled in same way or is it possible to implement OpenGL in shaders or some DX functionality as much hardware accelerated as possible?
2. Some optimizations: 90's cruft as you call it, is actually really nice conceptual model ('old-school' immidiate mode). Unfortunately it is also very inneficient. I am talking about glBegin() & glEnd() and friends. A software implementation on top of DX12 could implement glVertex* & Co calls as inlined functions, macros, whatever that does not result in myriad of functon calls, stuff data into some hidden vbo and render everything as 'retained' behind the user back. Are there any such plans if that is possible?
3. If used on platforms without DX (i.e. Linux) will it be possible to "pass through" GL calls to vendor driver if such are avaialable? Or maybe VK implementation?
Reply to this comment
Reply to this comment
Daniel Stone:
Jul 20, 2020 at 10:14 AM
1) Interesting question but I expect the answer is no. If vendors want to differentiate on price point, then I expect they'd do that uniformly across all their drivers. The only hypothetical exception is if a vendor wants to differentiate on feature enablement when the hardware support is really uniform, _but_ DirectX requires certain features and OpenGL makes them optional. In that case, GLon12 could hypothetically unlock more features. In many cases though, even if the hardware is uniform, the difference is QA: for example, if you have a consumer and a workstation product which do have a uniform hardware base, often the workstation dies are the ones which passed QA everywhere, but the consumer version failed for some part of it, so those bits of the hardware are masked off as being broken. I don't think this is hugely likely.
2) This is exactly what we already do! :) Mesa implements support for immediate mode by recording the state into caches, recording the user data into buffers (e.g. vertices into VBOs), etc, so at the end our driver does execute exactly like a modern DX12 client, with all the immediate-mode stuff left as an upper-level implementation detail. This is true of all modern Mesa drivers, as no-one has immediate-mode support anymore: if you run Mesa's AMD or Intel (or Arm Mali, Qualcomm Snapdragon, etc) drivers on Linux, you'll get the exact same thing.
3) Yeah, we have some prior art for that as well, in different drivers. VirGL is a virtualised GL-on-GL driver (used in ChromeOS amongst others), and Zink is a GL-on-Vulkan driver which shares a lot of code and concepts with GLon12.
Reply to this comment
Reply to this comment
MATHEUS EDUARDO GARBELINI:
Sep 14, 2020 at 08:34 AM
Is there some guide/tutorial on how to try this on WSL2 or building and installing mesa from the referenced repository is enough?
Reply to this comment
Reply to this comment
Erik Faye-Lund:
Sep 14, 2020 at 03:29 PM
WSL2 support is still not implemented. It should be coming up soon, though.
Reply to this comment
Reply to this comment
Michael L:
Nov 15, 2020 at 04:50 AM
Pretty sure thousands right now are putting their faith in this for the new Radeon release. I guess everyone expects OpenGL performance with emulators and Minecraft to still remain...questionable. I've been linked to this page many times now.
Of course, you guys are doing gods work, though I think people wonder when this will come to fully to fruition.
Reply to this comment
Reply to this comment
Daniel Stone:
Nov 15, 2020 at 02:31 PM
There’s been a lot of development to improve performance and there’ll be more still. I don’t think our work will ever be done as such, but it’s now well past the proof-of-concept stage and into something you can actually use.
Reply to this comment
Reply to this comment
MATHEUS EDUARDO GARBELINI:
Nov 16, 2020 at 07:04 AM
This is really good, as this is crucial to many applications that rely on OpenGL 3 functionality and previously would only run under a complete virtual machine.
Reply to this comment
Reply to this comment
Add a Comment