Deep dive into OpenGL over DirectX layering

Deep dive into OpenGL over DirectX layering

Louis-Francis Ratté-Boulianne
July 09, 2020

Share this post:

Reading time:

Earlier this year, we announced a new project in partnership with Microsoft: the implementation of OpenCL and OpenGL to DirectX 12 translation layers (Git repository). Time for a summer update! In this blog post, I will explain a little more about the OpenGL part of this work and more specifically the steps that have been taken to improve the performance of the OpenGL-On-D3D12 driver.

General Architecture

In the initial steps of this project, we quickly realized that the best way forward was to build on top of Mesa. Zink, a project started by Erik Faye-Lund, has already proven that we could achieve a similar goal: translating OpenGL to a lower-level graphics API (s/Vulkan/DirectX12/). People familiar with that project will therefore experience a strong déjà-vu feeling when looking at the architecture of our current effort:

The Mesa state tracker is responsible for translating OpenGL state (blend modes, texture state, etc) and drawing commands (like glDrawArrays and glDrawPixels) into objects and operations that map well to modern GPU hardware features (Gallium API). The "D3D12 Driver" is thus an implementation of that interface.

On the shader side, the state tracker is able to convert OpenGL fixed-functions, traditionally implemented directly by the hardware, into shaders. Mesa will also translate GLSL shaders into an intermediate representation named NIR. We use that representation to produce the DXIL bytecode consumed by DirectX. I'm not gonna focus on the NIR-to-DXIL compiler here as it definitely deserves its own blog post.

Finally, a different component of Mesa, the WGL State tracker, is handling WGL calls (API between OpenGL and the windowing system interface of Windows). Internally, an existing implementation of the windowing system was using GDI (Graphics Device Interface) to actually display the rendered frames on the screen. We added a new implementation using DXGI Swapchains. More on that later.

DirectX 12 - 101

In order to better understand the next sections, let's dive a little more into the details of DirectX 12.

DirectX 12 requires that we record commands (e.g. clearing the render target, draw calls, etc.) into a ID3D12GraphicsCommandList and then call Execute() to actually process the commands. But before we can record drawing commands, we first need to set some state on the command list. Including (not an exhaustive list):

Viewport, scissor clipping, blend factor, topology (whether we want to draw points, lines or triangles), vertex and index buffers.
Render Targets and Depth/Stencil resources: where the draw call is going to render the resulting pixels.
Pipeline State Object (state bits that will probably stay the same for multiple draws): compiled shaders (DXIL bytecode), blend state, depth/stencil/alpha state, rasterizer state...
Root Signature: defines what types of resources are bound to the graphics pipeline. For example, if a pipeline requires access to two textures, the root signature is going to declare a parameter that is a 2-descriptor range into the SRV (shader resource view) heap.
Descriptor Heaps: where we set the relevent descriptors for resources, samplers and constant buffers.
Resource State: description of how a GPU intends to access a resource. Transition barriers are required to garantee the proper state for a command. For example, before sampling from a texture, we need to make sure that the source is in the D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE state. The exact details of the transition vary from hardware to hardware, but it would minimally makes sure that all writes to the texture are completed, that the resource has the proper layout (e.g. (de)compression) and that the cache is coherent.

Vulkan follows a similar model, with VkPipeline objects encapsulating state such as image formats, render-target attachments, blend modes, and shaders all bounded into a single object. Like DirectX command lists, Vulkan pipelines are immutable once created and recorded. This is one of the biggest sources of impedance mismatch when translating from GL, where applications set global state parameters and the final state is only known when a draw call is submitted.

Performance Work

Initial State

Our initial implementation of the driver was as straightforward as possible, as we wanted to validate our approach and not focus too much on performance early on. Mission accomplished, it was really slow! For each draw call, we were setting the entire pipeline state, filling the descriptor heaps, recording the draw command, immediately executing the command list, and waiting for it to finish.

When drawing a scene with 6 textured triangles (red, green, blue, red, green, blue), the sequence of events would look like this:

Command Batching

This is of course extremely inefficient, and one easy approach to reduce latency is to batch multiple commands (clear, draw, etc.) together. Concretely, we create multiple batch objects that each contains a command allocator and a set of descriptor heaps (sampler and CBV/SRV heaps). Commands are recorded in the batch until the descriptor heaps are full, at which point we can simply execute (send the commands to the GPU), create a fence (for future CPU/GPU synchronization) and start a new batch. Given that queuing of the command and its actual execution are now decoupled, this optimization also requires that we keep track of needed resources. For example, we need to make sure to not delete textures that the draw call will sample from, once executed.

Batch
Command Allocator
Sample Descriptor Heap
CBV/SRV Descriptor Heap
Tracked Objects
Fence

When all of the batch objects are used (wrap-around), we wait for the oldest submitted batch to complete. It is then safe to unreference all of the tracked resources and re-use the allocator and heaps. Assuming a maximum of 2 active batches and when allocating heaps just big enough for two draw calls (let's start small), drawing our 6 triangles looks like this:

It is important to note that some more flushing (waiting for some or all of the commands to be finished) is needed in some scenarios. The main one being when mapping a resource currently used by a command (texturing, blit source/destination) for access by the CPU.

Dirty State and PSO Caching

In real-life situations, it is really rare that ALL of the state bits changes in-between draw calls. It is probably safe to assume, for example, that the viewport would keep constant during a frame and that the blending state won't budge much either. So, in a similar fashion to how (real) hardware drivers are implemented, we use dirty-state flags to keep track of which state has changed. We still need to re-assert the entire state when starting a new batch (resetting a command list also resets the entire pipeline state). However, it saves us some CPU cycles when doing multiple draw commands in one batch (very likely).

In addition to that, and given that PSO (Pipeline State Object) creation is relatively costly, we cache those. If any of the PSO-related dirty flags are set, we can then search the cache and in case of a miss, create a new PSO.

The total rendering time hasn't changed much in our example scenario, but the CPU usage is lowered. Another effect of not re-asserting the descriptor heaps on each draw call is that we can sometimes fit more commands in one batch without allocating bigger heaps.

DXGI Swapchain Winsys

Initially, only CPU-driven GDI winsys integration was implemented. There are two downsides to that approach.

Each time we are done recording the commands for a frame, we need to wait for the rendering to finish and are then completely stalling the CPU.
The resulting framebuffer content is copied to the GDI display target for composition by the window manager.

Let's zoom out and see what is happening when drawing 4 frames. For the purpose of this next diagram, we'll assume that we can draw an entire frame using only one batch:

By implementing integration with DXGI swapchains (only supported for double-buffered pixel formats; we still rely on the old GDI code path otherwise), we can solve these two issues. The swapchain provides us with a back buffer into which the GPU can render the current frame. It also keeps track of a front buffer that is used for the display. When the application want to present the next frame (wglSwapBuffers), the swapchain simply flip these two buffers.

Please note that this diagram is only valid for full-screen applications when throttling is disabled (wglSwapInterval(0)). When syncing with V-Sync, the GPU might also introduce some stalling to make sure it doesn't render over the currently displayed buffer. When rendering in windowed mode, the window manager will use the front buffer to compose the final display scene. It can also, in some situations, directly use the buffer without any bliting if the hardware supports hardware overlays.

One final caveat: the application will suffer a performance hit when drawing on the front buffer (glDrawBuffer(GL_FRONT)). The buffer-flip presentation model rules out that possibility; the front buffer needs to stay intact for scanout. If that happens, we have no choice but to create a fake front buffer and to performs some copies before drawing and swapping buffers.

Resource State Manager

Resource state transition barriers require that we specify both the initial state of the resource and the desired new state. The naive approach is to add a state barrier before and after each resource usage (COMMON -> RENDER_TARGET, draw, RENDER_TARGET -> COMMON). But that solution has a real performance cost: each transition may involve layout transitions, resolves, or even copies to shadow resources, so whilst this is an acceptable crutch for short-term development, the cost of using the lowest common denominator is too much for real-world usage. However, getting the details right for a better solution is tricky:

Subresources in a resource can have different states;
Some states are compatible together (read states);
Mutiple transition barriers can be aggregated to reduce the number of commands;
Some transitions are implicit (with no cost associated) and don't require a barrier.

Luckily for us, the Microsoft team already worked on a solution for a very similar problem. They have previously developed a project named D3D11on12 (yep, you guessed it, translating DirectX11 API to DirectX12) that is itself relying on D3D12TranslationLayer. They were able to adapt some code from the latter into our Mesa tree, fixing all of the problems previously mentioned.

Buffer allocation

It is not always optimal to create a new commited resource for each buffers. To speed up resource allocation, we don't immediatly destroy unreferenced buffers but instead try to re-use them for new allocations. Allocating whole resources for small buffers is also inefficient because of the alignment requirements. Therefore, we create a new buffer slab on demand to sub-allocate smaller buffers. In the future, it might be possible to implement a similar solution for textures, but this approach is stricly used for buffers as of now.

Wrap-Up

TDLR; All of these incremental changes sum up to an amazing result: less CPU time wasted waiting on completions (pipelining), less CPU time wasted on overhead (batching), less CPU time wasted on redundant operations (caching), more efficient memory usage (suballocation), zero-copy presentation pipeline (DXGI), more efficient GPU hardware usage (explicit image states which aren't COMMON).

For the ones that just came for the screenshot:

And some numbers I compiled for Doom 3 timedemo benchmark. In this very specific scenario, on my system (mobile Intel CPU/GPU), the cumulative gain of our changes is around 40x!

Step	FPS
Initial State	1.0
Command Batching	4.2
Dirty State & PSO Cache	12.9
DXGI Swapchain *	14.8
Resource State Manager	24.6
Buffer Caching/Suballoc	42.5

* The improvement when switching to DXGI is low because of resource creation that requests a kernel lock.

Disclaimer: You might not be able to replicate these numbers if you use the main repository as I disabled the debug layer and changed the descriptor heaps size for my benchmark.

Future

Here are some of the ideas we could consider going forward:

Further CPU overhead analysis
Optimize number of batches and size of descriptor heaps
Placed textures (sub-allocation from a large texture)
Same-subresource copies will always be more inefficient unless we can go around that
DirectX 12 restriction
Use a different BO when the whole content is discarded
Create bucket root signature (round up needed size for descriptor ranges)

Acknowledgments

Our team consists of five additional Collabora engineers (Boris Brezillon, Daniel Stone, Elie Tournier, Erik Faye-Lund, Gert Wollny) and two Microsoft DirectX engineers (Bill Kristiansen, Jesse Natalie).

Introducing OpenCL and OpenGL on DirectX

Zink: Fall Update

Introducing Zink, an OpenGL implementation on top of Vulkan

Introducing OpenCL and OpenGL on DirectX

Zink: Fall Update

Introducing Zink, an OpenGL implementation on top of Vulkan

Comments (9)

theuserbl:
Jul 09, 2020 at 10:46 PM

There are on Linux ways to run Direct3D on top of Vulkan and OpenGL. And on Windows is the other way around, OpenGL and OpenCL on top of DirectX12.
And then there existing graphic cards, which support DirectX and OpenGL.

Why isn't is possible to support OpenGL and OpenCL with Mesa direct on Windows. Without the way to sit on top of DirectX12?
On Linux Mesa works also direct. Why not on Windows?

Reply to this comment

Reply to this comment
1. Erik Faye-Lund:
  Jul 10, 2020 at 07:04 AM
  
  There's several reasons why it's beneficial to build this on top of Direct3D 12 on Windows rather than building native OpenGL GPU drivers for Windows:
  
  1. Existing drivers: The existing drivers in Mesa depends on a lot of non-Windows infrastructure (like DRM/DRI). Porting every driver over to support WGL natively would among other things require the introduction of a new Windows kernel driver per GPU. This is obviously a lot of work.
  
  2. New drivers: Future GPUs needs future work. With work like Zink in place for Linux, there's a reasonable chance some future GPUs wont get full OpenGL native support in Mesa. That would mean more work in enabling support for new GPUs, including writing new kernel drivers. Combine this with the fact that OpenGL is unlikely to get much further development, this isn't a very attractive value-proposition.
  
  3. Ecosystem: Microsoft's graphics ecosystem is based on D3D12. This is the API they support, document, ask their vendors to implement support for, and certify the implementations. It's only natural for them to build on that rather than starting over and having to certify drivers for a new API.
  
  Just to be clear, there's nothing forcing users who have a GPU with a native OpenGL implementation to use the D3D12 based one. This simply adds an option to run OpenGL applications for those who don't.
  
  I hope this clears up a bit.
  
  Reply to this comment
  
  Reply to this comment
lostpixel:
Jul 17, 2020 at 10:29 PM

I have some questions:

1. OpenGL drivers often come with certain penalty; some features are disabled in "game" versions and full/unlocked hardware implementations of OpenGL usually comes in "pro" versions (like in geforce vs quadro). Do you think that OpenGL over DX12 can give more "hardware" acceleration then what vendors driver implements? Is DX12 driver also crippled in same way or is it possible to implement OpenGL in shaders or some DX functionality as much hardware accelerated as possible?

2. Some optimizations: 90's cruft as you call it, is actually really nice conceptual model ('old-school' immidiate mode). Unfortunately it is also very inneficient. I am talking about glBegin() & glEnd() and friends. A software implementation on top of DX12 could implement glVertex* & Co calls as inlined functions, macros, whatever that does not result in myriad of functon calls, stuff data into some hidden vbo and render everything as 'retained' behind the user back. Are there any such plans if that is possible?

3. If used on platforms without DX (i.e. Linux) will it be possible to "pass through" GL calls to vendor driver if such are avaialable? Or maybe VK implementation?

Reply to this comment

Reply to this comment
1. Daniel Stone:
  Jul 20, 2020 at 10:14 AM
  
  1) Interesting question but I expect the answer is no. If vendors want to differentiate on price point, then I expect they'd do that uniformly across all their drivers. The only hypothetical exception is if a vendor wants to differentiate on feature enablement when the hardware support is really uniform, _but_ DirectX requires certain features and OpenGL makes them optional. In that case, GLon12 could hypothetically unlock more features. In many cases though, even if the hardware is uniform, the difference is QA: for example, if you have a consumer and a workstation product which do have a uniform hardware base, often the workstation dies are the ones which passed QA everywhere, but the consumer version failed for some part of it, so those bits of the hardware are masked off as being broken. I don't think this is hugely likely.
  
  2) This is exactly what we already do! :) Mesa implements support for immediate mode by recording the state into caches, recording the user data into buffers (e.g. vertices into VBOs), etc, so at the end our driver does execute exactly like a modern DX12 client, with all the immediate-mode stuff left as an upper-level implementation detail. This is true of all modern Mesa drivers, as no-one has immediate-mode support anymore: if you run Mesa's AMD or Intel (or Arm Mali, Qualcomm Snapdragon, etc) drivers on Linux, you'll get the exact same thing.
  
  3) Yeah, we have some prior art for that as well, in different drivers. VirGL is a virtualised GL-on-GL driver (used in ChromeOS amongst others), and Zink is a GL-on-Vulkan driver which shares a lot of code and concepts with GLon12.
  
  Reply to this comment
  
  Reply to this comment
MATHEUS EDUARDO GARBELINI:
Sep 14, 2020 at 08:34 AM

Is there some guide/tutorial on how to try this on WSL2 or building and installing mesa from the referenced repository is enough?

Reply to this comment

Reply to this comment
1. Erik Faye-Lund:
  Sep 14, 2020 at 03:29 PM
  
  WSL2 support is still not implemented. It should be coming up soon, though.
  
  Reply to this comment
  
  Reply to this comment
Michael L:
Nov 15, 2020 at 04:50 AM

Pretty sure thousands right now are putting their faith in this for the new Radeon release. I guess everyone expects OpenGL performance with emulators and Minecraft to still remain...questionable. I've been linked to this page many times now.

Of course, you guys are doing gods work, though I think people wonder when this will come to fully to fruition.

Reply to this comment

Reply to this comment
1. Daniel Stone:
  Nov 15, 2020 at 02:31 PM
  
  There’s been a lot of development to improve performance and there’ll be more still. I don’t think our work will ever be done as such, but it’s now well past the proof-of-concept stage and into something you can actually use.
  
  Reply to this comment
  
  Reply to this comment
  1. MATHEUS EDUARDO GARBELINI:
    Nov 16, 2020 at 07:04 AM
    
    This is really good, as this is crucial to many applications that rely on OpenGL 3 functionality and previously would only run under a complete virtual machine.
    
    Reply to this comment
    
    Reply to this comment

Add a Comment

Search the newsroom

Latest Blog Posts

Customizing WirePlumber's configuration for embedded systems

29/04/2025

Configuring WirePlumber on embedded Linux systems can be somewhat confusing. We take a moment to demystify this process for a particular…

Evolving hardware, evolving demo: Collabora's Embedded World Board Farm

24/04/2025

Collabora's Board Farm demo, showcasing our recent hardware enablement and continuous integration efforts, has undergone serious development…

Implementing Bluetooth on embedded Linux: Open source BlueZ vs proprietary stacks

27/02/2025

If you are considering deploying BlueZ on your embedded Linux device, the benefits in terms of flexibility, community support, and long-term…

The state of GFX virtualization using virglrenderer

15/01/2025

With VirGL, Venus, and vDRM, virglrenderer offers three different approaches to obtain access to accelerated GFX in a virtual machine. Here…

Faster inference: torch.compile vs TensorRT

19/12/2024

In the world of deep learning optimization, two powerful tools stand out: torch.compile, PyTorch’s just-in-time (JIT) compiler, and NVIDIA’s…

Mesa CI and the power of pre-merge testing

08/10/2024

Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…

About Collabora

Whether writing a line of code or shaping a longer-term strategic software development plan, we'll help you navigate the ever-evolving world of Open Source.

한국의 국기 한국어 버전의 Collabora.com 보기