Faith Ekstrand
September 07, 2022
Reading time:
3D rendering APIs such as OpenGL, D3D, and Vulkan involve a lot of state to drive the 3D pipeline. Even though most of the heavy lifting these days is done by programmable shaders, there are still many fixed-function pieces used to glue those shaders together. This includes things such as fetching vertex data and loading it into the vertex shader at the start of the pipeline, viewport transforms and clipping that sit between the end of the geometry pipeline and rasterization, and depth/stencil testing and color blending that happen at the end of the pipeline before writing the final image to the output buffers. Each of these fixed-function pieces is configurable and so has some amount of state associated with it.
In OpenGL, the 3D rendering pipeline is modeled as one giant blob of state where everything is re-configurable at any time. It's left to the driver to track state changes and re-configure the hardware as needed. With Vulkan, we improved this situation quite a bit by baking much of the state into immutable objects. Images and samplers, for instance, have all their parameters provided at the time the image or sampler is created and they are immutable from then on. (The color or depth/stencil data pointed to by an image is mutable but the core parameters such as width, height, number of miplevels, etc. are not.) The only state mutability with respect to these objects is the ability to change which images/samplers are bound at any given time. Compiled shaders, along with the state for fixed function pieces such as depth/stencil testing, are all rolled up into a single monolithic pipeline object. Because fully monolithic pipeline objects can be cumbersome, Vulkan also provides the option to make some of that state dynamic, meaning that you set it manually via a vkCmdSet*()
command instead of baking it into the pipeline. This allows the client to use the same pipeline object with, for instance, different blend constants.
Even though the Vulkan model is simpler in many ways than the OpenGL model, it poses its own challenges for drivers. As various extensions have been released, the mess has gotten worse over the years. The recently released VK_EXT_graphics_pipeline_libraries extension, while good for software developers, has provided its own unique set of challenges for implementers. Fortunately, most of the drivers in Mesa take approximately the same approach to solving these problems and I recently begin unifying some of that code to provide a common state tracking solution.
Before digging into where and why the Vulkan model often doesn't match real hardware, let's consider a theoretical GPU which matches Vulkan perfectly. This will help with understanding the mental model of Vulkan and why the API is designed the way it is. In the next section, we'll show where real GPUs differ from this theoretical model and how each of the assumptions we're about to make doesn't actually hold in the real world.
At the heart of most desktop GPUs is a command stream processor. This is a small processing unit, often a custom architecture, which is responsible for managing state within the GPU. The Vulkan command buffer is a wrapper around a bit of GPU memory that contains commands to be executed by the command stream processor. The types of commands parsed by this unit fall into roughly four categories:
Setting state is often done using MMIO register writes from the command stream processor. (Intel GPUs are a bit weird here in that they have lots of custom packets for things.) In our ideal world, each bit of state lives in its own register and can be set independently of any other state. We'll also assume that the DMA engine is capable of handling all copy, blit, and clear operations in the Vulkan API. (If you have experience working on actual GPU drivers, you're laughing hysterically right now.)
On this idealized GPU, implementing Vulkan is fairly straightforward. Each vkCmd*()
API call writes some series of commands into the command buffer for processing by the command stream processor. The vkCmdDraw*()
and vkCmdDispatch*()
turn into hardware draw or dispatch commands. Often this looks like writing the draw or dispatch parameter (number of vertices and instances, etc.) registers and then writing a final register to kick off the actual draw. The draw is done entirely based on whatever state is set at the time that the draw kick-off register is written. If the hardware wishes to pipeline things and have multiple draws going at a time (every competent GPU does), it handles buffering of the state internally so that, as soon as the kick-off register is written, you can begin modifying state for the next draw.
For state management, each vkCmdSet*()
command which sets some bit of state maps to a set of MMIO writes that write those states into registers. For example, vkCmdSetScissors()
would write 4 values into registers to describe the x/y dimensions of the two corners of the scissor box. Depending on the hardware, it might be two corners or an offset and size. It may be four 32-bit registers with one value in each register or two 32-bit registers with each value first converted to 16 bits and then packed into the registers in pairs. These sorts of hardware-specific translations are why we have drivers. Each of the other vkCmdSet*()
commands would look similar, translating the API values into hardware values and writing them to the appropriate registers.
Pipeline objects are sort of subroutines or macros for the command stream processor. Instead of being a single command or even a small handful of commands, each pipeline contains many commands. For each state that's provided in the pipeline and not dynamic, they have the commands required to set that state. These are identical to what the relevant vkCmdSet*()
command would emit except that they're baked into the pipeline and get set as part of the single vkCmdBindPipeline()
call. This is a convenience for developers and avoids the API overhead of dozens of little vkCmdSet*()
calls. In addition to API-visible state, there is some amount of state associated with each programmable shader stage. There is always a pointer to the compiled shader program but there are often other things which look like state to the GPU but are derived from compiling the shader. For instance, the hardware may need additional information about the number of registers used by the compiled program or whether or not the fragment shader does a discard. All of this is baked into the pipeline object. For the sake of our example, we'll assume the pipeline object has a bit of memory in it containing these commands and vkCmdBindPipeline()
does a memcpy()
to copy the pipeline commands into the command buffer. This should be faster than doing all the API translation of every state every single time.
For our ideal case, we can assume each of the vkCmdCopy*()
and vkCmdClear*()
calls as well as vkCmdBlitImage()
maps nicely onto one of our DMA operations. Like draw and dispatch commands, these probably look like a set of MMIO writes to set up the parameters for the copy, blit, or clear followed by a MMIO write which kicks off the DMA operation. Importantly, we get to assume that none of these operations share any state with the 3D pipeline. This is almost never true in the real world, as we'll see shortly, but it's a fun assumption for our land of make-believe.
Finally, pipeline barriers and events are implemented in terms of stalling and cache management commands. A typical example of transitioning a buffer from being used as a render target to texture would involve a stall to ensure the render is done before any texturing occurs as well as perhaps flushing the image from the render cache and invalidating it in the sampler cache.
For our above theoretical GPU, every, every vkCmd*()
operation maps to one or more MMIO writes from the command stream processor. Because we got to assume that API states map nicely to hardware states, everything is 100% independent and we don't have to do any state tracking whatsoever inside the driver. Sadly, the real world is not so nice. There are five major problems that need to be solved.
If you've been hanging around the graphics industry for very long, even as just an avid gamer who reads too much internet, you've likely heard of the woes of state-based shader re-compiles. This is because the process of compiling a shader from the source format (such as DXBC, DXIL, GLSL, or SPIR-V) to the final binary that gets executed by the GPU is often affected by some of the state. For instance, Intel needs to know whether the framebuffer is multisampled or not and whether the fragment shader will be executed per-sample. These things affect which shader instructions are valid and where certain pieces of data appear in the register file at the start of each thread execution. AMD hardware needs to know the color format of the render targets so it can use the right render target write messages to get good performance. Mobile hardware is famous for implementing color blending as part of the fragment shader.
In OpenGL, the only way to handle this is for the driver to guess how various state bits will be set and compile assuming those guesses. At draw time, it checks and, if it guessed wrong, it has to re-compile. This can result in what's called "hitching" where the game hangs for a few milliseconds while it waits on a shader to compile. Vulkan solves (Hah! Nice try...) this problem by baking all the state required to compile shaders into the pipeline object. While successful at making sure shaders only compile inside vkCreate*Pipelines()
and not at draw time, this approach has other serious limitations but those are unrelated to my recent Mesa state tracking work and a topic for a different blog post.
One big assumption we made in our theoretical GPU is that the various hardware states written via MMIO registers are independent and map nicely to the API. Often states are packed together with more than one state in a single register or command. We gave one example of this above with scissors of packing two of our four scissor components into a single 32-bit register. However, this doesn't actually cause any problem because the API always provides whole scissors, never just the X offset by itself.
Packing states together becomes a problem when multiple, mostly unrelated, states are in the same command or MMIO register with no ability to set them separately. Intel GPUs are especially bad about this as they often have upwards of a dozen different states set by a single command. Take, for instance, Intel's 3DSTATE_WM_DEPTH_STENCIL
packet which contains all of the state for the depth and stencil tests. Some of this state may come from the pipeline and some of it may come from dynamic state via vkCmdSetStencilReference()
or similar. In order to handle all this, the driver records the depth/stencil state in a CPU data structure attached to the command buffer in each of the relevant vkCmdSet*()
commands and emits the actual 3DSTATE_WM_DEPTH_STENCIL
packet as part of vkCmdDraw*()
if any depth/stencil state has changed. A more extreme example is Mali Bifrost hardware where the majority of the state required by the GPU is packed into a single RENDER_STATE
.
Worse than combined states are cases where multiple API states are required to compute a state that then gets put into the hardware. One example of this is on Mali where there's a bit specifying when to do early versus late depth/stencil testing. This is a combination of information from the fragment shader (whether or not it writes gl_FragDepth
or uses discard
), render targets (whether depth or stencil are used), and the depth and stencil test state. Another example is the multisampling rasterization mode on Intel which is dependent on the primitive type (triangles vs. lines vs. points) and the line mode if lines are being rendered. The solution for computed states is the same as combined states: they have to be stashed CPU-side and the actual state calculation delayed until draw time.
While some GPUs such as those from AMD and NVIDIA manage all the state internally as described above, others such as early Intel GPUs and Arm's Mali GPUs push most of the state management off on the driver by heavy use of indirect state. Indirect state is where, instead of setting the state directly in a register and the GPU managing state buffering, the state lives in an in-memory data structure and a pointer to that state is written to a register. Those indirect data structures may also contain pointers to other data structures, as needed. Modern Intel GPUs uses a mixture of direct and indirect state with indirect state being used for bigger things like viewports, scissors, and color blend state.
Using indirect state reduces the over-all amount of state the hardware has to manage and buffer internally since those pointers are expected to be valid for the entire duration of any draws which reference them. For a tiled architecture like Mali, this is especially important because it needs to maintain access to all of the state ever used in an entire render pass so that it can run the render pass per-tile. Doing this internally to the GPU would require a lot of internal memory and require flushing the render pass if they ever ran out. Using indirect state lets them manage it all in the userspace driver which has access to all the memory in the system. Indirect state also makes it possible to re-use state identical objects and reduce memory consumption and maybe even improve caching.
On the driver side, indirect state poses many of the same problems as combined states mentioned above only its worse because suddenly everything needs to be delayed until draw time. In theory, if your indirect state data structures are broken up enough, Vulkan pipeline objects can help a bit here as they give the opportunity to pre-bake certain states. However, anything which is affected by dynamic (not baked into the pipeline) state will need CPU management.
When talking about our ideal GPU, we made the simplifying assumption that the DMA engine could be used for all copy, blit, and clear commands. I don't know of a single GPU where this is actually true. There is almost always some case where you need to use the 3D engine or a compute shader to implement the blit. Multisample resolves and vkCmdBlitImage()
almost always have to be implemented using 3D or compute and some GPUs have no dedicated DMA hardware at all. In the case of older Intel GPUs, the hardware has a blit engine, but it has its own command stream processor and thus can't be used in the same command buffer as 3D and compute.
Why does using the 3D hardware for DMA operations pose a state tracking problem? Because doing so requires binding different shaders, setting different state, and maybe even changing render targets. Once the DMA operation is complete, all that state needs to be restored back to what the client set before continuing to render. While it may be possible with some GPUs to do a save and restore of various registers directly on the command stream processor, it's often easier to just keep a CPU copy of everything and save and restore that.
The way Vulkan pipeline objects are designed, there is some state which must always be provided when creating the pipeline object because it may be required for shader compilation. For possibly dynamic state, the client can choose whether the state gets baked into the pipeline or if they would rather manage it themselves. If the client requests that a particular piece of state be dynamic, it will not be changed by vkCmdBindPipeline()
and the client must instead set it via the relevant vkCmdSet*()
command.
In our ideal GPU where every state is independent, this isn't a big deal. All you have to do is conditionally emit the MMIO write in the pipeline based on whether or not the state is dynamic. In a real GPU where state isn't fully independent, you may have to combine pipeline state with dynamic state at draw time. The client being in control over what state is dynamic and what gets baked into the pipeline, makes this even more complicated. The approach taken by most Mesa drivers today is to treat any state which may be dynamic as always being dynamic and making vkCmdBindPipeline()
implicitly do a bunch of vkCmdSet*()
for any pipeline state which the driver allows to be dynamic. (Note that we can actually do this a bit more efficiently inside the driver than actually calling the driver's vkCmdSet*()
entrypoints.)
With the new VK_EXT_graphics_pipeline_library extension which shipped a few weeks ago, the pipeline state situation got more complex. The extension allows a client to create a pipeline that contains only a subset of the shaders and state and then link those partial pipelines together to form a complete graphics pipeline. In order to ensure that drivers still have all the information they need to compile shaders, shaders and pipeline state are now split into four categories: vertex input, pre-rasterization shaders, fragment shader, and fragment output interface. A partial pipeline can contain any set of state groups but it must contain whole state groups. For instance, a client can't have a pipeline which is just a vertex shader and expect to link it with one that is just a geometry shader.
Pipeline linking poses many of the same state problems. In an ideal GPU, each partial pipeline would contain the MMIO writes for the state contained in that piece of the pipeline and linking would be a simple concatenation. However, thanks to combined and computed state, even if you can compile shaders independently, you may not be able to emit all the pipeline state until link time. To handle this, we need a data structure that can encapsulate all of the state provided when each of the partial pipelines was created so we can combine it all together and have a complete copy when it comes time to create the final pipeline.
Does all this sound exhausting? It should because it is! Worse, every driver in Mesa has to solve all of these problems. When we wrote the original Intel driver, we had a struct anv_dynamic_state
which encapsulated all of the Vulkan 1.0 dynamic state (there were only nine dynamic state groups at the time). One anv_dynamic_state
was embedded in the command buffer for tracking CPU copies of dynamic state and one was embedded in the pipeline to store pipeline state. In vkCmdBindPipeline()
, we would copy any states which had not been declared dynamic from the pipeline into the command buffer. Other drivers have copied+pasted the code from the Intel driver almost verbatim and updated it as more things have become dynamic.
To help drivers sort all this out, we recently landed a common Vulkan graphics state tracking framework in Mesa (MR, docs). This new framework contains a set of structs for gathering graphics pipeline state as well as managing dynamic graphics state.
The pipeline state collection is designed to make it as easy as possible to implement pipeline libraries. Accumulating all the graphics pipeline state, including handling any libraries, looks something like this:
/* Assuming we have a vk_graphics_pipeline_state in pipeline */ memset(&pipeline->state, 0, sizeof(pipeline->state)); const VkPipelineLibraryCreateInfoKHR *lib_info = vk_find_struct_const(pCreateInfo, PIPELINE_LIBRARY_CREATE_INFO_KHR); if (lib_info != NULL) { for (uint32_t i = 0; i < lib_info->libraryCount; i++) { VK_FROM_HANDLE(drv_graphics_pipeline_library, lib, lib_info->pLibraries[i]); vk_graphics_pipeline_state_merge(&pipeline->state, &lib->state); } } /* This assumes you have a void **state_mem in pipeline */ result = vk_graphics_pipeline_state_fill(&device->vk, &pipeline->state, pCreateInfo, NULL, NULL, pAllocator, VK_SYSTEM_ALLOCATION_SCOPE_OBJECT, &pipeline->state_mem); if (result != VK_SUCCESS) return result;
The vk_graphics_pipeline_state_merge()
function populates a vk_graphics_pipeline_state
structure from a VkGraphicsPipelineCreateInfo
. It's aware of pipeline libraries and automatically checks for VK_PIPELINE_CREATE_LIBRARY_BIT_KHR
, figures out what bits of state it needs based on the VK_GRAPHICS_PIPELINE_LIBRARY_*
bits, provided shader stages, etc. and only fills out those portions. The logic to sort out exactly which bits are needed is quite tricky, especially in the presence of more and more dynamic state, so it's good to have it all centralized now. It's also important that we get it right because Vulkan allows clients to pass in garbage pointers under certain circumstances and we don't want to accidentally dereference anything we're not going to use.
The vk_graphics_pipeline_state_merge()
function accumulates state from a pipeline library. If a vk_graphics_pipeline_state
is already partially populated by vk_graphics_pipeline_state_merge()
, vk_graphics_pipeline_state_fill()
will only add those states which are currently missing, allowing for this accumulation pattern. It's still up to the individual driver to sort out any shader compiler issues but state collection and merging between the different state groups is now handled fairly automatically.
We also added shared dynamic state handling in the form of vk_dynamic_graphics_state
and related helpers. A vk_dynamic_graphics_state
is now embedded in every vk_command_buffer
and all of the vkCmdSet*()
state setter functions are now implemented in common code. In order to use the new common functionality, the driver must also embed a vk_dynamic_graphics_state
in its pipeline object and call vk_cmd_set_dynamic_graphics_state()
in its implementation of vkCmdBindPipeline()
. It also needs to be modified to pull dynamic state from the common vk_command_buffer::dynamic_graphics_state
instead of its own state struct. Once this is complete, the driver can delete all its vkCmdSet*()
entrypoints. For the Intel Vulkan driver, switching to the common dynamic state tracking dropped about 1000 lines of code from the driver.
Not only is the new code shared but it's also a bit smarter than the Intel driver code was. The common code checks if the client is setting the same value as was already set and doesn't dirty the state in that case. It does this both for normal vkCmdSet*()
entrypoints as well as vk_cmd_set_dynamic_graphics_state()
. This means that, for pipeline switches, we're automatically diffing the state in the pipelines and only re-emitting state that actually changes. Especially for drivers that frequently need to combine different states together in the same packet, this should substantially reduce redundant state emissions caused by pipeline switches.
Like most of the Vulkan runtime code in Mesa, the dynamic state tracking framework is optional. If a driver is targeting hardware where everything is neatly separated and redundant state is harmless, it's all optional. It takes a bit of CPU memory per command buffer but otherwise lies dormant unless a driver chooses to use it. However, for most drivers that's not the case and this provides a significant boon.
While not the most important part of the common Vulkan runtime in Mesa, this is a substantial quality of life improvement for Mesa driver developers. For any new Vulkan drivers added to the tree, this is 1000 lines of code they don't have to type and makes implementing pipeline libraries substantially easier. For existing drivers, it should eventually reduce their maintenance burden and make enabling new features faster once they switch to the new framework. Yet one more way we're trying to make Mesa the best place for 3D graphics driver development!
19/12/2024
In the world of deep learning optimization, two powerful tools stand out: torch.compile, PyTorch’s just-in-time (JIT) compiler, and NVIDIA’s…
08/10/2024
Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…
15/08/2024
After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…
01/08/2024
We're reflecting on the steps taken as we continually seek to improve Linux kernel integration. This will include more detail about the…
27/06/2024
With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…
26/06/2024
WirePlumber 0.5 arrived recently with many new and essential features including the Smart Filter Policy, enabling audio filters to automatically…
Comments (1)
David Neto:
Oct 01, 2022 at 04:46 PM
Bravo.
Great insight into how real drivers work. Thanks for sharing!
Reply to this comment
Reply to this comment
Add a Comment