We're hiring!
*

How to share code between Vulkan and Gallium

Faith Ekstrand avatar

Faith Ekstrand
January 16, 2024

Share this post:

Reading time:

One of the key high-level challenges of building Mesa drivers these days is figuring out how to best share code between a Vulkan driver and a Gallium driver. In the old Gallium-only world, the answer was simple: Implement Gallium and you get all the APIs for free. However, because Gallium isn't really capable of implementing Vulkan, that answer no longer works.

I used these organizational principals when creating NVK but they're broadly applicable to other drivers as well. Code organization has come up a few times recently when talking with other driver developers, so instead of answering every individual question, I thought it was worth a blog post.

Most of my examples will come from the Nvidia and Intel drivers in Mesa. However, it should be roughly applicable to pretty much any driver stack in Mesa. These are also recommendations for getting a clean separation. I don't expect anyone to respond to this blog post by doing a massive refactor of an existing, mature driver. However, if you're having trouble with code sharing turning into a mess, this may give you a sense of direction to get it all sorted out.

Device info

This is where it all starts. It's also one of the first things I see people get wrong when they come from the Gallium world to Vulkan.

You need some sort of structure to share information about the hardware device between different components. Most driver stacks have a bunch of shared components like an image layout library, a compiler, etc. They all need to know information about the device such as the hardware generation, number of shader cores, etc. Instead of passing all of that information around all the time, it's easier to have it all wrapped up in a single struct that you can pass in whenever needed. As an example, here's what the nv_device_info struct currently looks like:

enum PACKED nv_device_type {
   NV_DEVICE_TYPE_IGP,
   NV_DEVICE_TYPE_DIS,
   NV_DEVICE_TYPE_SOC,
};

struct nv_device_info { 
   enum nv_device_type type;

   uint16_t device_id;
   uint16_t chipset;

   char device_name[64];
   char chipset_name[16];

   /* Populated if type == NV_DEVICE_TYPE_DIS */
   struct {
      uint16_t domain;
      uint8_t bus;
      uint8_t dev;
      uint8_t func;
      uint8_t revision_id;
   } pci;

   uint8_t sm; /**< Shader model */

   uint8_t gpc_count;
   uint16_t tpc_count;
   uint8_t mp_per_tpc;
   uint8_t max_warps_per_mp;

   uint16_t cls_copy;
   uint16_t cls_eng2d;
   uint16_t cls_eng3d;
   uint16_t cls_m2mf;
   uint16_t cls_compute;

   uint64_t vram_size_B;
};

The information here describes the various attributes of the device. This includes the type of device (discrete, integrated, or Arm SoC), which chipset, versions for each of the classes (compute, 3D, etc.), numbers of shader cores, etc. On Intel, we have a similar `intel_device_info` struct that contains the same sort of information, only using Intel parlance.

Importantly, all of this information is about the hardware. Nothing in here is about our kernel version or Mesa software features. Some of these data need to be queried from the kernel (most on Nouveau) but it fundamentally describes the hardware configuration. There may be need for putting kernel or firmware information in there such as fixed address carve-outs but one should be careful. One reason to avoid bloat is that different APIs likely use the kernel driver quite differently and care about different features, so it doesn't make sense to centralize. Another is that it's often useful to fill out this struct without any kernel driver for things such as offline shader compilers or image layout unit tests. The less data you have to fake in this case, the better.

You can think of this struct as your driver's internal version of VkPhysicalDevice and it's used quite similarly throughout your driver code. In fact, in NVK, this is almost the entire physical device. The current version of nvk_physical_device looks like this:

struct nvk_physical_device {
   struct vk_physical_device vk;
   struct nv_device_info info;
   dev_t render_dev;
   dev_t primary_dev;
   struct wsi_device wsi_device;

   uint8_t device_uuid[VK_UUID_SIZE];

   VkMemoryHeap mem_heaps[2];
   VkMemoryType mem_types[2];
   uint8_t mem_heap_cnt;
   uint8_t mem_type_cnt;

   struct vk_sync_type syncobj_sync_type;
   const struct vk_sync_type *sync_types[2];
};

There's a bit of information in there to help map to Vulkan such as the memory type/heap stuff and sync types. However, most of what we need to support vkGetPhysicalDeviceFeatures() and vkGetPhysicalDeviceProperties() is just the nv_device_info. As the driver matures and we add kernel features for new Vulkan features, I'm sure we'll grow a few kernel bits in there but we don't need any today.

The mistake I see a lot of people make coming from the Gallium world is to confuse this with pipe_screen. A Gallium screen is a very different beast. In particular, a device info struct exists to tell your driver about the hardware while pipe_screen exists to tell Gallium about your driver. Though it's a subtle distinction, those aren't the same thing. Also, a Gallium screen is a stateful object, containing buffer and shader caches, whereas a device info struct is stateless and immutable once created. The state in a screen is necessary because of the insanity that is threaded OpenGL. Vulkan, on the other hand, handles threading quite well and neither needs nor wants that state being shared across devices. When trying to mentally map Gallium to Vulkan, a better model would be to map pipe_screen to VkDevice and map pipe_context to VkQueue.

Image layout

I could write a whole book on just this topic. For the sake of brevity, I won't go into all of the details about how to structure such a library. For now, I'll point you at the NIL and ISL code. There is also some quite good documentation for ISL which I wrote shortly before leaving Intel. It is by no means complete but gives a decent overview of the mental model employed by ISL. Some of that is hardware-specific but a lot of it applies more broadly.

For now, we'll focus on where such a library sits in the hierarchy of shared code components. The first thing to know is that you need such a library. I know you think you don't. You're wrong. ;p More seriously, the ability to have it broken out and write unit tests is pretty killer. Also, things like packing texture descriptors can get annoying and fiddly, and it's best to not duplicate that.

As with the device info struct discussed above, the image layout library should be about the hardware, not about the software needs of OpenGL or Vulkan. You may have software tricks that you need to play such as YCbCr emulation or shadow surfaces for certain cases. Those do not belong in your image layout library, at least not at the lower levels. The needs of Gallium and Vulkan are likely to be quite different there. Also, when doing any sort of emulation, it helps to have some separation between the description of a hardware image or surface and API-level image.

With NVK, even though I currently have no plans to write a Gallium driver, I wrote NIL as a separate library anyway for exactly that reason. When implementing emulated YCbCr, for instance, each VkImage can have up to three nil_image structures, one for each plane in the YCbCr image. We also need shadow surfaces for VK_FORMAT_D32_SFLOAT_S8_UINT images in order to work around the limitations of the DMA hardware. All of this is more straightforward when there is a clear separation between hardware capabilities and software emulation.

One thing that can be tricky about building an image library is that when you look at things from the perspective of an API driver, it looks like it needs to be aware of GPU memory, which is typically driver-managed. Even in Vulkan, a VkImage gets bound to memory and may possess a VA range for sparse binding. If we ignore vkBindImageMemory(), though, Vulkan presents us with a decent model to follow. Before you bind memory, a VkImage is just an immutable description of what a surface would look like, were it placed somewhere in memory. That's what you want for a shared image layout library. You want a description of what the image would look like. The library shouldn't know anything about memory until you go to fill an image view descriptor, at which point the driver can pass it a 64-bit GPU address.

Keeping the image description separate from memory has other benefits as well. One is that you may want to describe the image in multiple ways. For instance, in NIL (the NVIDIA image layout library), we have helpers to return a single LOD of a 3D image as a 2D array image. This makes it easy to implement the EXT_image_2d_view_of_3d extension. When the client requests a 2D array image view of a 3D image, we ask NIL to give us a nil_image describing the requested LOD as a 2D array and pass that to nil_image_fill_tic() to fill out the descriptor. This keeps the descriptor code simple (it doesn't have to know about such transforms) while also keeping the NVK code reasonably simple. We do a similar trick when binding 3D images as storage images (the hardware doesn't like 3D storage images) and for creating views of block-compressed images with an uncompressed format.

Overall, a well-constructed image layout library, with useful helpers for things like image re-description, makes writing the driver and getting all the corner cases correct way easier. The fact that you can share it between Vulkan and Gallium is an added bonus.

Compiler

I think most people realize that you want to share the compiler between GL and Vulkan. It's a huge pile of code and 99% of it has nothing to do with the driver. The real question is how. How do you abstract things such that GL and Vulkan can efficiently use the same back-end? While it's impossible to answer that question in general—there are just too many hardware details—I'll try to at least provide some guidelines.

The tricky part here is resource binding. Most of the rest of the compiler doesn't really care what API it's being used for. There is no difference between how arithmetic operations or even more complex things like subgroup operations work between OpenGL and Vulkan (unless you're Intel). Resource binding, however, is where the API and the driver end up being fairly tightly coupled. As a general rule, my recommendation is to make the back-end compiler consume texture ops and intrinsics that match the hardware binding model as closely as possible and have NIR lowering passes in each driver which lower to that model.

For example, a lot of hardware lacks native SSBO support and just loads and stores directly from/to 64-bit addresses. Instead of making the back-end try to support NIR's load/store_ssbo intrinsics, make the back-end handle load/store_global and have lowering passes inside each driver which lower to fetching the base address and size of the SSBO from somewhere and, doing the requisite arithmetic, and then loading/storing from/to the computed address. This lets the compiler can be entirely ignorant of anything having to do with descriptor sets or pushing addresses into shaders for GL. Instead, it just has to know how to do load/store_global.

The harder choices come when dealing with things that have fixed hardware bindings. For instance, Mali v9+ hardware has hardware descriptor sets which get bound to fixed slots in the hardware. In the shader, they're accessed by set number and a table index. In Vulkan, we likely want hardware descriptor sets to map to API descriptor sets. In GL, we may want a descriptor set per stage or something like that. In the compiler, though, there's no clear mapping in NIR today. The back-end compiler doesn't want to care. Ideally, we would get the set and index straight from the NIR texture/image instruction and pass that through to the hardware instruction. We haven't found the ultimate solution there yet (panvk for Valhall and later is still a work in progress) but it will probably look something like that.

Another example of this is Intel hardware which supports both a bindless model and hardware binding tables. The bindless model uses a 20-bit index into the bindless surface heap. The binding table model uses an index into a 240-entry per-stage binding table. The original GL driver provided the layout of this table as an output of the compiler and assumed the GL binding limits (128 textures, 32 images, 32 samplers, etc.) Resource indices were assumed to be relative to the start of the texture, image, or sampler sections, not the start of the table. When we brought up Vulkan, this caused a lot of headaches because it doesn't really map to the Vulkan binding model at all. Now, with the Iris Gallium driver and the modern Vulkan driver, both drivers compute the binding table layout up-front and pass complete absolute table indices to the back-end compiler. This is both closer to what the hardware expects and removes the need for the back-end to compute a table layout which Vulkan doesn't want anyway.

This brings me to my second recommendation: Limit the amount of side-band data returned from the back-end compiler. In an ideal world, you would pass NIR into the back-end and get a shader binary out and that would be all there is to it. We don't live in that world, unfortunately. There's almost always something that you need to get out, like the number of registers used or information that affects whether or not you can use an early depth test. However, this information should be kept to a minimum.

In a GL driver, it's really tempting to make the back-end compiler produce all sorts of extra stuff. GL is all magic when it comes to the shader compile flow, after all. Shaders can just use whatever built-ins and resources they want and the driver has to adapt. Often, this adaptation involves a bunch of extra magic uniforms that may or may not exist depending on what the shader uses. Many GL drivers end up producing a table of such uniforms as an output of the back-end compiler and expecting the driver to fill it out. The problem is that these values often make no sense whatsoever to a Vulkan driver or at least the needs of Vulkan are very different. For instance, you may want to push SSBO base addresses as magic uniform values in GL but they need to come from the descriptor set in Vulkan. If the back-end compiler is to be used by both GL and Vulkan, then anything which isn't truly universal needs to be lowered by the driver to something API-agnostic before going into the back-end.

This brings us to the third point: Compiler flow. How you structure the flow of compiler passes from SPIR-V parsing to the back-end compiler is important. A good structure makes implementing these lowering passes easier, or at least more obvious how they should work. In the Intel driver stack as well as the stack I'm building for NVK, we have roughly three steps:

  1. Pre-process: This step is entirely API/driver-agnostic. We take the NIR from either GLSL (for GL) or SPIR-V and run a set of optimizations and lowering passes. The objective of this stage is to optimize and lower as much as we can without knowing anything about the API driver. We do know about the hardware at that stage, though, so we can know how to lower things like texture projectors or unsupported ALU instructions. We just can't lower anything that requires API/driver knowledge.
  2. Linking (optional): After initial lowering and optimization, it may be desirable to link shaders across stages. Some hardware has combined shaders where, for instance, you may run vertex and geometry shaders as a single shader when both are present. Even for hardware which has fully separate shaders, such as Intel or NVIDIA, cross stage linking lets you run cross-stage dead code elimination which may substantially reduce the complexity of earlier shader stages when some of their outputs are never used. We also have NIR passes for cross-stage constant folding which lets you propagate constants that are written directly to shader outputs in an earlier stage into a later stage. In Vulkan, this should be done by the Vulkan driver but may use helpers provided by the compiler component. In a Gallium driver, this is done by Gallium itself.
  3. Lowering: This step happens inside the driver and is mostly unknown to the main compiler stack. In the Vulkan driver, this is where we lower input attachments, YCbCr conversions, descriptor sets, etc. In the Gallium driver, this is where we build binding tables and the magic uniform table.
  4. Finalize: This is the final step before we go into the back-end compiler. Again, this step is API/driver-agnostic. By this point, we should have turned any API or driver-specific idioms into generic or HW-specific idioms that the back-end understands. Instead, this step is focused on the very last bits of lowering and optimization. For instance, this is where we lower 64-bit integer arithmetic because the Vulkan descriptor set lowering may produce 64-bit pointer math and we want to optimize that a bit before lowering to 32-bit instructions. We also apply shader keys at this point. The sizes of our shader keys have shrunk a lot over the years, but we still have a few things which need to get lowered based on API state don't make sense to lower in the driver-specific lowering step.

Not only does this separation help sort out GL vs. Vulkan driver differences but it has a secondary benefit as well. Because the pre-processing step knows little to nothing about the driver, right after the pre-processing step is a natural place to cache shaders. If you use the same SPIR-V and same specialization constants and just change descriptor set layouts or API bits, the Vulkan driver can cache all of that early optimization work and only do it once.

Conclusion

As with any engineering problem, the devil is in the details. Every vendor's hardware is a bit different and has a different set of problems to solve. How exactly to structure any particular driver is up to the engineers developing it and will depend on the hardware being targeted.

What I've presented here is an overall framework to help think about the problem of code sharing between drivers. It's worked well for the Intel driver stack and it seems to be working nicely for NVK. I've seen a similar approach employed by a few other driver stacks in Mesa to good effect as well. Hopefully, this framework will help others as they try to solve the specific problems posed by the hardware they're targeting.

Comments (0)


Add a Comment






Allowed tags: <b><i><br>Add a new comment:


Search the newsroom

Latest Blog Posts

Faster inference: torch.compile vs TensorRT

19/12/2024

In the world of deep learning optimization, two powerful tools stand out: torch.compile, PyTorch’s just-in-time (JIT) compiler, and NVIDIA’s…

Mesa CI and the power of pre-merge testing

08/10/2024

Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…

A shifty tale about unit testing with Maxwell, NVK's backend compiler

15/08/2024

After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…

A journey towards reliable testing in the Linux Kernel

01/08/2024

We're reflecting on the steps taken as we continually seek to improve Linux kernel integration. This will include more detail about the…

Building a Board Farm for Embedded World

27/06/2024

With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…

Smart audio filters with WirePlumber 0.5

26/06/2024

WirePlumber 0.5 arrived recently with many new and essential features including the Smart Filter Policy, enabling audio filters to automatically…

Open Since 2005 logo

Our website only uses a strictly necessary session cookie provided by our CMS system. To find out more please follow this link.

Collabora Limited © 2005-2024. All rights reserved. Privacy Notice. Sitemap.