Faith Ekstrand
May 16, 2024
Reading time:
This week we merged support for the VK_EXT_image_drm_format_modifier extension in NVK, the new open-source Vulkan driver for NVIDIA hardware. We've also back-ported the code to the Mesa 24.1 staging branch so it will be part of the upcoming Mesa 24.1 release.
DRM format modifier support is one of the most important features we've landed in NVK in a while. Though it's not a very interesting feature to most Vulkan applications or game developers, it's very important to the Linux display pipeline. Importantly to users, this is the last piece required to support GameScope. It's also an important piece in making Zink+NVK a robust OpenGL solution.
Passing images around between applications is complicated. In an average desktop scenario, the app you're working in renders its contents to an image and passes the memory for that image onto the compositor. The compositor could be X11 or Wayland compositor such as GNOME Shell, KWin, or Sway. It could also be Xorg itself if you're running a bare X server without a separate compositor. The compositor then composites those images together using OpenGL or Vulkan to create the final image you see on your screen. Finally, the compositor passes that image to the kernel via the KMS API to put on the screen.
Each of the components in this example live in separate processes and have limited ability to communicate with each other. The dma-bufs that get passed from process to process are just handles to memory. Information about how the image is laid out in that memory is passed separately. The EGL dma-buf import/export APIs, for instance, take a width, height, and stride for each plane of the image. This ensures that all parties sharing the image know how pixels map to memory addresses.
There are other cases in which images might get shared between processes or components. Video decode is one such example. Let's say you're watching a YouTube video in your web browser. The video bitstream will be decoded through a video encode/decode API such as VA-API or VDPAU and then passed to OpenGL to composite it with the rest of the web content. Even though it might all happen in a single process, the image must be shared between two different APIs, and possibly different hardware units, in order to do the decode and display. The same decoded images may also be shared directly with the display hardware via KMS, providing a more efficient display path when a full composite isn't needed.
Things get complicated when image tiling and compression get involved. The description I gave above of the EGL APIs describing things in terms of a width, height, and stride assumes that the image is linear, meaning that pixel data is laid out in row-major order from left to right, then top to bottom. However, most GPUs don't actually like to work with linear images directly. Instead, most GPU images are tiled, meaning that the image is carved up into tiles and the data in the tiles is shuffled around in some HW-specific pattern. Tiling images is a standard trick that significantly improves performance by improving cache locality. Most GPUs have multiple different tiling patterns and select the best tiling pattern based on the image dimensions, format, and how it will be used. On top of this, many GPUs support some form of on-the-fly data compression which further reduces bandwidth within the GPU.
All this means that the simplistic model used by the EGL dma-buf import/export APIs is a bit of a lie. It looks like everything is linear on the surface but it may actually be tiled underneath. In the distant past, this was all done via driver magic. On Intel, for instance, there was a convention that all shared images were X tiled where the stride passed through the API was the stride in tiles multiplied by the tile width. On AMD, the kernel driver maintained a small blob of metadata attached to each buffer object (BO) in which the driver instance that created the image stored the tiling information. The driver instance that imported the image would read that metadata and use it to construct image descriptors. Nouveau worked similarly by associating a tile mode and PTE kind with every BO.
The difficulty comes in getting all of the various components of the system to agree on these tiling details. In the past, we trusted in magic driver heuristics and magic side-band data. This was fragile and didn't allow for any sort of improvement. Once the heuristic was decided and baked into drivers, it could never be changed because doing so might cause things to break if two components were at different versions. While it's tempting to think that you can just rely on everyone in the system using the same system-installed driver, VMs and container technologies like flatpak mean that an old version of the driver may be bundled with the app. Also, one of the components involved in the image sharing is the display driver in the kernel, which often lags behind userspace by 6 months or so. This means that we really do need to keep everything backwards-compatible.
Instead of trusting in magic heuristics, DRM format modifiers provide an explicit mechanism for negotiating tiling and compression information and ensuring that all of the components involved in sharing the image agree. A DRM format modifier is a 64-bit integer, defined in drm_fourcc.h
in the Linux kernel, which exactly describes an image layout scheme, including tiling and any compression which may be used. The details of a modifier definition originate from a specific piece of hardware, but the definition of a modifier is complete and unconditional. A single modifier has a specific meaning and cannot be interpreted differently by different hardware or drivers. There are a few modifiers, such as the Arm AFBC modifiers, which are supported by multiple Arm GPUs and display controllers. However, most modifiers only apply to one particular GPU vendor. The only modifier that is intended to be implemented by everyone is DRM_FORMAT_MOD_LINEAR
. These modifiers are passed along with the width, height, and stride information in the EGL dma-buf import/export APIs. This allows communicating the tiling and compression information in an unambiguous way without relying on driver heuristics.
Using DRM format modifiers requires the application to perform a negotiation step before any images are created. Each HW driver is queried for what all modifiers it supports. Then, the client takes the intersection of those modifier sets to determine the set of modifiers supported by everyone. Then that list of modifiers is passed into the image creation step and the driver selects the best modifier that is supported by everyone in the negotiation. Importantly, the client doesn't need to understand what those modifiers actually mean. It just knows what modifiers are supported by what components in the system and blindly passes them around. More information can be found in the Linux kernel documentation.
For most applications, this all happens behind the client's back by the EGL or Vulkan WSI implementation and the compositor. They simply create a VkSurfaceKHR
an a VkSwapchainKHR
on that surface and render to it like normal. The only components which have to work with modifiers directly are those that are directly involved in the sharing process, such as compositors and toolkits. Frameworks such as GStreamer may also need to know about modifiers because they share images between different APIs.
On most GPUs, memory is just memory. Pages are allocated in system RAM or out of the VRAM pool and mapped into the GPU address space via per-context page tables. This is very similar to the way that virtual memory works on CPUs. When an image is placed in memory, a descriptor is then created which contains the virtual address of the start of the image as well as image dimensions, strides, tiling, and any other information required to describe the image's layout in memory. Any access to the image from the GPU goes through such a descriptor. This maps nicely to the modifiers model: The dma-buf contains the memory and the width, height, stride, and modifier are a condensed version of the descriptor.
On NVIDIA GPUs, however, a small amount of information about the image is encoded into the page tables. Each combination of format, sample count, and compression node maps to an 8-bit PTE kind. On Turing (RTX 2000-series and GTX 1600-series) and later GPUs, there are three PTE kinds for all color images and two for each depth/stencil format. On earlier GPUs, the sample count is also taken into account when selecting the PTE kind. Unfortunately, this means that memory is no longer "just memory" because it now has image information attached. In the GL world, this wasn't much of a problem because the memory is owned by the image and so the driver can compute the PTE kind and pass that to the kernel when it allocates the memory.
Vulkan makes this all a bit more difficult. Because Vulkan allows images of different kinds to overlap in memory, we need some way to have different PTE kinds for the same memory, depending on which image is used. To handle this, assign a unique virtual address range to each image and bind those address ranges to memory by using VM_BIND and the PTE kind is provided at VM_BIND time. Because page table entries are per-virtual-page, this lets us have different PTE kinds for different images even though they're backed by the same physical pages. This uses the same mechanism as sparse binding through the Vulkan API.
This makes handling modifiers in NVK relatively easy. Because we already have to have that separation between image layout and memory for Vulkan, it translates fairly directly to modifiers. It's not that tiling on NVIDIA is easy, it just means that we've already paid all that complexity cost up-front in NVK because we needed to for Vulkan.
While implementing VK_EXT_image_drm_format_modifier for NVK, I discovered two bugs in the current modifiers implementation in the old Nouveau GL driver that makes this all more challenging.
The first bug is that the Nouveau GL driver entirely ignores the modifier for imported images. Instead, it uses the same path as for legacy dma-buf imports and looks at the tile mode set on the BO to determine the memory layout. This is wrong because the whole point of DRM format modifiers is to communicate that kind of information explicitly, rather than depending on magic BO properties. However, it works in practice as long as the image was allocated by the Nouveau GL driver because Nouveau GL driver always allocates the BO with the same tile mode as it communicates through the modifier. It broke the moment we started trying to import images allocated from NVK into Nouveau GL.
The second issue is related but more subtle. The Nouveau GL driver has no way to override the PTE kind on the BO. The Nouveau GL driver is designed around the assumption that images own their memory and that the PTE kind gets set directly on the BO at allocation time. Without converting the entire GL driver to use the new VM_BIND kernel APIs, there is no way for it to override the PTE kind set on the BO with the PTE kind specified in the modifier. Because NVK defaults to a PTE kind of zero on all allocations, this means that Nouveau GL tries to access the images with the wrong PTE kind and the GPU throws an exception.
The first of these two bugs would be straightforward to fix. Fixing PTE kinds, however, would require reworking the entire GL driver to use the new VM_BIND API and explicit synchronization. This would be a significant undertaking and likely not worth the effort in the current codebase. Also, even if we fixed both of these bugs, we wouldn't be able to back-port them far enough to ensure that NVK would never be paired with a buggy GL driver. Even though the current plan for OpenGL on NVIDIA going forward is to use Zink on NVK, instances of the old Nouveau GL driver will be around for a while and we need to be able to inter-operate.
The good news is that the Nouveau kernel display driver doesn't suffer from either of these issues. When an image with modifiers is handed off to the display driver via KMS, it pulls all the relevant information from the modifier. This means that NVK will have no issues inter-operating with the display driver. Also, though dma-buf import is broken for the old Nouveau GL driver is broken, export works fine. The modifier returned as part of the export correctly matches the PTE kind and tile mode set on the BO and NVK can import it correctly. The only truly broken case is the one where the image is allocated by NVK and imported into the Nouveau GL driver. Unfortunately, this is a really common case if your compositor uses OpenGL and your system OpenGL is the old Nouveau GL driver.
Because fixing the Nouveau GL driver isn't a practical solution, we had to make some concessions somewhere. Dave Airlie, Karol Herbst, and I have been talking about this for months, trying to come up with a good solution. Eventually, we settled on making the following two concessions in the VK_EXT_image_drm_format_modifier implementation in NVK:
1. For DRM format modifier images we return prefersDedicatedAllocation = true
when the client queries dedicated allocation properties.
2. When the client does a dedicated allocation for a DRM format modifier image, we pass the PTE kind and tile mode of the image through to the kernel as part of the allocation.
As long as the client uses dedicated allocations when requested, NVK will emulate the behavior of the old Nouveau GL driver and set the same tile mode and PTE kind on the BO that it communicates through the modifier. This allows the image to be correctly imported into the Nouveau GL driver. When the image is imported into NVK, however, the PTE kind and tile mode on the BO are ignored as they're supposed to be and NVK uses the PTE kind and tiling information from the modifier.
NVK does not, however, require dedicated allocations for DRM format modifier images. This means that the NVK modifiers implementation is not restricted in the same way as the old Nouveau GL driver. Applications are free to place multiple DRM format modifier images in a single allocation if they wish. Those images simply won't be compatible with the old Nouveau GL driver. Fortunately, all of the cases anyone is likely to care about such as the Vulkan WSI code in Mesa, Zink, and various compositors, are already using dedicated allocations by default so they shouldn't run into any issues if the old Nouveau GL driver happens to be in the pipeline somewhere.
Unfortunately, these concessions required us to change the Nouveau kernel UAPI yet again. Previously, the kernel driver would reject any memory allocations with an associated PTE kind and tile mode when running in VM_BIND mode. The assumption at the time was that allowing this was pointless since the client would set the PTE kind at bind time. However, because the old Nouveau GL driver requires the PTE kind and tile mode to be set on all BOs containing images, including those imported from other driver instances, we need to be able to set it from the Vulkan driver so that the GL driver can correctly use the BO. The kernel patches enabling this should make it into Linux 6.10 and we're also planning to back-port it to older, stable kernels since it's a fairly simple change.
With DRM format modifiers landed, that completes the last major piece required for getting good window-system integration in NVK. Previously, all window-system images in NVK went through the PRIME path, typically used for cross-GPU sharing, where a linear image is shared and a copy is done from the tiled image to the linear image at vkQueuePresent()
time. With modifiers, we can now share tiled images between the client and the compositor, avoiding the copy. While a single copy on a discrete GPU isn't usually a huge problem, it can get significant with 4K displays.
This was also the last major piece required for competent OpenGL support through Zink. While Zink has code-paths for drivers that don't support VK_EXT_image_drm_format_modifier, they are not nearly as well-tested or robust as the modifiers path.
Finally, as mentioned at the start of this post, DRM format modifiers were the last piece required for GameScope to work. This should improve the Linux gaming experience with NVK.
09/12/2024
Collabora will be at NeurIPs this week to dive into the latest academic findings in machine learning and research advancements that are…
05/12/2024
Now based on Debian Bookworm, Apertis is a collaborative OS platform that includes an operating system, but also tools and cloud services…
03/12/2024
Initial support for Rockchip's RK3576, a new SoC introduced earlier this year, has landed in Linux kernel 6.12. With the main target being…
Comments (1)
jonwallacedesign:
May 22, 2024 at 11:58 AM
this is a test - please ignore
Reply to this comment
Reply to this comment
Add a Comment