NVK update: Enabling new extensions, conformance status & more

NVK update: Enabling new extensions, conformance status & more

Faith Ekstrand
June 26, 2023

Share this post:

Reading time:

It's been a while since I've written about NVK. Rebecca, my intern, has written a couple of blog posts about her NVK work but I've been mostly quiet. Part of that is because I've been primarily focused on something else NVK will need but we'll get to that in a bit. That doesn't mean nothing has happened in NVK, though. Quite a bit has landed in the main NVK branch since October and we're long overdue for an update.

NVK Features

Along with Rebecca's work which you may have seen on the Collabora blog, we've seen a number of community contributions and I've done a bit of work here and there. Here are some highlights since my last post in October:

Support for Maxwell and Kepler GPUs. Karol Herbst posted a merge request adding handling various state setup bits needed for shaders on older hardware back in August. However, even though Karol's branch existed for a while, it stopped working when I added a hard dependency the MME (Macro Method Expander) for draws and clears. Pre-Turing hardware support was finally unblocked by Mary and her work enabling the MME on older hardware. I also spent a week or two in December on Maxwell bug-fixing and it's not too far off Turing these days in terms of CTS pass rates.
Geometry, Tessellation, and transform feedback. While not required for Vulkan 1.0, these are pretty important features for running modern games. The codegen compiler back-end already supports them so that part was done. In November or so, George Ouzounoudis did the work to plumb them through NVK and enable the corresponding Vulkan features.
Thomas Anderson implemented VK_KHR_draw_indirect_count as well as flipping on a bunch of smaller extensions.
Echo has been playing around with NVK+DXVK a bit and has succeeded in getting some games playing. It's still early days and requires some hacks. However, there are a few titles working and I was able to demo Hollow Knight and F1 2017 at the Collabora meet-up in May.
Mohamed Ahmed is currently working on VK_KHR_sampler_ycbcr_conversion as his GSoC internship project.

Here's a fairly complete list of extensions and notable features that have been enabled since my "Introducing NVK" blog post in October:

geometryShader
tessellationShader
shaderImageGatherExtended
shaderStorageImageReadWithoutFormat
VK_KHR_bind_memory2
VK_KHR_buffer_device_address
VK_KHR_depth_stencil_resolve
VK_KHR_device_group
VK_KHR_draw_indirect_count
VK_KHR_driver_properties
VK_KHR_dynamic_rendering
VK_KHR_maintenance2
VK_KHR_maintenance3
VK_KHR_maintenance4
VK_KHR_map_memory2
VK_KHR_multiview
VK_KHR_relaxed_block_layout
VK_KHR_shader_draw_parameters
VK_KHR_spirv_1_4
VK_KHR_uniform_buffer_standard_layout
VK_EXT_4444_formats
VK_EXT_buffer_device_address
VK_EXT_image_2d_view_of_3d
VK_EXT_image_robustness
VK_EXT_image_view_min_lod
VK_EXT_index_type_uint8
VK_EXT_mutable_descriptor_type
VK_EXT_non_seamless_cube_map
VK_EXT_provoking_vertex
VK_EXT_robustness2
VK_EXT_sample_locations
VK_EXT_sampler_filter_minmax
VK_EXT_separate_stencil_usage
VK_EXT_shader_demote_to_helper_invocation
VK_EXT_shader_viewport_index_layer
VK_EXT_transform_feedback
VK_EXT_vertex_attribute_divisor

Importantly, we are getting very close to being able to bump our Vulkan core version. Currently, we're advertising Vulkan 1.0 but now we have most of what's needed to get us to Vulkan 1.2 or maybe even 1.3. The only two major features missing from Vulkan 1.2 are timeline semaphores and VK_KHR_sampler_ycbcr_conversion. Proper timeline semaphore work is waiting on the new kernel uAPI (more on that in a moment) and Mohamed is working on YCbCr support right now.

Conformance status:

Before NVK will be considered a conformant Vulkan implementation, it will need to be able to pass the Vulkan CTS (conformance test suite). We've been testing with the CTS heavily during the entire development period. My latest CTS run on a GTX 2060 had the following results:

Pass: 313758, Fail: 1282, Crash: 161, Skip: 1672133, Flake: 41

The biggest difference between this and the results shared in my earlier blog post in October is that we fixed about a thousand crashes and the added features enabled about 60% more tests to be run. There's still a way to go fixing the remaining failures but it's good enough that we have decent regression testing.

Upstreaming

Probably the single most common question I get from folks is, "When will NVK be in upstream mesa?" The short answer is that it'll be upstreamed along with the new kernel API. The new API is going to be required in order to implement Vulkan correctly in a bunch of cases. Even though it mostly works on top of upstream nouveau, I don't want to be maintaining support for that interface for another 10 years when it only partially works.

We don't yet have an exact timetable for when the new API will be ready. I'm currently hoping that we get it all upstream this year but I can't say when exactly.

Performance:

Performance is still far from where it needs to be. So far, I've been more focused on getting something that's correct than getting maximum speed. As far as I know, there are no architectural problems with the driver that will prevent us from achieving good performance, but NVK is still maturing and there are a few things that are a bit naive at the moment. Here are a few of the performance issues we know about:

Uniforms currently go through global memory. Long-term, we want to be using bindless constant buffer access on Turing and later and possibly promoting things to bound cbufs on earlier hardware. Bindless constant buffers are going to require significant compiler plumbing and will probably have to wait in the compiler stuff I'm working on at the moment (more on that in a bit).
Descriptor access requires too much indirection. Currently, any resource access other than inline UBOs requires two dependent memory reads before you can even begin to access the resource: One to look up the descriptor set address and one to fetch the descriptor. We can reduce this by detecting certain descriptor cases and promoting them to bound and doing some binding work at draw time. Unfortunately, this makes all the descriptor set logic more complicated and I've been hesitant to do it until we have the ability to adequately regression-test and evaluate performance trade-offs.
We're stalling like crazy. Currently, each vkCmdPipelineBarrier() call does a full wait-for-idle no matter what barriers are requested. This is way more aggressive than we actually need and relaxing the stall rules is going to be necessary for good performance. However, we need to be very careful when relaxing it lest we end up causing data races. As with the descriptor changes, the ability to regression test is very important here.
Memory access optimization. There are a lot of cases where we could be loading, say, 16B at a time but are instead loading in 4B chunks which is much less efficient. There is a nouveau compiler pass that tries to fix this but it's buggy and we have to disable it for CTS runs and some games. There are NIR passes for it which are correct as far as we know but we haven't done the analysis to be able to turn them on. Currently, the way the nouveau NIR path works, even when the NIR provides us with a nice wide load, we split it down into components and then hope the nouveau back-end pass optimizes it. With the pass disabled, this means you always get single-component access even when the SPIR-V uses nice wide loads.

Those are the big performance holes I'm aware of off-hand. I'm sure we'll find many more along the way. We also still have issues with reclocking on upstream kernels. Those will be solved on Turing and later with the GSP firmware but older hardware is still problematic.

Nouveau kernel status

Speaking of GSP firmware... It's probably worth a few words about what's happening kernel-side. Broadly speaking, the ongoing kernel work breaks down into three categories:

Enabling the use of GSP firmware for Turing and later GPUs. This is required in order to get re-clocking support. It should also improve the overall stability of the stack as we'll be running the same firmware that NVIDIA uses and that means they've actually tested it and, in theory, fixed the bugs.
A new kernel API which is based on userspace-controlled VM bindings, and drm sync objects. This is necessary for the correct handling of depth and stencil buffers as well as MSAA pre-Turing. We currently have hacks that are good enough for passing most of the CTS but they're just that: hacks. If we want a correct implementation that behaves the way clients expect, we need the ability to control page tables from userspace. The new API will also bring proper timeline semaphore support.
General refactoring, bug-fixing, and stability improvements.

All this work is still ongoing and is being done by the lovely folks at Red Hat. My role is mostly in advising the kernel API design. Otherwise, I'm trusting them to do that part. As far as I understand, both GSP and the new kernel API are mostly working in some form if you have the right development branch but getting it all put together upstream is still ongoing.

A new back-end compiler

I mentioned earlier that I haven't been very focused on NVK itself lately because I've been working on something else. That something else is a new back-end compiler for NVIDIA hardware, code-named NAK or Nvidia Awesome Kompiler. It's written in Rust and is intended to eventually be a replacement for the old nv50 codegen, at least for modern hardware.

Currently, I'm only targeting Turing GPUs. It will be expanded to more hardware eventually. Unfortunately, unlike with the command stream, the shader encoding does change from generation to generation so it makes sense to fix everything to a particular generation for initial development.

Getting into all the details is probably a topic for another blog post but here are a few highlights:

First Mesa back-end compiler written in Rust
SSA-based all the way through to register allocation
Makes best-practice use of NIR. We trust NIR to do most of the heavy optimization and lowering, keeping the back-end relatively simple.

Overall, I've been very happy with Rust as a language for back-end compiler development. It's way more fun writing Rust code than C or C++ and I can already feel it guiding me away from mistakes. There are a few things that were tricky to get right but I'm pretty happy with the overall design.

The current development status of NAK is that the core seems to be in pretty good shape at this point. It's near parity for CTS runs as long as it's only enabled for compute shaders. My latest CTS run with NAK enabled for compute shaders had the following results:

Pass: 313148, Fail: 906, Crash: 1150, Skip: 1672133, Flake: 38

There are a few holes to be filled in yet but the biggest thing we're currently missing is support for spilling when the number of temporary values gets to be more than we can fit in registers. Also, the above assumes you're running compute-only. While some of the other shader stages do work at least some, more debugging needs to be done for 3D shaders. There are also a few opcodes that have yet to be implemented.

Conclusion

Looking back, it's amazing how much has happened in NVK in just the last 7 months. If development continues at this crazy pace, we may be looking at a pretty decent driver before too much longer.

Introducing Multiview for NVK

Implementing Vulkan extensions for NVK

Introducing NVK

Introducing Multiview for NVK

Implementing Vulkan extensions for NVK

Introducing NVK

Comments (0)

Add a Comment

Search the newsroom

Latest News & Events

PanVK now uses AFBC by default

17/09/2025

AFBC support has been merged to PanVK and will be available in the Mesa 25.3 release! This new enablement reduces memory bandwidth and boosts…

Adding an AI edge at SIDO 2025

15/09/2025

Visit us at the STMicroelectronics booth, where Collabora will highlight how the STM32MP2 chip empowers edge AI solutions for industrial…

Debian 13 "Trixie": Collabora's contributions that made the 2025 release

09/09/2025

The Debian Trixie release is jam-packed with new features thanks to the efforts of many. See where Collabora contributed to upgraded GNOME…

About Collabora

Whether writing a line of code or shaping a longer-term strategic software development plan, we'll help you navigate the ever-evolving world of Open Source.

한국의 국기 한국어 버전의 Collabora.com 보기