We're hiring!
*

NVK update: Enabling new extensions, conformance status & more

Faith Ekstrand avatar

Faith Ekstrand
June 26, 2023

Share this post:

Reading time:

It's been a while since I've written about NVK. Rebecca, my intern, has written a couple of blog posts about her NVK work but I've been mostly quiet. Part of that is because I've been primarily focused on something else NVK will need but we'll get to that in a bit. That doesn't mean nothing has happened in NVK, though. Quite a bit has landed in the main NVK branch since October and we're long overdue for an update.

NVK Features

Along with Rebecca's work which you may have seen on the Collabora blog, we've seen a number of community contributions and I've done a bit of work here and there. Here are some highlights since my last post in October:

  • Support for Maxwell and Kepler GPUs. Karol Herbst posted a merge request adding handling various state setup bits needed for shaders on older hardware back in August. However, even though Karol's branch existed for a while, it stopped working when I added a hard dependency the MME (Macro Method Expander) for draws and clears. Pre-Turing hardware support was finally unblocked by Mary and her work enabling the MME on older hardware. I also spent a week or two in December on Maxwell bug-fixing and it's not too far off Turing these days in terms of CTS pass rates.
  • Geometry, Tessellation, and transform feedback. While not required for Vulkan 1.0, these are pretty important features for running modern games. The codegen compiler back-end already supports them so that part was done. In November or so, George Ouzounoudis did the work to plumb them through NVK and enable the corresponding Vulkan features.
  • Thomas Anderson implemented VK_KHR_draw_indirect_count as well as flipping on a bunch of smaller extensions.
  • Echo has been playing around with NVK+DXVK a bit and has succeeded in getting some games playing. It's still early days and requires some hacks. However, there are a few titles working and I was able to demo Hollow Knight and F1 2017 at the Collabora meet-up in May.
  • Mohamed Ahmed is currently working on VK_KHR_sampler_ycbcr_conversion as his GSoC internship project.

Here's a fairly complete list of extensions and notable features that have been enabled since my "Introducing NVK" blog post in October:

  • geometryShader
  • tessellationShader
  • shaderImageGatherExtended
  • shaderStorageImageReadWithoutFormat
  • VK_KHR_bind_memory2 
  • VK_KHR_buffer_device_address
  • VK_KHR_depth_stencil_resolve
  • VK_KHR_device_group
  • VK_KHR_draw_indirect_count
  • VK_KHR_driver_properties
  • VK_KHR_dynamic_rendering
  • VK_KHR_maintenance2
  • VK_KHR_maintenance3
  • VK_KHR_maintenance4
  • VK_KHR_map_memory2
  • VK_KHR_multiview
  • VK_KHR_relaxed_block_layout
  • VK_KHR_shader_draw_parameters
  • VK_KHR_spirv_1_4
  • VK_KHR_uniform_buffer_standard_layout
  • VK_EXT_4444_formats
  • VK_EXT_buffer_device_address
  • VK_EXT_image_2d_view_of_3d
  • VK_EXT_image_robustness
  • VK_EXT_image_view_min_lod
  • VK_EXT_index_type_uint8
  • VK_EXT_mutable_descriptor_type
  • VK_EXT_non_seamless_cube_map
  • VK_EXT_provoking_vertex
  • VK_EXT_robustness2
  • VK_EXT_sample_locations
  • VK_EXT_sampler_filter_minmax
  • VK_EXT_separate_stencil_usage
  • VK_EXT_shader_demote_to_helper_invocation
  • VK_EXT_shader_viewport_index_layer
  • VK_EXT_transform_feedback
  • VK_EXT_vertex_attribute_divisor

Importantly, we are getting very close to being able to bump our Vulkan core version. Currently, we're advertising Vulkan 1.0 but now we have most of what's needed to get us to Vulkan 1.2 or maybe even 1.3. The only two major features missing from Vulkan 1.2 are timeline semaphores and VK_KHR_sampler_ycbcr_conversion. Proper timeline semaphore work is waiting on the new kernel uAPI (more on that in a moment) and Mohamed is working on YCbCr support right now.

Conformance status:

Before NVK will be considered a conformant Vulkan implementation, it will need to be able to pass the Vulkan CTS (conformance test suite). We've been testing with the CTS heavily during the entire development period. My latest CTS run on a GTX 2060 had the following results:

Pass: 313758, Fail: 1282, Crash: 161, Skip: 1672133, Flake: 41

The biggest difference between this and the results shared in my earlier blog post in October is that we fixed about a thousand crashes and the added features enabled about 60% more tests to be run. There's still a way to go fixing the remaining failures but it's good enough that we have decent regression testing.

Upstreaming

Probably the single most common question I get from folks is, "When will NVK be in upstream mesa?" The short answer is that it'll be upstreamed along with the new kernel API. The new API is going to be required in order to implement Vulkan correctly in a bunch of cases. Even though it mostly works on top of upstream nouveau, I don't want to be maintaining support for that interface for another 10 years when it only partially works.

We don't yet have an exact timetable for when the new API will be ready. I'm currently hoping that we get it all upstream this year but I can't say when exactly.

Performance:

Performance is still far from where it needs to be. So far, I've been more focused on getting something that's correct than getting maximum speed. As far as I know, there are no architectural problems with the driver that will prevent us from achieving good performance, but NVK is still maturing and there are a few things that are a bit naive at the moment. Here are a few of the performance issues we know about:

  • Uniforms currently go through global memory. Long-term, we want to be using bindless constant buffer access on Turing and later and possibly promoting things to bound cbufs on earlier hardware. Bindless constant buffers are going to require significant compiler plumbing and will probably have to wait in the compiler stuff I'm working on at the moment (more on that in a bit).
  • Descriptor access requires too much indirection. Currently, any resource access other than inline UBOs requires two dependent memory reads before you can even begin to access the resource: One to look up the descriptor set address and one to fetch the descriptor. We can reduce this by detecting certain descriptor cases and promoting them to bound and doing some binding work at draw time. Unfortunately, this makes all the descriptor set logic more complicated and I've been hesitant to do it until we have the ability to adequately regression-test and evaluate performance trade-offs.
  • We're stalling like crazy. Currently, each vkCmdPipelineBarrier() call does a full wait-for-idle no matter what barriers are requested. This is way more aggressive than we actually need and relaxing the stall rules is going to be necessary for good performance. However, we need to be very careful when relaxing it lest we end up causing data races. As with the descriptor changes, the ability to regression test is very important here.
  • Memory access optimization. There are a lot of cases where we could be loading, say, 16B at a time but are instead loading in 4B chunks which is much less efficient. There is a nouveau compiler pass that tries to fix this but it's buggy and we have to disable it for CTS runs and some games. There are NIR passes for it which are correct as far as we know but we haven't done the analysis to be able to turn them on. Currently, the way the nouveau NIR path works, even when the NIR provides us with a nice wide load, we split it down into components and then hope the nouveau back-end pass optimizes it. With the pass disabled, this means you always get single-component access even when the SPIR-V uses nice wide loads.

Those are the big performance holes I'm aware of off-hand. I'm sure we'll find many more along the way. We also still have issues with reclocking on upstream kernels. Those will be solved on Turing and later with the GSP firmware but older hardware is still problematic.

Nouveau kernel status

Speaking of GSP firmware... It's probably worth a few words about what's happening kernel-side. Broadly speaking, the ongoing kernel work breaks down into three categories:

  1. Enabling the use of GSP firmware for Turing and later GPUs. This is required in order to get re-clocking support. It should also improve the overall stability of the stack as we'll be running the same firmware that NVIDIA uses and that means they've actually tested it and, in theory, fixed the bugs.
  2. A new kernel API which is based on userspace-controlled VM bindings, and drm sync objects. This is necessary for the correct handling of depth and stencil buffers as well as MSAA pre-Turing. We currently have hacks that are good enough for passing most of the CTS but they're just that: hacks. If we want a correct implementation that behaves the way clients expect, we need the ability to control page tables from userspace. The new API will also bring proper timeline semaphore support.
  3. General refactoring, bug-fixing, and stability improvements.

All this work is still ongoing and is being done by the lovely folks at Red Hat. My role is mostly in advising the kernel API design. Otherwise, I'm trusting them to do that part. As far as I understand, both GSP and the new kernel API are mostly working in some form if you have the right development branch but getting it all put together upstream is still ongoing.

A new back-end compiler

I mentioned earlier that I haven't been very focused on NVK itself lately because I've been working on something else. That something else is a new back-end compiler for NVIDIA hardware, code-named NAK or Nvidia Awesome Kompiler. It's written in Rust and is intended to eventually be a replacement for the old nv50 codegen, at least for modern hardware.

Currently, I'm only targeting Turing GPUs. It will be expanded to more hardware eventually. Unfortunately, unlike with the command stream, the shader encoding does change from generation to generation so it makes sense to fix everything to a particular generation for initial development.

Getting into all the details is probably a topic for another blog post but here are a few highlights:

  • First Mesa back-end compiler written in Rust
  • SSA-based all the way through to register allocation
  • Makes best-practice use of NIR. We trust NIR to do most of the heavy optimization and lowering, keeping the back-end relatively simple.

Overall, I've been very happy with Rust as a language for back-end compiler development. It's way more fun writing Rust code than C or C++ and I can already feel it guiding me away from mistakes. There are a few things that were tricky to get right but I'm pretty happy with the overall design.

The current development status of NAK is that the core seems to be in pretty good shape at this point. It's near parity for CTS runs as long as it's only enabled for compute shaders. My latest CTS run with NAK enabled for compute shaders had the following results:

Pass: 313148, Fail: 906, Crash: 1150, Skip: 1672133, Flake: 38

There are a few holes to be filled in yet but the biggest thing we're currently missing is support for spilling when the number of temporary values gets to be more than we can fit in registers. Also, the above assumes you're running compute-only. While some of the other shader stages do work at least some, more debugging needs to be done for 3D shaders. There are also a few opcodes that have yet to be implemented.

Conclusion

Looking back, it's amazing how much has happened in NVK in just the last 7 months. If development continues at this crazy pace, we may be looking at a pretty decent driver before too much longer.

 

Comments (0)


Add a Comment






Allowed tags: <b><i><br>Add a new comment:


 

Search the newsroom

Latest News & Events

Upstream support for Rockchip's RK3588: Progress and future plans

20/12/2024

The Rockchip RK3588 upstream support has progressed a lot over the last few years. As 2024 comes to a close, it is a great time to have…

Academically inclining at NeurIPS 2024

09/12/2024

Collabora will be at NeurIPs this week to dive into the latest academic findings in machine learning and research advancements that are…

Apertis v2024: the new Bookworm-based release for industrial embedded devices

05/12/2024

Now based on Debian Bookworm, Apertis is a collaborative OS platform that includes an operating system, but also tools and cloud services…

Open Since 2005 logo

Our website only uses a strictly necessary session cookie provided by our CMS system. To find out more please follow this link.

Collabora Limited © 2005-2024. All rights reserved. Privacy Notice. Sitemap.