We're hiring!
*

Implementing Vulkan extensions for NVK

Rebecca McKeever avatar

Rebecca McKeever
March 10, 2023

Share this post:

Reading time:

Last November, I joined the graphics team at Collabora as a Software Engineering Intern. Prior to that, I was an Outreachy Intern working on the memblock testing suite of the Linux kernel. I chose to work at Collabora because I wanted to continue working on open-source software. While there are many companies where this is possible, Collabora appealed to me because open-source software is the primary focus of the entire company. Since joining, I have been very impressed by the supportive and inclusive environment.

I have been working on NVK, an open-source Vulkan driver for NVIDIA hardware that is part of Mesa, with my mentor Faith Ekstrand as well as Gustavo Noronha. During my first three months at Collabora, I implemented several Vulkan API extensions for NVK. I also contributed to the upstream Mesa project.

I verified that the extensions met the specifications in the Vulkan API using the Vulkan Conformance Test suite (CTS).

Enable VK_KHR_device_group

The VK_KHR_device_group Vulkan extension provides functionality for interacting with multiple physical devices. I implemented three entrypoints for this extension: vkCmdDispatchBase() and dummy versions of vkCmdSetDeviceMask() and vkGetDeviceGroupPeerMemoryFeatures().

One issue I ran into when implementing this extension was a failed assertion during the dEQP-VK.compute.device_group.device_index test:

SPIR-V WARNING:
    In file ../src/compiler/spirv/spirv_to_nir.c:4628
    Unsupported SPIR-V capability: SpvCapabilityDeviceGroup (4437)
    28 bytes into the SPIR-V binary
deqp-vk: ../src/compiler/nir/nir.c:2489: nir_intrinsic_from_system_value: Assertion `!"system value does not directly correspond to intrinsic"' failed.

Looking at nir_intrinsic_from_system_value() in nir.c reveals that the function consists of a switch statement that checks for system values from enum gl_system_value. A comment in the declaration of gl_system_value indicates that SYSTEM_VALUE_DEVICE_INDEX is required for VK_KHR_device_group. However, nir_intrinsic_from_system_value() does not check for SYSTEM_VALUE_DEVICE_INDEX.

The solution was to support nir_intrinsic_load_workgroup_id_zero_base and set lower_device_index_to_zero in nir_shader_compiler_options to true.

--- a/src/nouveau/codegen/nv50_ir_from_nir.cpp
+++ b/src/nouveau/codegen/nv50_ir_from_nir.cpp
@ -1727,6 +1727,7 @@ Converter::convert(nir_intrinsic_op intr)
   case nir_intrinsic_load_vertex_id:
      return SV_VERTEX_ID;
   case nir_intrinsic_load_workgroup_id:
+  case nir_intrinsic_load_workgroup_id_zero_base:
      return SV_CTAID;
   case nir_intrinsic_load_work_dim:
      return SV_WORK_DIM;
@ -1986,6 +1987,7 @@ Converter::visit(nir_intrinsic_instr *insn)
   case nir_intrinsic_load_tess_level_outer:
   case nir_intrinsic_load_vertex_id:
   case nir_intrinsic_load_workgroup_id:
+  case nir_intrinsic_load_workgroup_id_zero_base:
   case nir_intrinsic_load_work_dim: {
      const DataType dType = getDType(insn);
      SVSemantic sv = convert(op);
--- a/src/nouveau/codegen/nv50_ir_from_nir.cpp
+++ b/src/nouveau/codegen/nv50_ir_from_nir.cpp
@ -3575,7 +3575,7 @@ nvir_nir_shader_compiler_options(int chipset, uint8_t shader_type, bool prefer_n
   op.optimize_sample_mask_in = false;
   op.lower_cs_local_index_to_id = true;
   op.lower_cs_local_id_to_index = false;
-  op.lower_device_index_to_zero = false; // TODO
+  op.lower_device_index_to_zero = true;
   op.lower_wpos_pntc = false; // TODO
   op.lower_hadd = true; // TODO
   op.lower_uadd_sat = true; // TODO

This is necessary for a no-op implementation of VK_KHR_device_group, where only one device group is supported. Moreover, the way lower_system_value_instr() in the NIR compiler is currently written only handles a device index of 0:

      case SYSTEM_VALUE_DEVICE_INDEX:
         if (b->shader->options->lower_device_index_to_zero)
            return nir_imm_int(b, 0);
         break;

The code also needed to handle the intrinsic nir_intrinsic_load_base_workgroup_id in lower_intrin():

+  case nir_intrinsic_load_base_workgroup_id:
+     return lower_load_base_workgroup_id(b, intrin, ctx);

lower_load_base_workgroup_id() lowers the load_base_workgroup_id intrinsic to a load_ubo system value:

static bool
lower_load_base_workgroup_id(nir_builder *b, nir_intrinsic_instr *load,
                             const struct lower_descriptors_ctx *ctx)
{
   const uint32_t root_table_offset =
      nvk_root_descriptor_offset(cs.base_group);

   b->cursor = nir_instr_remove(&load->instr);

   nir_ssa_def *val = nir_load_ubo(b, 3, 32,
                                   nir_imm_int(b, 0),
                                   nir_imm_int(b, root_table_offset),
                                   .align_mul = 4,
                                   .align_offset = 0,
                                   .range = root_table_offset + 3 * 4);

   assert(load->dest.is_ssa);
   nir_ssa_def_rewrite_uses(&load->dest.ssa, val);

   return true;
}

Implementing this extension in NVK also lead to an upstream contribution to the main Mesa project.

Add common code for VK_KHR_device_group entrypoints to upstream Mesa

According to the Vulkan spec, vkCmdDispatch() is equivalent to calling vkCmdDispatchBase() with baseGroupX, baseGroupY, and baseGroupZ set to 0:

vkCmdDispatchBase(0, 0, 0, groupCountX, groupCountY, groupCountZ)

Since this will be the same regardless of the driver it is implemented in, it makes sense to implement it in common Vulkan code rather than independently in each driver. Additionally, several drivers were implementing no-op versions of vkCmdSetDeviceMask() and vkGetDeviceGroupPeerMemoryFeatures(), so those could also be implemented in common Vulkan code.

I added vk_common_ versions of these functions to Mesa Vulkan runtime and then deleted independent implementations of these entrypoints from each driver where possible. I also had to make sure to update any driver-specific instances of these functions to the vk_common_ version. However, in one case I had to use the driver's CmdDispatchBase() instead of using vk_common_CmdDispatch():

--- a/src/amd/vulkan/layers/radv_sqtt_layer.c
+++ b/src/amd/vulkan/layers/radv_sqtt_layer.c
@ -484,7 +484,7 @@ sqtt_CmdDrawIndexedIndirectCount(VkCommandBuffer commandBuffer, VkBuffer buffer,
   VKAPI_ATTR void VKAPI_CALL
   sqtt_CmdDispatch(VkCommandBuffer commandBuffer, uint32_t x, uint32_t y, uint32_t z)
   {
-     EVENT_MARKER(Dispatch, commandBuffer, x, y, z);
+     EVENT_MARKER_ALIAS(DispatchBase, Dispatch, commandBuffer, 0, 0, 0, x, y, z);
   }

   VKAPI_ATTR void VKAPI_CALL

Support A4B4G4R4 EXT_4444_formats

The VK_EXT_4444_formats extension defines two image formats in which the A, B, G, and R components are each 4 bits:

VK_FORMAT_A4R4G4B4_UNORM_PACK16_EXT
VK_FORMAT_A4B4G4R4_UNORM_PACK16_EXT

The A4R4G4B4 format was already implemented. I added the following lines to the nil_format_info table to implement the A4B4G4R4 format:

C4(A, R4G4B4A4_UNORM,   NONE,       R, G, B, A, UNORM,   A4B4G4R4,   T),
F3(A, R4G4B4X4_UNORM,   NONE,       R, G, B, x, UNORM,   A4B4G4R4,   T),

The R4G4B4A4_UNORM parameter is the pipe format that is returned from vk_format_to_pipe_format(VK_FORMAT_A4B4G4R4_UNORM_PACK16):

case VK_FORMAT_A4B4G4R4_UNORM_PACK16:
   return PIPE_FORMAT_R4G4B4A4_UNORM;

I also added the VK_FORMAT_B4G4R4A4_UNORM_PACK16 format to the table. This involved a less-straightforward GBAR swizzle:

C4(A, A4R4G4B4_UNORM,   NONE,       G, B, A, R, UNORM,   A4B4G4R4,   T),

Support VK_EXT_non_seamless_cube_map

The VK_EXT_non_seamless_cube_map extension allows for disabling cube map edge handling in the sampler. The CTS tests for this extension required shadow sampling, so I first had to enable shadow sampling in nvk_get_image_format_features():

if (vk_format_has_depth(vk_format)) {
   features |= VK_FORMAT_FEATURE_2_SAMPLED_IMAGE_DEPTH_COMPARISON_BIT;
}

Next, I attempted to map the corresponding VkSamplerCreateFlags bit (VK_SAMPLER_CREATE_NON_SEAMLESS_CUBE_MAP_BIT_EXT) to one of the CUBEMAP_INTERFACE_FILTERING sampler hardware bits. On my first attempt at this, some tests were passing and some were failing:

   assert(device->ctx->eng3d.cls >= KEPLER_A);
   SAMP_SET_E(samp, NVA097, 1, CUBEMAP_INTERFACE_FILTERING, AUTO_SPAN_SEAM);
+  if (pCreateInfo->flags & VK_SAMPLER_CREATE_NON_SEAMLESS_CUBE_MAP_BIT_EXT) {
+     SAMP_SET_E(samp, NVA097, 1, CUBEMAP_INTERFACE_FILTERING, USE_WRAP);
+  }

There was already an existing line setting the CUBEMAP_INTERFACE_FILTERING field, so the field was being set twice. To get the failing tests working, I revised the code to avoid setting the field twice:

--- a/src/nouveau/vulkan/nvk_sampler.c
+++ b/src/nouveau/vulkan/nvk_sampler.c
@ -224,7 +224,11 @@ nvk_CreateSampler(VkDevice _device,
   assert(device->ctx->eng3d.cls >= KEPLER_A);
-  SAMP_SET_E(samp, NVA097, 1, CUBEMAP_INTERFACE_FILTERING, AUTO_SPAN_SEAM);
+  if (pCreateInfo->flags & VK_SAMPLER_CREATE_NON_SEAMLESS_CUBE_MAP_BIT_EXT) {
+     SAMP_SET_E(samp, NVA097, 1, CUBEMAP_INTERFACE_FILTERING, USE_WRAP);
+  } else {
+     SAMP_SET_E(samp, NVA097, 1, CUBEMAP_INTERFACE_FILTERING, AUTO_SPAN_SEAM);
+  }

   if (device->ctx->eng3d.cls >= MAXWELL_B) {
      switch (vk_sampler_create_reduction_mode(pCreateInfo)) {

Implement VK_EXT_image_view_min_lod

The VK_EXT_image_view_min_lod extension allows for clamping the minimum LOD (level of detail) by a given float minLod value. When working on this, I encountered the following failed assert:

deqp-vk: ../src/util/bitpack_helpers.h:61: util_bitpack_uint: Assertion `v <= max' failed.

The MIN_LOD_CLAMP field in the NVIDIA texture header is unsigned and 12-bit fixed point, but the available macros in nil/nil_image_tic.c for handling unsigned values were for 32-bit values. I added a set of macros for handling unsigned fixed values to address this. This allowed the tests to run without crashing, but 11 tests were failing.

I noticed that the failing tests all had base_level in their names, whereas only one of the passing tests did. So I suspected that I was not accounting for the base level. To account for the base level, I updated the code to subtract view->base_level from view->min_lod_clamp:

--- a/src/nouveau/nil/nil_image_tic.c
+++ b/src/nouveau/nil/nil_image_tic.c
@ -371,6 +371,9 @@ nv9097_nil_image_fill_tic(const struct nil_image *image,
      TH_NV9097_SET_U(th, 7, MULTI_SAMPLE_COUNT,
                      nil_to_nv9097_multi_sample_count(image->sample_layout));
 
+     TH_NV9097_SET_UF(th, 7, MIN_LOD_CLAMP,
+                      view->min_lod_clamp - view->base_level);
+
      memcpy(desc_out, th, sizeof(th));
   }

@ -446,6 +449,9 @@ nvb097_nil_image_fill_tic(const struct nil_image *image,
      TH_NVB097_SET_U(th, BL, MULTI_SAMPLE_COUNT,
                      nil_to_nvb097_multi_sample_count(image->sample_layout));
 
+     TH_NVB097_SET_UF(th, BL, MIN_LOD_CLAMP,
+                      view->min_lod_clamp - view->base_level);
+
      memcpy(desc_out, th, sizeof(th));
   }

Support VK_EXT_mutable_descriptor_type

The VK_EXT_mutable_descriptor_type extension allows applications to specify a list of descriptor types that a given descriptor may mutate into to reduce descriptor memory usage.

The first issue I encountered was a segfault when running any of the relevant tests. A backtrace revealed that the tests are calling vkGetDescriptorSetLayoutSupport(), which was yet to be implemented in NVK. After I implemented vkGetDescriptorSetLayoutSupport(), some tests were crashing on a failed assert, either mutable_info != NULL or dst_binding_layout->type == src_binding_layout->type.

To address the mutable_info != NULL assert, I replaced the asserts in nvk_CreateDescriptorPool() with a conditional to allow for an application that specifies a NULL descriptor type. The Vulkan spec requires a descriptor type list for each mutable descriptor set in vkCreateDescriptorSetLayout(). In vkCreateDescriptorPool(), however, it allows the client to request mutable descriptor types and no descriptor type list and the driver is expected to assume the worst case.

--- a/src/nouveau/vulkan/nvk_descriptor_set.c
+++ b/src/nouveau/vulkan/nvk_descriptor_set.c
@ -371,11 +369,9 @@ nvk_CreateDescriptorPool(VkDevice _device,
   uint32_t max_align = 0;
   for (unsigned i = 0; i < pCreateInfo->poolSizeCount; ++i) {
      const VkMutableDescriptorTypeListEXT *type_list = NULL;
-     if (pCreateInfo->pPoolSizes[i].type == VK_DESCRIPTOR_TYPE_MUTABLE_EXT) {
-        assert(mutable_info != NULL);
-        assert(i <= mutable_info->mutableDescriptorTypeListCount);
-        type_list = &mutable_info->pMutableDescriptorTypeLists[i];
-     }
+     if (pCreateInfo->pPoolSizes[i].type == VK_DESCRIPTOR_TYPE_MUTABLE_EXT &&
+         mutable_info && i < mutable_info->mutableDescriptorTypeListCount)
+           type_list = &mutable_info->pMutableDescriptorTypeLists[i];

In nvk_descriptor_stride_align_for_type(), I set stride and align to NVK_MAX_DESCRIPTOR_SIZE when type_list is NULL.

I addressed the dst_binding_layout->type == src_binding_layout->type assert by removing it since the types do not need to be the same when doing a copy, and the code is already handling descriptors of different sizes.

VK_EXT_image_robustness and VK_EXT_robustness2

The VK_EXT_image_robustness and VK_EXT_robustness2 extensions add stricter requirements for how out of bounds reads from images are handled.

I first enabled both extensions to see which tests, if any, were failing. There were some failing tests, and I noticed that they were failing this assert in write_buffer_view_desc():

assert(view->desc_index < (1 << 20));

The code wasn't handling the case where view was VK_NULL_HANDLE. I updated the code to only assert when view is not NULL:

--- a/src/nouveau/vulkan/nvk_descriptor_set.c
+++ b/src/nouveau/vulkan/nvk_descriptor_set.c
@ -133,12 +133,13 @@ write_buffer_view_desc(struct nvk_descriptor_set *set,
                          const VkBufferView bufferView,
                          uint32_t binding, uint32_t elem)
   {
-     VK_FROM_HANDLE(nvk_buffer_view, view, bufferView);
+     struct nvk_image_descriptor desc = { };
+     if (bufferView != VK_NULL_HANDLE) {
+        VK_FROM_HANDLE(nvk_buffer_view, view, bufferView);
 
-     assert(view->desc_index < (1 << 20));
-     struct nvk_image_descriptor desc = {
-        .image_index = view->desc_index,
-     };
+        assert(view->desc_index < (1 << 20));
+        desc.image_index = view->desc_index;
+     }
      write_desc(set, binding, elem, &desc, sizeof(desc));
   }

Conclusion

For the second half of my internship, I will be working on implementing VK_KHR_multiview as described in NVK issue #52. Multiview is a rendering technique for working more efficiently with multiple views. One application of this is rendering stereoscopic views for VR displays.

I am very grateful to my mentor Faith Ekstrand, who has offered crucial support and guidance on top of reviewing my code. Additionally, I am thankful to Gustavo Noronha, who encouraged and advised me throughout the application and onboarding process. I am also grateful for the feedback I have received from others in the Mesa community.

Comments (1)

  1. Jakub:
    Mar 10, 2023 at 08:16 PM

    Great work Rebecca, very interesting read!

    Reply to this comment

    Reply to this comment


Add a Comment






Allowed tags: <b><i><br>Add a new comment:


Search the newsroom

Latest Blog Posts

Faster inference: torch.compile vs TensorRT

19/12/2024

In the world of deep learning optimization, two powerful tools stand out: torch.compile, PyTorch’s just-in-time (JIT) compiler, and NVIDIA’s…

Mesa CI and the power of pre-merge testing

08/10/2024

Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…

A shifty tale about unit testing with Maxwell, NVK's backend compiler

15/08/2024

After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…

A journey towards reliable testing in the Linux Kernel

01/08/2024

We're reflecting on the steps taken as we continually seek to improve Linux kernel integration. This will include more detail about the…

Building a Board Farm for Embedded World

27/06/2024

With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…

Smart audio filters with WirePlumber 0.5

26/06/2024

WirePlumber 0.5 arrived recently with many new and essential features including the Smart Filter Policy, enabling audio filters to automatically…

Open Since 2005 logo

Our website only uses a strictly necessary session cookie provided by our CMS system. To find out more please follow this link.

Collabora Limited © 2005-2024. All rights reserved. Privacy Notice. Sitemap.