Rebecca McKeever
March 10, 2023
Reading time:
Last November, I joined the graphics team at Collabora as a Software Engineering Intern. Prior to that, I was an Outreachy Intern working on the memblock
testing suite of the Linux kernel. I chose to work at Collabora because I wanted to continue working on open-source software. While there are many companies where this is possible, Collabora appealed to me because open-source software is the primary focus of the entire company. Since joining, I have been very impressed by the supportive and inclusive environment.
I have been working on NVK, an open-source Vulkan driver for NVIDIA hardware that is part of Mesa, with my mentor Faith Ekstrand as well as Gustavo Noronha. During my first three months at Collabora, I implemented several Vulkan API extensions for NVK. I also contributed to the upstream Mesa project.
I verified that the extensions met the specifications in the Vulkan API using the Vulkan Conformance Test suite (CTS).
The VK_KHR_device_group
Vulkan extension provides functionality for interacting with multiple physical devices. I implemented three entrypoints for this extension: vkCmdDispatchBase()
and dummy versions of vkCmdSetDeviceMask()
and vkGetDeviceGroupPeerMemoryFeatures()
.
One issue I ran into when implementing this extension was a failed assertion during the dEQP-VK.compute.device_group.device_index
test:
SPIR-V WARNING: In file ../src/compiler/spirv/spirv_to_nir.c:4628 Unsupported SPIR-V capability: SpvCapabilityDeviceGroup (4437) 28 bytes into the SPIR-V binary deqp-vk: ../src/compiler/nir/nir.c:2489: nir_intrinsic_from_system_value: Assertion `!"system value does not directly correspond to intrinsic"' failed.
Looking at nir_intrinsic_from_system_value()
in nir.c reveals that the function consists of a switch
statement that checks for system values from enum gl_system_value
. A comment in the declaration of gl_system_value
indicates that SYSTEM_VALUE_DEVICE_INDEX
is required for VK_KHR_device_group
. However, nir_intrinsic_from_system_value()
does not check for SYSTEM_VALUE_DEVICE_INDEX
.
The solution was to support nir_intrinsic_load_workgroup_id_zero_base
and set lower_device_index_to_zero
in nir_shader_compiler_options
to true
.
--- a/src/nouveau/codegen/nv50_ir_from_nir.cpp +++ b/src/nouveau/codegen/nv50_ir_from_nir.cpp @ -1727,6 +1727,7 @@ Converter::convert(nir_intrinsic_op intr) case nir_intrinsic_load_vertex_id: return SV_VERTEX_ID; case nir_intrinsic_load_workgroup_id: + case nir_intrinsic_load_workgroup_id_zero_base: return SV_CTAID; case nir_intrinsic_load_work_dim: return SV_WORK_DIM; @ -1986,6 +1987,7 @@ Converter::visit(nir_intrinsic_instr *insn) case nir_intrinsic_load_tess_level_outer: case nir_intrinsic_load_vertex_id: case nir_intrinsic_load_workgroup_id: + case nir_intrinsic_load_workgroup_id_zero_base: case nir_intrinsic_load_work_dim: { const DataType dType = getDType(insn); SVSemantic sv = convert(op);
--- a/src/nouveau/codegen/nv50_ir_from_nir.cpp +++ b/src/nouveau/codegen/nv50_ir_from_nir.cpp @ -3575,7 +3575,7 @@ nvir_nir_shader_compiler_options(int chipset, uint8_t shader_type, bool prefer_n op.optimize_sample_mask_in = false; op.lower_cs_local_index_to_id = true; op.lower_cs_local_id_to_index = false; - op.lower_device_index_to_zero = false; // TODO + op.lower_device_index_to_zero = true; op.lower_wpos_pntc = false; // TODO op.lower_hadd = true; // TODO op.lower_uadd_sat = true; // TODO
This is necessary for a no-op implementation of VK_KHR_device_group
, where only one device group is supported. Moreover, the way lower_system_value_instr()
in the NIR compiler is currently written only handles a device index of 0:
case SYSTEM_VALUE_DEVICE_INDEX: if (b->shader->options->lower_device_index_to_zero) return nir_imm_int(b, 0); break;
The code also needed to handle the intrinsic nir_intrinsic_load_base_workgroup_id
in lower_intrin()
:
+ case nir_intrinsic_load_base_workgroup_id: + return lower_load_base_workgroup_id(b, intrin, ctx);
lower_load_base_workgroup_id()
lowers the load_base_workgroup_id
intrinsic to a load_ubo
system value:
static bool lower_load_base_workgroup_id(nir_builder *b, nir_intrinsic_instr *load, const struct lower_descriptors_ctx *ctx) { const uint32_t root_table_offset = nvk_root_descriptor_offset(cs.base_group); b->cursor = nir_instr_remove(&load->instr); nir_ssa_def *val = nir_load_ubo(b, 3, 32, nir_imm_int(b, 0), nir_imm_int(b, root_table_offset), .align_mul = 4, .align_offset = 0, .range = root_table_offset + 3 * 4); assert(load->dest.is_ssa); nir_ssa_def_rewrite_uses(&load->dest.ssa, val); return true; }
Implementing this extension in NVK also lead to an upstream contribution to the main Mesa project.
According to the Vulkan spec, vkCmdDispatch()
is equivalent to calling vkCmdDispatchBase()
with baseGroupX
, baseGroupY
, and baseGroupZ
set to 0:
vkCmdDispatchBase(0, 0, 0, groupCountX, groupCountY, groupCountZ)
Since this will be the same regardless of the driver it is implemented in, it makes sense to implement it in common Vulkan code rather than independently in each driver. Additionally, several drivers were implementing no-op versions of vkCmdSetDeviceMask()
and vkGetDeviceGroupPeerMemoryFeatures()
, so those could also be implemented in common Vulkan code.
I added vk_common_
versions of these functions to Mesa Vulkan runtime and then deleted independent implementations of these entrypoints from each driver where possible. I also had to make sure to update any driver-specific instances of these functions to the vk_common_
version. However, in one case I had to use the driver's CmdDispatchBase()
instead of using vk_common_CmdDispatch()
:
--- a/src/amd/vulkan/layers/radv_sqtt_layer.c +++ b/src/amd/vulkan/layers/radv_sqtt_layer.c @ -484,7 +484,7 @@ sqtt_CmdDrawIndexedIndirectCount(VkCommandBuffer commandBuffer, VkBuffer buffer, VKAPI_ATTR void VKAPI_CALL sqtt_CmdDispatch(VkCommandBuffer commandBuffer, uint32_t x, uint32_t y, uint32_t z) { - EVENT_MARKER(Dispatch, commandBuffer, x, y, z); + EVENT_MARKER_ALIAS(DispatchBase, Dispatch, commandBuffer, 0, 0, 0, x, y, z); } VKAPI_ATTR void VKAPI_CALL
The VK_EXT_4444_formats
extension defines two image formats in which the A, B, G, and R components are each 4 bits:
VK_FORMAT_A4R4G4B4_UNORM_PACK16_EXT VK_FORMAT_A4B4G4R4_UNORM_PACK16_EXT
The A4R4G4B4 format was already implemented. I added the following lines to the nil_format_info
table to implement the A4B4G4R4 format:
C4(A, R4G4B4A4_UNORM, NONE, R, G, B, A, UNORM, A4B4G4R4, T), F3(A, R4G4B4X4_UNORM, NONE, R, G, B, x, UNORM, A4B4G4R4, T),
The R4G4B4A4_UNORM
parameter is the pipe format that is returned from vk_format_to_pipe_format(VK_FORMAT_A4B4G4R4_UNORM_PACK16)
:
case VK_FORMAT_A4B4G4R4_UNORM_PACK16: return PIPE_FORMAT_R4G4B4A4_UNORM;
I also added the VK_FORMAT_B4G4R4A4_UNORM_PACK16
format to the table. This involved a less-straightforward GBAR swizzle:
C4(A, A4R4G4B4_UNORM, NONE, G, B, A, R, UNORM, A4B4G4R4, T),
The VK_EXT_non_seamless_cube_map extension
allows for disabling cube map edge handling in the sampler. The CTS tests for this extension required shadow sampling, so I first had to enable shadow sampling in nvk_get_image_format_features()
:
if (vk_format_has_depth(vk_format)) { features |= VK_FORMAT_FEATURE_2_SAMPLED_IMAGE_DEPTH_COMPARISON_BIT; }
Next, I attempted to map the corresponding VkSamplerCreateFlags
bit (VK_SAMPLER_CREATE_NON_SEAMLESS_CUBE_MAP_BIT_EXT
) to one of the CUBEMAP_INTERFACE_FILTERING
sampler hardware bits. On my first attempt at this, some tests were passing and some were failing:
assert(device->ctx->eng3d.cls >= KEPLER_A); SAMP_SET_E(samp, NVA097, 1, CUBEMAP_INTERFACE_FILTERING, AUTO_SPAN_SEAM); + if (pCreateInfo->flags & VK_SAMPLER_CREATE_NON_SEAMLESS_CUBE_MAP_BIT_EXT) { + SAMP_SET_E(samp, NVA097, 1, CUBEMAP_INTERFACE_FILTERING, USE_WRAP); + }
There was already an existing line setting the CUBEMAP_INTERFACE_FILTERING
field, so the field was being set twice. To get the failing tests working, I revised the code to avoid setting the field twice:
--- a/src/nouveau/vulkan/nvk_sampler.c +++ b/src/nouveau/vulkan/nvk_sampler.c @ -224,7 +224,11 @@ nvk_CreateSampler(VkDevice _device, assert(device->ctx->eng3d.cls >= KEPLER_A); - SAMP_SET_E(samp, NVA097, 1, CUBEMAP_INTERFACE_FILTERING, AUTO_SPAN_SEAM); + if (pCreateInfo->flags & VK_SAMPLER_CREATE_NON_SEAMLESS_CUBE_MAP_BIT_EXT) { + SAMP_SET_E(samp, NVA097, 1, CUBEMAP_INTERFACE_FILTERING, USE_WRAP); + } else { + SAMP_SET_E(samp, NVA097, 1, CUBEMAP_INTERFACE_FILTERING, AUTO_SPAN_SEAM); + } if (device->ctx->eng3d.cls >= MAXWELL_B) { switch (vk_sampler_create_reduction_mode(pCreateInfo)) {
The VK_EXT_image_view_min_lod
extension allows for clamping the minimum LOD (level of detail) by a given float minLod
value. When working on this, I encountered the following failed assert:
deqp-vk: ../src/util/bitpack_helpers.h:61: util_bitpack_uint: Assertion `v <= max' failed.
The MIN_LOD_CLAMP
field in the NVIDIA texture header is unsigned and 12-bit fixed point, but the available macros in nil/nil_image_tic.c
for handling unsigned values were for 32-bit values. I added a set of macros for handling unsigned fixed values to address this. This allowed the tests to run without crashing, but 11 tests were failing.
I noticed that the failing tests all had base_level
in their names, whereas only one of the passing tests did. So I suspected that I was not accounting for the base level. To account for the base level, I updated the code to subtract view->base_level
from view->min_lod_clamp
:
--- a/src/nouveau/nil/nil_image_tic.c +++ b/src/nouveau/nil/nil_image_tic.c @ -371,6 +371,9 @@ nv9097_nil_image_fill_tic(const struct nil_image *image, TH_NV9097_SET_U(th, 7, MULTI_SAMPLE_COUNT, nil_to_nv9097_multi_sample_count(image->sample_layout)); + TH_NV9097_SET_UF(th, 7, MIN_LOD_CLAMP, + view->min_lod_clamp - view->base_level); + memcpy(desc_out, th, sizeof(th)); } @ -446,6 +449,9 @@ nvb097_nil_image_fill_tic(const struct nil_image *image, TH_NVB097_SET_U(th, BL, MULTI_SAMPLE_COUNT, nil_to_nvb097_multi_sample_count(image->sample_layout)); + TH_NVB097_SET_UF(th, BL, MIN_LOD_CLAMP, + view->min_lod_clamp - view->base_level); + memcpy(desc_out, th, sizeof(th)); }
The VK_EXT_mutable_descriptor_type
extension allows applications to specify a list of descriptor types that a given descriptor may mutate into to reduce descriptor memory usage.
The first issue I encountered was a segfault when running any of the relevant tests. A backtrace revealed that the tests are calling vkGetDescriptorSetLayoutSupport()
, which was yet to be implemented in NVK. After I implemented vkGetDescriptorSetLayoutSupport()
, some tests were crashing on a failed assert, either mutable_info != NULL
or dst_binding_layout->type == src_binding_layout->type
.
To address the mutable_info != NULL
assert, I replaced the asserts in nvk_CreateDescriptorPool()
with a conditional to allow for an application that specifies a NULL
descriptor type. The Vulkan spec requires a descriptor type list for each mutable descriptor set in vkCreateDescriptorSetLayout()
. In vkCreateDescriptorPool()
, however, it allows the client to request mutable descriptor types and no descriptor type list and the driver is expected to assume the worst case.
--- a/src/nouveau/vulkan/nvk_descriptor_set.c +++ b/src/nouveau/vulkan/nvk_descriptor_set.c @ -371,11 +369,9 @@ nvk_CreateDescriptorPool(VkDevice _device, uint32_t max_align = 0; for (unsigned i = 0; i < pCreateInfo->poolSizeCount; ++i) { const VkMutableDescriptorTypeListEXT *type_list = NULL; - if (pCreateInfo->pPoolSizes[i].type == VK_DESCRIPTOR_TYPE_MUTABLE_EXT) { - assert(mutable_info != NULL); - assert(i <= mutable_info->mutableDescriptorTypeListCount); - type_list = &mutable_info->pMutableDescriptorTypeLists[i]; - } + if (pCreateInfo->pPoolSizes[i].type == VK_DESCRIPTOR_TYPE_MUTABLE_EXT && + mutable_info && i < mutable_info->mutableDescriptorTypeListCount) + type_list = &mutable_info->pMutableDescriptorTypeLists[i];
In nvk_descriptor_stride_align_for_type()
, I set stride
and align
to NVK_MAX_DESCRIPTOR_SIZE
when type_list
is NULL
.
I addressed the dst_binding_layout->type == src_binding_layout->type
assert by removing it since the types do not need to be the same when doing a copy, and the code is already handling descriptors of different sizes.
The VK_EXT_image_robustness
and VK_EXT_robustness2
extensions add stricter requirements for how out of bounds reads from images are handled.
I first enabled both extensions to see which tests, if any, were failing. There were some failing tests, and I noticed that they were failing this assert in write_buffer_view_desc()
:
assert(view->desc_index < (1 << 20));
The code wasn't handling the case where view
was VK_NULL_HANDLE
. I updated the code to only assert when view
is not NULL
:
--- a/src/nouveau/vulkan/nvk_descriptor_set.c +++ b/src/nouveau/vulkan/nvk_descriptor_set.c @ -133,12 +133,13 @@ write_buffer_view_desc(struct nvk_descriptor_set *set, const VkBufferView bufferView, uint32_t binding, uint32_t elem) { - VK_FROM_HANDLE(nvk_buffer_view, view, bufferView); + struct nvk_image_descriptor desc = { }; + if (bufferView != VK_NULL_HANDLE) { + VK_FROM_HANDLE(nvk_buffer_view, view, bufferView); - assert(view->desc_index < (1 << 20)); - struct nvk_image_descriptor desc = { - .image_index = view->desc_index, - }; + assert(view->desc_index < (1 << 20)); + desc.image_index = view->desc_index; + } write_desc(set, binding, elem, &desc, sizeof(desc)); }
For the second half of my internship, I will be working on implementing VK_KHR_multiview
as described in NVK issue #52. Multiview is a rendering technique for working more efficiently with multiple views. One application of this is rendering stereoscopic views for VR displays.
I am very grateful to my mentor Faith Ekstrand, who has offered crucial support and guidance on top of reviewing my code. Additionally, I am thankful to Gustavo Noronha, who encouraged and advised me throughout the application and onboarding process. I am also grateful for the feedback I have received from others in the Mesa community.
15/01/2025
With VirGL, Venus, and vDRM, virglrenderer offers three different approaches to obtain access to accelerated GFX in a virtual machine. Here…
19/12/2024
In the world of deep learning optimization, two powerful tools stand out: torch.compile, PyTorch’s just-in-time (JIT) compiler, and NVIDIA’s…
08/10/2024
Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…
15/08/2024
After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…
01/08/2024
We're reflecting on the steps taken as we continually seek to improve Linux kernel integration. This will include more detail about the…
27/06/2024
With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…
Comments (1)
Jakub:
Mar 10, 2023 at 08:16 PM
Great work Rebecca, very interesting read!
Reply to this comment
Reply to this comment
Add a Comment