Rebecca Mckeever
May 03, 2023
Reading time:
NVK, an open-source Vulkan driver for NVIDIA hardware that is part of Mesa, now supports the Vulkan extension VK_KHR_multiview
.
Multiview is a rendering technique originally designed for VR. To improve efficiency when rendering multiple views, a single set of commands is recorded and then executed slightly differently in each view. Each view has a different ViewIndex
, which corresponds to one of the bits set in the view mask. The view mask is specified in VkRenderPassMultiviewCreateInfo
during render pass creation.
Nvidia hardware has some features that make it easier to implement multiview for Nvidia compared to other hardware vendors, such as Intel. Intel's implementation of multiview uses instanced rendering. In instanced rendering, within each draw call, the number of instances is multiplied by the number of views in the subpass. In the shader, gl_InstanceId
is divided by the number of views and and a compacted view index is calculated as gl_InstanceId % view_count
. If there are gaps in the view mask, further operations are performed to convert the compacted view index to the actual view index. This involves defining a map from the compacted view index to the actual view index. If primitive replication can be used, instanced rendering is not necessary. In that case, the shader makes gl_Position
an array and fills it with values for each view that is set.
By contrast, on Nvidia hardware, the instance looping is implemented by us using macros. Because we control the macros, we can add a second level of looping for multiview without affecting instanced rendering. This means we don't have to modify the shaders or adjust vertex inputs, simplifying the whole implementation. We do have to implement ViewIndex
but that is fairly simple in this scheme.
We added multiview support in the CmdDraw*()
functions. To do this, we placed a loop that iterates over the view mask inside the MME (Macro Method Expander) macros that are called by the CmdDraw*()
functions. First, we added view_index
to nvk_root_descriptor_table::draw
. Then inside the loop, for each view that was set in the view mask, draw.view_index
from the root descriptor table is set to the index of that bit within the view mask, and SET_RT_LAYER
was used to select the layer.
static void nvk_mme_emit_view_index(struct mme_builder *b, struct mme_value view_index) { /* Set the push constant */ mme_mthd(b, NV9097_LOAD_CONSTANT_BUFFER_OFFSET); mme_emit(b, mme_imm(nvk_root_descriptor_offset(draw.view_index))); mme_mthd(b, NV9097_LOAD_CONSTANT_BUFFER(0)); mme_emit(b, view_index); /* Set the layer to the view index */ STATIC_ASSERT(DRF_LO(NV9097_SET_RT_LAYER_V) == 0); STATIC_ASSERT(NV9097_SET_RT_LAYER_CONTROL_V_SELECTS_LAYER == 0); mme_mthd(b, NV9097_SET_RT_LAYER); mme_emit(b, view_index); }
Then, the draw was executed as usual.
Our initial attempt at this was not working for CTS tests where the view mask was 5, 10, or 15, but it was working for all other relevant tests. Looking at the .qpa logs generated by the tests revealed that only the first slice was rendering for a view mask of 15, but it should have 4 slices since 15 is $1111_2$. In the tests where the subpasses alternated between a view mask of 5 and 10, only the first 2 slices rendered, but these should also have 4 slices since $5_{10} = 0101_2$ and $10_{10} = 1010_2$. So only the first slice of each view mask was rendering.
qpa log rendered as XML, showing CTS results for dEQP-VK.multiview.masks.15 and dEQP-VK.multiview.masks.5_10_5_10. |
The problem was that the loop over the view mask was in nvk_mme_draw()
, and nvk_mme_build_draw()
was being called inside the loop for iterations where the bit was set. nvk_mme_build_draw()
was loading the draw parameters every time it was called, but the parameters were only passed once, so there weren't any parameters available to load after the first iteration.
We moved the loop over the view mask inside nvk_mme_build_draw()
, after the parameters were loaded. The inner loop over instance_count
was moved to a new helper function, nvk_mme_build_draw_loop()
. This allowed all the relevant CTS tests to pass on Turing architecture.
struct mme_value view_mask = nvk_mme_load_scratch(b, VIEW_MASK); mme_if(b, ieq, view_mask, mme_zero()) { mme_free_reg(b, view_mask); nvk_mme_build_draw_loop(b, instance_count, first_vertex, vertex_count); } view_mask = nvk_mme_load_scratch(b, VIEW_MASK); mme_if(b, ine, view_mask, mme_zero()) { mme_free_reg(b, view_mask); struct mme_value view = mme_mov(b, mme_zero()); mme_while(b, ine, view, mme_imm(32)) { view_mask = nvk_mme_load_scratch(b, VIEW_MASK); struct mme_value has_view = mme_bfe(b, view_mask, view, 1); mme_free_reg(b, view_mask); mme_if(b, ine, has_view, mme_zero()) { mme_free_reg(b, has_view); nvk_mme_emit_view_index(b, view); nvk_mme_build_draw_loop(b, instance_count, first_vertex, vertex_count); } mme_add_to(b, view, view, mme_imm(1)); } mme_free_reg(b, view); }
Since the view index is in the root descriptor table, we lower draw.view_index
to a load_ubo
system value using NIR:
static bool lower_load_view_index(nir_builder *b, nir_intrinsic_instr *load, const struct lower_descriptors_ctx *ctx) { const uint32_t root_table_offset = nvk_root_descriptor_offset(draw.view_index); b->cursor = nir_instr_remove(&load->instr); nir_ssa_def *val = nir_load_ubo(b, 1, 32, nir_imm_int(b, 0), nir_imm_int(b, root_table_offset), .align_mul = 4, .align_offset = 0, .range = root_table_offset + 4); assert(load->dest.is_ssa); nir_ssa_def_rewrite_uses(&load->dest.ssa, val); return true; }
We set maxMultiviewViewCount
in VkPhysicalDeviceVulkan11Properties
to 32, which is the maximum possible value since the view mask is 32 bits.
When multiview is enabled, queries must use N
consecutive query indices in the query pool, where N
is the number of bits set in the view mask in the subpass the query is used in. In NVK, only the first query is used, so we emitted zeros for the remaining queries.
if (cmd->state.gfx.render.view_mask != 0) { const uint32_t num_queries = util_bitcount(cmd->state.gfx.render.view_mask); if (num_queries > 1) emit_zero_queries(cmd, pool, query + 1, num_queries - 1); }
/** * Goes through a series of consecutive query indices in the given pool, * setting all element values to 0 and emitting them as available. */ static void emit_zero_queries(struct nvk_cmd_buffer *cmd, struct nvk_query_pool *pool, uint32_t first_index, uint32_t num_queries) { switch (pool->vk.query_type) { case VK_QUERY_TYPE_OCCLUSION: case VK_QUERY_TYPE_TIMESTAMP: case VK_QUERY_TYPE_PIPELINE_STATISTICS: { for (uint32_t i = 0; i < num_queries; i++) { uint64_t addr = nvk_query_available_addr(pool, first_index + i); struct nv_push *p = nvk_cmd_buffer_push(cmd, 5); P_MTHD(p, NV9097, SET_REPORT_SEMAPHORE_A); P_NV9097_SET_REPORT_SEMAPHORE_A(p, addr >> 32); P_NV9097_SET_REPORT_SEMAPHORE_B(p, addr); P_NV9097_SET_REPORT_SEMAPHORE_C(p, 1); P_NV9097_SET_REPORT_SEMAPHORE_D(p, { .operation = OPERATION_RELEASE, .release = RELEASE_AFTER_ALL_PRECEEDING_WRITES_COMPLETE, .pipeline_location = PIPELINE_LOCATION_ALL, .structure_size = STRUCTURE_SIZE_ONE_WORD, }); } break; } default: unreachable("Unsupported query type"); } }
Render passes can use a set of image resources called attachments. Input attachments are attachments that a subpass within a render pass can read from. Input attachments read from layers of the framebuffer; when multiview is enabled, the current layer corresponds to the view index.
In nir_lower_input_attachments.c
, we see that nir_input_attachment_options::use_layer_id_sysval
and nir_input_attachment_options::use_view_id_for_layer
must both be true for an input attachment to load the view index system value when load_layer_id()
is called during input attachment lowering. When use_layer_id_sysval
is true and use_view_id_for_layer
is false, the layer id system value will be loaded when load_layer_id()
is called. So use_layer_id_sysval
and use_view_id_for_layer
were added in the NIR pass for input attachments and set to true when multiview is enabled.
--- a/src/nouveau/vulkan/nvk_shader.c +++ b/src/nouveau/vulkan/nvk_shader.c @@ -343,7 +343,10 @@ nvk_lower_nir(struct nvk_device *device, nir_shader *nir, NIR_PASS(_, nir, nir_shader_instructions_pass, lower_fragcoord_instr, nir_metadata_block_index | nir_metadata_dominance, NULL); NIR_PASS(_, nir, nir_lower_input_attachments, - &(nir_input_attachment_options) { }); + &(nir_input_attachment_options) { + .use_layer_id_sysval = is_multiview, + .use_view_id_for_layer = is_multiview, + }); } nir_lower_compute_system_values_options csv_options = {
Additionally, maxPerStageDescriptorInputAttachments
and maxDescriptorSetInputAttachments
in nvk_GetPhysicalDeviceProperties2()
were both set to UINT32_MAX
.
Some additional modifications to the MME code were needed to support multiview on pre-Turing architecture, which has fewer registers available. To free more registers, some variables were moved to the shadow scratch so that they could be freed when not needed and restored from the shadow scratch when they were needed again. To make it easier to work with the shadow scratch, we introduced some new helper functions.
static inline struct mme_value _nvk_mme_load_scratch(struct mme_builder *b, enum nvk_mme_scratch scratch) { return mme_state(b, 0x3400 + scratch * 4); } #define nvk_mme_load_scratch(b, S) \ _nvk_mme_load_scratch(b, NVK_MME_SCRATCH_##S) static inline void _nvk_mme_store_scratch(struct mme_builder *b, enum nvk_mme_scratch scratch, struct mme_value data) { mme_mthd(b, 0x3400 + scratch * 4); mme_emit(b, data); } #define nvk_mme_store_scratch(b, S, v) \ _nvk_mme_store_scratch(b, NVK_MME_SCRATCH_##S, v) static inline void _nvk_mme_load_to_scratch(struct mme_builder *b, enum nvk_mme_scratch scratch) { struct mme_value val = mme_load(b); _nvk_mme_store_scratch(b, scratch, val); mme_free_reg(b, val); } #define nvk_mme_load_to_scratch(b, S) \ _nvk_mme_load_to_scratch(b, NVK_MME_SCRATCH_##S)
These helpers were used to add begin
, draw_count
, pad_dw
, draw_idx
, and view_mask
to the shadow scratch.
enum nvk_mme_scratch {
NVK_MME_SCRATCH_CS_INVOCATIONS_HI = 0,
NVK_MME_SCRATCH_CS_INVOCATIONS_LO,
+ NVK_MME_SCRATCH_DRAW_BEGIN,
+ NVK_MME_SCRATCH_DRAW_COUNT,
+ NVK_MME_SCRATCH_DRAW_PAD_DW,
+ NVK_MME_SCRATCH_DRAW_IDX,
+ NVK_MME_SCRATCH_VIEW_MASK,
/* Must be at the end */
NVK_MME_NUM_SCRATCH,
};
To ensure that values restored from the shadow scratch use the same value that the register had previously, a new helper for re-allocating registers was added.
static inline void mme_realloc_reg(struct mme_builder *b, struct mme_value value) { return mme_reg_alloc_realloc(&b->reg_alloc, value); }
static inline void mme_reg_alloc_realloc(struct mme_reg_alloc *a, struct mme_value val) { assert(val.type == MME_VALUE_TYPE_REG); assert(val.reg < 32); assert(a->exists & (1u << val.reg)); assert(!(a->alloc & (1u << val.reg))); a->alloc |= (1u << val.reg); }
Then, nvk_mme_spill()
and nvk_mme_fill()
from nvk_cmd_draw.c
were replaced with nvk_mme_spill()
and nvk_mme_unspill()
in nvk_mme.h
. The new versions take an nvk_mme_scratch
instead of an index to avoid overlaps. nvk_mme_unspill()
uses mme_realloc_reg()
to ensure that unspilled values use the same register as spilled values. nvk_mme_spill()
now also frees the register.
static void _nvk_mme_spill(struct mme_builder *b, enum nvk_mme_scratch scratch, struct mme_value val) { if (val.type == MME_VALUE_TYPE_REG) { _nvk_mme_store_scratch(b, scratch, val); mme_free_reg(b, val); } } #define nvk_mme_spill(b, S, v) \ _nvk_mme_spill(b, NVK_MME_SCRATCH_##S, v) static void _nvk_mme_unspill(struct mme_builder *b, enum nvk_mme_scratch scratch, struct mme_value val) { if (val.type == MME_VALUE_TYPE_REG) { mme_realloc_reg(b, val); _nvk_mme_load_scratch_to(b, val, scratch); } } #define nvk_mme_unspill(b, S, v) \ _nvk_mme_unspill(b, NVK_MME_SCRATCH_##S, v)
Even with all the edits mentioned so far, the CTS tests were still failing on pre-Turing architecture. Debugging this lead to the discovery of a bug in the existing pre-Turing mme_while()
code where the condition was checked on the wrong line, causing the XOR
on the previous line (which should have been the condition) to be skipped. The result was that the while
loops would not work unless the condition was comparing against zero. The loop would terminate either immediately or never. Fixing this bug allowed the CTS tests to pass on pre-Turing architecture.
This work did not include support for geometry or tessellation shaders. However, support for geometry shaders was added to NVK soon after basic multiview support was merged.
We noticed some test coverage holes in the multiview CTS. There were no tests for multiview support for:
CmdDrawIndirectCount
CmdDrawIndexedIndirectCount
CmdDrawIndirectByteCountEXT
We filed an issue about the coverage holes. Then, Ricardo Garcia from Igalia added this test coverage to the CTS. Our NVK implementation passed the new tests on the first try.
I am grateful to my mentor Faith Ekstrand for her help, particularly with debugging and testing pre-Turing support.
15/01/2025
With VirGL, Venus, and vDRM, virglrenderer offers three different approaches to obtain access to accelerated GFX in a virtual machine. Here…
19/12/2024
In the world of deep learning optimization, two powerful tools stand out: torch.compile, PyTorch’s just-in-time (JIT) compiler, and NVIDIA’s…
08/10/2024
Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…
15/08/2024
After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…
01/08/2024
We're reflecting on the steps taken as we continually seek to improve Linux kernel integration. This will include more detail about the…
27/06/2024
With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…
Comments (0)
Add a Comment