We're hiring!
*

Introducing Multiview for NVK

Rebecca Mckeever avatar

Rebecca Mckeever
May 03, 2023

Share this post:

Reading time:

NVK, an open-source Vulkan driver for NVIDIA hardware that is part of Mesa, now supports the Vulkan extension VK_KHR_multiview.

Multiview is a rendering technique originally designed for VR. To improve efficiency when rendering multiple views, a single set of commands is recorded and then executed slightly differently in each view. Each view has a different ViewIndex, which corresponds to one of the bits set in the view mask. The view mask is specified in VkRenderPassMultiviewCreateInfo during render pass creation.

Nvidia hardware has some features that make it easier to implement multiview for Nvidia compared to other hardware vendors, such as Intel. Intel's implementation of multiview uses instanced rendering. In instanced rendering, within each draw call, the number of instances is multiplied by the number of views in the subpass. In the shader, gl_InstanceId is divided by the number of views and and a compacted view index is calculated as gl_InstanceId % view_count. If there are gaps in the view mask, further operations are performed to convert the compacted view index to the actual view index. This involves defining a map from the compacted view index to the actual view index. If primitive replication can be used, instanced rendering is not necessary. In that case, the shader makes gl_Position an array and fills it with values for each view that is set.

By contrast, on Nvidia hardware, the instance looping is implemented by us using macros. Because we control the macros, we can add a second level of looping for multiview without affecting instanced rendering. This means we don't have to modify the shaders or adjust vertex inputs, simplifying the whole implementation. We do have to implement ViewIndex but that is fairly simple in this scheme.

Draw functions

We added multiview support in the CmdDraw*() functions. To do this, we placed a loop that iterates over the view mask inside the MME (Macro Method Expander) macros that are called by the CmdDraw*() functions. First, we added view_index to nvk_root_descriptor_table::draw. Then inside the loop, for each view that was set in the view mask, draw.view_index from the root descriptor table is set to the index of that bit within the view mask, and SET_RT_LAYER was used to select the layer.

static void
nvk_mme_emit_view_index(struct mme_builder *b, struct mme_value view_index)
{
   /* Set the push constant */
   mme_mthd(b, NV9097_LOAD_CONSTANT_BUFFER_OFFSET);
   mme_emit(b, mme_imm(nvk_root_descriptor_offset(draw.view_index)));
   mme_mthd(b, NV9097_LOAD_CONSTANT_BUFFER(0));
   mme_emit(b, view_index);

   /* Set the layer to the view index */
   STATIC_ASSERT(DRF_LO(NV9097_SET_RT_LAYER_V) == 0);
   STATIC_ASSERT(NV9097_SET_RT_LAYER_CONTROL_V_SELECTS_LAYER == 0);
   mme_mthd(b, NV9097_SET_RT_LAYER);
   mme_emit(b, view_index);
}

Then, the draw was executed as usual.

Our initial attempt at this was not working for CTS tests where the view mask was 5, 10, or 15, but it was working for all other relevant tests. Looking at the .qpa logs generated by the tests revealed that only the first slice was rendering for a view mask of 15, but it should have 4 slices since 15 is $1111_2$. In the tests where the subpasses alternated between a view mask of 5 and 10, only the first 2 slices rendered, but these should also have 4 slices since $5_{10} = 0101_2$ and $10_{10} = 1010_2$. So only the first slice of each view mask was rendering.

qpa log rendered as XML, showing CTS results for dEQP-VK.multiview.masks.15 and dEQP-VK.multiview.masks.5_10_5_10.

 

The problem was that the loop over the view mask was in nvk_mme_draw(), and nvk_mme_build_draw() was being called inside the loop for iterations where the bit was set. nvk_mme_build_draw() was loading the draw parameters every time it was called, but the parameters were only passed once, so there weren't any parameters available to load after the first iteration.

We moved the loop over the view mask inside nvk_mme_build_draw(), after the parameters were loaded. The inner loop over instance_count was moved to a new helper function, nvk_mme_build_draw_loop(). This allowed all the relevant CTS tests to pass on Turing architecture.

   struct mme_value view_mask = nvk_mme_load_scratch(b, VIEW_MASK);
   mme_if(b, ieq, view_mask, mme_zero()) {
      mme_free_reg(b, view_mask);

      nvk_mme_build_draw_loop(b, instance_count,
                              first_vertex, vertex_count);
   }

   view_mask = nvk_mme_load_scratch(b, VIEW_MASK);
   mme_if(b, ine, view_mask, mme_zero()) {
      mme_free_reg(b, view_mask);

      struct mme_value view = mme_mov(b, mme_zero());
      mme_while(b, ine, view, mme_imm(32)) {
         view_mask = nvk_mme_load_scratch(b, VIEW_MASK);
         struct mme_value has_view = mme_bfe(b, view_mask, view, 1);
         mme_free_reg(b, view_mask);
         mme_if(b, ine, has_view, mme_zero()) {
            mme_free_reg(b, has_view);
            nvk_mme_emit_view_index(b, view);
            nvk_mme_build_draw_loop(b, instance_count,
                                    first_vertex, vertex_count);
         }

         mme_add_to(b, view, view, mme_imm(1));
      }
      mme_free_reg(b, view);
   }
 

Lowering view index

Since the view index is in the root descriptor table, we lower draw.view_index to a load_ubo system value using NIR:

static bool
lower_load_view_index(nir_builder *b, nir_intrinsic_instr *load,
                      const struct lower_descriptors_ctx *ctx)
{
   const uint32_t root_table_offset =
      nvk_root_descriptor_offset(draw.view_index);

   b->cursor = nir_instr_remove(&load->instr);

   nir_ssa_def *val = nir_load_ubo(b, 1, 32,
                                   nir_imm_int(b, 0),
                                   nir_imm_int(b, root_table_offset),
                                   .align_mul = 4,
                                   .align_offset = 0,
                                   .range = root_table_offset + 4);

   assert(load->dest.is_ssa);
   nir_ssa_def_rewrite_uses(&load->dest.ssa, val);

   return true;
}

We set maxMultiviewViewCount in VkPhysicalDeviceVulkan11Properties to 32, which is the maximum possible value since the view mask is 32 bits.

Queries

When multiview is enabled, queries must use N consecutive query indices in the query pool, where N is the number of bits set in the view mask in the subpass the query is used in. In NVK, only the first query is used, so we emitted zeros for the remaining queries.

if (cmd->state.gfx.render.view_mask != 0) {
      const uint32_t num_queries =
         util_bitcount(cmd->state.gfx.render.view_mask);
      if (num_queries > 1)
         emit_zero_queries(cmd, pool, query + 1, num_queries - 1);
   }
 /**
 * Goes through a series of consecutive query indices in the given pool,
 * setting all element values to 0 and emitting them as available.
 */
static void
emit_zero_queries(struct nvk_cmd_buffer *cmd, struct nvk_query_pool *pool,
                  uint32_t first_index, uint32_t num_queries)
{
   switch (pool->vk.query_type) {
   case VK_QUERY_TYPE_OCCLUSION:
   case VK_QUERY_TYPE_TIMESTAMP:
   case VK_QUERY_TYPE_PIPELINE_STATISTICS: {
      for (uint32_t i = 0; i < num_queries; i++) {
         uint64_t addr = nvk_query_available_addr(pool, first_index + i);

         struct nv_push *p = nvk_cmd_buffer_push(cmd, 5);
         P_MTHD(p, NV9097, SET_REPORT_SEMAPHORE_A);
         P_NV9097_SET_REPORT_SEMAPHORE_A(p, addr >> 32);
         P_NV9097_SET_REPORT_SEMAPHORE_B(p, addr);
         P_NV9097_SET_REPORT_SEMAPHORE_C(p, 1);
         P_NV9097_SET_REPORT_SEMAPHORE_D(p, {
            .operation = OPERATION_RELEASE,
            .release = RELEASE_AFTER_ALL_PRECEEDING_WRITES_COMPLETE,
            .pipeline_location = PIPELINE_LOCATION_ALL,
            .structure_size = STRUCTURE_SIZE_ONE_WORD,
         });
      }
      break;
   }
   default:
      unreachable("Unsupported query type");
   }
}

Input attachments

Render passes can use a set of image resources called attachments. Input attachments are attachments that a subpass within a render pass can read from. Input attachments read from layers of the framebuffer; when multiview is enabled, the current layer corresponds to the view index.

In nir_lower_input_attachments.c, we see that nir_input_attachment_options::use_layer_id_sysval and nir_input_attachment_options::use_view_id_for_layer must both be true for an input attachment to load the view index system value when load_layer_id() is called during input attachment lowering. When use_layer_id_sysval is true and use_view_id_for_layer is false, the layer id system value will be loaded when load_layer_id() is called. So use_layer_id_sysval and use_view_id_for_layer were added in the NIR pass for input attachments and set to true when multiview is enabled.

--- a/src/nouveau/vulkan/nvk_shader.c
+++ b/src/nouveau/vulkan/nvk_shader.c
@@ -343,7 +343,10 @@ nvk_lower_nir(struct nvk_device *device, nir_shader *nir,
      NIR_PASS(_, nir, nir_shader_instructions_pass, lower_fragcoord_instr,
               nir_metadata_block_index | nir_metadata_dominance, NULL);
      NIR_PASS(_, nir, nir_lower_input_attachments,
-              &(nir_input_attachment_options) { });
+              &(nir_input_attachment_options) {
+                 .use_layer_id_sysval = is_multiview,
+                 .use_view_id_for_layer = is_multiview,
+              });
   }

   nir_lower_compute_system_values_options csv_options = {

Additionally, maxPerStageDescriptorInputAttachments and maxDescriptorSetInputAttachments in nvk_GetPhysicalDeviceProperties2() were both set to UINT32_MAX.

Supporting pre-Turing

Some additional modifications to the MME code were needed to support multiview on pre-Turing architecture, which has fewer registers available. To free more registers, some variables were moved to the shadow scratch so that they could be freed when not needed and restored from the shadow scratch when they were needed again. To make it easier to work with the shadow scratch, we introduced some new helper functions.

static inline struct mme_value
_nvk_mme_load_scratch(struct mme_builder *b, enum nvk_mme_scratch scratch)
{
   return mme_state(b, 0x3400 + scratch * 4);
}
#define nvk_mme_load_scratch(b, S) \
   _nvk_mme_load_scratch(b, NVK_MME_SCRATCH_##S)

static inline void
_nvk_mme_store_scratch(struct mme_builder *b, enum nvk_mme_scratch scratch,
                       struct mme_value data)
{
   mme_mthd(b, 0x3400 + scratch * 4);
   mme_emit(b, data);
}
#define nvk_mme_store_scratch(b, S, v) \
   _nvk_mme_store_scratch(b, NVK_MME_SCRATCH_##S, v)

static inline void
_nvk_mme_load_to_scratch(struct mme_builder *b, enum nvk_mme_scratch scratch)
{
   struct mme_value val = mme_load(b);
   _nvk_mme_store_scratch(b, scratch, val);
   mme_free_reg(b, val);
}
#define nvk_mme_load_to_scratch(b, S) \
   _nvk_mme_load_to_scratch(b, NVK_MME_SCRATCH_##S)

These helpers were used to add begin, draw_count, pad_dw, draw_idx, and view_mask to the shadow scratch.

enum nvk_mme_scratch {
   NVK_MME_SCRATCH_CS_INVOCATIONS_HI = 0,
   NVK_MME_SCRATCH_CS_INVOCATIONS_LO,
+  NVK_MME_SCRATCH_DRAW_BEGIN,
+  NVK_MME_SCRATCH_DRAW_COUNT,
+  NVK_MME_SCRATCH_DRAW_PAD_DW,
+  NVK_MME_SCRATCH_DRAW_IDX,
+  NVK_MME_SCRATCH_VIEW_MASK,

   /* Must be at the end */
   NVK_MME_NUM_SCRATCH,
};

To ensure that values restored from the shadow scratch use the same value that the register had previously, a new helper for re-allocating registers was added.

 static inline void
mme_realloc_reg(struct mme_builder *b, struct mme_value value)
{
   return mme_reg_alloc_realloc(&b->reg_alloc, value);
}
static inline void
mme_reg_alloc_realloc(struct mme_reg_alloc *a, struct mme_value val)
{
   assert(val.type == MME_VALUE_TYPE_REG);

   assert(val.reg < 32);
   assert(a->exists & (1u << val.reg));
   assert(!(a->alloc & (1u << val.reg)));

   a->alloc |= (1u << val.reg);
}

Then, nvk_mme_spill() and nvk_mme_fill() from nvk_cmd_draw.c were replaced with nvk_mme_spill() and nvk_mme_unspill() in nvk_mme.h. The new versions take an nvk_mme_scratch instead of an index to avoid overlaps. nvk_mme_unspill() uses mme_realloc_reg() to ensure that unspilled values use the same register as spilled values. nvk_mme_spill() now also frees the register.

static void
_nvk_mme_spill(struct mme_builder *b, enum nvk_mme_scratch scratch,
               struct mme_value val)
{
   if (val.type == MME_VALUE_TYPE_REG) {
      _nvk_mme_store_scratch(b, scratch, val);
      mme_free_reg(b, val);
   }
}
#define nvk_mme_spill(b, S, v) \
   _nvk_mme_spill(b, NVK_MME_SCRATCH_##S, v)

static void
_nvk_mme_unspill(struct mme_builder *b, enum nvk_mme_scratch scratch,
                 struct mme_value val)
{
   if (val.type == MME_VALUE_TYPE_REG) {
      mme_realloc_reg(b, val);
      _nvk_mme_load_scratch_to(b, val, scratch);
   }
}
#define nvk_mme_unspill(b, S, v) \
   _nvk_mme_unspill(b, NVK_MME_SCRATCH_##S, v)

Even with all the edits mentioned so far, the CTS tests were still failing on pre-Turing architecture. Debugging this lead to the discovery of a bug in the existing pre-Turing mme_while() code where the condition was checked on the wrong line, causing the XOR on the previous line (which should have been the condition) to be skipped. The result was that the while loops would not work unless the condition was comparing against zero. The loop would terminate either immediately or never. Fixing this bug allowed the CTS tests to pass on pre-Turing architecture.

Conclusion

This work did not include support for geometry or tessellation shaders. However, support for geometry shaders was added to NVK soon after basic multiview support was merged.

We noticed some test coverage holes in the multiview CTS. There were no tests for multiview support for: 

  • CmdDrawIndirectCount 
  • CmdDrawIndexedIndirectCount 
  • CmdDrawIndirectByteCountEXT

We filed an issue about the coverage holes. Then, Ricardo Garcia from Igalia added this test coverage to the CTS. Our NVK implementation passed the new tests on the first try.

I am grateful to my mentor Faith Ekstrand for her help, particularly with debugging and testing pre-Turing support.

Comments (0)


Add a Comment






Allowed tags: <b><i><br>Add a new comment:


Search the newsroom

Latest Blog Posts

The state of GFX virtualization using virglrenderer

15/01/2025

With VirGL, Venus, and vDRM, virglrenderer offers three different approaches to obtain access to accelerated GFX in a virtual machine. Here…

Faster inference: torch.compile vs TensorRT

19/12/2024

In the world of deep learning optimization, two powerful tools stand out: torch.compile, PyTorch’s just-in-time (JIT) compiler, and NVIDIA’s…

Mesa CI and the power of pre-merge testing

08/10/2024

Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…

A shifty tale about unit testing with Maxwell, NVK's backend compiler

15/08/2024

After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…

A journey towards reliable testing in the Linux Kernel

01/08/2024

We're reflecting on the steps taken as we continually seek to improve Linux kernel integration. This will include more detail about the…

Building a Board Farm for Embedded World

27/06/2024

With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…

Open Since 2005 logo

Our website only uses a strictly necessary session cookie provided by our CMS system. To find out more please follow this link.

Collabora Limited © 2005-2025. All rights reserved. Privacy Notice. Sitemap.