Re-converging control flow on NVIDIA GPUs - What went wrong, and how we fixed it

Re-converging control flow on NVIDIA GPUs - What went wrong, and how we fixed it

Faith Ekstrand
April 25, 2024

Share this post:

Reading time:

In a recent update to NVK, I landed support in Mesa for the VK_KHR_shader_maximal_reconvergence and VK_KHR_shader_subgroup_uniform_control_flow extensions. Properly implementing these required a good bit of compiler engineering. My first attempt at implementing control flow re-convergence from last year was fundamentally flawed and we had to throw it away entirely. This is the story of what went wrong and how we fixed it.

What is re-convergence?

Modern 3D rendering and compute APIs like Vulkan are built around shaders, little programs that control various aspects of the rendering process. For instance, vertex shaders take the raw geometry data from the client and output a set of per-vertex attributes. One of those per-vertex attributes is a projected 2D position on your screen. Meanwhile, fragment shaders (or pixel shaders in D3D) take those per-vertex attributes and compute a color value for each pixel. There are also several other optional shader stages which provide additional control over the geometry pipeline such as geometry, tessellation, and task/mesh shaders. Finally, compute shaders provide a more general-purpose shader that runs independently of the 3D pipeline and just reads and writes memory.

Over the years, a lot of engineering effort has been put into making these shaders run incredibly quickly. On a 4K screen, for instance, you have to run a minimum of half a billion fragment shader invocations per second in order to achieve a frame rate of 60 FPS. One of the techniques used by modern GPUs to achieve this speed is to execute shaders using a single instruction multiple thread (SIMT) model where as many as 128 invocations of the same shader are executed in lock-step with a single hardware instruction processing the data for all 128 invocations. These groups of invocations are called a subgroup in Vulkan or a wave in D3D or a warp on NVIDIA hardware. (I will be using the Vulkan terminology throughout this post.) The actual size of a subgroup is dependent on the hardware, with 128 being the maximum. On NVIDIA hardware, subgroups are always 32 invocations.

Historically, subgroups have been mostly invisible to applications. Shader programs are written in a language that describes how to transform the input data for a single vertex or pixel into the output data for that vertex or pixel. The fact that the hardware processes the data for many vertices or pixels simultaneously is mostly an implementation detail. I keep saying "mostly" because there is one notable exception: fragment shader derivatives. Within the fragment stage, there are derivative opcodes which take a value and give you a coarse screen-space derivative of that value as the output. This is only really possible because each 2x2 group of pixels is executed within the same subgroup and the hardware can do a bit of arithmetic across the subgroup to compute that derivative. Derivatives are also implicitly used by texture sampling operations to compute the level of detail (LOD) of the texture to be sampled.

With HLSL shader model 6.0, Microsoft added support for explicit wave operations to D3D shaders. Vulkan followed quickly with the addition of subgroup support to Vulkan 1.1 and SPIR-V 1.3. These operations allow applications to take advantage of subgroup-level parallelism within shaders for even greater performance. For instance, a compute shader doing a reduction operation can first reduce within a subgroup and then have only a single invocation of that subgroup participate in the global reduction. Since the subgroup reduction happens entirely in registers, this is significantly faster than if every invocation of the shader had to participate in the global reduction.

So what is re-convergence and what does any of this have to do with control flow? With straight-line code, it's fairly easy to see how subgroup operations work. The shader core executes each instruction one at a time with each instruction processing an entire subgroup's worth of data. When you execute a subgroup operation, it works like any other arithmetic operation except it takes its values from the same register in each of the different invocations within a subgroup rather than from different registers in a single invocation. For disabled invocations, the subgroup operations are smart enough to know to ignore those values. (Disabled invocations may happen as a result of control flow or if the number of invocations is not a multiple of the subgroup size.) The trouble arises the moment you hit an if statement. What happens if the if statement condition is true for some of the invocations in the subgroup and false for others? This is what we call "divergence" where part of the subgroup goes one way and part goes the other. In this case, the set of invocations for which the condition is false are disabled when the then side of the branch is taken. When the else side of the branch is taken, the invocations for which the condition was true are disabled. The order in which the then and else sides of the branch are executed is implementation-dependent. On recent NVIDIA hardware, the two sides of an if may even be executed simultaneously since the two sides act on disjoint sets of invocations.

So what happens to subgroup operations after we've taken a divergent branch? What invocations are involved in the subgroup operation? Consider the following snippet of theoretical GLSL code:

glsl
int x;
if (cond) {
    x = thing1();
} else {
    x = thing2();
}
int sum = subgroupAdd(x);

One might expect, as the shader author, that the value of x gets assigned the result of either thing1() or thing2() depending on the condition and then the subgroupAdd() would sum the value of x, regardless of whether thing1() or thing2() was executed. That is to say that one might expect control flow to re-converge before the subgroupAdd(), ensuring that all the invocations participate in the subgroupAdd(). Unfortunately, the truth may not always be so simple.

Many compilers don't model control flow in terms of the kinds of constructs you might have in a high-level programming language, such as if statements and for loops. Instead, they model control flow as a directed graph where each chunk of straight-line code (typically called a basic block) is a node in the graph and the paths between those blocks are edges. This representation is commonly called a control flow graph (CFG). The CFG for the above example would look something like this:

The code example above as a directed graph.

For a simple example like this, it's still fairly obvious in this form where and how the re-convergence should happen. You can see the control flow diverging after the if and coming back together before the subgroupAdd(). With loops, however, things get much more complicated and not at all obvious. Consider the following snippet of GLSL code:

glsl
int s;
loop {
    int x = thing();
    if (x > 200) {
        s = subgroupAdd(x);
        break;
    }
}
int m = subgroupMin(s);

Looking at this as GLSL code, it's evident what is supposed to happen. It calls thing() until it gets a value over 200. Whenever it gets a value over 200, it adds all of them across the active lanes of the subgroup and then breaks out of the loop. All of the invocations which got a value over 200 in this loop iteration will participate in the subgroupAdd(). After the loop completes, it takes the minimum of those sums across the subgroup. This time, the subgroupMin() includes all of the invocations in the subgroup. (Why would you ever want that? You probably wouldn't. It's a made up example.) Importantly, this would behave differently if we moved the subgroupAdd() out of the loop as follows:

glsl
int x;
loop {
    x = thing();
    if (x > 200)
        break;
}
int s = subgroupAdd(x);
int m = subgroupMin(s);

We get a third, different behavior if we move the subgroupMin() into the loop:

glsl
int m;
loop {
    int x = thing();
    if (x > 200) {
        int s = subgroupAdd(x);
        m = subgroupMin(s);
        break;
    }
}

In both of these later cases, the subgroupMin() does nothing because all of the results of a subgroupAdd() are the same across the subgroup. However, they are still different in which invocations participate in the add. In the first case, the add happens outside the loop after every invocation has had a call to thing() return more than 200. In the later case, the subgroupAdd() happens inside the loop and only involves those invocations which had thing() return more than 200 in the same iteration of the loop.

Unfortunately, if you aren't careful, these three code snippets look almost identical when looking at the shader as a directed graph of basic blocks. The first version of our loop looks like this:

The loop code example as a directed graph with the subgroupAdd() and the subgroupMin() in separate basic blocks.

While the second and third versions look like this:

The loop code example as a directed graph with the subgroupAdd() and the subgroupMin() in the same basic block.

In fact, it is a fairly common code transformation in compilers to observe that every path of execution which executes the subgroupAdd() then goes on to execute the subgroupMin() and collapse the first graph into the second. On a CPU or when compiling a language that does not have explicit SIMT concepts like subgroups, this is an entirely valid code transformation. However, with modern shading languages it isn't.

It's also important to note that the second and third versions of our loop example have exactly the same CFG even though they have different behavior. The placement of re-convergence is crucial.

The way this is solved in the 3D graphics dialect of SPIR-V that is used by Vulkan is by representing the original high-level structure of the program in SPIR-V directly. SPIR-V has concepts of if statements (which it calls selection constructs) and loops which provide this information to the driver's compiler. With the VK_KHR_shader_maximal_reconvergence and VK_KHR_shader_subgroup_uniform_control_flow extensions, Vulkan shader compilers are now required to respect those constructs and ensure that control flow re-converges in the natural way.

In Mesa's NIR optimizing compiler core, we preserve the structure directly in our representation of the CFG. Instead of simply representing it as a graph of nodes and blocks, our CFG data structure has node types for basic blocks, if statements, loops, and functions, and each of those contains lists of other control flow nodes. Each basic block also has lists of successor and predecessor blocks which provide the edges in the more classic sense of a CFG. In this way, NIR is able to represent both the classic CFG and the structure from the original source program and both structures are preserved as we transform the shader.

This is not only useful for tracking control flow re-convergence but it's also useful for hardware back-ends. On some hardware such as Intel, Imagination, Apple, and older NVIDIA GPUs, these high level concepts of ifs and loops are actually encoded in hardware instructions in some form. On Arm and Qualcomm GPUs, control flow re-convergence is based on the ordering of blocks within the shader program with lower instruction addresses taking priority and re-convergence happens when two groups of invocations reach the same, higher instruction address. In this case, NIR's natural block ordering from the structured CFG gives exactly the instruction order they need in order to re-converge properly. On AMD GPUs, the hardware isn't actually capable of diverging and it's the responsibility of the compiler to juggle execution masks and predicate instructions as needed. Again, NIR's structure is useful as it provides a very natural place to compute these masks and make it easy to reason about uniform jumps.

Control flow re-convergence on NVIDIA

On NVIDIA GPUs starting with their Volta architecture, this all works quite differently. Instead of structured control flow instructions or instruction address based merging, NVIDIA Volta+ hardware is entirely unstructured. It has a simple branch instruction which can be predicated to provide a conditional branch. To handle subgroup re-convergence, there is a barrier register file together with a set of instructions for setting, breaking out of, and waiting on these barriers. The hardware has no fundamental notion of control flow structure and instead it's up to the compiler to construct what structure it wishes from these barrier primitives.

The reason for this shift is because of an important advancement made in the Volta architecture which allows for independent forward progress of individual shader invocations. Instead of requiring things to follow a particular structure or prioritizing invocations with low addresses like most other GPUs, groups of invocations on Volta+ run independently with some sort of fair scheduler. This allows for CPU-like threading primitives such as mutexes. With a fixed scheduling strategy, you run the risk of the hardware prioritizing the invocations that are spinning waiting on the mutex and never executing the code which would eventually unlock the mutex. With a fair scheduler, every set of unblocked invocations continues to execute until they exit the shader, regardless of what other invocations may be active.

My first attempt at implementing control flow re-convergence on Volta+ was fairly naive. Before each NIR control flow structure, we emitted a bmov.clear instruction followed by a bssy to populate the barrier register. After each control flow structure, we emitted a bsync instruction to wait on the barrier, ensuring re-convergence. Before each loop break or continue, we emitted a sequence of bbreak instructions to break out of any barriers up to and including the one for the nearest loop.

The barrier placement pass wasn't totally naive, though. It made a decent attempt at avoiding unnecessary barriers by only inserted subgroup barriers around control flow that would actually increase divergence. It also included an optimization to avoid inserting barriers if the control flow would re-converge anyway. For instance, there is an if statement at the end of a loop body where the merge for the loop continue would re-converge control flow anyway, we can elide the barriers for the if statement.

This scheme appeared to work for a while and probably did if all of the divergence was limited to if statements. However, it had a fatal flaw in the presence of non-uniform loops. We honestly may never have noticed the flaw if it weren't for a few Vulkan CTS tests for the new VK_KHR_shader_subgroup_uniform_control_flow extension. When we first tried to turn it on, I saw 4 test failures:

dEQP-VK.reconvergence.subgroup_uniform_control_flow_ballot.compute.nesting4.1.2
dEQP-VK.reconvergence.subgroup_uniform_control_flow_elect.compute.nesting4.1.2
dEQP-VK.reconvergence.workgroup_uniform_control_flow_ballot.compute.nesting4.1.2
dEQP-VK.reconvergence.workgroup_uniform_control_flow_elect.compute.nesting4.1.2

The person who made the initial merge request (MR) missed this because they did legitimately pass for him. I accepted the MR because I was pretty sure I'd implemented proper re-convergence and I believed him that the tests passed. I didn't see the failure until I was doing some regression testing on a different branch. Upon further investigation, it turned out that he was running a newer version of the Vulkan CTS where the tests were slightly different. Normally, I would just write this off as a CTS bugfix but by this point I had already started investigating and discovered the deficiency in the NVK compiler. After fixing the NVK compiler, we now pass all the different versions of those tests.

To understand this flaw, we need to first look at how I plumbed these barriers through NIR. In NIR, I represented the bssy, bsync, and bbreak instructions as a set of NIR intrinsics: bar_set_nv, bar_sync_nv, and bar_break_nv. These NIR intrinsics fairly directly map to the hardware instructions except that they copy the barrier value through a GPR to preserve proper NIR semantics. The compiler can usually eliminate this copy so it doesn't hurt anything. Because bar_break_nv modifies a barrier, it both takes a barrier as a source and returns a new barrier with the break applied. Even though bbreak only takes a single read/write barrier source, I had to model it this way in NIR because SSA form does not allow for modifying values. In the register allocator, we ensure that both the source and destination of the bbreak get assigned the same barrier register and insert bmov instructions as needed.

When we inserted barrier instructions for loops in the original control flow re-convergence pass, it looked something like this:

%bar_reg = decl_reg 32 1
%b0 = bar_set_nv
store_reg %b0 %bar_reg
loop {
    if (cond1) {
        %b1 = load_reg %bar_reg
        %b2 = bar_break_nv %b1
        store_reg %b2 %bar_reg
        break;
    }

    if (cond2) {
        continue;
    }

}
%b3 = load_reg %bar_reg
bar_sync_nv %b3

When we run load/store elimination to convert back into proper SSA form, this becomes

%b0 = bar_set_nv
loop {
    if (cond) {
        %b2 = bar_break_nv %b0
        break;
    }

    if (cond2) {
        continue;
    }
    ...
}
bar_sync_nv %b2

There are a few obvious problems with this. The first is that there is no bar_sync_nv inside the loop to ensure that early continues re-converge at the top of the loop. Hardware with built-in structured control flow constructs typically re-converge early continues automatically. On NVIDIA, however, we need a bar_sync_nv at the top of the loop to ensure that all continues arrive before we repeat the loop. The second problem is that, the way SSA works out, the value consumed by the bar_sync_nv intrinsic after the loop is always a break value which, by definition doesn't contain the current thread. This meant that the bar_sync_nv instruction after the loop also did nothing. The naive solution to these problems would be to re-structure the barrier intrinsics like this:

%b0 = bar_set_nv
loop {
    %b1 = phi %b0 %b2
    bar_sync_nv %b1
    if (cond) {
        %b2 = bar_break_nv %b1
        break;
    }

    if (cond2) {
        continue;
    }
    ...
}
bar_sync_nv %b0

At first glance, this looks like it does what we want. The bar_sync_nv after the loop applies to all the threads that entered the loop, ensuring re-convergence because it uses the value from the original bar_set_nv. Also, we have a bar_sync_nv at the top of the loop to ensure that all the various loop continues re-converge and it takes its barrier value from the phi so it takes into account the bar_break_nv. However, if you look a little closer, you'll observe that the phi at the top of the loop is bogus. The %b2 = bar_break_nv %b1 value has no way to make its way back to the top of the loop because it's only defined on the break path.

This all speaks to a much more fundamental problem with control flow re-convergence on NVIDIA Volta+ hardware: barrier instructions have no implicit intra-subgroup communication. All communication has to follow the normal shader data flow. On hardware with structured control flow instructions such as Intel, the loop break instruction does not always jump to the end of the loop. Instead, it removes the active invocations from some sort of internal execution mask to ensure that they get disabled on the next iteration of the loop. Once the final invocation is disabled (the execution mask goes to zero), the hardware jumps to the end of the loop. In order to handle nested loops, there is typically some sort of internal stack of such execution masks. In order for this to work, each iteration of the loop must know which invocations have broken out in previous iterations. Otherwise, it wouldn't know when the final invocation breaks so it can pop the stack and jump to the end of the loop. This effectively means that the invocations that break out of the loop communicate their break status to the invocations that continue.

The barrier instructions on Volta have no such internal stack or implicit intra-subgroup communication. The values stored in the barrier registers are invocation masks and there is an implicit subgroup ballot that happens as part bssy and bbreak. However, the barrier register value is per-invocation and the modification of the barrier register happens only in the active invocations. A bbreak instruction, for instance, will disable the barrier for active invocations based on the provided predicate but the barrier values for inactive invocations are unmodified. This means that, in order to break out of a barrier, all invocations which are included in the barrier must be present at the execution of the bbreak instruction.

But doesn't this defeat the purpose of the bbreak instruction? How do we break if the bbreak has to happen in all the invocations? The solution is that we have to re-converge before we break. In the [new barrier placement pass][3] landed last week, I defined a concept called scopes which are single entry, single exit regions in the control-flow graph. Each scope begins with a bar_set_nv and concludes with a bar_sync_nv. Because we need to support breaking out of more than a single scope, each scope end optionally cascades to the end of the next scope, re-converging at each step. After the bar_sync_nv, it checks a break value from the incoming scope break edges and jumps to the end of the next scope, if needed. In our loop example above, this looks like the following:

%b0 = bar_set_nv
scope 1 {
    loop_header:
    %b1 = bar_set_nv
    scope 2 {
        if (cond) {
            // break;
            %c0 = 1
            goto scope_2_merge;
        }

        if (cond2) {
            // continue;
            %c1 = 2
            goto scope_2_merge;
        }
        ...
        %c2 = 2
        goto scope_2_merge;
    }
    scope_2_merge:
    %c = phi %c0 %c1 %c2
    bar_sync_nv %b1

    if (%c < 2) {
        goto scope_1_merge;
    }

    goto loop_header;
}
scope_1_merge:
bar_sync_nv %b0

In this new formulation, all control flow paths that flow through scope 2 eventually end up at scope_2_merge: and the bar_sync_nv intrinsic that ensures everything inside scope 2 re-converges. The cascade value %c specifies how far we intend to break. For the loop continue, we set the %c to 2 indicate that we don't intend to break any further than scope 2. For the loop break, we set %c to 1 and the check after the bar_sync_nv causes us to jump again to scope_1_merge:, effecting a loop break. By handling breaks via this sort of cascade, we ensure that control flow re-converges at each level before breaking to the next level. By converging each scope one at a time, we avoid the communication problem because every invocation which is involved in the bar_set_nv also ends up at the corresponding bar_sync_nv, regardless of which path it takes through the scope. We never need to use bar_break_nv because we just create a new barrier value at the top of the loop.

The one exception to this is for nir_jump_halt which is an instruction that entirely exits the shader program. (This is similar to exit() in a C program.) The nir_jump_halt instruction maps to exit on NVIDIA hardware. Threads which have exited never block barriers so we don't need to worry about the cascade for them. The invocation mask stored in the barrier register may still contain that invocation but the bbreak instruction knows better than to wait on exited threads.

NIR improvements

If you're familiar with NIR and paying close attention to the pseudo-code, you may have noticed the goto statements. But didn't I say earlier that NIR is structured? Yes and no. NIR has limited support for both structured and unstructured forms. While most NIR passes currently require the structured form, the core IR can be unstructured. The unstructured form was originally added to support the OpenCL variant of SPIR-V for use by rusticl and Microsoft's CLOn12 layer. Because most NIR passes don't support unstructured NIR, we call nir_lower_goto_ifs() to structurize the control flow immediately after parsing the SPIR-V.

However, because NVIDIA hardware wants so badly to be unstructured and because converting to this scope-based form of re-convergence is such a substantial re-structuring of the CFG, it made sense to convert to unstructured NIR before going into the NVIDIA back-end. Also, because of the invasive control flow manipulations, I needed to temporarily lower phis to registers and run the SSA repair pass after re-structuring the CFG. Using unstructured NIR allows us to re-use NIR's SSA repair passes instead of having to repair SSA in the back-end compiler. The changes to the NVK back-end compiler were fairly minor as it already does unstructured control flow natively. All I had to do was to make it support nir_jump_goto and nir_jump_goto_if.

The rest of NIR, however, needed some enhancement. In particular, I had to teach NIR's dominance analysis, SSA repair, and register load/store elimination passes to work with unstructured control-flow. None of these fixes were particularly difficult. The most important thing that structured NIR provides is a natural dominance-preserving ordering of the basic blocks. In other words, when you walk the CFG from top to bottom, you are guaranteed to encounter each SSA value's definition before any of its uses. The natural ordering of basic blocks in NIR's structured control flow form has this property. With unstructured control flow, the standard dominance-preserving ordering used in most of the literature is a reverse post-order depth-first search (DFS) of the CFG.

Instead of trying to directly iterate with a reverse post-order DFS or sorting the blocks for every iteration, I opted to require that unstructured NIR always has its blocks ordered. I added a pass which sorts the blocks according to a post-order DFS and added validation that ensures that unstructured blocks are always sorted. Any passes which modify the control flow of an unstructured NIR shader simply need to call the sorting pass before they finish to ensure this invariant. Because iteration is common but modifying control-flow is not, it's a reasonable trade-off to simply maintain this invariant. This will also make it easy and efficient to update most NIR passes in the future to support unstructured control flow. Passes which modify the CFG in some way will need more work but those are uncommon.

Conclusion

With everything reworked and finally correct, we could finally turn the VK_KHR_shader_maximal_reconvergence and VK_KHR_shader_subgroup_uniform_control_flow extensions back on in NVK. However, that's not the end of the story.

A day or two after landing the MR, I received a message on Mastodon telling me that it had fixed the rendering corruptions in Genshin Impact:


A screenshot of Genshin Impact showing rendering artifacts (left) verus a screenshot of Genshin Impact rendering correctly (right).

If you look at the first screenshot, you can see all these rectangular regions that aren't quite the right color. In the second screenshot, everything looks fine. Does Genshin Impact use subgroup ops? Not to my knowledge, no. Genshin Impact is a D3D11 title and real subgroup ops weren't added until D3D12.

So what's going on? Do you remember at the beginning of this whole discussion where I mentioned derivatives? Derivatives are a shader feature that goes back to the OpenGL 2.0 and D3D9 but they're also subgroup ops and depend on re-convergence to get the correct values. Somewhere in some post-processing shader they're probably using derivatives inside or after a loop and we weren't re-converging properly, leading to wrong derivatives. So by fixing a brand new Vulkan extension, I also fixed a bug with an ancient 3D graphics feature in an actual game.

NVK is now ready for prime time

NVK holiday update: What we've achieved, and where we're headed

NVK reaches Vulkan 1.0 conformance

NVK is now ready for prime time

NVK holiday update: What we've achieved, and where we're headed

NVK reaches Vulkan 1.0 conformance

Comments (0)