A shifty tale about unit testing with Maxwell, NVK's backend compiler

A shifty tale about unit testing with Maxwell, NVK's backend compiler

Faith Ekstrand
August 15, 2024

Share this post:

Reading time:

A few weeks ago, I put a bit of time into the Maxwell compiler back-end for NVK. The original backend compiler for NVK targeted Volta and later GPUs but Daniel Almeida at Collabora and a few community members have been chipping away at Maxwell support for a while now. They made good progress and had most of the instructions wired up, but it needed someone with a bit more experience to debug some of the remaining issues.

This turned into a bit of a rabbit hole (because, of course it did) and resulted in me adding an awesome unit-testing framework to the compiler.

Better abstractions

The story actually starts with me trying to make some better abstractions for the different code generators.

While the Maxwell and Volta instruction sets look very similar at the text assembly level, they have entirely different binary encodings and a different set of per-opcode restrictions. For instance, most arithmetic instructions on Volta allow a 32-bit immediate or a constant buffer value in either the second or third source. Maxwell, on the other hand, usually only has 20-bit immediates (high 20 bits for float, low 20 bits for integers) and the rules for when you can use an immediate or constant buffer value are very ad-hoc.

The way we deal with these restrictions in NAK (the code-name for the NVK compiler) is through a legalization pass. In most of the compiler, we assume that you can use any type value in any source of an arithmetic instruction. Then the legalize pass comes through and applies the restrictions. The pass is generally pretty smart about applying restrictions as well. For instance, if it sees an add instruction with an immediate in the first source but a register in the second, it will simply swap the two sources since it knows that addition is commutative. It can do a similar swap on comparison instructions, only we have to flip the comparison in that case so 5 < x becomes x > 5. This lets us keep most of the compiler generic while handling HW restrictions in one place and intimately understands those restrictions.

Something I observed when people were working on Maxwell was that they didn't really understand what the legalization pass did or when they needed to legalize something. The legalization pass was also constantly getting out of sync with the encoder. This led to all sorts of bugs and assertions where someone would validly assert some condition in the encoder but wouldn't add legalization code to ensure that condition was satisfied.

This wasn't really their fault, though. The code was not well-structured and it wasn't really clear when and what to legalize. The legalization pass was also in a different file and structured totally differently from the encoder. It was a mess.

To clean this up, I rewrote both the Volta and Maxwell back-ends to work in terms of a per-op trait: SM50Op or SM70Op:

trait SM70Op {
    fn legalize(&mut self, b: &mut LegalizeBuilder);
    fn encode(&self, e: &mut SM70Encoder<'_>);
}

The first method in this trait is the per-op part of the legalization pass and the second encodes the instruction to the binary format the hardware interprets. By restructuring everything to use this trait, the legalization and encoding of an opcode are now right next to each other in the code:

impl SM70Op for OpFMul {
    fn legalize(&mut self, b: &mut LegalizeBuilder) {
        let gpr = op_gpr(self);
        let [src0, src1] = &mut self.srcs;
        swap_srcs_if_not_reg(src0, src1, gpr);
        b.copy_alu_src_if_not_reg(src0, gpr, SrcType::F32);
    }

    fn encode(&self, e: &mut SM70Encoder<'_>) {
        e.encode_alu(
            0x020,
            Some(&self.dst),
            Some(&self.srcs[0]),
            Some(&self.srcs[1]),
            Some(&Src::new_zero()),
        );
        e.set_bit(76, self.dnz);
        e.set_bit(77, self.saturate);
        e.set_rnd_mode(78..80, self.rnd_mode);
        e.set_bit(80, self.ftz);
        e.set_field(84..87, 0x4_u8); // TODO: PDIV
    }
}

It also forces the developer to consider legalization as they add instruction encodings. You have to implement both methods in the trait. Sure, you could just make the legalize method a no-op but then it'd be pretty obvious something was missing.

This refactor also improved the legalization pass overall because it forced me to separate hardware opcodes from the virtual opcodes used by register allocation. We have special cases for handling opcodes like copy, swap, and phis, and those are the same across the different hardware generations. However, because I do most of my development on Ampere, I hadn't added Maxwell support for those ops. With this new structure, those opcodes are legalized in a common path and the per-back-end path is only used for hardware opcodes, meaning that Maxwell got fixed implicitly.

Once this refactor was complete, it was fairly straightforward to audit the Maxwell back-end and find tons of bugs where legalization and encoding mismatched. I also cleaned up the code-base and made things more consistent while I was at it.

Oh, shift!

After my initial audit, I started looking at test failures and fixing bugs left and right. There were a bunch of little encoding bugs here and there, an issue with 64-bit type conversions and a few other minor issues. And then there were shifts... In particular, the 64-bit left-shift instruction was causing the hardware to throw an illegal instruction encoding error. After a bit of poking and prodding, I discovered that a left-shift with the high bit set wasn't allowed. Okay, fine, I'll just figure out some other way to do left-shifts. But how?

The problem here wasn't that I didn't know how to implement shifts. I can tell you about three different ways to implement a 64-bit left-shift in terms of 32-bit shifts. That part is fine. The real problem is that I didn't know what the shf instruction actually does. This is one of those areas where the few bits of documentation I have access to just aren't enough. The PTX docs have a description of the PTX shf instruction but that doesn't have half the modifier bits that the hardware instruction does. The overall description doesn't really fill in many details, either.

The shf instruction is a 64-bit barrel-shift. It takes two 32-bit sources and concatenates them together into a 64-bit value. It then shifts that value to the right or left (based on a modifier bit) by the shift amount and returns either the high or low 32 bits of the result. There are also modifier bits to specify the data type (signed or unsigned, 32 or 64-bit) and how to handle out-of-bound shift values (either wrap or clamp). All that I knew. However, there are a lot of details in there that really matter. What does it do with 32-bit values? Does it ignore the high 32 bits or does it still do a 64-bit shift? What is the behavior for clamp vs. wrap for 32 vs. 64 bits? We don't know! There are no docs.

Unfortunately, without knowing those details and without being able to trust the hardware (I could tell Maxwell had some quirks, though I didn't know what they were yet), I was flying blind.

Also, while it may sound like 64-bit left-shift is quite the edge case, left-shifts come up all the time in address calculations. Given that every UBO or SSBO access on Maxwell goes through the global memory path with 64-bit addresses, this was a load-bearing bug.

Unit tests to the rescue!

So how do you figure out detailed hardware behavior? You test it! I went through a similar exercise about a year ago with the iadd3 opcode on Volta. The iadd3 and iadd3.x opcodes are 32-bit add instructions that are carefully designed to be chained together to form 64-bit or larger adds. The carry bits, which propagate overflow from one 32-bit add to the next, are communicated through predicate registers and the details of those carry bits were something I needed to understand. In particular, I needed to understand how those carry bits interacted with the negate source modifier used to do subtractions.

My solution at the time was to build a little unit test framework which used a back-door in the Vulkan driver to execute arbitrary shader binaries. I added a chain-in struct to VkShaderStageInfo, which lets you specify a pre-compiled binary. I then wrote a little library which provided a simple interface and used Vulkan with this back-door to run a shader binary on a data set. Using this library as the back-end, I wrote a set of very targeted tests for iadd3, which allowed me to figure out the exact semantics of the carry bits. You can read more about my adventures with iadd3 in my social media thread on the topic.

I never liked my first attempt at opcode testing. Using Vulkan that way felt really clunky and there were a lot of issues trying to punch through to the driver like that. Since implementing VK_EXT_shader_object and switching to the common Mesa implementation of VkPipeline, having a driver-specific punch-through got much harder. Vulkan is also quite verbose and it takes about the same amount of code to run a compute shader by talking directly to the hardware as going through Vulkan. It also made for an awkward dependency where NVK depends on NAK, but NAK calls into Vulkan and then into NVK to test things.

This time around, I wrote a simple runner that talks directly to the hardware. The most complicated part of compute shader dispatch is populating the compute shader descriptors (QMD), and I moved that into NAK about three months ago. The only thing this new library had to do was open a device, allocate a buffer, copy the data in, and fire off a command buffer with about a half-dozen commands to set things up and fire off the shader. This hardware compute shader runner ended up being less than 500 lines of Rust code.

The tests themselves are regular Rust unit tests. When you configure Mesa with tests enabled, they get built alongside NAK and you can invoke them with $BUILD/src/nouveau/compiler/nak hw_tests. By default, Meson is configured to skip them so that they don't run on CI builders where we don't have NVIDIA GPUs. However, if a developer wants to test a new opcode or double-check that a piece of hardware works the way they expect, they're right there and easy to use.

The best part, though, is the way that the tests also serve as documentation. Once again, Rust traits are our friend here. Instead of just typing a bunch of test functions, I added a Foldable trait which can be implemented on an opcode. The Foldable trait has a single method: fold which is a software implementation of the opcode:

impl Foldable for OpIAbs {
    fn fold(&self, _sm: &dyn ShaderModel, f: &mut OpFoldData<'_>) {
        let src = f.get_u32_src(self, &self.src);
        let dst = (src as i32).abs() as u32;
        f.set_u32_dst(self, &self.dst, dst);
    }
}

The per-op tests are then implemented in terms of that trait. A tiny shader is generated which loads some data, runs the opcode on it, and writes out the result. The hardware is then flooded with random data and the results are verified to ensure that the CPU implementation in fold() matches the behavior of the hardware.

This not only makes it easy to test new opcodes but it also documents the opcode. Right there next to the definition of OpIAbs is a Rust implementation of Foldable which says what OpIAbs does!

Back to shifts

With this little side-quest complete, I now had the tools I needed to take on shifts once more.

I started on Ampere (it's the same shader core as Volta) because that hardware is generally pretty sane, and because my 64-bit left shift worked there so I wasn't starting from zero. I learned that the shf opcode always does 64-bit shifts, the data type sign is only used to control whether or not to sign-extend right-shifts, and the data type size only affects the initial clamping or wrapping of the shift parameter:

impl Foldable for OpShf {
    fn fold(&self, sm: &dyn ShaderModel, f: &mut OpFoldData<'_>) {
        let low = f.get_u32_src(self, &self.low);
        let high = f.get_u32_src(self, &self.high);
        let shift = f.get_u32_src(self, &self.shift);

        let bits: u32 = self.data_type.bits().try_into().unwrap();
        let shift = if self.wrap {
            shift & (bits - 1)
        } else {
            min(shift, bits)
        };

       let folded = if self.data_type.is_signed() {
            if self.right {
                (x as i64).checked_shr(shift).unwrap_or(0) as u64
            } else {
                (x as i64).checked_shl(shift).unwrap_or(0) as u64
            }
        } else {
            if self.right {
                x.checked_shr(shift).unwrap_or(0) as u64
            } else {
                x.checked_shl(shift).unwrap_or(0) as u64
            }
        };

        let dst = if self.dst_high {
            (shifted >> 32) as u32
        } else {
            shifted as u32
        };

        f.set_u32_dst(self, &self.dst, dst);
    }
}

From there, I moved on to Maxwell and found several weird corner cases where Maxwell just doesn't implement certain things. I suspected this might be the case from the start but couldn't confirm it until I had a proper testing framework. For left shifts, I already knew that Maxwell throws an illegal instruction encoding error if the high modifier is set. What I learned is that Maxwell's shr.l always gives you the high bits so that modifier doesn't really exist. (Annoyingly, it does exist in the disassembler, which is why it took me so long to figure out.) For right shifts, the high modifier works exactly the same as on Volta but it ignores the sign for 32-bit right shifts and always does a logical (unsigned) shift. In order to get an arithmetic (signed) shift, you have to use an i64 data type.

Armed with this new understanding, I was able to pretty quickly fix our 64-bit shift implementation. I was also able to reduce 64-bit shifts on Volta+ to two instructions (left shift is still three on Maxwell) because I now understand how the wrapping clamping works.

And, of course, I wrote unit tests for 64-bit shifts as well.

Conclusion

When developing a compiler, your compiler will only ever be as good as your ISA documentation. When you don't have ISA documentation and you are reverse-engineering the hardware, it's only as good as your tools. While the Vulkan CTS was good enough for a lot of compiler development, sometimes you need something much more targeted. We now have that and I'm really excited about what this will enable for future efforts.

Eventually, I'd like to have all of the arithmetic opcodes in NAK unit tested this way along with all of the non-trivial builder helpers which generate different code on different platforms. However, for now I'm happy to have the framework in place so that the next time I need a more detailed understanding of an opcode, we have the tools to do it.

Another thing this potentially enables in the future is a back-end constant folding pass. Constant folding is common in compilers where you take instructions with compile-time known source values, evaluate the instruction on the CPU, and replace it with the compile-time result. I haven't added a constant folding pass to NAK because the cases where you can constant fold are pretty rare at that point in the compile process. NIR already has very powerful constant folding, and the NIR to NAK translation rarely adds things that can be folded. However, now that we have the trait, it wouldn't be hard to add the pass.

Mesa 24.1 brings new hardware support for Arm and NVIDIA GPUs

Implementing DRM format modifiers in NVK

Re-converging control flow on NVIDIA GPUs - What went wrong, and how we fixed it

Mesa 24.1 brings new hardware support for Arm and NVIDIA GPUs

Implementing DRM format modifiers in NVK

Re-converging control flow on NVIDIA GPUs - What went wrong, and how we fixed it

Comments (0)

Add a Comment

Search the newsroom

Latest Blog Posts

Implementing Bluetooth on embedded Linux: Open source BlueZ vs proprietary stacks

27/02/2025

If you are considering deploying BlueZ on your embedded Linux device, the benefits in terms of flexibility, community support, and long-term…

The state of GFX virtualization using virglrenderer

15/01/2025

With VirGL, Venus, and vDRM, virglrenderer offers three different approaches to obtain access to accelerated GFX in a virtual machine. Here…

Faster inference: torch.compile vs TensorRT

19/12/2024

In the world of deep learning optimization, two powerful tools stand out: torch.compile, PyTorch’s just-in-time (JIT) compiler, and NVIDIA’s…

Mesa CI and the power of pre-merge testing

08/10/2024

Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…

A shifty tale about unit testing with Maxwell, NVK's backend compiler

15/08/2024

After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…

A journey towards reliable testing in the Linux Kernel

01/08/2024

We're reflecting on the steps taken as we continually seek to improve Linux kernel integration. This will include more detail about the…

About Collabora

Whether writing a line of code or shaping a longer-term strategic software development plan, we'll help you navigate the ever-evolving world of Open Source.

한국의 국기 한국어 버전의 Collabora.com 보기