We're hiring!
*

Writing an open source GPU driver - without the hardware

Alyssa Rosenzweig avatar

Alyssa Rosenzweig
January 27, 2022

Share this post:

Reading time:

After six months of reverse-engineering, the new Arm “Valhall” GPUs (Mali-G57, Mali-G78) are getting free and open source Panfrost drivers. With a new compiler, driver patches, and some kernel hacking, these new GPUs are almost ready for upstream.

In 2021, there were no Valhall devices running mainline Linux. While a lack of devices poses an obvious obstacle to device driver development, there is no better time to write drivers than before hardware reaches end-users. Developing and distributing production-quality drivers takes time, and we don’t want users to be reliant on closed source blobs. If development doesn’t start until a device hits shelves, that device could reach “end-of-life” by the time there are mature open drivers. But with a head start, we can have drivers ready by the time devices reach end users.

Let’s see how.

Reverse-engineering without root

Over the summer, Collabora purchased an Android phone with a Mali-G78. The phone isn’t rooted, so we can’t replace its graphics drivers with our own. We can put the phone in developer mode, so we can run test applications with the proprietary graphics driver and inject our own code with LD_PRELOAD, allowing us to inspect the graphics memory prepared by the proprietary driver and “passively” reverse-engineer the hardware. This memory includes compiled shader binaries in the Valhall instruction set, as well as Valhall’s data structures controlling graphics state like textures, blending, and culling.

Reverse-engineering “actively” is possible, too. We can modify compiled shaders and GPU data structures, allowing us to experiment with individual bits. We can go further, constructing our own shaders and data structures and validating them against the hardware.

To motivate this technique, consider the reverse-engineering of Valhall’s “buffer descriptor”. This new data structure describes a buffer of memory, accessed by a new “load buffer” instruction (LD_BUFFER). After guessing the layout of the buffer descriptor and encoding of LD_BUFFER, we can build our own buffer descriptor and write a shader using LD_BUFFER to validate our guess and probe the low-level semantics.

When reverse-engineering Valhall’s new data structures, we have legacy to guide us. While Valhall reorganizes its data structures to reduce Vulkan driver overhead, the bit-level contents resemble older Mali GPUs. If we find the “contours” of new data structures, we can fill in the details by comparing with older hardware.

As we learn about the data structures, we document our findings in a formal XML hardware description. This file has the same format as the XML for older Mali architectures already supported by Panfrost. Since the Valhall data structures descend from these older architectures, we can fork an older Mali’s XML to save us some typing and keep naming consistent.

After enough reverse-engineering, we can slot our XML into Panfrost, automatically generating code to pack and unpack the data structures. Thanks to tireless work by Collaboran Boris Brezillon, Panfrost’s performance-critical code is specialized at compile-time to the target architecture, allowing us to add new architectures without adding overhead to existing hardware. So with our XML in hand, we can get started writing a Valhall driver.

Writing drivers without hardware

It is November 2021. I’ve written a Valhall compiler. I’ve reverse-engineered enough to write a driver. I still have no Linux hardware to test my code.

That’s a major road block.

Good thing I know a detour.

We can develop the driver on any Linux machine, without testing against real hardware. To pull that off, unit testing is mandatory. With no hardware, we can’t run integration tests, but unit tests can run on any hardware. For the Valhall compiler, I wrote unit tests for everything from instruction packing to optimization. Although the coverage isn’t exhaustive, it caught numerous bugs early on.

There is a caveat: unit testing can’t tell us if our expectations of the hardware are correct. However, it can confirm that our code matches our expectations. If our reverse-engineering is thorough, these expectations should be correct.

Even so, unit testing alone isn’t enough.

Enter drm-shim.

drm-shim

Mesa drivers like Panfrost can mock test hardware with drm-shim, a small library which stubs out the system calls used by userspace graphics drivers to communicate with the kernel. With drm-shim, unmodified userspace drivers think they’re running against real hardware – including Valhall hardware.

Graphics guru Emma Anholt designed drm-shim to run Mesa’s compilers as cross-compilers for use in continuous integration (CI). Outside of CI, drm-shim allows testing compilers on our development machines, which may be significantly faster than the embedded devices we target. But it’s not limited to compilers; we can run entire test suites under drm-shim, “cross-testing” for any hardware we please. The tests won’t pass, since drm-shim does no rendering; it is a shim, not an emulator. But it allows us to exercise new driver code paths without the constraints of real hardware.

As drm-shim runs on any Linux machine, I wanted to use the fastest Linux machine I own: my Apple M1. Bizarrely, drm-shim didn’t work on my M1 Linux box, although it works on everyone else’s computers. That calls for a debugging session.

After some poking around, I stumbled on the offending code:

bo->addr = util_vma_heap_alloc(&heap, size, 4096);
mmap(NULL, ..., bo->addr);

This code allocates a chunk of memory aligned to a page and uses its address as the offset in a call to mmap. On my system, the mmap call fails, so I consulted the man page for mmap:

offset must be a multiple of the page size as returned by sysconf(_SC_PAGE_SIZE).

The mmap in drm-shim works, because the page size on Linux is 4096 bytes (4K)…

Until it isn’t.

Apple’s input/output memory management unit uses larger pages, 16384 bytes (16K) large. As a consequence, when we run Linux bare metal on Apple platforms, we configure Linux to use 16K pages everywhere to keep life simple. That means that on Apple platforms running Linux, sysconf(_SC_PAGE_SIZE) returns 16384, so the mmap fails. The fix is easy:

bo->addr = util_vma_heap_alloc(&heap, size, sysconf(_SC_PAGE_SIZE));
mmap(NULL, ..., bo->addr);

With that, drm-shim works on systems with page sizes larger than 4K, including my M1. That means I can compile thousands of shaders per second with the Valhall compiler, far more than any system with a Mali GPU could. I can also run Khronos’s OpenGL ES Conformance Test Suite:

PAN_MESA_DEBUG=valhall,trace LIBGL_DRIVERS_PATH=~/lib/dri/ LD_PRELOAD=~/mesa/build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so PAN_GPU_ID=9091 EGL_PLATFORM=surfaceless ./deqp-gles31 --deqp-surface-type=pbuffer --deqp-gl-config-name=rgba8888d24s8ms0 --deqp-surface-width=256 --deqp-surface-height=256'

Long commands like this one run tests and produce pretty-printed dumps of GPU memory, ready for manual inspection. If the dumps look like the dumps from the proprietary driver, there’s a good chance the tests will pass on real hardware, too.

Code sharing

Since Valhall is similar to its predecessors, the years we’ve spent nurturing Panfrost mean we only need to modify the driver in areas where Valhall introduces breaking changes.

For example, Valhall’s instruction set resembles the older “Bifrost” instruction set, so we may embed the Valhall compiler as an additional backend in the existing Bifrost compiler. Shared compiler passes like instruction selection and register allocation “just work” on Valhall, even though they were developed and debugged for Bifrost.

Once we adapt Panfrost for Valhall, we’ll have a conformant, performant driver ready out-of-the-box.

…In theory.

Real hardware, real pain

I couldn’t test on real Valhall hardware until early January, when I procured a Chromebook with a MediaTek MT8192 system-on-chip and a matching serial cable. MT8192 sports a Valhall “Mali-G57” GPU, compatible with the Mali-G78 I’m reverse-engineering. Mainline kernel support for MT8192 is sparse, but Linux does boot. With patches by other Collaborans, USB works too. That’s enough to get to work on the GPU. Sure, the display doesn’t work, but who needs that?!

We’ll start by teaching Linux how to find the GPU. On desktops, ACPI and UEFI let the operating system discover any connected hardware. While these standards exist for Arm, in practice Arm systems require a device tree describing the hardware: what parts there are, which registers and clocks they use, and how they’re connected. We don’t know much about MT8192, but ChromeOS supports it, so ChromeOS has a complete device tree. Adapting that device tree for mainline, we soon see signs of life:

[  1.942843] panfrost 13000000.gpu: unknown id 0x9093 major 0x0 minor 0x0 status 0x0

The kernel cannot identify the connected Mali GPU, but that’s expected – after all, it has never seen a Mali-G57 before. We need to add a mapping from Mali-G57’s hardware ID to its name, feature list, and hardware bug list. Then the driver loads.

[  1.942843] panfrost 13000000.gpu: mali-g57 id 0x9093 major 0x0 minor 0x0 status 0x0
[  1.982322] [drm] Initialized panfrost 1.2.0 20180908 for 13000000.gpu on minor 0

Based on the downstream kernel module released by Arm, we know the parts of Valhall relevant to the kernel are backwards-compatible with Mali GPUs from a decade ago. Panfrost supports existing Mali hardware, so in theory, we can test drive the Mali-G57 right now.

When it comes to hardware, theory and practice never agree.

Let’s try submitting a “null job” to the hardware, a simple job that does nothing whatsoever:

struct mali_job_descriptor_header job = {
    .job_type = MALI_JOB_TYPE_NULL,
    .job_index = 1
};

Only 2 bits set in the entire data structure. We can even hard-code this job into the kernel and submit it as soon as the hardware powers on. Since this job is correct, the hardware will run it fine.

[   2.094748] panfrost 13000000.gpu: js fault, js=1, status=DATA_INVALID_FAULT, head=0x6087000, tail=0x6087000

What? The hardware claims the job is invalid, even though the job is clearly valid. Apparently, the hardware is reading something different from memory than we wrote.

That symptom is eerily familiar. When Collaboran Tomeu Vizoso and I added support for Mali-G52 two years ago, we observed the same symptoms on an Amlogic system-on-chip. The culprit was an Amlogic-specific cache coherency issue. That fix doesn’t apply here, so it’s time to hunt for MediaTek-specific bugs.

Crawling through ChromeOS code, I found that MediaTek submitted an unexplained change to the GPU driver, setting a single bit belonging to a clock on MT8192 in order to “disable ACP”, fixing bus faults. This change is the embodiment of a “fix everything” magic bit, the kind only rumoured to exist and the stuff of reverse-engineers’ nightmares.

…But setting that bit in our kernel makes our null job complete successfully.

…Wait, what?

It turns out ACP is the “Accelerator Coherency Port”, responsible for managing cache coherency between the CPU and the GPU. Apparently, ACP was not supposed to be enabled on MT8192, but due a hardware bug was enabled accidentally. The kernel must set this bit to disable ACP as a workaround.

Again, what?

Pressing on, we can submit the same null job from userspace. To the hardware, kernelspace and userspace are the same, so this must work.

It does not.

The job times out before completing. Inspecting the kernel log, we notice an earlier timeout, waiting for the GPU to wake up after being reset.

Littering the kernel with printks, eventually we find that the GPU is powered off once Linux boots, and nothing we do will power it back on. No wonder everything times out.

For some problems, we can only hope for a leprechaun to whisper the solution in our ear. Our leprechaun comes in the form of kernel wizard Heiko Stuebner. Heiko suggested that Linux might be powering off the GPU. To save power, Linux turns off unused clocks and power domains. If Linux doesn’t know a clock or power domain is used by the GPU, it’ll turn off the GPU inadvertently.

For debugging, we can disable this mechanism by setting the clk_ignore_unused pd_ignore_unused kernel arguments. Doing so makes our userspace tests work.

Sometimes the simplest solutions are in front of us.

What is the root cause? MediaTek has a complicated hierarchy of clocks and power domains, and we missed some in our device tree. We’ll need to update our code to teach Linux about the extra clocks and power domains to fix the issue properly.

Nevertheless, we can now test our driver on real hardware. It’s a rough start: the first job we submit returns a Data Invalid Fault. Experimenting, it seems Valhall requires greater pointer alignment of its data structures than Bifrost did. Increasing the alignment at which we allocate fixes the faults, and decreasing again lets us determine the minimum required alignment. This information is accessible once we can run code on the hardware, but inaccessible when studying hardware in vitro. Reverse-engineering and driver development are better together.

Success at last

With these fixes, we finally see our first passing test, running on real hardware, with data structures prepared by our open source Mesa driver and shaders compiled by our Valhall compiler. Woo!

It only took me a few days after getting the hardware and a serial cable to pass hundreds of tests on the new architecture. Months of speculatively developing the driver came with a huge pay off.

Sounds like we’ll have Valhall drivers in time for end-users after all.

Comments (17)

  1. wanderer_:
    Jan 28, 2022 at 12:00 AM

    I am blown away. Thanks, I now feel woefully inadequate :)

    Reply to this comment

    Reply to this comment

  2. Stuart Naylor:
    Jan 28, 2022 at 12:49 AM

    Amazing stuff Alyssa.

    Has Collabora ever thought about user led funding (donate) where on any discourse there is a simple link so a consumer can make an expression of ‘I would like some of that’ or ‘Thanks for what your doing’ that actually provides metadata feedback to what the community appreciates?
    Often donate modes are branded unitary affairs that doesn't aalow donators much scope to what and who was of importance that has no dictate but does allow community expression and provides valuable metrics.

    Its a simple link with some tags in the wonderfully efficient gift exchange mechanism of opensource.

    Reply to this comment

    Reply to this comment

  3. Wade Mealing:
    Jan 28, 2022 at 01:00 AM

    I can appreciate the amount of effort that went into writing this. Very impressed.

    Reply to this comment

    Reply to this comment

  4. 0x0c:
    Jan 28, 2022 at 01:38 AM

    Awesome work!
    There is a question that has plagued me for so long, I'd really appreciate any pointers on it: [gpu in question is NVidia] So I know with envytools I can dump _physical_ memory of my gpu, but how can I find out the stuff in _virtual_ address space? That is: I want to know, when a cuda kernel is launched, how the memory layout looks like from this process's view? (So I want mapping from _virtual_ address to data) Is this possible to do, while running the proprietary drivers?

    Reply to this comment

    Reply to this comment

    1. Mark Filion:
      Jan 28, 2022 at 04:13 PM

      Unfortunately we have no idea, but if you join #nouveau on OFTC IRC they'll be able to help you out.

      Reply to this comment

      Reply to this comment

  5. MICHAL LAZO:
    Jan 28, 2022 at 07:27 AM

    It will be also nice to find why amdgpu and nouveau don't work on arm boards. And if that cache coherency problem is real source of problem or not

    Reply to this comment

    Reply to this comment

  6. Nikolaos Bezirgiannia:
    Jan 28, 2022 at 09:20 AM

    This is so exciting news! I am looking forward to try it on with some valhall hardware im the future!

    Reply to this comment

    Reply to this comment

  7. Gideon "Gnafu" Mayhak:
    Jan 28, 2022 at 03:15 PM

    This is amazing work, and very timely! I was just reading about the MediaTek Kompanio 1380 with its AV1 hardware decoding, and I started looking to see whether there was open source support for the Mali-G57 graphics. And then your blog post was published! I appreciate the work, and I hope you're able to continue making progress.

    A related question: What is available to work with hardware video encoders/decoders using open source tools on chips like these? Is it far away yet, or could I be enjoying completely open source hardware-accelerated AV1 playback on a MediaTek-based device in a year or two? Is it at all part of this work, or a completely different stack?

    Thanks!

    Reply to this comment

    Reply to this comment

    1. Nicolas Dufresne:
      Jan 28, 2022 at 04:09 PM

      Thanks for your feedback, indeed work is in progress related to AV1 Stateless Decoding with mainline Linux Kernel. In fact, a very similar approach has been taken, but not to reverse something but to create a Kernel API for future drivers to use it. Inspired from VA API and DXVA implementation of AV1 Stateless decoding, Daniel Almeida drafted a kernel API for that [1]. In order to write a reference decoder, he implemented vivpu driver (also part of the RFC patchset). With this stub driver, he could implement and validate a GStreamer implementation for the reference decoder [2]. Similar to the GPU stubs, it will only produce blank frames, but it allow very early implementation and minimal validation.

      [1] https://lore.kernel.org/lkml/20210810220552.298140-1-daniel.almeida@collabora.com/

      [2] https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/1011

      [+] https://www.phoronix.com/scan.php?page=news_item&px=Linux-Media-RFC-AV1-uAPI

      Reply to this comment

      Reply to this comment

      1. Gideon "Gnafu" Mayhak:
        Jan 29, 2022 at 12:10 AM

        Wonderful! Thank you so much for the additional info. I'm so excited these things are happening.

        Reply to this comment

        Reply to this comment

  8. Michael Torrie:
    Jan 29, 2022 at 07:58 PM

    Fascinating stuff, and well done. Sounds like a lot of hard work to go. And sadly this gets repeated for every new ARM system with a new GPU.

    While I found your report very interesting (I'm in awe of your abilities), sadly it reinforces my general frustration with the ARM ecosystem. Despite the tremendous potential and the advantages of low power consumption, as far as general Linux use is concerned ARM is continually crippled by proprietary, undocumented, hardware components like these GPUs. To say nothing of the lack of a standardized boot and discovery environment such as what UEFI provides. Seems like every SoC has a specialized kernel and often special distribution, and no universal, common method of booting, like from a USB stick, SD card, SATA, etc. There's no such thing as a universal aarch64 debian install ISO, for example.

    I've got a drawer full of ARM-based development boards, none of which have been a useful as I had hoped, or lived up to their potential. Sure they work well as little servers (which is what I've put most of them to work doing), but getting OpenGL of any kind for modern interactive, desktop-replacement sort of work has been a crap shoot. Even investigating amazing devices like the PineBook Pro tells me getting the GPU working remains problematic, despite your hard work.

    Obviously Android is really the only OS any of these companies design for and think about. Since it's Linux under the hood you'd think that providing open source drivers would be something they'd want to do but alas no. Maybe I should blame Google for that, as they could pressure their partners to contribute to the community that they are benefiting from.

    Maybe Apple's M1 line will pressure other ARM licensees to focus on standardization.

    Anyway, neat stuff. But for now I'm sticking with x86_64 as much as I can. Even thinking about getting a Rock Pi board for a project that involves a screen and needs working opengl.

    Reply to this comment

    Reply to this comment

    1. Wade Mealing:
      Jan 31, 2022 at 01:04 AM

      We can only hope. Maybe someone needs to write a 'how to ship a successful arm board with opensource' guide for these companies to be hounded with.

      Reply to this comment

      Reply to this comment

  9. Marcos:
    Jan 31, 2022 at 05:18 PM

    That's absolutely riveting! Digital adventures, by the digital 'Lara Croft'! I run Fatdog linux on an acer 714 chromebook. Everything works except touch. Checking dmesg we see timeout waiting for bus ready, and timeout in disabling hardware. Casual research indicated the bus seems to be unpowered.. I think it is basically the same problem you found but using those command line arguments didn't have an effect in this case. But reading your report was exciting and interesting. Thanks for your efforts.

    Reply to this comment

    Reply to this comment

  10. Callum:
    Oct 24, 2022 at 12:30 PM

    Really engaging read - fascinating to read about the overall approach, and the snags along the way. Thanks for writing!

    Reply to this comment

    Reply to this comment

  11. rickster:
    Feb 20, 2023 at 04:22 PM

    It's literally been decades and yet free/opensource proper (hardware-accelerated) GPU drivers, ..., are still NOT available from these big closed-source/proprietary ARM-based chip makers like Amlogic, Rockchip, ...., onto the Linux world.?
    There's no reason for this lack of driver/software support anymore, hence why x86_64-based embedded/SBC's like Intel and Ryzen are gaining huge traction like never before.
    Socio/Technological development worldwide has already proven that.
    Maybe RISC-V international is the better future way to go, bey hey, we''ll see eh?

    Reply to this comment

    Reply to this comment

    1. crantob:
      May 14, 2023 at 05:00 AM

      Often if 'this makes no sense' or 'there's no reason for this', you may find that the real reason for it is something your subconscious mind doesn't want to consider to be possible.

      Reply to this comment

      Reply to this comment


Add a Comment






Allowed tags: <b><i><br>Add a new comment:


Search the newsroom

Latest Blog Posts

Faster inference: torch.compile vs TensorRT

19/12/2024

In the world of deep learning optimization, two powerful tools stand out: torch.compile, PyTorch’s just-in-time (JIT) compiler, and NVIDIA’s…

Mesa CI and the power of pre-merge testing

08/10/2024

Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…

A shifty tale about unit testing with Maxwell, NVK's backend compiler

15/08/2024

After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…

A journey towards reliable testing in the Linux Kernel

01/08/2024

We're reflecting on the steps taken as we continually seek to improve Linux kernel integration. This will include more detail about the…

Building a Board Farm for Embedded World

27/06/2024

With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…

Smart audio filters with WirePlumber 0.5

26/06/2024

WirePlumber 0.5 arrived recently with many new and essential features including the Smart Filter Policy, enabling audio filters to automatically…

Open Since 2005 logo

Our website only uses a strictly necessary session cookie provided by our CMS system. To find out more please follow this link.

Collabora Limited © 2005-2024. All rights reserved. Privacy Notice. Sitemap.