We're hiring!
*

Open Source OpenGL ES 3.1 on Mali GPUs with Panfrost

Alyssa Rosenzweig avatar

Alyssa Rosenzweig
June 11, 2021

Share this post:

Reading time:

Panfrost, the open source driver for Arm Mali, now supports OpenGL ES 3.1 on both Midgard (Mali T760 and newer) and Bifrost (Mali G31, G52, G72) GPUs. OpenGL ES 3.1 adds a number of features on top of OpenGL ES 3.0, notably including compute shaders. While Panfrost has had limited support for compute shaders on Midgard for use in TensorFlow Lite, the latest work extends the support to more GPUs and adds complementary features required by the OpenGL ES 3.1 specification, like indirect draws and no-attachment framebuffers.

The new feature support represents the cumulative effort of multiple Collaborans -- Boris Brezillon, Italo Nicola, and myself -- in tandem with the wider Mesa community. The OpenGL driver has seen over 1000 commits since the beginning of 2021, including several hundred targeting OpenGL ES 3.1 features. Our focus is Mali G52, where we are passing essentially all drawElements Quality Program and Khronos conformance tests and are aiming to become formally conformant. Nevertheless, thanks to a unified driver, many new features on Bifrost trickle down to Midgard allowing the older architecture still in wide use to improve long after the vendor has dropped support. On Mali T860, we are passing about 99.5% of tests required for conformant OpenGL ES 3.1. That number can only grow thanks to Mesa's continuous integration running these tests for every merge request and preventing Panfrost regressions. With a Vulkan driver in the works, Panfrost's API support is looking good.

Instruction scheduling

Since the last Panfrost update, we've added an instruction scheduler to the Bifrost compiler. To understand the motivation, recall the hardware design. The Bifrost instruction set pairs instructions into "tuples", one using the multipliers and the other using the adders. Up to 8 tuples are grouped into a "clause", a sequence of instructions with fixed latency that can execute back-to-back with no pipeline bubbles in the middle. The benefit to the hardware designers was that Bifrost's pipeline could be statically filled by the compiler, rather than adding logic in the hardware to dynamically dispatch instructions to different parts of the units like a superscalar chip. Unfortunately, that means the compiler becomes significantly more complicated, as it has to group instructions itself satisfying dozens of architectural invariants. If any condition fails to be met, the GPU will fault with an Invalid Instruction Encoding exception and abort execution. In Panfrost, we've approached this by formally modeling the constraints to produce a predicate (function returning a boolean) for whether a given instruction may be scheduled in a given position in the program. Then it's simple enough to schedule greedily by choosing instructions with this predicate according to some selection heuristic. Wait, selection heuristic?

An algorithm is "greedy" if it makes a locally optimal choice at every step. On the surface, it seems like that strategy would produce a globally optimal result. In special cases, this is true and the best algorithms to solve a problem are greedy. Unfortunately, greedy algorithms produce suboptimal results on many other problems, sometimes spectacularly so. Instruction scheduling is one of those cases: when the predicate shows two different instructions can be scheduled next, which one should be picked? It's not enough to always pick the first or pick one at random; while both strategies are locally optimal, both have poor global performance. We need a heuristic to choose the best instruction from the set of candidate instructions at each point, taking into account how it will affect our ability to schedule in the future. Coming up with good heuristics is tricky, and we have a great deal of room to grow in Panfrost, but the basic model is serving us well so far.

Towards zero overhead

Another large change to the driver since our last blog was the addition of dirty tracking, a common graphics driver optimization with a twist for Mali. Typical GPUs are stateful, and the driver emits commands to set pieces of graphics state -- for instance, setting uniforms -- before each draw. Mali GPUs present an unusual stateless interface. Instead of commands, the driver prepares descriptors containing large amounts of graphics state bundled together, and each draw has pointers to the different descriptors. In a sense, typical GPUs are programmed with an OpenGL-like state machine, whereas Mali is programmed with Vulkan-like pipelines.

There is conventional graphics wisdom that "state changes are expensive", so graphics programmers try to minimize the number of API calls they make. Drivers can help minimize state changes as well, by tracking which state is "dirty" (modified) and which state is "clean" (unmodified). Then the driver only has to emit commands for the dirty state, reducing its own CPU overhead as well as reducing work for the GPU to process.

Surprisingly, the same idea generalizes to Mali. It is inefficient for the driver to upload every Mali descriptor for every application draw call. Ideally, we could reuse the same descriptor for subsequent draw calls if we know the state hasn't changed. Dirty tracking lets us know exactly when the state has changed, allowing us to only upload new descriptors when required, and reusing the descriptor otherwise. On the surface, the purpose is simply reducing CPU overhead, since the GPU's programming model is stateless and therefore must redo work anyway. However, the GPU has several layers of caches, so reusing these descriptors can enable the GPU to use cached descriptors as opposed to invalidating its descriptor caches after every draw call. Implementing dirty tracking in Panfrost improved draws per second in one synthetic benchmark by about 400%. Not bad for a week's work.

Looking forward

In the coming months, we're aiming to polish the OpenGL ES 3.1 support in time for the Mesa 21.2 release next month. Next stop after that: Bifrost performance improvements and introducing support for the modern Valhall (Mali G77 and newer) architecture family.

Comments (11)

  1. José Manuel:
    Jun 13, 2021 at 09:42 AM

    Thank you all for your huge contribution to the open source world!

    Reply to this comment

    Reply to this comment

  2. debiangamer:
    Jun 15, 2021 at 08:15 AM

    " many new features on Bifrost trickle down to Midgard allowing the older architecture still in wide use to improve long after the vendor has dropped support. On Mali T860, we are passing about 99.5% of tests required for conformant OpenGL ES 3.1."

    The Panfrost driver is unusable with Mali T820 while you are working on this. The Panfrost driver is slow, spams kernel messages to syslog and crashes time to time. The fbdev driver with llvm is faster and stable. Will this long time bug get ever fixed?

    https://gitlab.freedesktop.org/mesa/mesa/-/issues/3143

    Reply to this comment

    Reply to this comment

    1. Daniel Stone:
      Jun 15, 2021 at 02:17 PM

      As you were told by Arm developers on dri-devel@, the message in dmesg is likely to be removed. However, the fact you are seeing those messages indicates that your system is already under severe memory pressure which is the root cause of your problems. Diagnosing and fixing this is your first step.

      T860 is the best-supported GPU for Midgard. T820 has some rough edges, and these will be fixed, however with 18 different GPU revisions - all with their own differences and quirks to support - not every GPU will be perfect all the time.

      Reply to this comment

      Reply to this comment

      1. debiangamer:
        Jun 15, 2021 at 03:58 PM

        "severe memory pressure which is the root cause of your problems. Diagnosing and fixing this is your first step."

        Yes, the Sunvell T95Z Plus TV box has 2GB RAM and Firefox is using that all. fbdev and llvmpipe handles memory pressure situation fine but not Panfrost. But the Xfce desktop runs slowly after boot when there is RAM available.

        Reply to this comment

        Reply to this comment

        1. Daniel Stone:
          Jun 15, 2021 at 06:28 PM

          llvmpipe can swap all its pages out to disk because it's a CPU renderer. We can't do that with GPU acceleration, because the GPU needs to be able to access those pages in memory, not on disk. Anyway, as discussed in the thread on dri-devel@, the message is going to be removed, but the error-path handling cannot be removed.

          Reply to this comment

          Reply to this comment

        2. debiangamer:
          Jun 15, 2021 at 06:39 PM

          Disabling the Xfce desktop compositing and using this command makes the Xfce desktop and Firefox works better with Panfrost: sudo xfconf-query -c xfwm4 -p /general/vblank_mode -t string -s "xpresent" --create
          However, the Panfrost kernel driver spams to dmesg:
          [ 921.575208] panfrost d00c0000.gpu: AS_ACTIVE bit stuck
          [ 921.579233] panfrost d00c0000.gpu: AS_ACTIVE bit stuck
          [ 921.590248] panfrost d00c0000.gpu: AS_ACTIVE bit stuck
          [ 921.922100] panfrost d00c0000.gpu: AS_ACTIVE bit stuck

          The spam message in panfrost_gem_shrinker_scan is disabled.
          // if (freed > 0)
          // pr_info_ratelimited("Purging %lu bytes\n", freed

          Reply to this comment

          Reply to this comment

  3. Walter ZAMBOTTI:
    Jul 16, 2021 at 08:05 AM

    I would just like to thank everyone working on the Panfrost project for a fantastic effort and result. The recent July 2021 updates have completely changed my ARM desktop experience (for the better).

    Many thanks.

    Walter ZAMBOTTI
    Independent developer

    System tested. ODROID N2 - Ubuntu 21.04 MATE - Kernel 5.13 - MESA 5:21.2.0

    Reply to this comment

    Reply to this comment

  4. mctom:
    Jul 16, 2021 at 08:48 AM

    Just wanted to say big Thank You from Odroid community, people on the official forum are super excited with a significant boost of performance on their N2/N2+ computers. :)

    Reply to this comment

    Reply to this comment

  5. christian ponzoni:
    Sep 20, 2021 at 08:40 AM

    thank you for all the effort!

    Reply to this comment

    Reply to this comment

  6. ericek111:
    Oct 01, 2021 at 12:42 PM

    This sounds exciting! Are you planning on adding support for Mali G76, too? Or is it too different from your main targets?

    Reply to this comment

    Reply to this comment


Add a Comment






Allowed tags: <b><i><br>Add a new comment:


Search the newsroom

Latest Blog Posts

The state of GFX virtualization using virglrenderer

15/01/2025

With VirGL, Venus, and vDRM, virglrenderer offers three different approaches to obtain access to accelerated GFX in a virtual machine. Here…

Faster inference: torch.compile vs TensorRT

19/12/2024

In the world of deep learning optimization, two powerful tools stand out: torch.compile, PyTorch’s just-in-time (JIT) compiler, and NVIDIA’s…

Mesa CI and the power of pre-merge testing

08/10/2024

Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…

A shifty tale about unit testing with Maxwell, NVK's backend compiler

15/08/2024

After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…

A journey towards reliable testing in the Linux Kernel

01/08/2024

We're reflecting on the steps taken as we continually seek to improve Linux kernel integration. This will include more detail about the…

Building a Board Farm for Embedded World

27/06/2024

With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…

Open Since 2005 logo

Our website only uses a strictly necessary session cookie provided by our CMS system. To find out more please follow this link.

Collabora Limited © 2005-2025. All rights reserved. Privacy Notice. Sitemap.