Boris Brezillon
February 23, 2023
Reading time:
As fate would have it, a new DRM driver for recent Mali GPUs was submitted earlier this month. This is a bit of an oddity in the DRM subsystem world, where support for new hardware is usually added to GPU drivers supporting previous hardware generations. So let's have a look at why this was done differently this time, and the challenges to come to get this new driver merged.
Version 10 of the Mali architecture (second iteration of Valhall GPUs) introduced a major change in how jobs and their parameters are passed to the GPU. Arm replaced the Job Manager (JM) block by a Command Stream Frontend (CSF). As the new name implies, CSF hardware is introducing a command stream based solution to update the pipeline state and submit GPU jobs, thus avoiding re-allocation of relatively big pipeline state structures when only a single parameter of the pipeline changes between two job submissions. Nothing really new here, other GPU vendors have been using this command stream based submission pattern for years now, and Arm is just catching up here.
We won't give a detailed overview of how CSF works, but it is worth noting that the CSF frontend has a dedicated instruction set, and a bunch of registers to pass data around or keep internal states. There are instructions to submit jobs (compute, tiling and fragment jobs), and others to do more trivial stuff, like read/write memory, wait for job completion, wait for fences, jump/branch... That alone means we have to provide CSF-specific handling in Mesa to deal with the command stream emission and submission. If CSF was just about moving away from a descriptor based job submission approach, we could get away with a minimal amount of kernel changes and squash CSF support to the existing kernel driver.
But here comes the second major change brought by CSF hardware: firmware-assisted scheduling. The GPU not only embeds its unified shader cores (used to execute shader code) and the Command Execution Unit (the block processing the CSF instructions), it also has a Cortex-M7 microcontroller in front, that is here to do some high level queue scheduling. Before we get into that, let's take a step back, and explain how job scheduling is done in the Panfrost driver.
Panfrost uses the drm_sched
framework to deal with job scheduling. This framework is based on the concept of hardware queues (represented by drm_gpu_scheduler
), which are processing jobs in order and have a predefined amount of job slots available. These hardware queues are fed by a software scheduler taking jobs from higher level scheduling entities represented by drm_sched_entity
. To keep things simple, let's assume these scheduling entities are backing VkQueue or GL context objects, which end up being passed render/compute jobs to [execute]1.
Unlike the hardware queue model, where operations are submitted to hardware queues at the job granularity, modern GPUs have been moving to firmware-assisted scheduling. In this new model, an intermediate micro-controller is taking high-level queue objects containing a stream of instructions to execute (the job submissions being encoded in the command stream) and scheduling these high-level queues. The following diagram describes the Mali CSF scheduling model. But other GPU vendors have pretty similar scheduling schemes, with different naming for their scheduling entities, and probably different ways of passing those scheduling entities around.
We initially tried to re-use drm_sched, but quickly realized it would be challenging to reconcile the hardware queue and firmare assisted scheduling schemes. Eventually we gave up on this idea and went for our own scheduler implementation, duplicating the drm_sched
job dependency tracking logic in our scheduler code. This change alone made us reconsider the viability of having the CSF and JM backend implemented in the same driver. But there is still quite a bit of common code to be shared, even if the scheduling logic is diverging: MMU handling, driver initialization boiler-plate, device frequency scaling, power-management, and probably other stuff I forgot about.
On a side note, Intel has been working on making drm_sched
ready for the firmware-assisted scheduling case, so we will likely go back to a drm_sched
-based implementation, thus reducing the potential friction there has been between the CSF and JM scheduling logic. But there are still two crucial reasons we would rather have a separate driver:
The first aspect is pretty obvious, but let's go over the second one and try to detail what a Vulkan-friendly uAPI looks like, and how it differs from the Panfrost uAPI.
For those who are unfamiliar with graphics APIs, it is worth reminding that Vulkan is all about giving control back to the user by making a lot of the graphics pipeline management explicit, whereas OpenGL was trying to hide things from its users to make their life easier. We won't go over the pros and cons of each API here, but this design decision has an impact on the uAPI needed to have a performant Vulkan driver. We will detail some of them here.
Whilst executing a Vulkan command buffer, fences and semaphore objects passed to vkQueueSubmit can be waited on before the queued work begins, or signaled after it finishes. That means the waits on buffer object idleness that was required when dealing with GL-like submissions can go away. The only places where implicit fencing is still needed are the Window System Integration layers. Luckily, this has been recently addressed with the addition of two dma-buf ioctls, allowing one to import a sync-file into a dma-buf, or export all fences attached to a dma-buf to a sync-file. With these new ioctls, we can reconcile the implicit and explicit fencing worlds, and allow kernel drivers to be explicit-synchronization centric (no code to deal with the implicit synchronization case).
With explicit synchronization, we get rid of the step that was iterating over all buffer objects passed to a submit ioctl to extract implicit fences to wait on, and add the job done implicit fence back to these buffer objects so other users can wait on buffer idleness. Although we still have this list of buffer objects to pass in order to make sure the GPU mappings on these buffers are preserved while the GPU is potentially accessing those buffers.
Again, Vulkan is pretty explicit about object lifecycles and when things can and can't be freed. One such case is about VkMemory objects and the bind/unbind operations that are used to attach memory to a VkImage or VkBuffer object. That means the user is responsible for keeping the memory objects live in the GPU virtual address space while jobs are still in flight. This in turn means we don't need to pass all buffer objects the GPU jobs are accessing when we submit a batch.
This new paradigm requires quite a few changes compared to what the Panfrost uAPI provides. In Panfrost, GPU virtual address mappings were implicitly created at buffer object creation time. We now want to add explicit VM_{MAP,UNMAP}
ioctls to allow creating these mappings explicitly. And while we're at it, and other drivers already allow that, we can just provide extra ioctls to create/destroy virtual address spaces (VM instances), so a single DRM file descriptor can deal with multiple independent contexts. And if we go further and envision support for sparse memory binding, we also need a way to queue binding/unbinding operations to a VkQueue. This generally implies adding some sort of VM_BIND
ioctl that provides asynchronous/queue-based VM_{MAP,UNMAP}
operations.
This is quite a major shift in how we deal with the GPU virtual address space; retrofitting that in the Panfrost driver would be both painful and error-prone. So, we just lost one denominator between Panfrost and the new driver: the MMU/GPU-va-management logic.
To sum-up, we have a completely new uAPI (almost nothing shared with the old one), a new scheduling logic, and a new MMU/GPU-VA-management logic. This leaves us some driver initialization boilerplate, the device frequency scaling implementation, and the power management code, which is likely to differ too, because some of the power-management is now done by the firmware. So, the only sane decision here was to fork Panfrost and make PanCSF an independent driver. We might end sharing some code at some point if it makes sense, but it sounds a bit premature to try to do that now.
The first thing to note is that this RFC, while being at least partly functional (only tested on basic GLES2 workload so far), is far from being ready. There are things we need to address: like trying to use drm_sched
instead of implementing our own timesharing-based scheduler, having a proper buffer object eviction mechanism to gracefully handle situations where the system is memory pressured (and implementing the VM fencing mechanism that goes with it, so we don't end up with GPU faults when such evictions happen), and of course, making sure we are robust to all kind of failures. It also lacks support for power-management, device frequency scaling, and probably other useful features like performance counters, but those should be relatively straightforward to implement compared to the scheduling and memory management logic.
At any rate, that is still an important step in our attempt at having a fully upstream open-source graphics stack for Mali CSF GPUs. And with this RFC being posted early, we hope to get the discussion started and sort out some important implementation details before we get too far and risk a major rewrite of the code when others start reviewing what we have done.
Note that the Mesa changes needed to support CSF hardware and interface with this PanCSF driver should be posted soon, so stay tuned!
Special thanks to Faith Ekstrand, Alyssa Rosenzweig, Daniel Stone, and Daniel Vetter for supporting/advising me when I was working on the various iterations of this driver.
1. In practice, the graphics API queue object might require more than one scheduling entity ↩
20/12/2024
The Rockchip RK3588 upstream support has progressed a lot over the last few years. As 2024 comes to a close, it is a great time to have…
09/12/2024
Collabora will be at NeurIPs this week to dive into the latest academic findings in machine learning and research advancements that are…
05/12/2024
Now based on Debian Bookworm, Apertis is a collaborative OS platform that includes an operating system, but also tools and cloud services…
Comments (28)
Nikos:
Feb 23, 2023 at 10:11 PM
Thank you for this work, it sounds very promising.
Reply to this comment
Reply to this comment
Googulator:
Mar 02, 2023 at 11:37 AM
Any timeline for the Mesa counterpart?
Reply to this comment
Reply to this comment
bbrezillon:
Mar 07, 2023 at 10:49 AM
We pushed it here [1] a few hours ago, and here [2] is a branch containing the latest kernel driver version. Please keep in mind that this is still work-in-progress, so don't expect a stable or performant driver.
[1]https://gitlab.freedesktop.org/panfrost/mesa/-/tree/panfrost/v10-wip
[2]https://gitlab.freedesktop.org/bbrezillon/linux/-/tree/pancsf
Reply to this comment
Reply to this comment
Fredrum:
Mar 03, 2023 at 09:00 PM
Would this driver help improve general GLES performance on Mali-G610 using panfrost drivers?
Currently I understand that Panfrost GLES2 is only running at !25-40% of capacity and I also read somone mention that som sort of scheduling was part of that problem.
Would this improve that sitiation or has nothing to do with it?
Cheers!
Reply to this comment
Reply to this comment
bbrezillon:
Mar 07, 2023 at 12:00 PM
We haven't benchmarked this driver yet, so I'm not sure where you get these numbers from. We do intend to work on the performance aspect further down the road, but that's not our main priority right now.
Reply to this comment
Reply to this comment
Fredrum:
Mar 07, 2023 at 05:06 PM
The number estimate were not about your driver just Panfrost in general on Mali-G610 (rk3588s).
And were based on gl benchmark scores Vendor Blob driver vs Panfrost.
Reply to this comment
Reply to this comment
bbrezillon:
Mar 07, 2023 at 05:17 PM
I'm pretty sure there's a confusion between the official mesa project [1] and panfork [2] (which is a fork of mesa with Mali-G610 support on top). We don't support Mali-G610 in mesa yet.
[1]https://gitlab.freedesktop.org/mesa/mesa
[2]https://gitlab.com/panfork/mesa
Reply to this comment
Reply to this comment
Stuart Naylor:
Apr 24, 2023 at 08:44 AM
From memory if you poke around in /sys you can find the current load of the GPU and with Panfork running its only managing at best to create 40%.
It will be great to get off Panfork for the MaliG610 is it does seem very hacky.
Thnx for the great work and fungers crossed for a MaliG610 as there are such a number of great RK3588(x) boards in the wild now.
Reply to this comment
Reply to this comment
Stuart Naylor:
Jun 27, 2023 at 11:47 PM
Any updates on this now kernel 6.4 is released?
Reply to this comment
Reply to this comment
Boris Brezillon:
Jun 28, 2023 at 07:28 AM
After a long period of inactivity, we recently resumed working on the kernel driver and hope to have a new version posted in the coming weeks. Once this new version is posted, we plan to finish the cleanup of the mesa GL driver and finally get an MR posted.
Reply to this comment
Reply to this comment
Stuart Naylor:
Jun 29, 2023 at 08:27 AM
I have a few RK3588 boards and was curious if the kernel to firmware CSF would provide any perf boost and also getting off the RK BSP would be great.
I wondered if it would would go in 6.4, but thanks for the update.
Reply to this comment
Reply to this comment
Alan Macdonald:
Oct 03, 2023 at 07:31 PM
Any news on this? I've been holding off buying an orang pi5 plus because of sketchy sounding hw acceleration support for mali g610. I've bitten the bullet and put the order in. Though not sure if the orange pi 5 official images are somehow providing support a different way
Reply to this comment
Reply to this comment
Stuart Naylor:
Oct 04, 2023 at 07:21 AM
The Opi Ubuntu desktop versions include panfork which uses userspace hacks for the CSF, it does work but a bit hacky.
With the announcement that Arm is going to collaborate with Collabora I wonder if there is anything you can announce even if its just to say things are or are not in the pipeline?
Reply to this comment
Reply to this comment
Alan Macdonald:
Oct 11, 2023 at 06:00 PM
Thanks. Just knowing Panfork is shipped with the official orange pi images is useful. I think I'd prefer working with the official images even if the solution is hacky, since getting SBCs to work on vanilla kernels etc seems to be non trivial in my experience.
I really hope they release with the mesa drivers in the future assuming they do actually work well
Reply to this comment
Reply to this comment
Boris Brezillon:
Oct 04, 2023 at 08:26 AM
A [second version of the kernel driver](https://lore.kernel.org/dri-devel/20230809165330.2451699-1-boris.brezillon@collabora.com/) has been posted a couple month back and we are about to post a v3. Development branches can be found [here](https://gitlab.freedesktop.org/panfrost/linux/-/tree/panthor?ref_type=heads) and [here](https://gitlab.freedesktop.org/panfrost/mesa/-/tree/v10+panthor?ref_type=heads) if you want to play with it. They are updated on a regular basis, but I doubt orange pi 5 official images are shipping these drivers.
Reply to this comment
Reply to this comment
Stuart Naylor:
Oct 11, 2023 at 06:30 AM
The gitlab repos all seem to need elevate security, so will wait until something public is avail.
I have Radxa Rock5b & Opi5 and https://github.com/Joshua-Riek/ubuntu-rockchip/releases/tag/v1.26
My aim is mainline but HDMI still needs to be implemented, which also has been WIP for a while.
I did a nasty hack of the ArmNN delegate for tensorflow https://github.com/StuartIanNaylor/rock5b-wav2letter-bench which uses OpenCl
https://github.com/Tencent/ncnn has a great benchmark for ML that also you can switch easily for cpu vs gpu benches which uses Vulkan
So the hdmi doesn't matter that much, but the likes of Joshua Riek might be a benefit to add to the repo read rights.
Reply to this comment
Reply to this comment
Boris Brezillon:
Oct 11, 2023 at 03:34 PM
Unless I missed something, the mesa and linux repos I pointed to are both public.
Reply to this comment
Reply to this comment
Stuart Naylor:
Oct 12, 2023 at 10:49 AM
Apols I should read what is in front of me "Admin message
Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience."
Dunno why I just can not read and get the above
Reply to this comment
Reply to this comment
Boris Brezillon:
Oct 12, 2023 at 02:30 PM
Thanks for clarifying! So are you all set now?
Reply to this comment
Reply to this comment
archetech:
Oct 12, 2023 at 09:33 PM
Where can I find a kernel that has the latest bits to test panthor in it? sources or prebuilt.
Reply to this comment
Reply to this comment
Alan Macdonald:
Jan 05, 2024 at 08:50 PM
How is this driver getting on? I still can't even play a video inside Firefox on an orange pi 5 plus. It's seems to fail to detect vaapi stuff on startup. Maybe that is more a Firefox thing though as chromium seems to fair better. This on Joshua Riek's Ubuntu which I think he patches all the latest rockchip hw support into.
Reply to this comment
Reply to this comment
sodo:
Jan 15, 2024 at 09:03 AM
i don't understand much i used panfork on android
i can't change its kernel driver since it's not rooted
but panfork just worked without need to change kernel driver
I want to ask if it's still the case for upcoming userpace driver
Reply to this comment
Reply to this comment
Boris Brezillon:
Jan 15, 2024 at 03:10 PM
No, unlike panfork, you will have to use a new kernel driver for this new userspace driver.
Reply to this comment
Reply to this comment
Freedom:
Feb 17, 2024 at 08:38 PM
I like free software. In patch v4 you write 'The CSF
firmware binary can be found here[3]'
Did i understand right, that there wont be any working output without this closed source binary blob?
Panfrost because the GPU is hardware based have the benefit to not require any closed source software to fully work.
Would at some point this also be possible here? Could this firmware be fully reverse engeneered?
Is there any digital signature the gpu check when loading the binary file or would a complete reverse engeneering in theorhy work?
Reply to this comment
Reply to this comment
Boris Brezillon:
Feb 26, 2024 at 08:23 AM
> Did i understand right, that there wont be any working output without this closed source binary blob?
That's correct, you'll need this closed source firmware binary to get things working.
> Panfrost because the GPU is hardware based have the benefit to not require any closed source software to fully work.
Indeed.
> Would at some point this also be possible here? Could this firmware be fully reverse engeneered?
Of course. Actually, the CSF work started with a reverse engineering effort on the HW interface used by the FW. You can find the outcome of this effort [here](https://gitlab.freedesktop.org/panfrost/csf.rs)). We just decided to focus on the userspace/kernel driver.
> Is there any digital signature the gpu check when loading the binary file or would a complete reverse engeneering in theorhy work?
None of the GPUs we had access to had a signature-check mechanism, so working on an open source FW is a perfectly valid option. If you want to play with the Cortex-M7 embedded in G610, I started a [rust project](https://gitlab.freedesktop.org/bbrezillon/open-csf-fw) which defines all the IOMEM interfaces the MCU has access to, but right now, this code base is just an empty shell.
Reply to this comment
Reply to this comment
stuartiannaylor@outlook.com:
Feb 28, 2024 at 12:21 AM
Thanks Boris as the Panthor userspace/kernel driver will be a great addition. Also an later opensource CSF would be a great addition.
PS is it keeping the Panthor monika or just Panfrost?
Also does anyone know the state of play of Vulkan with GPU's such as the G710/610?
Reply to this comment
Reply to this comment
Boris Brezillon:
Feb 28, 2024 at 03:50 PM
> PS is it keeping the Panthor monika or just Panfrost?
Panthor is the name of the new kernel driver. The usermode driver is still called Panfrost.
> Also does anyone know the state of play of Vulkan with GPU's such as the G710/610?
That's something we are actively working on, but don't expect anything usable before the end of the year.
Reply to this comment
Reply to this comment
stuartiannaylor@outlook.com:
Feb 28, 2024 at 06:21 PM
Vulkan as a ML api especially for LMMs is of much interest.
Likely also many of the retro gamers would be also.
If not Vilkan is openCL another thing in the pipeline as likely prefer Vulkan anyway for ML
I get confused by Zink that having Vulkan is great but building it ontop of OpenGL just seems a strange concept.
I thought in parts Vulkan was a lowre level API that hardware architecture such as G710/g610 if not designed for then optimised for Vulkan
Reply to this comment
Reply to this comment
Add a Comment