André Almeida
February 08, 2022
Reading time:
Over the past 18 months, we have been on a roller-coaster ride developing futex2, a new set of system calls. As part of this prolonged effort, the futex_waitv() syscall has now successfully landed in Linux 5.16.
A followup of the initial futex syscall, this new interface aims to overcome long term issues that have been limiting the way applications use the Linux kernel. But what exactly is futex? This series of blog posts will help answer that and other questions around this tricky function.
If you've ever run strace
in a multithread program, chances are that the trace was filled with futex()
calls. If you are a Linux gamer trying to increase the performance of your setup, you have probably heard of futex as well.
Read on as I take a deep dive into this important system call and how it is used to process synchronization functions.
Before getting into futex, we need to know which problem it solves: race condition. Since the arrival of multitasking operating systems, different tasks can now work at the same time and compete for the same resources. Using the same data at the same moment can lead to serious bugs as we will see next.
Imagine that your bank has a multithreading software to manage your account running in a single core, but failed to correctly use synchronization mechanisms. Let's say that you have 500 moneys in the bank. You want to withdraw 300 moneys and at the same time your friend wants to send you 200 moneys. Then, in the fast multithread bank system, this happens:
aux = account_balance // 500 aux += receive_money // 500 + 200 account_balance = aux // 700
aux = account_balance // 700 aux -= withdraw_money // 700 - 300 account_balance = aux // 400
All good! Everything went as expected. The kernel can order the threads in different manners. In this case, if Thread 2 went before Thread 1 the final result would be the same anyway:
aux = account_balance // 500 aux -= withdraw_money // 500 - 300 account_balance = aux // 200
aux = account_balance // 200 aux += receive_money // 200 + 200 account_balance = aux // 400
However, the scheduler preemption can interleave the execution of both threads, resulting in something like this:
aux = account_balance // 500 aux += receive_money // 500 + 200
aux = account_balance // 500 aux -= withdraw_money // 500 - 300 account_balance = aux // 200
account_balance = aux // 700
Given that both threads were sharing the same resource (account_balance
) and aren't using synchronization mechanisms, they failed to keep data consistence. Thread 2 writes 200 to account_balance
, but just after Thread 1 overwrites it with 700.
If we were in a multiprocessor machine, this issue would be even worse given that these threads can run in concurrency.
But it's not that bad, right? You got more moneys! However, the opposite scheduling could happen as well, keeping you with only 200 moneys in the bank. So how do we solve this?
Mutex is a sync primitive that helps you make sure only one thread can access a critical section (hence the name, mutual exclusion).
Imagine that there's a room that can only be accessed by one person at a time. You get the key for the room, access it, and lock the door. When exiting, you leave the key for the next person. This is how a mutex works: you lock to use access the critical section and then unlock after using it. If when you want to lock it's already locked, then you need to patiently wait for another thread to open the door and give you the key.
In our bank example, the code could be fixed like this:
mutex_lock() aux = account_balance [...] account_balance = aux mutex_unlock()
This would then make any thread changing the account balance do so exclusively, without racing to other threads. Even if the thread gets scheduled with the lock, any other thread that tries to change the account_balance
would face the mutex_lock()
and would have to wait until someone calls mutex_unlock()
.
There are several other sync mechanisms out there, like spinlocks, semaphores and barriers, but in this article we are going to focus just on mutexes.
The goal of a mutex is to enable multiple threads to do their work, so ideally we should not impose high overheads with our mutexes. They should be super fast when the lock is not taken and ensure that tasks waiting to get the lock don't waste CPU time.
So how is a userspace mutex implemented? There are different approaches to it, as operating systems and programming libraries can take different directions. But if I would implement my own, I would make sure to use the kernel. After all, only the kernel can put a thread to sleep and awaken it when asked.
If we use the kernel, we can figure out a nice interface that takes care of a value for us. That would then allow us to:
However, doing syscalls requires expensive context switches (and costs even more in the face of Meltdown). Don't forget that our goal is to be as fast as possible in the common case, which is when the lock is not taken. This is where futex design comes in.
Instead of asking for the kernel to take care of the mutex value for us, we can keep track of it in userspace. Then, depending on the value, we can use the kernel only if we need to sleep or to wake other threads.
This is the main design feature of the futex. Most of the time, it doesn't need the kernel, so we have less context switch and can have a lightweight mutex implementation. However, to make this work, the platform needs to support atomic operations in order to ensure the mutex key is atomically modified. Let's see what the code would look like for such a situation.
First, let's define values for our mutex state. Futex can only operate with unsigned int of 32bits:
#define UNLOCKED 0 #define LOCKED 1
Now, here's an implementation for the mutex_lock()
. Note, however, that a correct mutex is slightly more complex than this example. A complete mutex would enable us to avoid going into the kernel when possible, and would be able to prevent mutual starvation between two threads acquiring the lock. The example below is therefore only for didactic purposes, to help demonstrate how a mutex works.
mutex_lock(unsigned int *mutex) { while (!cmpxchg(mutex, UNLOCKED, LOCKED)) futex(mutex, FUTEX_WAIT, LOCKED); }
Breaking down the code:
Here, our mutex is just a simple uint
. cmpxchg()
is an atomic compare-and-exchange operation, where if the mutex value is UNLOCKED, it replaces it by LOCKED. It returns true if the operation succeeded or false otherwise. If it returns true, it means that the lock was free and we took it. Otherwise, the lock was already taken by someone else and we need to wait for it to get freed. Now we call the futex() syscall. We use mutex
as the first argument, since the memory address of our value is the identifier.
The second argument is FUTEX_WAIT
, that's the opcode to wait on this address until something wakes us (you can optionally use a timeout argument as well). The third argument is the value that we expect to find as the value of the mutex variable. If the value at mutex is different from that, the syscall returns immediately with an -EAGAIN
error. We need to set that because between the cmpxchg()
and futex()
the lock could be freed and if we wait in a free lock, we will probably wait forever.
We need to do this in a "while loop" because when waking up we might race with another thread trying to acquire the lock, and might need to wait again.
On the unlock side, we have this:
mutex_unlock(unsigned int *mutex) { atomic_set(mutex, UNLOCKED); futex(mutex, FUTEX_WAKE, 1); }
We atomically set the value as UNLOCKED
, making it free to anyone that wants to take it and issue a futex()
call; using the FUTEX_WAKE
operation with 1 as an argument. That means that we will wake someone that is waiting in the mutex
address. We could optimize this code to have more than two states and have a way to specify that the mutex is locked and that there is someone waiting for it. If there is no one waiting for it, then we don't need to call futex()
in the unlock function. That way we achieve a very fast mutex with no syscalls needed for the common case, that's where the lock is not taken.
As you can see, futexes are tricky. Hopefully, if everything works as expected, software developers won't need to know that it exists, even if it's constantly being used by multithread applications.
At a high level, the futex syscall can be described as providing a kernel side wait queue indexed by a userspace address, to which the userspace can add or remove new threads from.
In the next post, we are going to dive into the internal implementation of the futex syscall, its limitations, and why a new interface is being proposed.
15/01/2025
With VirGL, Venus, and vDRM, virglrenderer offers three different approaches to obtain access to accelerated GFX in a virtual machine. Here…
19/12/2024
In the world of deep learning optimization, two powerful tools stand out: torch.compile, PyTorch’s just-in-time (JIT) compiler, and NVIDIA’s…
08/10/2024
Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…
15/08/2024
After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…
01/08/2024
We're reflecting on the steps taken as we continually seek to improve Linux kernel integration. This will include more detail about the…
27/06/2024
With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…
Comments (6)
Joao Pereira:
Feb 09, 2022 at 02:12 PM
Typo on the data race code example: thread 2 should be 500-300=200
Reply to this comment
Reply to this comment
Mark Filion:
Feb 09, 2022 at 03:15 PM
Good catch, thanks for letting us know. Typo has been fixed!
Reply to this comment
Reply to this comment
Ian Taylor:
Feb 10, 2022 at 01:00 AM
What if there is more than one thread waiting for the unlock. How do ensure they are given priority? Is there a "take a number and be served in turn" mechanism?
Reply to this comment
Reply to this comment
André Almeida:
Feb 10, 2022 at 04:59 PM
For the normal FUTEX_WAKE operation there are no priority, as per man page "No guarantee is provided about which waiters are awoken (e.g., a waiter with a higher scheduling priority is not guaranteed to be awoken in preference to a waiter with a lower priority)."
However, for some use cases FUTEX_WAKE_OP and FUTEX_PI can be helpful.
Reply to this comment
Reply to this comment
Chance Ginger:
Feb 18, 2023 at 11:04 AM
You are wrong about needing the kernel to implement a mutex. I suggest you read up on things like test-and-set instructions or technology like atomic memory operations.
Reply to this comment
Reply to this comment
Muhammad Usama Anjum:
Feb 20, 2023 at 05:09 PM
It depends on the requirements. IMHO mutexes are necessary for protecting critical sections sometimes. There can be several use cases where the mutexes are essentially needed for correct implementation and atomic memory operations cannot help. Atomic operations can also be useful in specific scenarios. For example, the atomic memory operations are enough if only algebraic operations are needed to be performed on a number that can be manipulated from more than one thread. In our above simple bank account example, atomic memory operations can also be used in place of the mutex for synchronization as we have chosen a simple example for the purpose of ease of understanding.
Reply to this comment
Reply to this comment
Add a Comment