Adrian Ratiu
April 15, 2019
Reading time:
In our first article we introduced the eBPF VM, its intentional design limitations and how to interact with it from a userspace process. If you haven't yet read it, you should probably do so before continuing because starting directly with machine and bytecode specifics can be hard without a proper introduction. If in doubt, see the flowchart at the beginning of part 1.
The second part of this series takes a more in-depth look at the eBPF VM and program studied in the first part. Having this low level knowledge is not mandatory but can be a very useful foundation for the rest of the series where we examine higher level tools built on top of these mechanisms.
eBPF is a RISC register machine with a total of 11 64-bit registers, a program counter and a 512 byte fixed-size stack. 9 registers are general purpouse read-write, one is a read-only stack pointer and the program counter is implicit, i.e. we can only jump to a certain offset from it. The VM registers are always 64-bit wide (even when running inside a 32-bit ARM processor kernel!) and support 32-bit subregister addressing if the most significant 32 bits are zeroed - this will be very useful in part 4 when cross-compiling and running eBPF programs on embedded devices.
The registers are:
r0: |
stores return values, both for function calls and the current program exit code |
r1-r5: |
used as function call arguments, upon program start r1 contains the "context" argument pointer |
r6-r9: |
these get preserved between kernel function calls |
r10: |
read-only pointer to the per-eBPF program 512 byte stack |
The eBPF program type supplied at load-time determines exactly what subset of kernel functions are available for calling, as well as what "context" argument gets supplied via r1 at program startup. The meaning of the program exit value stored in r0 is also determined by the program type.
Each function call can have at most 5 arguments in registers r1-r5; this applies to both ebpf-to-ebpf calls and to kernel function calls. Registers r1-r5 can only store numbers or pointers to the stack (to be pased as arguments to functions), never direct pointers to arbitrary memory. All memory accesses must be done by first loading data to the eBPF stack before using it in the eBPF program. This restriction helps the eBPF verifier, it simplifies the memory model to enable easier corectness checking.
The BPF-accesible kernel "helper" functions are defined by the kernel core (not extensible through modules) via an API similar to defining syscalls, using BPF_CALL_* macros. bpf.h tries to provide a reference for all BPF-accesible kernel function helpers. For example the definition of bpf_trace_printk uses BPF_CALL_5 with 5 pairs of type/arg-names. Defining the argument data types is important because at each eBPF program load the eBPF verifier makes sure the register data types match the callee argument types.
eBPF instructions are also fixed-size 64-bit encoded with around 100 instructions (for now...) grouped into 8 classes. The VM supports 1-8 byte load/store from generic memory (maps, the stack, "contexts" like packet buffers and so on), forward/backward (un)conditional jumps, arithmetic/logic operations and function calls. For a good in-depth opcode format study refer to the Cilium project instruction set documentation. The IOVisor project also maintains a useful instruction spec.
In the example studied in part 1 of the series, we used some helpful kernel macros to create an eBPF bytecode instruction array using the following structure (all instructions are encoded this way):
struct bpf_insn { __u8 code; /* opcode */ __u8 dst_reg:4; /* dest register */ __u8 src_reg:4; /* source register */ __s16 off; /* signed offset */ __s32 imm; /* signed immediate constant */ }; msb lsb +------------------------+----------------+----+----+--------+ |immediate |offset |src |dst |opcode | +------------------------+----------------+----+----+--------+
Let's look at the BPF_JMP_IMM instruction which encodes a conditional jump against an immediate value. The macro comment below should be self-explanatory on the instruction logic. The opcode encodes the instruction class BPF_JMP, the operation (which is passed through an BPF_OP bitfield to ensure corectness) and a flag which denotes that it is an operation on an immediate/constant value, BPF_K.
#define BPF_OP(code) ((code) & 0xf0) #define BPF_K 0x00 /* Conditional jumps against immediates, if (dst_reg 'op' imm32) goto pc + off16 */ #define BPF_JMP_IMM(OP, DST, IMM, OFF) \ ((struct bpf_insn) { \ .code = BPF_JMP | BPF_OP(OP) | BPF_K, \ .dst_reg = DST, \ .src_reg = 0, \ .off = OFF, \ .imm = IMM })
If we do the legwork to calculate the value or disassemble an eBPF bytecode binary containing BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), we'll find out it is 0x020015. This specific bytecode is very frequently used to test the return value of a function call, stored in r0; it jumps over the next 2 instructions if r0 == 0.
Now that we have the necessary knowledge to fully understand the bytecode eBPF example used in part 1 of this series, we'll go step by step and explain it. Remember, sock_example.c is a simple userspace program using eBPF to count how many TCP, UDP and ICMP protocol packets are received on the loopback interface.
At a high level, what the code does is reads the protocol number from the received packet, then pushes it on the eBPF stack to be used as index for a map_lookup_elem call which gets the respective protocol's packet count. The map_lookup_elem function takes an index (or key) pointer in r0 and a map file descriptor in r1. If the lookup call succeeds, r0 will contain a pointer to the map value stored at the protocol index. We then atomically increment the map value and exit.
BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
When an eBPF program is started, the context (in this case the packet buffer) is pointed at by the address in r1. r1 will be used for arguments during function calls so we also store it in r6 as backup.
BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol) /* R0 = ip->proto */),
This instruction loads a byte (BPF_B) into r0 from an offset in the context buffer, which is a network packet buffer in this case, so we supply the offset of the protocol byte from an iphdr structure to be loaded into r0.
BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
Push the word (BPF_W) containing the previously read protocol on the stack (pointed by r10 starting with offset -4 bytes).
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
Move the stack address pointer to r2 and substract 4 so now r2 points to the protocol value, to be used as argument for the next map key lookup.
BPF_LD_MAP_FD(BPF_REG_1, map_fd),
Load the local in-process file descriptor referencing the map containing protocol packet counts into r1.
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
Execute the map lookup call with the protocol value from the stack, pointed at by r2, as key. The result is stored in r0: a pointer address to the value indexed by the key.
BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
Remember 0x020015? This is the same bytecode from the first section. If the map lookup did not suceed, r0 == 0 so we skip the next two instructions.
BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */ BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
Increment the map value at the address pointed to by r0.
BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */ BPF_EXIT_INSN(),
Set the eBPF retcode to 0 and exit.
Even though this sock_example logic is quite trivial (it just increments some numbers in a map), implementing or understanding it is difficult to do in raw bytecode. More complex tasks become extremely difficult when done just in an assembler like this. Going forward we'll start using higher level languages and tools to enable more powerful eBPF use-cases with less effort.
In this part we took a closer look at the eBPF VM registers and instruction set, we learned how eBPF-accessible kernel functions are called from bytecode and how they are defined via a syscall-like special purpouse API by the core kernel. We also fully understood the bytecode used in the part 1 example. There are still unexplored areas like creating multiple eBPF program functions or daisy-chaining eBPF programs to get around the 4096 instruction limit imposed by Linux distributions. Maybe we'll explore these in later articles.
For now, the main lesson is that writing raw bytecode is hard, very much like writing processor assembly and not very productive. In part 3 we'll start going up the software stack and compile higher level languages to eBPF bytecode now that we know the low level essentials of how the VM works.
Continue reading (An eBPF overview, part 3: Walking up the software stack)…
19/12/2024
In the world of deep learning optimization, two powerful tools stand out: torch.compile, PyTorch’s just-in-time (JIT) compiler, and NVIDIA’s…
08/10/2024
Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…
15/08/2024
After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…
01/08/2024
We're reflecting on the steps taken as we continually seek to improve Linux kernel integration. This will include more detail about the…
27/06/2024
With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…
26/06/2024
WirePlumber 0.5 arrived recently with many new and essential features including the Smart Filter Policy, enabling audio filters to automatically…
Comments (0)
Add a Comment