Adrian Ratiu
April 05, 2019
Reading time:
Interested in learning more about low-level specifics of the eBPF stack? Read on as we take a deep dive, from its VM mechanisms and tools, to running traces on remote, resource-constrained embedded devices.
Note: These blog posts will be focusing on eBPF so for our purpuses BPF and eBPF are the same thing and can be used interchangeably. The name/acronym deasn't mean much anymore because the project evolved way beyond its initial scope. BPF and eBPF are used interchangeably in the series.
When in doubt, use this flowchart:
eBPF is a register-based Virtual Machine using a custom 64 bit RISC instruction set capable of running Just-in-Time native-compiled "BPF programs" inside the Linux kernel with access to a subset of kernel functions and memory. It is a full VM implementation, not to be confused with the Kernel-based VM (KVM) which is a module enabling Linux to act as hypervisor for other VMs. It is also part of the mainline kernel, so it doesn't require any 3rd party modules like other frameworks (*cough* LTTng or SystemTap *cough*), and comes enabled by default in almost all Linux distributions. Readers familiar with DTrace might find this Dtrace/BPFtrace comparison useful.
Reasons for running a full VM inside the kernel revolve around convenience and safety: While all operations done by eBPF programs can be handled via normal kernel modules, direct kernel programming is a really dangerous endeavour - it can cause lock-ups, memory corruption, crash processes, cause security vulnerabilities and other unwanted effects especially on production devices (eBPF is very often used to inspect systems in production), so running native-fast JiT-compiled kernel code via a safe VM becomes valuable for security monitoring and sandboxing, network filtering, program tracing, profiling, debugging and so on. Some quick examples can be found in this excellent eBPF reference.
By design the eBPF VM and its programs are intentionally not Turing complete: no loops are allowed (there is work-in-progress to support bounded loops) so each eBPF program is guaranteed to finish and not hang, all memory access is bounded and type-checked (including registers, a MOV instruction can change the type of a register), there are no null dereferences, a program must have at most BPF_MAXINSNS instructions (default 4096), the "main" function takes a single argument (the context) and so on. These kinds of restrictions enable simple and fast correctness verification when an eBPF program is loaded in the kernel and its instructions get parsed into a directed-acyclic-graph by a verifier module.
Historically the eBPF VM was only available from within the kernel and was used to filter network packets without userspace interference, hence the now-nonsense name of "Berkley Packet Filter". Starting with kernel v3.18 (2014), the VM was also exposed to userspace via the bpf() syscall and uapi/linux/bpf.h which resulted in a freeze of its instruction set present at the time which became a public ABI, though new instructions can still be (and have been) added later.
Because the in-kernel eBPF implementation is licensed under GPLv2, it cannot be easily redistributed by non-GPL users so there is also an alternative Apache-licensed userspace eBPF VM implementation called "uBPF". Legalese aside, this userspace implementation can probably be useful for tracing performance-critical applications which need to avoid the cost of kernel-userspace context switches.
eBPF programs are run by the kernel when events occur, so they can be viewed as a form of function hooking or event driven programming. There is very little use in running on-demand eBPF programs from userspace as all on-demand user calls are already handled via the normal non-VM kernel API calls ("syscalls") where VM bytecode brings little value. Events can be generated from kprobes/uprobes, tracepoints, dtrace probes, sockets and more. This allows to hook and inspect memory in any function at any instruction in both kernel and user processes, to intercept file operations, inspect specific network packets and so on. A good reference is the BPF Features by Linux Kernel Version.
As mentioned before, an event triggers the execution of an attached eBPF program which then can store information in maps, print to ringbuffers or call a subset of kernel functions defined by a special API. An eBPF program can be attached to multiple events and different eBPF programs can also access the same map to share data. A special read/write map called "program array" stores references to other eBPF programs loaded via the bpf() syscall and a succesful lookup in this map triggers a jump with no return to the respective eBPF program. This eBPF nesting is also bounded to avoid infinite recursion loops.
Steps for running an eBPF program:
Userspace sends bytecode to the kernel together with a program type which determines what kernel areas can be accessed.
The kernel runs a verifier on the bytecode to make sure the program is safe to run (kernel/bpf/verifier.c).
The kernel JiT-compiles the bytecode to native code and inserts it in (or attaches to) the specified code location.
The inserted code writes data to ringbuffers or generic key-value maps.
Userspace reads the result values from the shared maps or ringbuffers.
The map and ringbuffer structures are managed by the kernel (like pipes and FIFOs are) independently of the hooked eBPF or user programs accessing them. Accesses are asynchronous via file descriptors and reference counting ensures structures exist as long as there is at least one program accessing them. The hooked JiT-compiled native code usually gets removed when the user process which loaded it terminates, though in some cases it can still continue beyond the loading-process' lifespan.
To ease writing eBPF programs and avoid doing raw bpf() syscalls, the kernel provides a convenient libbpf library containing syscall function wrappers like bpf_load_program and structure definitions like bpf_map which can be linked statically or as a DSO, dual-licensed under the LGPL 2.1 and BSD 2-Clause. The kernel tree also provides some neat examples which use libbpf in samples/bpf/.
Kernel developers are poor creatures because the kernel is a self-contained project: there are no userspace goodies in the kernel - no Glibc, no LLVM, no JavaScript, not even WebAssembly! - this is why the kernel-tree eBPF examples contain raw bytecode or load pre-assembled bytecode files via libbpf. We can see this in sock_example.c, a simple userspace program using eBPF to count how many TCP, UDP and ICMP protocol packets are received on the loopback interface.
We skip the main and open_raw_sock functions because they're trivial and instead focus on test_sock because that's where the magic happens:
static int test_sock(void) { int sock = -1, map_fd, prog_fd, i, key; long long value = 0, tcp_cnt, udp_cnt, icmp_cnt; map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key), sizeof(value), 256, 0); if (map_fd < 0) { printf("failed to create map '%s'\n", strerror(errno)); goto cleanup; } struct bpf_insn prog[] = { BPF_MOV64_REG(BPF_REG_6, BPF_REG_1), BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol) /* R0 = ip->proto */), BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */ BPF_LD_MAP_FD(BPF_REG_1, map_fd), BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */ BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */ BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */ BPF_EXIT_INSN(), }; size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn); prog_fd = bpf_load_program(BPF_PROG_TYPE_SOCKET_FILTER, prog, insns_cnt, "GPL", 0, bpf_log_buf, BPF_LOG_BUF_SIZE); if (prog_fd < 0) { printf("failed to load prog '%s'\n", strerror(errno)); goto cleanup; } sock = open_raw_sock("lo"); if (setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd)) < 0) { printf("setsockopt %s\n", strerror(errno)); goto cleanup; }
First a BPF map is created which behaves like a fixed-size array of max 256 elements via the libbpf API. The keys index network protocols as defined by IPROTO_* (2 byte words) and the values represent the respective packet counts (4 bytes size). Beside arrays, eBPF maps also implement other data structure types like stacks or queues.
Next the eBPF bytecode instruction array is defined using the conveniently provided kernel macros. We won't go into details on the bytecode here (that will be done in part 2 after the machine is described some more). At a high level the bytecode reads the protocol word from the packet buffer then looks it up in the map and increments its specific packet count.
Then the BPF bytecode is loaded into the kernel and verified for corectness/safety via libbpf's bpf_load_program which returns a fd for reference. The call specifies what type of program this eBPF is which determines what kernel subset it will have access to. Because this is a SOCKET_FILTER, an argument pointing to the current network packet is provided. Finally the eBPF bytecode is attached via the socket layer to a specific raw socket after which it will be run for every received packet regardless of protocol.
All that remains is for the user process to start polling the shared map for the data:
for (i = 0; i < 10; i++) { key = IPPROTO_TCP; assert(bpf_map_lookup_elem(map_fd, &key, &tcp_cnt) == 0); key = IPPROTO_UDP; assert(bpf_map_lookup_elem(map_fd, &key, &udp_cnt) == 0); key = IPPROTO_ICMP; assert(bpf_map_lookup_elem(map_fd, &key, &icmp_cnt) == 0); printf("TCP %lld UDP %lld ICMP %lld packets\n", tcp_cnt, udp_cnt, icmp_cnt); sleep(1); } }
The basics of eBPF were introduced in this first part and we got our feet wet with an example of how to load bytecode and communicate with the eBPF VM. Compiling and running the example is left as an exercise for the reader due to space constraints. We also intentionally stopped short of looking at specific eBPF bytecode instructions as this will be the focus of part 2. In the example we studied, userspace reads eBPF map values from the kernel VM via libbpf directly in C (using 10 x 1 second sleeps of all things!) which is clunky, error-prone and can get complex fast, so in part 3 we'll be looking at higher level tools which automate these VM interactions via scripting or domain-specific languages.
Continue reading (An eBPF overview, part 2: Machine & bytecode)…
08/10/2024
Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…
15/08/2024
After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…
01/08/2024
We're reflecting on the steps taken as we continually seek to improve Linux kernel integration. This will include more detail about the…
27/06/2024
With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…
26/06/2024
WirePlumber 0.5 arrived recently with many new and essential features including the Smart Filter Policy, enabling audio filters to automatically…
12/06/2024
Part 3 of the cmtp-responder series with a focus on USB gadgets explores several new elements including a unified build environment with…
Comments (9)
annonymos:
Apr 16, 2019 at 01:57 PM
What a well written article, thank you so much for the explaining the basics of eBPF.
Reply to this comment
Reply to this comment
annonymous:
Apr 19, 2019 at 07:45 PM
Thanks for the article!
s/definitins/definitions/
s/becuase/because/
Reply to this comment
Reply to this comment
Adrian Ratiu:
Apr 23, 2019 at 03:38 PM
Thank you for spotting the errors, they have been corrected!
Reply to this comment
Reply to this comment
Tiago Katcipis:
Apr 28, 2019 at 07:15 PM
Great stuff Adrian, thanks =).
There is a small typo: s/corectness/correctness
Reply to this comment
Reply to this comment
Editor:
Apr 29, 2019 at 06:41 PM
Typo fixed, thank you!
Reply to this comment
Reply to this comment
anon:
Apr 29, 2019 at 06:29 PM
The link to "other data structure types" is wrong, it goes to the IP_PROTO enum.
Reply to this comment
Reply to this comment
Editor:
Apr 29, 2019 at 06:42 PM
Fixed, thank you!
Reply to this comment
Reply to this comment
Vijay:
May 23, 2021 at 03:10 AM
Wonderfully written article... thanks
Reply to this comment
Reply to this comment
Long Nguyen-Vu:
Sep 05, 2021 at 10:34 AM
In my opinion, eBPF is no longer a VM, code is JIT-complied and executed directly on registers. Therefore the term "full VM implementation" is misleading.
Reply to this comment
Reply to this comment
Add a Comment