An eBPF overview, part 1: Introduction

An eBPF overview, part 1: Introduction

Adrian Ratiu
April 05, 2019

Share this post:

Reading time:

Interested in learning more about low-level specifics of the eBPF stack? Read on as we take a deep dive, from its VM mechanisms and tools, to running traces on remote, resource-constrained embedded devices.

Note: These blog posts will be focusing on eBPF so for our purpuses BPF and eBPF are the same thing and can be used interchangeably. The name/acronym deasn't mean much anymore because the project evolved way beyond its initial scope. BPF and eBPF are used interchangeably in the series.

Part 1 and Part 2 give a condensed in-depth introduction to eBPF for newcommers or those looking to further their knowledge by taking a deep dive into the lower part of the eBPF stack.
Part 3 is an overview of the userspace tooling meant to increase productivity, built on top of the low level VM mechanisms presented in parts 1 and 2.
Part 4 focuses on running eBPF programs on embedded systems which are resource constrained, where a full toolchain stack (BCC/LLVM/python/etc) is undesirable. We'll cross-compile and run eBPF programs on 32 bit ARM with smaller embedded-focused tools. Readers interested only in this part can probably skip the rest.
Part 5 is about tracing userspace: Up until now our efforts were focused on tracing the kernel, so it's about time we also examine other user processes.

When in doubt, use this flowchart:

eBPF flowchart

What is it?

eBPF is a register-based Virtual Machine using a custom 64 bit RISC instruction set capable of running Just-in-Time native-compiled "BPF programs" inside the Linux kernel with access to a subset of kernel functions and memory. It is a full VM implementation, not to be confused with the Kernel-based VM (KVM) which is a module enabling Linux to act as hypervisor for other VMs. It is also part of the mainline kernel, so it doesn't require any 3rd party modules like other frameworks (*cough* LTTng or SystemTap *cough*), and comes enabled by default in almost all Linux distributions. Readers familiar with DTrace might find this Dtrace/BPFtrace comparison useful.

Reasons for running a full VM inside the kernel revolve around convenience and safety: While all operations done by eBPF programs can be handled via normal kernel modules, direct kernel programming is a really dangerous endeavour - it can cause lock-ups, memory corruption, crash processes, cause security vulnerabilities and other unwanted effects especially on production devices (eBPF is very often used to inspect systems in production), so running native-fast JiT-compiled kernel code via a safe VM becomes valuable for security monitoring and sandboxing, network filtering, program tracing, profiling, debugging and so on. Some quick examples can be found in this excellent eBPF reference.

By design the eBPF VM and its programs are intentionally not Turing complete: no loops are allowed (there is work-in-progress to support bounded loops) so each eBPF program is guaranteed to finish and not hang, all memory access is bounded and type-checked (including registers, a MOV instruction can change the type of a register), there are no null dereferences, a program must have at most BPF_MAXINSNS instructions (default 4096), the "main" function takes a single argument (the context) and so on. These kinds of restrictions enable simple and fast correctness verification when an eBPF program is loaded in the kernel and its instructions get parsed into a directed-acyclic-graph by a verifier module.

Historically the eBPF VM was only available from within the kernel and was used to filter network packets without userspace interference, hence the now-nonsense name of "Berkley Packet Filter". Starting with kernel v3.18 (2014), the VM was also exposed to userspace via the bpf() syscall and uapi/linux/bpf.h which resulted in a freeze of its instruction set present at the time which became a public ABI, though new instructions can still be (and have been) added later.

Because the in-kernel eBPF implementation is licensed under GPLv2, it cannot be easily redistributed by non-GPL users so there is also an alternative Apache-licensed userspace eBPF VM implementation called "uBPF". Legalese aside, this userspace implementation can probably be useful for tracing performance-critical applications which need to avoid the cost of kernel-userspace context switches.

How does it work?

eBPF programs are run by the kernel when events occur, so they can be viewed as a form of function hooking or event driven programming. There is very little use in running on-demand eBPF programs from userspace as all on-demand user calls are already handled via the normal non-VM kernel API calls ("syscalls") where VM bytecode brings little value. Events can be generated from kprobes/uprobes, tracepoints, dtrace probes, sockets and more. This allows to hook and inspect memory in any function at any instruction in both kernel and user processes, to intercept file operations, inspect specific network packets and so on. A good reference is the BPF Features by Linux Kernel Version.

As mentioned before, an event triggers the execution of an attached eBPF program which then can store information in maps, print to ringbuffers or call a subset of kernel functions defined by a special API. An eBPF program can be attached to multiple events and different eBPF programs can also access the same map to share data. A special read/write map called "program array" stores references to other eBPF programs loaded via the bpf() syscall and a succesful lookup in this map triggers a jump with no return to the respective eBPF program. This eBPF nesting is also bounded to avoid infinite recursion loops.

Steps for running an eBPF program:

Userspace sends bytecode to the kernel together with a program type which determines what kernel areas can be accessed.
The kernel runs a verifier on the bytecode to make sure the program is safe to run (kernel/bpf/verifier.c).
The kernel JiT-compiles the bytecode to native code and inserts it in (or attaches to) the specified code location.
The inserted code writes data to ringbuffers or generic key-value maps.
Userspace reads the result values from the shared maps or ringbuffers.

The map and ringbuffer structures are managed by the kernel (like pipes and FIFOs are) independently of the hooked eBPF or user programs accessing them. Accesses are asynchronous via file descriptors and reference counting ensures structures exist as long as there is at least one program accessing them. The hooked JiT-compiled native code usually gets removed when the user process which loaded it terminates, though in some cases it can still continue beyond the loading-process' lifespan.

To ease writing eBPF programs and avoid doing raw bpf() syscalls, the kernel provides a convenient libbpf library containing syscall function wrappers like bpf_load_program and structure definitions like bpf_map which can be linked statically or as a DSO, dual-licensed under the LGPL 2.1 and BSD 2-Clause. The kernel tree also provides some neat examples which use libbpf in samples/bpf/.

An example study

Kernel developers are poor creatures because the kernel is a self-contained project: there are no userspace goodies in the kernel - no Glibc, no LLVM, no JavaScript, not even WebAssembly! - this is why the kernel-tree eBPF examples contain raw bytecode or load pre-assembled bytecode files via libbpf. We can see this in sock_example.c, a simple userspace program using eBPF to count how many TCP, UDP and ICMP protocol packets are received on the loopback interface.

We skip the main and open_raw_sock functions because they're trivial and instead focus on test_sock because that's where the magic happens:

static int test_sock(void)
{
	int sock = -1, map_fd, prog_fd, i, key;
	long long value = 0, tcp_cnt, udp_cnt, icmp_cnt;

	map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key), sizeof(value),	256, 0);
	if (map_fd < 0) {
		printf("failed to create map '%s'\n", strerror(errno));
		goto cleanup;
	}

	struct bpf_insn prog[] = {
		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
		BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol) /* R0 = ip->proto */),
		BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
		BPF_LD_MAP_FD(BPF_REG_1, map_fd),
		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
		BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
		BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
		BPF_EXIT_INSN(),
	};
	size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn);

	prog_fd = bpf_load_program(BPF_PROG_TYPE_SOCKET_FILTER, prog, insns_cnt,
				   "GPL", 0, bpf_log_buf, BPF_LOG_BUF_SIZE);
	if (prog_fd < 0) {
		printf("failed to load prog '%s'\n", strerror(errno));
		goto cleanup;
	}

	sock = open_raw_sock("lo");

	if (setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd)) < 0) {
		printf("setsockopt %s\n", strerror(errno));
		goto cleanup;
	}

First a BPF map is created which behaves like a fixed-size array of max 256 elements via the libbpf API. The keys index network protocols as defined by IPROTO_* (2 byte words) and the values represent the respective packet counts (4 bytes size). Beside arrays, eBPF maps also implement other data structure types like stacks or queues.

Next the eBPF bytecode instruction array is defined using the conveniently provided kernel macros. We won't go into details on the bytecode here (that will be done in part 2 after the machine is described some more). At a high level the bytecode reads the protocol word from the packet buffer then looks it up in the map and increments its specific packet count.

Then the BPF bytecode is loaded into the kernel and verified for corectness/safety via libbpf's bpf_load_program which returns a fd for reference. The call specifies what type of program this eBPF is which determines what kernel subset it will have access to. Because this is a SOCKET_FILTER, an argument pointing to the current network packet is provided. Finally the eBPF bytecode is attached via the socket layer to a specific raw socket after which it will be run for every received packet regardless of protocol.

All that remains is for the user process to start polling the shared map for the data:

	for (i = 0; i < 10; i++) {
		key = IPPROTO_TCP;
		assert(bpf_map_lookup_elem(map_fd, &key, &tcp_cnt) == 0);

		key = IPPROTO_UDP;
		assert(bpf_map_lookup_elem(map_fd, &key, &udp_cnt) == 0);

		key = IPPROTO_ICMP;
		assert(bpf_map_lookup_elem(map_fd, &key, &icmp_cnt) == 0);

		printf("TCP %lld UDP %lld ICMP %lld packets\n",
		       tcp_cnt, udp_cnt, icmp_cnt);
		sleep(1);
	}
}

Summary

The basics of eBPF were introduced in this first part and we got our feet wet with an example of how to load bytecode and communicate with the eBPF VM. Compiling and running the example is left as an exercise for the reader due to space constraints. We also intentionally stopped short of looking at specific eBPF bytecode instructions as this will be the focus of part 2. In the example we studied, userspace reads eBPF map values from the kernel VM via libbpf directly in C (using 10 x 1 second sleeps of all things!) which is clunky, error-prone and can get complex fast, so in part 3 we'll be looking at higher level tools which automate these VM interactions via scripting or domain-specific languages.

Continue reading (An eBPF overview, part 2: Machine & bytecode)…

Gaining eBPF vision: A new way to trace Linux filesystem disk requests

Linux block I/O tracing

Performance analysis in Linux

Gaining eBPF vision: A new way to trace Linux filesystem disk requests

Linux block I/O tracing

Performance analysis in Linux

Comments (9)

annonymos:
Apr 16, 2019 at 01:57 PM

What a well written article, thank you so much for the explaining the basics of eBPF.

Reply to this comment

Reply to this comment
annonymous:
Apr 19, 2019 at 07:45 PM

Thanks for the article!

s/definitins/definitions/
s/becuase/because/

Reply to this comment

Reply to this comment
1. Adrian Ratiu:
  Apr 23, 2019 at 03:38 PM
  
  Thank you for spotting the errors, they have been corrected!
  
  Reply to this comment
  
  Reply to this comment
Tiago Katcipis:
Apr 28, 2019 at 07:15 PM

Great stuff Adrian, thanks =).

There is a small typo: s/corectness/correctness

Reply to this comment

Reply to this comment
1. Editor:
  Apr 29, 2019 at 06:41 PM
  
  Typo fixed, thank you!
  
  Reply to this comment
  
  Reply to this comment
anon:
Apr 29, 2019 at 06:29 PM

The link to "other data structure types" is wrong, it goes to the IP_PROTO enum.

Reply to this comment

Reply to this comment
1. Editor:
  Apr 29, 2019 at 06:42 PM
  
  Fixed, thank you!
  
  Reply to this comment
  
  Reply to this comment
Vijay:
May 23, 2021 at 03:10 AM

Wonderfully written article... thanks

Reply to this comment

Reply to this comment
Long Nguyen-Vu:
Sep 05, 2021 at 10:34 AM

In my opinion, eBPF is no longer a VM, code is JIT-complied and executed directly on registers. Therefore the term "full VM implementation" is misleading.

Reply to this comment

Reply to this comment

Add a Comment

Search the newsroom

Latest Blog Posts

PipeWire workshop 2025: Updates on video transport, Rust efforts, TSN networking, and Bluetooth support

03/07/2025

As part of the activities Embedded Recipes in Nice, France, Collabora hosted a PipeWire workshop/hackfest, an opportunity for attendees…

Coccinelle for Rust progress report

25/06/2025

In collaboration with Inria, the French Institute for Research in Computer Science and Automation, Tathagata Roy shares the progress made…

Linux Media Summit 2025 recap

23/06/2025

Last month in Nice, active media developers came together for the annual Linux Media Summit to exchange insights and tackle ongoing challenges…

Constructor acquires, destructor releases

09/06/2025

In this final article based on Matt Godbolt's talk on making APIs easy to use and hard to misuse, I will discuss locking, an area where…

What if C++ had decades to learn?

21/05/2025

In this second article of a three-part series, I look at how Matt Godbolt uses modern C++ features to try to protect against misusing an…

Unleashing gst-python-ml: Python-powered ML analytics for GStreamer pipelines

12/05/2025

Powerful video analytics pipelines are easy to make when you're well-equipped. Combining GStreamer and Machine Learning frameworks are the…

About Collabora

Whether writing a line of code or shaping a longer-term strategic software development plan, we'll help you navigate the ever-evolving world of Open Source.

한국의 국기 한국어 버전의 Collabora.com 보기