Adrian Ratiu
May 14, 2019
Reading time:
In our previous parts we focused on tracing the Linux kernel, for which the eBPF-based projects are, in our humble opinion, the most safe, widely available and useful methods (eBPF is fully upstreamed in Linux, guarantees a stable ABI, comes enabled by default in almost all distributions and integrates with all other tracing mechanisms). It has really become a no-brainer choice for kernel work. However, up until now, talking in-depth about userspace tracing was deliberately avoided because it merits special treatment, hence this full part 5 article devoted to it.
First we'll look at the why of it, then we'll group and examine eBPF user tracing in two categories: static and dynamic.
The most important user question is why use eBPF at all to trace userspace processes when there are so many other debuggers/profilers/tracers, so many language-specific or OS-portable tools developed for the same task? The answer is not simple and, depending on the use case, eBPF might not be the best solution; there is no one-size-fits-all debug/tracing project for the vast userspace ecosystem.
eBPF has some advantages:
Some disadvantages:
Static tracepoints, also known in userspace as USDT (User Statically Defined Trace) probes, are specific locations of interest in an application where tracers can be hooked to inspect code execution and data. They are explicitly defined by developers in source code and usually enabled at compile time with flags like "--enable-dtrace". The allure of static tracepoints is that they don't frequently change: developers usually keep stable static trace ABIs, so tracing tools work between different application versions which is useful, for example, when upgrading a PostgreSQL installation and encountering a performance regression.
BCC-tools contains a lot of useful well-tested tools to interact with tracepoints defined by specific applications or language runtimes. For our example we'll want to trace python applications. Make sure you have python3 built with "--enable-dtrace" and run tplist on either the python binary or libpython (depending how you've built it) to confirm tracepoints are enabled:
$ tplist -l /usr/lib/libpython3.7m.so b'/usr/lib/libpython3.7m.so' b'python':b'import__find__load__start' b'/usr/lib/libpython3.7m.so' b'python':b'import__find__load__done' b'/usr/lib/libpython3.7m.so' b'python':b'gc__start' b'/usr/lib/libpython3.7m.so' b'python':b'gc__done' b'/usr/lib/libpython3.7m.so' b'python':b'line' b'/usr/lib/libpython3.7m.so' b'python':b'function__return' b'/usr/lib/libpython3.7m.so' b'python':b'function__entry'
First let's use one of the cool tracer tools provided by BCC, uflow, to trace the execution flow of python's simple http server. The trace should be self explanatory, the arrows and indentation indicate function enter/exit. What we see in this trace is how a worker thread exits on CPU 3 and the main thread is ready to service other incoming http requests on CPU 0.
$ python -m http.server >/dev/null & sudo ./uflow -l python $! [4] 11727 Tracing method calls in python process 11727... Ctrl-C to quit. CPU PID TID TIME(us) METHOD 3 11740 11757 7.034 /usr/lib/python3.7/_weakrefset.py._remove 3 11740 11757 7.034 /usr/lib/python3.7/threading.py._acquire_restore 0 11740 11740 7.034 /usr/lib/python3.7/threading.py.__exit__ 0 11740 11740 7.034 /usr/lib/python3.7/socketserver.py.service_actions 0 11740 11740 7.034 /usr/lib/python3.7/selectors.py.select 0 11740 11740 7.532 /usr/lib/python3.7/socketserver.py.service_actions 0 11740 11740 7.532
Next we want to run our own custom code when a tracepoint is hit, so we don't fully rely on whatever tools BCC provides. The following example hooks itself to python's function__entry tracepoint (see the python instrumentation documentation) and notifies us when someone is downloading a file:
#!/usr/bin/env python from bcc import BPF, USDT import sys bpf = """ #include <uapi/linux/ptrace.h> static int strncmp(char *s1, char *s2, int size) { for (int i = 0; i < size; ++i) if (s1[i] != s2[i]) return 1; return 0; } int trace_file_transfers(struct pt_regs *ctx) { uint64_t fnameptr; char fname[128]={0}, searchname[9]="copyfile"; bpf_usdt_readarg(2, ctx, &fnameptr); bpf_probe_read(&fname, sizeof(fname), (void *)fnameptr); if (!strncmp(fname, searchname, sizeof(searchname))) bpf_trace_printk("Someone is transferring a file!\\n"); return 0; }; """ u = USDT(pid=int(sys.argv[1])) u.enable_probe(probe="function__entry", fn_name="trace_file_transfers") b = BPF(text=bpf, usdt_contexts=[u]) while 1: try: (_, _, _, _, ts, msg) = b.trace_fields() except ValueError: continue print("%-18.9f %s" % (ts, msg))
We test by attaching again to the simple http-server:
$ python -m http.server >/dev/null & sudo ./trace_simplehttp.py $! [14] 28682 34677.450520000 b'Someone is transferring a file!'
The above example tells us when someone is downloading a file but it can't be any more specific than that, to provide more detailed information about who is downloading, what file and so on. This is because python only comes with a few very generic tracepoints enabled by default (module load, function entry/exit etc.). To get additional info we'll have to define our own tracepoints at places of interest so we can extract the relevant data.
Until now we have only worked with tracepoints defined by others, but if our application doesn't have any tracepoints at all, or we need to expose more information than possible with the existing tracepoints, then we'll have to add our own tracepoints.
There are multiple ways of adding tracepoints, for example python-core uses the systemtap development package "systemtap-sdt-dev" via pydtrace.h and pydtrace.d, but we'll take another route and use libstapsdt because it has a simpler API, is more lightweight (only depends on libelf) and has multiple language bindings. For consistency we focus again on python, but tracepoints can be added in other languages, here's a C example.
First we patch the simple http server to expose a tracepoint. The code should be self explanatory: notice the tracepoint name file_transfer and its arguments which are big enough to store two string pointers and a 32bit unsigned integer representing the client IP address, the file path and file size.
diff --git a/usr/lib/python3.7/http/server.py b/usr/lib/python3.7/http/server.py index ca2dd50..af08e10 100644 --- a/usr/lib/python3.7/http/server.py +++ b/usr/lib/python3.7/http/server.py @@ -107,6 +107,13 @@ from functools import partial from http import HTTPStatus +import stapsdt +provider = stapsdt.Provider("simplehttp") +probe = provider.add_probe("file_transfer", + stapsdt.ArgTypes.uint64, + stapsdt.ArgTypes.uint64, + stapsdt.ArgTypes.uint32) +provider.load() # Default error message template DEFAULT_ERROR_MESSAGE = """\ @@ -650,6 +657,8 @@ class SimpleHTTPRequestHandler(BaseHTTPRequestHandler): f = self.send_head() if f: try: + path = self.translate_path(self.path) + probe.fire(self.address_string(), path, os.path.getsize(path)) self.copyfile(f, self.wfile) finally: f.close()
Running the patched server, we can verify our file_transfer tracepoint is present at runtime using tplist:
$ python -m http.server >/dev/null 2>&1 & tplist -p $! [1] 13297 b'/tmp/simplehttp-Q6SJDX.so' b'simplehttp':b'file_transfer' b'/usr/lib/libpython3.7m.so.1.0' b'python':b'import__find__load__start' b'/usr/lib/libpython3.7m.so.1.0' b'python':b'import__find__load__done'
The most significant modifications we'll do below to our tracer example code from the above example are:
#!/usr/bin/env python2 from bcc import BPF, USDT import sys bpf = """ #include <uapi/linux/ptrace.h> BPF_PERF_OUTPUT(events); struct file_transf { char client_ip_str[20]; char file_path[300]; u32 file_size; u64 timestamp; }; int trace_file_transfers(struct pt_regs *ctx, char *ipstrptr, char *pathptr, u32 file_size) { struct file_transf ft = {0}; ft.file_size = file_size; ft.timestamp = bpf_ktime_get_ns(); bpf_probe_read(&ft.client_ip_str, sizeof(ft.client_ip_str), (void *)ipstrptr); bpf_probe_read(&ft.file_path, sizeof(ft.file_path), (void *)pathptr); events.perf_submit(ctx, &ft, sizeof(ft)); return 0; }; """ def print_event(cpu, data, size): event = b["events"].event(data) print("{0}: {1} is downloding file {2} ({3} bytes)".format( event.timestamp, event.client_ip_str, event.file_path, event.file_size)) u = USDT(pid=int(sys.argv[1])) u.enable_probe(probe="file_transfer", fn_name="trace_file_transfers") b = BPF(text=bpf, usdt_contexts=[u]) b["events"].open_perf_buffer(print_event) while 1: try: b.perf_buffer_poll() except KeyboardInterrupt: exit()
Tracing the patched server yields:
$ python -m http.server >/dev/null 2>&1 & sudo ./trace_stapsdt.py $! [1] 5613 325540469950102: 127.0.0.1 is downloading file /home/adi/ (4096 bytes) 325543319100447: 127.0.0.1 is downloading file /home/adi/.bashrc (827 bytes) 325552448306918: 127.0.0.1 is downloading file /home/adi/workspace/ (4096 bytes) 325563646387008: 127.0.0.1 is downloading file /home/adi/workspace/work.tar (112640 bytes) (...)
The above custom file_transfer tracepoint can look trivial (a direct python print or a logger call could have the same effect), but the mechanism it presents is very powerful: well placed tracepoints guaranteeing ABI stability, giving the ability to dynamically run safe, native-fast, programmable logic can be very helpful in quickly analyzing and fixing all kinds of issues without restarting the offending application (reproducing an issue might take a long time).
The problem with static tracepoints, as exemplified above, is that they need to be explicitly defined in source code and require rebuilding applications when modifying the tracepoints. Guaranteeing ABI stability for existing tracepoints imposes limits on how maintainers can rearchitect/rewrite the code which exposes the tracepoint data. Therefore in some cases, fully runtime dynamic userspace probes (aka uprobes) are preferred: these poke directly at a running application's memory in an ad-hoc manner without requiring any special source code definition. Dynamic probes can easily break between application versions, but even so they can be useful for live debugging running instances.
While static tracepoints are useful for tracing applications written in high level languages like Python or Java, uprobes are less useful for this because they're lower level and have no knowledge of the language runtime implementations (static tracepoints work because developers take it upon themselves to expose the relevant data of the high level application). Dynamic probes are useful however for debugging the language implementations / engines themselves or applications written in languages with no runtimes like C.
Uprobes can be added to stripped binaries, but the user has to manually compute the in-process memory offset location where the uprobe should attach via tools like objdump and /proc//maps (see example) which is painful and unportable. Since most distributions provide debug symbols packages (or a quick way to build with debug symbols) and BCC makes it trivial to use uprobes with symbol name resolution, the vast majority of dynamic probe uses happen this way.
The gethostlatency BCC tool prints DNS lookup latencies by attaching uprobes to gethostbyname and friends in libc. To verify that libc is not stripped so the tool can be run (otherwise it throws a sybol-not-found error):
$ file /usr/lib/libc-2.28.so /usr/lib/libc-2.28.so: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, (...), not stripped $ nm -na /usr/lib/libc-2.28.so | grep -i -e getaddrinfo 0000000000000000 a getaddrinfo.c
The gethostlatency code is very similar (and in some places identical, it also uses BPF_PERF_OUTPUT) to the tracepoint example we examined above, so we won't be posting it fully here. The most relevant difference is the use of the BCC uprobe API:
b.attach_uprobe(name="c", sym="getaddrinfo", fn_name="do_entry", pid=args.pid) b.attach_uretprobe(name="c", sym="getaddrinfo", fn_name="do_return", pid=args.pid)
The key idea to understand and remember here is that with small changes to our BCC eBPF programs we can trace very different applications, libraries or even the kernel via static and dynamic probes. Before we were statically tracing a Python application and now we're dynamically measuring libc hostname resolution latency. With similar modifications to these small (less than 150 LOC, lots of it boilerplate) examples anything can be traced in a running system, very safely with no risk of crashes or problems induced by other tools like debugger app pauses/halts.
In this fifth part we looked at how to use eBPF programs to trace userspace applications. The biggest advantage of using eBPF for this task is that it provides a unified interface to safely and efficiently trace the entire system: a bug can be reproduced in an application, then traced further into a library or the kernel, providing full system visibility via a unified programming framework / interface. eBPF is not a silver bullet though, especially when debugging applications written in higher-level languages, where language specific tools can do a better job of providing insight, or for those running old versions of Linux kernels, or requiring non-Linux systems.
15/01/2025
With VirGL, Venus, and vDRM, virglrenderer offers three different approaches to obtain access to accelerated GFX in a virtual machine. Here…
19/12/2024
In the world of deep learning optimization, two powerful tools stand out: torch.compile, PyTorch’s just-in-time (JIT) compiler, and NVIDIA’s…
08/10/2024
Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…
15/08/2024
After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…
01/08/2024
We're reflecting on the steps taken as we continually seek to improve Linux kernel integration. This will include more detail about the…
27/06/2024
With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…
Comments (1)
John Xsyourpal:
Mar 02, 2020 at 06:49 AM
Two question:
1) Are all the examples from these articles available on github or another repo ? that would be very useful.
2) Is there any information on how to compile eBPF programs ? In particular compiling the loader program needs bpf.h libbpf and various dependencies that are hard to manage in simple makefile. There are samples/bpf in the kernel however it is not clear how one compiles load_bpf.c load_bpf.h outside the kernel using a simple makefile and clang. Any pointers would be appreciated.
Reply to this comment
Reply to this comment
Add a Comment