Updated: September 30, 2020
Over the years, I've written at great length about how to troubleshoot software-related problems in the IT world in general, and in Linux, in particular. After all, this has been my bread & butter for a long time, and I'm still quite keen on the art of problem solving. One of the topics that I covered profusely is gdb, the quintessential software debugger. The only problem is - you need to innit to winnit.
What I mean by this - gdb is excellent if you can reproduce your problems. But if you run software in a production setup, you might not have the luxury to keep on triggering issues over and over. The ability to capture and then replay bugs is a great asset, and it comes in the form of RR, a tool designed to help debug recorded executions of software in a precise, deterministic fashion. Let's see what gives.
RReady, steady, set
In essence, RR is gdb, and gdb is RR. The idea is simple and the implementation elegant. You run your tool with rr, you capture the execution (and the failure), and then you replay the recording as many times as you like, away from the production environment. Furthermore, if there are elusive issues, you might be able to grab a repeatable scenario, allowing you to more quickly figure out the root cause and fix the problem.
I installed and configured RR in Fedora 32. Pretty straightforward. Now, the execution does require some attention to detail. If you run the program as a regular user, you may see a warning that RR cannot actually grab privileged kernel events. You can change this, and then you don't need sudo. Similar to what we've seen with perf really. Sweet.
rr record ./seg
rr needs /proc/sys/kernel/perf_event_paranoid <= 1, but it is 2.
Change it to 1, or use 'rr record -n' (slow).
Consider putting 'kernel.perf_event_paranoid = 1' in /etc/sysctl.conf
There are many ways you can change this. Cat a value to /proc, use sysctl -w to write the value, manually edit the /etc/sysctl.conf file and then reload the configuration. Whichever way you choose, you will have better performance, and the ability to trace all the necessary events.
sudo sysctl -w kernel.perf_event_paranoid=1
To see how practical and useful RR is, I decided to use the same segfault example from the gdb tutorial. Basically, a loop with malloc() that will lead to a segmentation fault:
pointer = malloc(sizeof(int));
for (i = 0; 1; i++)
printf("pointer[%d] = %d\n", i, pointer[i]);
gcc -g seg.c -o seg
seg.c:4:1: warning: return type defaults to ‘int’ [-Wimplicit-int]
4 | main()
RR record & RR replay
The two main functions that RR uses - record and replay.
rr record ./seg
pointer = 33621
pointer = 33622
pointer = 33623
Segmentation fault (core dumped)
Please note that the actual execution will be slower than usual. This means that if you have time-dependent issues, RR may not be useful. Quite similar to what we've seen with strace really. You want deterministic issues that can be reliably replicated (under the right conditions, that is).
Anyway, once we have the problem recorded, we can replay it:
The first time RR loaded, it warned me that debug symbols were not available - this is quite important if you really want to be able to troubleshoot the issue. It's not specific to RR in any way, but this is something to take into account - you can install the missing packages if you like, the program even lists the exact command you can use to do that.
Remote debugging using 127.0.0.1:7747
Reading symbols from /lib64/ld-linux-x86-64.so.2...
(No debugging symbols found in /lib64/ld-linux-x86-64.so.2)
0x00007f25ce73e110 in _start () from /lib64/ld-linux-x86-64.so.2
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.31-2.fc32.x86_64--Type <RET> for more, q to quit, c to continue without paging--
Once the RR interface loaded, you're in the gdb land. The commands are the same. You can set break points, and then use conditions for when those break points should be actually activated and the execution of the task stopped.
condition 1 i == 33610
And the debug session will look something like:
pointer = 33611
9 for (i = 0; 1; i++)
Program received signal SIGSEGV, Segmentation fault.
0x000000000040116a in main () at seg.c:11
Now, you can go deeper, and perform additional checks. The main difference is that all this occurs on a recorded instance of your software, so you are not potentially interfering with the actual usage of your services and applications. Ideally, you need a clever setup that can automatically detect problems and record them, but that's a different story altogether.
I've not spent too much time using RR, but I like what I see. The program uses the familiar, robust fundamentals from gdb, which means you don't need to re-learn Linux troubleshooting from scratch. On top of that, it adds a layer of powerful flexibility, allowing you to minimize the time pressure that is often associated with IT problems - like software crashing. You can record and replay at your own convenience. This also means you're more likely to find the issue, especially if you're dealing with complicated, long executions of tasks.
Hopefully, you will find this short tutorial useful. In a world where there are ten chefs to every meal, and fifty redundant Linux tools to every need, it's nice to see software that offers meaningful extra functionality rather than a rehash of the same old. Well, you now have another utility in your arsenal, which also means one less excuse for not being able to resolve those pesky software problems quickly enough. That's how it works, no.