Perf Trace, Private Tux - Solving Linux mysteries

Updated: July 4, 2022

The title of this tutorial is a bit vague, but I couldn't help myself. Also, I don't care if the "AI" search engines somehow miss it in the sea of dross, it doesn't matter much to me, strawberry Tux forever. So, we're here to talk about Linux troubleshooting and debugging, on how to identify problems with application execution and detect possible bottlenecks.

This isn't the first such tutorial I've written - there's an almost endless series, which you ought to discover by perusing my Linux section. But specifically, you want strace, which is an awesome tool, and you want perf, which is also extremely useful in figuring out where and why your system might be misbehaving. Then, as it turns out, you can combine the power of these two functions by using the somewhat less known perf trace option. To wit, this guide. Begin.

Setup, tracing userspace applications

In order for you to enjoy this tutorial fully, I must tell you, there are no shortcuts. You ought to spend some time reading the basics, to get the hang of Linux system internals. Without those, you won't fully appreciate the topic here. As it stands, this article aims mostly at professionals who have already exhausted their existing arsenal, and need additional ways to figure out what may be wrong with their software.

Perf trace is simply an option of the perf toolbox, and it is executed thusly:

perf trace "options" "application binary path"

There are some small caveats to take into consideration here. Perf works by accessing (reading) the various system counters, exposed under the /sys filesystem. In most cases, this means you need sudo or root privileges to run perf. However, this may not be ideal, as it may mask or change the application execution sequence. Now, perf does include the option to run a program under a specific user (with the -u flag), but again, this may not be perfect.

If you run perf trace, under different conditions, you may see the following errors:

Error:  Permission denied.
Hint:   Check /proc/sys/kernel/perf_event_paranoid setting.
Hint:   For your workloads it needs to be <= 1
Hint:   For system wide tracing it needs to be set to -1.
Hint:   Try: 'sudo sh -c "echo -1 > /proc/sys/kernel/perf_event_paranoid"'
Hint:   The current value is 2.

Error:  No permissions to read /sys/kernel/tracing/events/raw_syscalls/sys_(enter|exit)
Hint:   Try 'sudo mount -o remount,mode=755 /sys/kernel/tracing/'

The tool even offers suggestions to get around the issue without switching to sudo or root. Indeed, to be able to perform system-wide tracing as the user, you need to run the following command - a one-time change that will be removed after a reboot:

echo -1 > /proc/sys/kernel/perf_event_paranoid

Similarly, to trace raw system calls:

mount -o remount,mode=755 /sys/kernel/tracing/

If this options does not work, then you can simply chmod the path above to 755:

chmod 755 -R /sys/kernel/tracing/*

Please note that this will "somewhat" affect the security baseline of your system during the troubleshooting, so you should reboot after you are done. The system will launch with the /sys permissions reset to defaults, as we have not made any permanent changes.

Perf trace options

There are a lot of useful options. You may trace major and minor pagefaults, add timing to the logfile, indeed set a logfile, and more. As a rule of thumb, the simplest run will probably look like this:

perf trace -T -o "logfile" -s -v "application path"

What do we have here?

We run with the following options in place:

You can alternatively narrow down the log info by printing only system calls that failed (-1 exit status), check only specific system calls (-e), hook to a running process and more.

Example run

So, when you run the command with the above options, the logfile will show something like:

992927.525 ( 1.037 ms): 16524  ... [continued]: clone()) = 16525 (matchpathcon)
992928.257 (         ): matchpathcon/16525 execve(filename: 0x2469f0, argv: 0xc0004f7c00, envp: 0xc000184b60) ...
992928.566 ( 0.002 ms): 16524 rt_sigprocmask(how: SETMASK, nset: 0xc00059ef40, sigsetsize: 8) = 0
992928.572 ( 0.003 ms): 16524 close(fd: 13) = 0
992928.579 ( 0.004 ms): 16524 read(fd: 12, buf: 0xc00059f0a8, count: 8) = 0
992928.586 ( 0.002 ms): 16524 close(fd: 12)  = 0
992928.601 ( 0.002 ms): 16524 close(fd: 4</dev/null>)  = 0

With the -T option in place, the first column will be the absolute timestamp for the execution. If you care about the relative sequence of events, you can then remove it. The second column shows the time it took to complete the system call, in milliseconds. This can also help you identify possible performance issues.

Then, you get the process name or process ID, and the actual system call that was executed, including any flags, options, and the exit status (very similar to strace). You can use this information to identify possible problems. Examples can include: files that do not exist, wrong permissions, or maybe memory locking contention. You do need to be familiar with the actual application code, or at least be able to consult someone who understands the expected functionality behind these system calls, so you can see what gives.

992887.070 ( 0.496 ms): 16509  ... [continued]: newfstatat()) = -1 ENOENT (No such file or directory)

992892.003 ( 0.003 ms): 16513 futex(uaddr: 0xc000100148, op: WAIT|PRIVATE_FLAG) = -1 EAGAIN (Resource temporarily unavailable)
992892.009 ( 0.056 ms): 16513 nanosleep(rqtp: 0x7fde8acb4b70)  = 0
992892.068 ( 0.006 ms): 16513 futex(uaddr: 0xc00006e948, op: WAKE|PRIVATE_FLAG, val: 1) = 1
992892.078 ( 0.006 ms): 16513 futex(uaddr: 0x55cbd282e388, op: WAKE|PRIVATE_FLAG, val: 1) = 1
992892.086 ( 0.025 ms): 16513 futex(uaddr: 0xc000100148, op: WAIT|PRIVATE_FLAG) = 0
992892.132 ( 0.004 ms): 16513 clock_gettime(which_clock: MONOTONIC, tp: 0x7fde8acb4cd0) = 0
992892.139 ( 0.005 ms): 16513 futex(uaddr: 0xc00006e948, op: WAKE|PRIVATE_FLAG, val: 1) = 1
992892.147 ( 0.005 ms): 16513 futex(uaddr: 0x55cbd282e388, op: WAKE|PRIVATE_FLAG, val: 1) = 1

992922.412 ( 0.178 ms): 16511 futex(uaddr: 0x55cbd282f9b0, op: WAIT|PRIVATE_FLAG, utime: 0x7fde8bcf6cc0) = -1 ETIMEDOUT (Connection timed out)

With the summary information included, you will also see:

syscall      calls  errors  total       min       avg       max       stddev
                            (msec)    (msec)    (msec)    (msec)        (%)
--------- --------  ------ -------- --------- --------- ---------     ------
futex          319     54  2756.595     0.000     8.641   789.740     42.78%
read           124      0   292.409     0.000     2.358   286.699     98.03%
fsync           10      0    15.245     1.395     1.525     1.914      3.34%
clock_gettime 1078      0     4.836     0.002     0.004     0.037      1.62%
newfstatat     477     58     2.640     0.002     0.006     0.037      2.78%
openat          68      0     1.537     0.005     0.023     0.126     12.48%
access          81     27     0.945     0.004     0.012     0.038      6.89%
unlink          11      0     0.617     0.027     0.056     0.090     11.17%
write           76      0     0.479     0.002     0.006     0.065     14.92%
lseek          165      0     0.401     0.001     0.002     0.008      2.95%
madvise         48      0     0.315     0.004     0.007     0.026      8.48%
getdents64      36      0     0.234     0.002     0.007     0.027     16.35%
close           68      0     0.222     0.002     0.003     0.009      6.09%
...

Finally, if you use the -F option, you can check pagefaults:

 0.361 ( 0.000 ms): 15925 minfault [_dl_cache_libcmp+0xe] => /etc/ld.so.cache@0x11869 (d.)
 0.458 ( 0.000 ms): 15925 minfault [memset+0x3a] => /usr/lib64/libc.so.6@0x1f7888 (d.)
 0.475 ( 0.000 ms): 15925 minfault [_dl_map_object_from_fd+0x9d6] => /usr/lib64/libc.so.6@0x1f5ba0 (d.)
 0.484 ( 0.000 ms): 15925 minfault [_dl_map_object_from_fd+0xab4] => /usr/lib64/libc.so.6@0x1f5be8 (d.)
 0.496 ( 0.000 ms): 15925 minfault [_dl_map_object_from_fd+0xca2] => /usr/lib64/libc.so.6@0x318 (d.)
 0.506 ( 0.000 ms): 15925 minfault [strcmp+0x1a] => /usr/lib64/libc.so.6@0x1e393 (d.)
 0.531 ( 0.000 ms): 15925 minfault [_dl_allocate_tls_storage+0x4a] => //anon@0x7f660f658740 (d.)
 0.537 ( 0.000 ms): 15925 minfault [_dl_allocate_tls_storage+0x51] => //anon@0x7f660f6590f8 (d.)
 0.550 ( 0.000 ms): 15925 minfault [_dl_relocate_object+0x63d] => /usr/lib64/libc.so.6@0x1f28f8 (d.)

Conclusion

Perf trace is a versatile, flexible powerful option in the perf toolbox, allowing to combine the power of strace with additional capabilities, which give you a multi-dimensional view of your system's behavior. For instance, by default, strace is a bit more difficult to follow and analyze, whereas perf trace helps more easily bucket results per thread, filter out unnecessary information, and conversely, zero in on the specific data you need.

Hopefully, this tutorial is a good reference point for getting started with perf trace. You do need some background knowledge, you need to know how your software behaves, but then you can use it to figure out possible performance bottlenecks, and improve and optimize its execution. Perf may also need some tweaking to run well in userspace, and you have tons of flags and options to narrow down the troubleshooting to what you need. And that would be all for today.

Cheers.

You may also like: