Slow system? Perf to the rescue!

Updated: September 26, 2014

In the last several years, we have had a whole bunch of system administration, troubleshooting and hacking articles here on Dedoimedo. Mostly related to Linux, and we learned how to fix system issues by combining the power of methodical investigation and analysis with nifty tools and utilities. Among the many, we used strace, lsof, oprofile, and others. Links below.

We even had a through so-called super-duper debugging guide, which combined the power of several programs all at once in order to resolve very complex performance related problems. However, we have mostly dwelled in the domain of userspace, less so in the kernel. In my OProfile article, someone even asked me for a practical example where a kernel profiler could actually yield useful results. Well, now the time as come, oh-oh, to demonstrate just that. Let us learn how to fix a seemingly unsolvable system performance issue using perf.


Perf at a glance

Quoting the original page, Perf is a profiler tool for Linux 2.6+ based systems that abstracts away CPU hardware differences in Linux performance measurements and presents a simple commandline interface. Perf is based on the perf_events interface exported by recent versions of the Linux kernel. Perf notation is somewhat similar to version control tools and strace combined. For example, perf stat will run a command and gather performance counter statistics. Per record will run a command and record its profile into a log file, by default Then, perf report will read the log file created by the perf record execution and display the profile. You may also want to analyze lock events, define probes, and more. There's even a top mode.

All in all, I am bluffing like a pro, because it is impossible to learn perf in five minutes of reading, especially if you do not have prior Linux knowledge, significant one at that. You will both need to understand the Linux system behavior fairly well, and be somewhat proficient in the concepts of scheduling, memory management, various interfaces, and CPU architecture. Only then will perf make good sense. Still, let us explore a scenario where this tool proves its awesome merit.

Problem what we have

Say you have a Linux system with multiple users working on it, against a remote file system, e.g. NFS. One of the users reports a certain performance degradation in accessing the network objects, like files and directories. Other users have no such woes. As a smart lad, you immediately start a proper investigation, and you grab your favorite tool, strace.

You execute a summary run, with the -c flag, and in both cases, the quantity of system calls and errors, as well as their order is identical. The only difference is that a specific system call for one user takes longer than for the others. You narrow down your tests using the -e flag, which lets you trace individual calls. With the so-called problematic user, you get a consistently slow result for the stat system call:

strace -tt -T -e lstat /usr/bin/stat /nfs/object.txt
14:04:28.659680 lstat("/nfs/object.txt", {st_mode=S_IFREG|0755,
st_size=291, ...}) = 0 <0.011364>

On the other hand, the so-called healthy user has no such worries:

strace -tt -T -e lstat /usr/bin/stat /nfs/object.txt
14:04:54.032616 lstat("/nfs/object.txt", {st_mode=S_IFREG|0755,
st_size=291, ...}) = 0 <0.000069>

There is no difference in the environment settings of the two users. They both use identical shells. Everything is perfect except that one small difference in the system call time for stat. With the usual set of your tools, you have just reached a dead end.

Call the cavalry

Now, let us see what perf can do for us. Let us run the test wrapped around the stat command. It is a very good way to start your investigation, because you will get a neat summary of what happens in the kernel, which can then help point out the next lead in your troubleshooting. Indeed, for the healthy system, we get:

Performance counter stats for '/usr/bin/stat /nfs/object.txt':

       3.333125  task-clock-msecs         #      0.455 CPUs
              6  context-switches         #      0.002 M/sec
              0  CPU-migrations           #      0.000 M/sec
            326  page-faults              #      0.098 M/sec
        3947536  cycles                   #   1184.335 M/sec
        2201811  instructions             #      0.558 IPC
          45294  cache-references         #     13.589 M/sec
          11828  cache-misses             #      3.549 M/sec

    0.007327727  seconds time elapsed

For the baddie:

Performance counter stats for '/usr/bin/stat /nfs/object.txt':

      14.167143  task-clock-msecs         #      0.737 CPUs
              7  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
            326  page-faults              #      0.023 M/sec
       17699949  cycles                   #   1249.366 M/sec
        4424158  instructions             #      0.250 IPC
         304109  cache-references         #     21.466 M/sec
          60553  cache-misses             #      4.274 M/sec

    0.019216707  seconds time elapsed

There is a marked difference between the two, as you can observe. While the CPU speed is the same, and the number of migrations, context switches and page faults are identical, the bad user spins approx. five time longer, using more cycles and instructions and everything, resulting in more total time needed for the command to complete. This already shows us there's something wrong afoot.

Let us explore a little deeper. Let us now record the run and then analyze the data using the report command. This will give us a far more detailed understanding of what really happened. So here's the report for the good user:

# Samples: 56
# Overhead  Command      Shared Object  Symbol
# ........  .......  .................  ......
     5.36%     stat  [kernel]           [k] page_fault
     5.36%     perf  /usr/bin/perf      [.] 0x0000000000d099
     3.57%     stat  [kernel]           [k] flush_tlb_others_ipi
     3.57%     stat  [kernel]           [k] handle_mm_fault
     3.57%     stat  [kernel]           [k] find_vma
     3.57%     stat  /lib64/libc-2...   [.] __GI_strcmp
     3.57%     stat  /lib64/ld-2.11...  [.] _dl_lookup_symbol_x
     3.57%     perf  [kernel]           [k] _spin_lock
     1.79%     stat  [kernel]           [k] flush_tlb_mm
     1.79%     stat  [kernel]           [k] finish_task_switch
     1.79%     stat  [kernel]           [k] ktime_get_ts

And for our naughty misbehaver - do note the extra long names of symbols broken into multiple lines for brevity, so do your mental thing and assemble them back. Anyhow, the baddie:

# Samples: 143
# Overhead  Command      Shared Object  Symbol
# ........  .......  .................  ......
    57.34%     stat  [kernel]           [k] rpcauth_lookup_
     2.80%     stat  [kernel]           [k] generic_match                                                    [sunrpc]
     2.10%     stat  [kernel]           [k] clear_page_c
     1.40%     stat  [kernel]           [k] flush_tlb_others_ipi
     1.40%     stat  [kernel]           [k] __do_fault
     1.40%     stat  [kernel]           [k] zap_pte_range
     1.40%     stat  [kernel]           [k] handle_mm_fault
     1.40%     stat  [kernel]           [k] __link_path_walk
     1.40%     stat  [kernel]           [k] page_fault
     1.40%     stat  /lib64/libc-2...   [.] __GI_strcmp
     1.40%     stat  /lib64/ld-2.11...  [.] _dl_load_cache_lookup

Understanding the report

Before we discuss the results, we need to spend a moment or two to overview the report format. The first column indicates the percentage of the overall samples collected in the corresponding function, similar to what strace -c reports. The second column reports the process from which the samples were collected. In the per-thread/per-process mode, this will always be the name of the monitored command. In the CPU-wide mode, the command name can vary.

The third column shows the name of the ELF image where the samples came from. If a program is dynamically linked, then this may show the name of a shared library. When the samples come from the kernel, then the pseudo ELF image name [kernel.kallsyms] is used. The fourth column indicates the privilege level at which the sample was taken, i.e. when the program was running when it was interrupted. The following levels exist:

The final column shows the symbol name. If you're interested in learning more about what the specific symbol name does, you can consult The Linux Cross Reference site and search for the specific string entry, or if you have the kernel sources available and installed for your distro, normally under /usr/src/linux, then consult there. Don't do this for fun, please. There are better ways to spend the afternoon.

Now, the results. We can see the difference immediately. The bad user wastes a lot of time fiddling with rpcauth_lookup_credcache, which is linked inside the sunrpc kernel module. At this point, you have all the information you need to go to the Internet do a very narrow and intelligent search. Just by punching the name of the symbol, you will find two or three mailing list references, pointing to this phenomenon, which in turns, points to a bug.

Your problem is NOT immediately solved, but you know someone else will handle it now, with solid information in your hand. This means reporting the issue to the community, the vendor, or whoever is in charge of your distro, patiently waiting for a fix, or maybe searching for a workaround that might help. But the whole point is that we used perf, which exposed information that we could not have obtained otherwise, and thus, allowed us to move forward in our troubleshooting.

More reading

You should spend some time reading all of these:

Linux commands & configurations

Linux hacking guides one two three and four

Linux system debugging super tutorial

Strace and lsof tutorials

GDB mega-tutorial


This is a relatively short tutorial, but then good things are simple and effective. And so is perf. Effective. It sure is not simple. In fact, it is an advanced tool for advanced system debugging, but if you do encounter problems where you draw a purest blank with the conventional software, you may want to consider using perf. You just might uncover some really interesting things.

Anyhow, we are done. Hopefully, as always, you may have learned something new. Namely, how to conduct a thorough analysis of system issues, how to compare results between good and manifestations of the system behavior, and how to use perf. We also had a brief reminder of what strace can do for us. Now, you are that much more of a geek. Take care.