Get a realistic view of how your program runs on real hardware, so you can find small changes that make a big performance difference.
At some point, all developers of software applications, whether targeted to Linux or not, are likely to spend at least a small amount of time focusing on the performance of their applications. The reason is simple: many potential benefits can be gained from tuning software for improved performance. For example, in the scientific and engineering arenas, performance gains can make the difference between running smaller scale simulations rather than larger and potentially more accurate models that would improve the scientific quality of the results. Applications that are more user-oriented also stand to benefit from improvements that result in faster responsiveness to the user and an improved overall user experience.
Although microprocessor improvements over the past decade or so have made clock speeds well in excess of the gigahertz range commonplace, most developers are aware that a tenfold increase in processor frequency does not guarantee a tenfold reduction in the runtime of your application. Additionally, for those developing software for distribution to others, attention to performance and responsiveness can pay big dividends when you consider that your end user may be running your application on a mid-1990s era 100MHz Pentium processor.
This article is an introduction to a set of open-source software tools called PerfSuite that can help you to understand and possibly improve the performance of your application under Linux. PerfSuite consists of several related tools and libraries targeted at several different activities useful in performance-oriented analysis.
The development of PerfSuite was motivated by my own experiences in working with not only applications that I had developed, but a number of large supercomputer-class applications in both academic and corporate settings. After having worked with several research groups, I realized that developers often take advantage of only a limited set of tools that may be available to them. They typically rely on traditional time-based statistical profiling techniques such as gprof.
Of course, gprof-style profiles are invaluable and should be the mainstay of any developer's performance toolbox. However, the microprocessors of today, such as those on which you probably are using Linux, offer advanced features that can provide alternative insights into characteristics that directly affect the performance of your software. In particular, nearly all microprocessors in common use today incorporate hardware-based performance measurement support in their designs. This support can provide an alternative viewpoint of your software's performance. While time-based profiles tell you where your software spends its time, hardware performance measurements can help you understand what the processor is doing and how effectively the processor is being utilized. Hardware measurements also pinpoint particular reasons why the CPU is stalling rather than accomplishing useful work.
The first time I encountered the term hardware performance counters, it was in the context of having access to multimillion-dollar supercomputers where every CPU cycle is critical and research teams spend substantial amounts of time tweaking their codes in order to extract maximum performance from the system. Often, software is tailored explicitly for each type of computer on which it is to be run. Research teams sometimes pore over the numbers generated by these performance counters to measure the exact performance of their applications and to ferret out places where they might gain additional speedup. Needless to say, this all sounded exotic to me. But the purpose and function of the counters turned out to be simple: they are extra logic added to the CPU that track low-level operations or events within the processor accurately and with minimal overhead.
For example, even if you're not an expert in computer architecture, you probably already know that nearly all processors in common use are cache-based machines. Caches, which offer much higher-speed access to data and instructions than what is possible with main memory, are based on the principles of temporal and spatial locality. Put another way, cache designs hope to take advantage of many applications' tendency to reuse blocks of data not long after first use (temporal locality) and to also access data items near those already used (spatial locality). If your application follows these patterns, you have a much greater chance of achieving high performance on a cache-based processor. If not, your performance may be disappointing. If you're interested in improving a poorly performing application, your next task is to try to determine why the processor is stalling instead of completing useful work. This is where performance counters may help.
It takes a little research to learn which performance counters are available to you on a particular processor. Each CPU has a different set of available performance counters, usually with different names. In fact, different models in the same processor family can differ substantially in the specific performance counters available. In general, the counters measure similar types of things. For example, they can record the absolute number of cache misses, the number of instructions issued, the number of floating point instructions executed and the number of vector, such as SSE or MMX, instructions. The best reference for available counters on your processor are the vendor's technical reference on the processor, often available on the Web.
Another complication is kernel-level support is needed to access the performance counters. Although the Itanium (IA-64) kernel provides this support through the perfmon driver in the official kernel (authored by Stephane Eranian of HP Research), the standard x86 Linux tree currently does not.
Fortunately, efforts are underway to address these issues. The first is the development of a performance monitoring driver for the x86 kernel called perfctr. This is a very stable kernel patch developed by Mikael Pettersson of Uppsala University in Sweden. The perfctr kernel patch is becoming more widely adopted by the community and continually is improved and maintained. The second is an effort from the Innovative Computing Laboratory at the University of Tennessee-Knoxville called PAPI (Performance Application Programming Interface). PAPI defines a standard set of cross-platform performance monitoring events and a standard API that allows measurement using hardware counters in a portable way. The PAPI Project provides implementations for the library on several current processors and operating systems, including Intel/AMD x86 processors, Itanium systems and, most recently, AMD's x86-64 CPUs. On Linux, PAPI uses the perfmon and perfctr drivers as appropriate. Refer to the on-line Resources for references where you can learn much more about perfctr, perfmon and PAPI.
PerfSuite, discussed in the remainder of this article, builds upon PAPI, perfmon and perfctr to provide developers with an even higher-level user interface as well as additional functionality. A main focus of PerfSuite is ease of use. Based on my experiences in working with developers interested in performance analysis, it became clear that an ideal solution would require little or no extra work from users who simply want to know how well an application is performing on a computer. They want to know this without having to learn many details about how to configure or access the performance data at a low level.
Let's say that you do have an application that isn't cache-friendly—what might happen? In the worst case scenario, rather than loading a line of data into the cache and operating on the data contained in that line repeatedly, it may use only one piece of data and then be done with it. The next piece of data you need may require another cache line to be loaded and so forth. Each of these cache loads are relatively expensive and can result in reduced performance, because the processor is waiting primarily for the data it needs to become available. Each time the next piece of data is required, the processor attempts to load it from data already resident in the cache. If it's not found, a cache miss occurs and a corresponding hardware event is signaled. The higher the ratio of cache misses to hits, the more likely it is that the overall performance of the software degrades.
Listing 1 shows a basic but concrete example of how this might occur. The listing shows a loop that initializes each element of a matrix using the sum of the corresponding element of another matrix and a vector. Because the C language stores data in row-major order, the loop as written does not access neighboring data elements in the two matrices. Fortunately, this problem has a simple solution: interchange the nested loops so the matrices are processed on a row-by-row basis. This pattern of array access also is referred to as stride-one access. Many optimizing compilers perform this type of loop-interchange optimization automatically, depending on the optimization level you select.
Test cases containing these two versions of the loop were compiled with a recent release of Intel's ICC compiler, run on a Pentium III computer and timed. The result of this simple change sped up the loop by a factor of ten. Not unexpectedly, the overall level 2 cache miss count decreased considerably for the optimized version of the loop (212,665,026 versus 25,287,572—see the next section for more information).
Often, it's useful to combine the raw hardware performance counts into a derived metric that can provide a normalized view of performance. For example, one of the most widely used metrics for performance measurement describes the average number of cycles required to complete an instruction (CPI). By counting the total number of cycles and instructions retired (both of which are available as hardware events), we easily can obtain this metric. Similarly, we might be interested in knowing, on average, how often a piece of data was reused once it was resident in the cache. By counting the appropriate cache-related events and combining them into a single metric, we can obtain an approximation of this information as well.
PerfSuite's hardware performance counter tools and libraries provide easy access to both the raw measurement data as well as a large number of derived metrics that you can use to learn about and hopefully improve the performance of your application. In its most basic use, PerfSuite requires nothing more than a slight modification to the command you execute to run your program. If your executable is in the file myprog, then instead of running myprog directly, you instead would enter psrun myprog. If all goes well, the output of psrun is an XML document that contains a standard set of hardware events along with additional information about the CPU. You can translate this XML document into a comprehensive performance report with the command psprocess, supplying it with the name of the XML file.
The current release of PerfSuite includes the following four tools for accessing and working with performance data:
psrun: a utility for hardware performance event counting and profiling of single-threaded, POSIX threads-based and MPI applications.
psprocess: a utility that assists with a number of common tasks related to pre- and post-processing of performance measurements.
psinv: a utility that provides access to information about the characteristics of a machine (for example, processor type, cache information and available performance counters).
psconfig: a graphical tool for easy creation and management of PerfSuite configuration files.
This section demonstrates the two commands psrun and psprocess. Visit the PerfSuite Web site for more information about and examples of the use of psinv and psconfig.
The easiest way to learn to use the basic PerfSuite tools is try them out on your own programs. Here is a sequence of commands you might enter to run the simple cache example discussed earlier with performance measurement enabled. Also shown are the current contents of the directory after each run with psrun to show that XML documents are created:
1% ls badcache goodcache 2% psrun badcache 3% ls badcache goodcache psrun.22865.xml 4% psrun goodcache 5% ls badcache goodcache psrun.22865.xml psrun.22932.xml 6% psprocess psrun.22865.xml 7% psprocess psrun.22932.xml
Listings 2 and 3 show the output of the psprocess command for the unoptimized and optimized versions of the test program; these listings have been edited slightly to fit in the available space. As you can see, a substantial amount of information is gathered during the course of the measurement and the report includes not only the raw event counts measured using PAPI, but also a series of metrics that can be derived from the counts.
psrun determines the performance events to be measured by consulting a configuration file you can supply, which is an XML document that describes the measurements to be taken. If you don't supply a configuration file, a default is used (the output shown in Listings 2 and 3 used the default). As an XML document, the configuration file is straightforward to modify and read. For example, if you wanted to obtain the raw events required to calculate the CPI metric discussed earlier, you'd need to ask psrun to measure the total number of graduated instructions and the total number of cycles. These events are predefined in PAPI and are called PAPI_TOT_INS and PAPI_TOT_CYC, respectively. Listing 4 shows a PerfSuite XML configuration file that could be used to measure these events. To use this configuration file with psrun, all you need to do is supply the option -c, along with the name of your custom configuration and run as usual.
The measurements described so far have been in aggregate counting mode, where the total count of one or more performance events are measured and reported over the total runtime of your application. PerfSuite provides an additional way of looking at your application's performance. Let's say you are interested in finding out where in your application all the level 2 cache misses occur so that you can focus your optimization work there. In other words, you'd like a profile similar to gprof's time-based profile, but instead have it be based on level 2 cache misses. This can be done rather easily with psrun by specifying a configuration file tailored for profiling rather than aggregate counting. The PerfSuite distribution includes a number of similar alternative configuration files that you can tailor as needed. Here's an example of how you would ask for a profiling experiment rather than the default total count of events:
8% psrun -c \ /usr/local/share/perfsuite/xml/pshwpc/profile.xml \ solver 9% psprocess -e solver psrun.24135.xml
In profiling mode, the psprocess tool also needs the name of your executable (solver) to do its work. This is required in order to extract the symbol information in the executable so the program address can be mapped to source code lines.
Listing 5 shows an example of a profiling run of psrun obtained in this way. Not only is the application (solver) analyzed, but it also lists shared libraries used with the application that consumed CPU time. The combination of overall performance counting and profiling can be a powerful tool for learning about bottlenecks that may exist in your software and can help you to isolate quickly those areas of your application most in need of attention.
This article has touched only the surface of the techniques available to you when using hardware performance counters to measure and improve the performance of your applications. Hopefully, you now have an idea of what hardware performance counters are and how they can help you gain insight into performance bottlenecks. If you would like to get started using PerfSuite or other tools and supporting software mentioned in this article, visit the on-line Resources.
Many different ways exist in which applications can be tuned for higher performance. In fact, the most effective way is not loop-level improvements or tweaking but fundamental changes to the algorithms used in your application that are more computationally efficient. Ideally, your software will use efficient algorithms further tuned to make effective use of your CPU. PerfSuite and other similar tools can go a long way toward making this process easier for you.
I would like to thank Professor Danesh Tafti of the Mechanical Engineering Department at Virginia Tech for providing the program used for the psrun profiling example in Listing 5. This is a computational kernel extracted from a computational fluid dynamics application named GenIDLEST that Tafti and his research team use, maintain and develop. I also would like to express my thanks to all the PAPI team members of the Innovative Computing Laboratory at the University of Tennessee-Knoxville for their support and encouragement during the development of PerfSuite.
Resources for this article: /article/8264.