Research:ChaNGaPerformanceAnalysis

From Astronomy Facility Wiki

Jump to: navigation, search

Contents

Using Projections to Analyse Parallel Performance for ChaNGa

The Charm++ runtime system has tools to help analyze parallel performance. The main tool is Projections. The start of a Projections tutorial is on the Charm++ site, but it is very minimal. Therefore, I'm including more extensive notes here. As of 4/17/09, the scalability of the visualization tool limits one to less than about 8,000 processors. Further scalability is work in progress.

Compiling for Projections

Actually this is just linking. Uncomment the "-tracemode projections" on the LDFLAGS line in the Makefile and relink to create an executable that will generate performance information. Adding "-tracemode summary" will generate summary performance information.

Running for Projections

When the projections capable executable is run, it will generate .log files, one file for each processor, and a .sts file. By default, these files end up in the directory in which the executable resides. Note that these can get quite large, so keep the run short. Also, in order to prevent the projections logging from impacting the performance of the program, an option +logsize <number of log entries> to increase the buffer size of the logging information. The default is currently set at 1,000,000 which means approximately 80-90 MB of a core's memory is reserved for projections buffers. To determine the log size actually needed, first make a run and examine the log files as follows.

  1. run grep ^8 *.log. If anything shows up, this means that at least one processor was forced to flush it's performance logs.
  2. If something does show up then wc -l *.log | sort -n will tell you how big a +logsize to use to prevent log flushing from impacting performance.

For -tracemode summary, the option is +bincount <number of bins>. The default is 10,000. To determine if a re-bin will be forced mid-run (and hence affect the processor's performance), it is simply good enough to find out for how long the application ran. As long as <number of bins> X <bin size> (default 1 ms) is a duration longer than an application's run time, no mid-run interference will occur. So, in the default case, an application can run for as long as 10 seconds without a re-bin.

Running the Projections GUI

This is a java program that can be started with charm/tools/projections/bin/projections. The .sts file can be given as an argument.

The menu items under tools include:

  • Graph: this is very memory intensive. It plots processor usage or messages as a function of either processor or interval. (Not particularly useful)
  • Timelines: This is also very memory intensive. For each processor, this gives a timeline of entry methods that were executed. This is useful to see exactly the sequence of events on each processor.
  • Usage Profile: This gives a profile of the processor utilization over the selected interval. As well as a bar graph, it can give a table of the utilization of each entry point.
  • Overview: This gives a processor utilization overview. As a function of time and processor number, the utilization will be shown as a color. Colors can also designate the entry point being executed. Note that when you switch from utilization to entry points, the log files get reread, and this takes time.
  • Time Profile Graph: This gives an overall time profile of entry points being executed. As a function of time, the execution time spent in each entry point across the entire selected processors is plotted, with each entry point getting a different color.

Getting Load Balancing Information

Dynamic load balancing is a key feature of Charm++ available to ChaNGa. To get information on what the load balancer is doing use the following options.

  • +LBDebug <number> where the higher the number, the more debugging information you will receive on stderr and stdout.

The type of information probably varies with the choice of load balancer, but for RefineLB, one gets an estimate of the load on each processor, with the background load in parenthesis. One also gets a report on which pieces migrate to which processors, and a mapping of pieces to processors.

Using TAU Performance Analysis with ChaNGa

TAU is a parallel performance analysis tool that can also be used with ChaNGa. TAU has a graphical user interface that allows the user to quickly identify performance bottlenecks.

In order to use it, source must be downloaded from the above site, and Charm++ needs to be built referring to the TAU libraries. See the build instructions for NAMD on the TAU wiki.

Personal tools