Research:ChaNGaPerformanceAnalysis
From Astronomy Facility Wiki
Contents |
Using Projections to Analyse Parallel Performance for ChaNGa
The Charm++ runtime system has tools to help analyze parallel performance. The main tool is Projections. The start of a Projections tutorial is on the Charm++ site, but it is very minimal. Therefore, I'm including more extensive notes here. As of 4/17/09, the scalability of the visualization tool limits one to less than about 8,000 processors. Further scalability is work in progress.
Compiling for Projections
Actually this is just linking. Uncomment the "-tracemode projections" on the LDFLAGS line in the Makefile and relink to create an executable that will generate performance information. Adding "-tracemode summary" will generate summary performance information.
Running for Projections
When the projections capable executable is run, it will generate .log files, one file for each processor, and a .sts file. By default, these files end up in the directory in which the executable resides. Note that these can get quite large, so keep the run short. Also, in order to prevent the projections logging from impacting the performance of the program, an option +logsize <number of log entries> to increase the buffer size of the logging information. The default is currently set at 1,000,000 which means
approximately 80-90 MB of a core's memory is reserved for projections
buffers. To determine the log size actually needed, first make a run and examine the log files as follows.
- run
grep ^8 *.log. If anything shows up, this means that at least one processor was forced to flush it's performance logs. - If something does show up then
wc -l *.log | sort -nwill tell you how big a+logsizeto use to prevent log flushing from impacting performance.
For -tracemode summary, the option is +bincount <number of bins>.
The default is 10,000. To determine if a re-bin will be forced mid-run (and
hence affect the processor's performance), it is simply good enough to find
out for how long the application ran. As long as <number of bins> X <bin
size> (default 1 ms) is a duration longer than an application's run time, no
mid-run interference will occur. So, in the default case, an application can
run for as long as 10 seconds without a re-bin.
Running the Projections GUI
This is a java program that can be started with charm/tools/projections/bin/projections. The .sts file can be given as an argument.
The menu items under tools include:
- Graph: this is very memory intensive. It plots processor usage or messages as a function of either processor or interval. (Not particularly useful)
- Timelines: This is also very memory intensive. For each processor, this gives a timeline of entry methods that were executed. This is useful to see exactly the sequence of events on each processor.
- Usage Profile: This gives a profile of the processor utilization over the selected interval. As well as a bar graph, it can give a table of the utilization of each entry point.
- Overview: This gives a processor utilization overview. As a function of time and processor number, the utilization will be shown as a color. Colors can also designate the entry point being executed. Note that when you switch from utilization to entry points, the log files get reread, and this takes time.
- Time Profile Graph: This gives an overall time profile of entry points being executed. As a function of time, the execution time spent in each entry point across the entire selected processors is plotted, with each entry point getting a different color.
Getting Load Balancing Information
Dynamic load balancing is a key feature of Charm++ available to ChaNGa. To get information on what the load balancer is doing use the following options.
-
+LBDebug <number>where the higher the number, the more debugging information you will receive on stderr and stdout.
The type of information probably varies with the choice of load balancer, but for RefineLB, one gets an estimate of the load on each processor, with the background load in parenthesis. One also gets a report on which pieces migrate to which processors, and a mapping of pieces to processors.
Using TAU Performance Analysis with ChaNGa
TAU is a parallel performance analysis tool that can also be used with ChaNGa. TAU has a graphical user interface that allows the user to quickly identify performance bottlenecks.
In order to use it, source must be downloaded from the above site, and Charm++ needs to be built referring to the TAU libraries. See the build instructions for NAMD on the TAU wiki.
