' NAME SpeedShop - An integrated package of performance tools IMPLEMENTATION IRIX systems DESCRIPTION SpeedShop is the generic name for an integrated package of performance tools that run performance experiments on executables and lets you examine the results of those experiments. It also supports starting a process in such a way as to permit a debugger to attach to it. SpeedShop also runs Purify on executables. For Purify and for some experiments, instrumentation of the code is necessary. When it is necessary, SpeedShop performs the instrumentation automatically and runs the instrumented executable to generate the data. SUPPORTED EXECUTABLES SpeedShop works under IRIX 6.2 or later, and it supports executables compiled with the IRIX 6.2 compilers (o32, n32 and 64), or with the MIPSPro 7.x compilers (n32 and 64). SpeedShop supports C, C++, Fortran 77, Fortran 90, ADA, and assembler programs. Programs must be built using shared libraries (DSOs); nonshared or stripped executables are not supported. RECORDING EXPERIMENTS Experiments are recorded using the ssrun(1) command, as follows: ssrun [ssrun-options] -exp_type executable_name [executable_args] exp_type One of the experiment names described in the EXPERIMENT TYPES section. The result of an experiment is one or more files that are named by the following convention: executable_name.exp_type.[Rrank.][Tthread.]id rank Rank number of the MPI process that generated this experiment file. This part of the file name is optional and will not be present for non-MPI targets. Ranks are given in terms of the MPI_COMM_WORLD communicator. thread Number of the OpenMP thread that generated this experiment file. This part of the file name is optional and will not be present for non-OpenMP targets. Thread numbers are those given by the function omp_get_thread_num(). id One of the following one or two-letter codes followed by the process identifier (PID): m For the master process created by ssrun; p For a process created by a call to sproc(); SPEEDSHOP(1) SPEEDSHOP(1) f For a process created by a call to fork(); e For a process created by a call to exec(); s For a process created by a call to system(); and fe For the exec'd process created by calls to fork() and exec(), with environment variable _SPEEDSHOP_TRACE_FORK_TO_EXEC set to False. To start the target process running and leave it in a state to attach a debugger, add, the, -hang flag: ssrun -hang -exp_type executable_name executable_args To get more detailed information about the run, add the -v flag, as in one of the following examples: ssrun -v -exp_type executable_name executable_args ssrun -v -hang -exp_type executable_name executable_args To run Purify on an executable, use the following: ssrun -purify executable_name executable_args Purify and performance experiments are mutually exclusive. ssrun takes additional arguments; see the ssrun(1) man page for further information. EXPERIMENT TYPES The following experiment types are supported on all architectures: usertime Returns CPU time, the time your program is actually running plus the time the operating system is performing services for your program. The display generated by prof breaks the program time down into the time used by each function within the program. Uses statistical callstack profiling, based on CPU time, with a time sample interval of 30 milliseconds. Note: An o32 executable must explicitly link with -lexc for this experiment to work. Program execution may show significant slowdown compared to the original executable. The stack unwind code sometimes fails to completely unwind the stack; consequently, caller attribution cannot be done beyond the point of failure. totaltime Returns wall-clock time in a manner identical to that of the usertime experiment. Uses statistical callstack profiling, based on wall-clock time, with a time sample interval of 30 milliseconds. [f]pcsamp[x] Returns the estimated actual CPU time for each source code line, machine code line, and function in your program. Uses statistical PC sampling, using 16-bit bins, based on user and system time, with a sample interval of 10 milliseconds. If the optional f prefix is specified, a sample interval of 1 millisecond will be used. If the optional x suffix is specified, a 32-bit bin size will be used. bbcounts Returns the calculated linear time of executed instructions. This produces a complete call graph, but does not take into consideration time spent in paging or any reduction in time due to processor parallelism. Uses basic-block counting, done by instrumenting the executable. fpe Traces all floating-point exceptions. heap Traces malloc and free calls and also supports various options for debugging heap usage. Use cvperf(1) to display this information; it is not supported with prof(1). io Traces the following I/O system calls: read(2), readv(2), pread(2), write(2), writev(2), pwrite(2), open(2), close(2), dup(2), lseek(2), pipe(2), and creat(2). mpi Traces calls to various MPI routines and generates a file viewable in the prof(1) report generator. For a list of the routines that are traced, see the ssrun(1) manual page. mpi_trace Traces calls to various MPI routines and generates a file viewable in the cvperf(1) performance analyzer window. For a list of the routines that are traced, see the ssrun(1) man page. This experiment is deprecated and will be removed in a future release. On machines with hardware performance counters (R10000, R12000, R14000, and R16000 machines), the following additional types are supported: [f|s]gi_hwc Uses statistical PC sampling, based on overflows of the graduated-instruction counter (counter17), at an overflow interval of 32771. If the optional f prefix is used, the overflow interval will be 6553. If the optional s prefix is used, the overflow interval will be 3999971. [f|s]cy_hwc Uses statistical PC sampling, based on overflows of the cycle counter (counter 0), at an overflow interval of 16411. If the optional f prefix is used, the overflow interval will be 3779. If the optional s prefix is used, the overflow interval will be 1999993. [f|s]ic_hwc Uses statistical PC sampling, based on overflows of the primary instruction-cache miss counter (counter 9), at an overflow interval of 2053. If the optional f prefix is used, the overflow interval will be 419. If the optional s prefix is used, the overflow interval will be 524309. [f|s]isc_hwc Uses statistical PC sampling, based on overflows of the secondary instruction-cache miss counter (counter 10), at an overflow interval of 131. If the optional f prefix is used, the overflow interval will be 29. If the optional s prefix is used, the overflow interval will be 65537. [f|s]dc_hwc Uses statistical PC sampling, based on overflows of the primary data-cache miss counter (counter 25), at an overflow interval of 2053. If the optional f prefix is used, the overflow interval will be 419. If the optional s prefix is used, the overflow interval will be 524309. [f|s]dsc_hwc Uses statistical PC sampling, based on overflows of the secondary data-cache miss counter (counter 26), at an overflow interval of 131. If the optional f prefix is used, the overflow interval will be 29. If the optional s prefix is used, the overflow interval will be 65537. [f|s]tlb_hwc Uses statistical PC sampling, based on overflows of the TLB miss counter (counter 23), at an overflow interval of 257. If the optional f prefix is used, the overflow interval will be 53. If the optional s prefix is used, the overflow interval will be 19997. [f|s]gfp_hwc Uses statistical PC sampling, based on overflows of the graduated floating-point instruction counter (counter 21), at an overflow interval of 32771. If the optional f prefix is used, the overflow interval will be 6553. If the optional s prefix is used, the overflow interval will be 3999971. [f|s]fsc_hwc Uses statistical PC sampling, based on overflows of the failed store conditionals counter (counter 5), at an overflow interval of 2003. If the optional f prefix is used, the overflow interval will be 401. If the optional f prefix is used, the overflow interval will be 19997. prof_hwc Uses statistical PC sampling, based on overflows of the counter specified by the environment variable _SPEEDSHOP_HWC_COUNTER_NUMBER, at an interval given by the environment variable _SPEEDSHOP_HWC_COUNTER_OVERFLOW. Note that these environment variables cannot be used to override the counter number or interval for the other defined experiments. They are examined only when the prof_hwc experiment is specified. The default counter is the primary instruction-cache miss counter and the default overflow interval is 2053. gi_hwctime Profiles the cycle counter using statistical call-stack sampling, based on overflows of the graduated-instruction counter (counter 17), at an overflow interval of 1000003. cy_hwctime Profiles the cycle counter using statistical call-stack sampling, based on overflows of the cycle counter (counter 16), at an overflow interval of 10000019. ic_hwctime Profiles the cycle counter using statistical call-stack sampling, based on overflows of the primary instruction- cache-miss counter (counter 9), at an overflow interval of 8009. isc_hwctime Profiles the cycle counter using statistical call-stack sampling, based on overflows of the secondary instruction-cache-miss counter (counter 10), at an overflow interval of 2003. dc_hwctime Profiles the cycle counter using statistical call-stack sampling, based on overflows of the primary data-cache- miss counter (counter 25), at an overflow interval of 8009. dsc_hwctime Profiles the cycle counter using statistical call-stack sampling, based on overflows of the secondary data-cache- miss counter (counter 26), at an overflow interval of 2003. tlb_hwctime Profiles the cycle counter using statistical call-stack sampling, based on overflows of the TLB miss counter (counter 23), at an overflow interval of 2521. gfp_hwctime Profiles the cycle counter using statistical call-stack sampling, based on overflows of the graduated floating- point instruction counter (counter 21), at an overflow interval of 10007. fsc_hwctime Profiles the cycle counter using statistical call-stack sampling, based on overflows of the failed store conditionals counter (counter 5), at an overflow interval of 5003. prof_hwctime Profiles the counter specified by the environment variable _SPEEDSHOP_HWC_COUNTER_PROF_NUMBER using statistical call-stack sampling, based on overflows of the counter specified by the environment variable _SPEEDSHOP_HWC_COUNTER_NUMBER, at an interval given by the environment variable _SPEEDSHOP_HWC_COUNTER_OVERFLOW. Note that these environment variables can not be used to override the counter numbers or interval for the other defined experiments. They are examined only when the prof_hwctime experiment is specified. The default overflow and profiling counter is the cycle counter and the default overflow interval is 10000019. On SGI's Origin systems with the ccNUMA architecture, the following additional type is supported: numa Profiles ccNUMA memory access patterns by statistically sampling the memory accesses made by the application. Records information about the memory being accessed and which ccNUMA node is making the access. REPORT GENERATION Report generation is done through the prof(1) command: prof output file . . . output file The prof(1) command adds the data from all of the output files and produces a listing that depends on the particular experiment type. For all experiments, it produces a list of functions, annotated with the appropriate metric. For [f]pcsamp[x], and the various _hwc experiments, the function list is annotated with the exclusive metric. For the PC sampling experiments, the metric is exclusive time; for the various hardware counter profiling experiments, the metric is exclusive counts. For bbcounts experiments, the function list is annotated with a cycle count and percentage, a cumulative percentage for that function and all others above it in the list, an estimated linear time, an instruction execution count, and a call count. If the -b[utterfly] flag is added, a list of callers and callees of each function is also produced. For usertime and totaltime and the various _hwctime experiments, the function list is annotated with percentage of time or counts for the function, the time in that function, and the time or counts in that function and its descendants, and a count of the number of callstacks containing that function. If the -b[utterfly] flag is added, a list of callers and callees of each function is also produced. For fpe experiments, the function list is annotated with the percentage of FPEs in that function, and counts for the function and its descendants. If the -b[utterfly] flag is added, a list of callers and callees of each function is also produced. For io experiments, the function list is annotated with the percentage of IO calls in that function, and counts for the function and its descendants. If the -b[utterfly] flag is added, a list of callers and callees of each function is also produced. For mpi experiments, a call site list is produced that is annotated with the number of MPI calls made by that call site and the total amount of time taken by those calls. For numa experiments, the number of memory accesses sampled, the number of remote memory accesses, the percentage of remote memory accesses, and the average ccNUMA routing distance are all reported. There are many additional options to prof; see the prof(1) man page for further details and examples of some of the displays. CALIPER SAMPLES In the current releases, caliper samples can be recorded, and the -calipers option to prof will let you to see the data for any caliper setting. Caliper samples are supported in three different ways: First, you can explicitly link with the SpeedShop runtime DSO and call its API routine to record a caliper sample. Second, you can define a signal to be used to record a caliper sample by specifying the environment variable _SPEEDSHOP_CALIPER_POINT_SIG and send the target the specified signal. Third, you can set a caliper-sample trap in either dbx, or the WorkShop debugger. In the current debuggers, this is done by planting a stop trap (breakpoint) and, when the process stops, evaluating the expression: ssrt_caliper_point(0, 0) The evaluation of the expression always returns zero, but a side effect of the evaluation is the recording of the appropriate data. After evaluation, process execution may be resumed. See the ssapi(3) man page for further details. USER ENVIRONMENT VARIABLE CONTROLS Various environment variables are normally used to control the operation of SpeedShop. They are as follows: _SPEEDSHOP_VERBOSE Causes a log of each program's operation to be written to stderr. If it is set to an empty string, only major events are logged; if it is set to a non-empty string, more detailed events are logged. _SPEEDSHOP_SILENT If set, suppresses all output, other than fatal error messages from SpeedShop. If both _SPEEDSHOP_VERBOSE and _SPEEDSHOP_SILENT are set, _SPEEDSHOP_SILENT wins. _SPEEDSHOP_CALIPER_POINT_SIG signal-number If specified, gives a signal number to be used for recording a caliper-point in the experiment. _SPEEDSHOP_POLLPOINT_CALIPER_POINT timer_type, timer_interval Sets a caliper point every timer_interval seconds. The timer_type argument is one of the following: 0 Real time, or wall-clock time. This is the total time a program spent while executing. It includes both time spent when a program is swapped out waiting for a CPU and the time the operating system is in control, performing some task for the program such as I/O or executing a system call. 1 Process virtual time. This is the time spent when the program is actually running. This does not include either the time spent when a program is swapped out waiting for a CPU or the time the operating system is in control, performing some task for the program such as I/O or executing a system call. 2 User and system time. This is process virtual time plus the time the system is running on behalf of the process. The system time could include performing I/O or executing system calls. _SPEEDSHOP_OUTPUT_DIRECTORY If specified, the output data files will be put in the named directory. _SPEEDSHOP_OUTPUT_FD If specified, gives the number of the file descriptor to be used for writing the output file. Note: this option is not supported in the current release. _SPEEDSHOP_REUSE_FILE_DESCRIPTORS If set, opens and closes the file descriptors for the output files every time performance data is to be written. _SPEEDSHOP_OUTPUT_FILENAME If specified, the given name will be used for the output file; if _SPEEDSHOP_OUTPUT_DIRECTORY is also specified, it will be prepended to the name. _SPEEDSHOP_HWC_COUNTER_NUMBER Specifies the overflow counter to be used for prof_hwc, prof_hwctime, or numa experiments. Counters are numbered between 0 and 31 and are described in the MIPS R10000 Microprocessor User's Manual and the MIPS R12000 Microprocessor User's Manual. Counter 0 counters are numbered 0-15, and counter 1 counters are numbers 16-31. _SPEEDSHOP_HWC_COUNTER_OVERFLOW Specifies the overflow value for the counter to be used in prof_hwc, prof_hwctime, or numa experiments. The value chosen may be any number greater than 0. Some choices may produce data that is not statistically random, but rather reflects a correlation between the overflow interval and a cyclic behavior in the application. Users may want to do two or more runs with different overflow values. This is unnecessary with the numa experiment as it randomly varies the real overflow value with every sample. _SPEEDSHOP_HWC_COUNTER_PROF_NUMBER Specifies the profiling counter to be used for prof_hwctime experiments. Counters are numbered between 0 and 31, and are described in the MIPS R10000 Microprocessor User's Manual and the MIPS R12000 Microprocessor User's Manual. Counter 0 counters are numbered 0-15, and counter 1 counters are numbers 16-31. _SPEEDSHOP_OUTPUT_NOCOMPRESS If set, disables the compression of performance data. PROCESS TRACKING ENVIRONMENT VARIABLE CONTROLS The following environment variables are used for controlling the treatment of processes spawned from the original target: _SPEEDSHOP_TRACE_FORK {True|False} If True, specifies that processes spawned by calls to fork() will be monitored, if they do not call exec(). If they do call exec(), and _SPEEDSHOP_TRACE_FORK_TO_EXEC is not set to True, the data covering the time between the fork() and the exec() will be discarded. It is True by default. Note: in the current release, data will be recorded independent of whether the process calls exec() or not. _SPEEDSHOP_TRACE_FORK_TO_EXEC {True|False} If True, specifies that process spawned by calls to fork() will be monitored, even if they also call exec(). It is False by default. _SPEEDSHOP_TRACE_EXEC {True|False} If True, specifies that process spawned by calls to any of the various flavors of exec() will be monitored. It is True by default. _SPEEDSHOP_TRACE_SPROC {True|False} If True, specifies that process spawned by calls to sproc() will be monitored. It is True by default. _SPEEDSHOP_TRACE_SYSTEM {True|False} If True, specifies that process spawned by calls to system() will be monitored. It is False by default. _SPEEDSHOP_TRACE_MPI_RANKS mpi-ranks If specified, specifies that only the list MPI ranks will be monitored. This list is a comma-separated list with optional dash- separated ranges. For example, "1-4,7". Rank numbers are given in terms of the MPI_COMM_WORLD communicator. Data is collected for ALL MPI ranks by default and this option is silently ignored for non-MPI executables. EXPERT-MODE ENVIRONMENT VARIABLE CONTROLS The following additional environment variables are used for debugging and finer control of the operation of SpeedShop: _SPEEDSHOP_SAMPLING_MODE For PC-sampling and hardware-counter profiling, if set to 1, will generate data for the base executable only. If it is not set, or set to anything other than 1, data is generated for the executable and all DSOs it uses. _SPEEDSHOP_INIT_DEFERRED_SIG signal-number If specified, initialization of the experiment will not be performed when the target process starts, but rather will be delayed until the specified signal is sent to the process. A handler for the given signal will be installed when the process starts, and it is the users responsibility to ensure that it is not overridden by the target code. If the process terminates before the signal is received, no data will be recorded. _SPEEDSHOP_INIT_DEFERRED If specified, initialization of the experiment will not be performed when the target process starts, but rather will be delayed until the application calls SpeedShop API routine ssrt_experiment_init. If the process terminates before the signal is received, no data will be recorded. _SPEEDSHOP_SHUTDOWN_SIG signal-number If specified, termination of the experiment will not be performed when the target process exits, but rather will happen when the specified signal is sent to the process. A handler for the given signal will be installed when the process starts, and it is the users responsibility to ensure that it is not overridden by the target code. If the process terminates before the signal is received, data is recorded normally. _SPEEDSHOP_EXPERIMENT_TYPE Passes the name of the experiment to the runtime. It is normally set by ssrun(1), but may be overwritten. _SPEEDSHOP_MARCHING_ORDERS Passes the marching orders of the experiment to the runtime. It is normally set by ssrun(1) from the experiment type, but may be overwritten. _SPEEDSHOP_EXTRA_MARCHING_ORDERS Specifies additional marching orders. This environment variable is useful when the experiment name is used to specify an experiment, but additional specification is also required via marching orders. _SPEEDSHOP_SBRK_BUFFER_LENGTH Defines the segment grow size for the internal malloc arena used. This arena is completely separate from the user's arena, and it usually grows in default segments of size 0x100000. _SPEEDSHOP_SBRK_BUFFER_ADDR Defines the preferred starting address to be used for the internal malloc arena. This option has to used with extreme care since it might result in memory region overlap. _SPEEDSHOP_FILE_BUFFER_LENGTH Defines the size of the buffer used for writing the experiment files. The default length is 64 Kbytes. The buffer is only used for writing many small records to the file (as in tracing experiments); large records are written directly, to avoid the buffering overhead. _SPEEDSHOP_DEBUG_NO_SIG_TRAPS If set, disables the normal setting of signal handlers for all fatal and exit signals. _SPEEDSHOP_DEBUG_NO_STACK_UNWIND If set, suppresses the stack unwind as done in usertime or other callstack-based experiments. The option is used as a workaround for various unwind bugs in libexc. _SPEEDSHOP_RLD Defines the full path name to rld and enables rld profiling (for pcsamp and _hwc experiments only). If the path name does not lead to rld, SpeedShop determines the correct path name automatically. For example, if you set _SPEEDSHOP_RLD to 1, SpeedShop will locate rld automatically. _SPEEDSHOP_INSTR_ARGS Defines additional instrumentation arguments. INSTRUMENTATION Instrumentation is invoked automatically by ssrun(1) and, if necessary, for DSOs that are opened during a run by the runtime library. By default, instrumented executables and DSOs appear in the current working directory. You can direct them to a directory of your choice by setting the _SPEEDSHOP_OUTPUT_DIRECTORY environment variable. SPEEDSHOP API ROUTINES The SpeedShop API routines are defined in the include file SpeedShop/api.h, which is installed in /usr/include. It defines three entry points, described int the SpeedShop API man page, ssapi(3). SPEEDSHOP CUSTOM DATA CAPTURE ROUTINES The SpeedShop facility for users to add custom data capture routines is not available in the current release. MISCELLANEOUS UTILITY PROGRAMS Several utility routines are provided in addition to the main functionality in SpeedShop. They are: sscord, ssorder, and sswsextr Generate cord feedback files from recorded data. sswsextr is a script to produce the working-set files used for cord computations. See their respective man pages for more information. ssusage A variant of time(1) that prints more information about the resource usage of a program. See the ssusage(1) man page for more information. squeeze Allocates and locks down memory, making the system behave as if it had less physical memory that it really does. See squeeze(1) for more information. thrash Allocates memory and touches all of the pages in order to force other pages out of the system's physical memory. See the thrash(1) man page for more information. CAVEATS The caveats described here may affect the results of SpeedShop experiments. R10000 Hardware Counter 14 Revisions of the R10000 CPUs earlier than 3.1 differ from version 3.1 and later R10000 CPUs. The difference is in the interpretation of counter number 14. Before revision 3.1, counter 14 reflects the Virtual coherency condition. With revision 3.1 and later R10000 releases, counter 14 reflects ALU/FPU completion cycles. There are also some subtle differences in the semantics of some of the counters. See r10k_counters(5) for more information. In systems with a homogeneous deployment of CPUs at the same revision, SpeedShop will adjust the reported information accordingly. For systems with a mixed deployment of CPU revisions, including some before 3.1 and some at or after 3.1, the interpretation of counter 14 is undefined, and there may be some slight inaccuracies due to aggregation of counters with different semantics across all CPUs. Use hinv -v to identify the revision levels for all CPUs. Pthreads Performance data for applications that use pthreads for usertime experiments using SIGALRM on IRIX 6.2-6.5 and for _hwctime experiments on IRIX 6.5 may be subject to minor inaccuracies. SEE ALSO perfex(1), prof(1), squeeze(1), sscord(1), ssorder(1), ssrun(1), ssusage(1), sswsextr(1), thrash(1) fpe_ss(3), io_ss(3), malloc_ss(3), ssapi(3) r10k_counters(5), speedshop_restrictions(5) SpeedShop User's Guide Page 13