PROF(1)PROF(1) NAME prof - Analyzes and displays SpeedShop performance data SYNOPSIS prof [options] speedshop_data_file ... prof [options] executable_name [speedshop_data_file] ... IMPLEMENTATION IRIX systems DESCRIPTION prof analyzes one or more data files generated by the SpeedShop performance tools, such as ssrun(1) , and produces a report. Most reports are formatted with long lines and are best viewed in a window that is 135 characters wide or printed in wide format. The second form is used to analyze data files generated by the SpeedShop performance tools if the target program is not in the same directory as the data files, in which case executable_name should be the path to the target program. Multiple files can be included only if they are recorded from the same executable with the same experiment type (see ssrun(1) ). The default listing for all experiments lists functions in descending order of the appropriate exclusive performance metric. Exclusive data is derived from the function only; it does not include data from other functions that it calls. Inclusive data is derived from both the function and other functions it calls. (For more information, see the FUNCTION LIST section.) Options allow sorting by calls or by inclusive metrics, for those experiments in which the recorded data supports the two methods of sorting. Where applicable, the -b[utterfly] option also produces a listing of callers and callees for each function, with attribution percentages and time or counts. (See the BUTTERFLY LIST section.) Additional listings are described in the EXPERIMENTS section. OPTIONS The following are general options to the prof(1) command: -archinfo Reports the number of times each register was used as a destination, base (integer registers only) or source, how many times each instruction opcode was used, and some detailed statistics concerning branches jumps, and how many delay slots were filled with no-op instructions. Works only with bbcounts experiments. -basicblocks Prints a list of all the basic blocks executed, ordered by the total number of cycles spent in each basic block. Works only with bbcounts experiments. -b[utterfly] Causes prof to print a report showing the callers and callees of each function, with inclusive time attributed to each. For bbcounts experiments, the attribution is based on a heuristic. For the various call stack sampling and tracing experiments, the attribution is precise. The usertime, totaltime, and some _hwctime experiments are statistical in nature and so are less exact. This option is ignored for experiments in which the data does not support inclusive calculations. -calipers [n1,]n2 Causes prof to compute the data between caliper points n1 and n2, rather than for the entire experiment. If n1 >= n2, an error is reported. If n1 is negative, it is set to the beginning of the experiment. If n2 is greater than the maximum recorded, it is set to the maximum. If n1 is omitted, zero is assumed. -[no]cordfb Disables or enables cord feedback file generation for the executable only. Cord feedback is used to arrange procedures in the binary in an optimal ordering. This improves both paging and instruction cache performance. You can use cord(1) or ld(1) to actually do the procedure ordering. -cordfball Enables cord feedback for the executable and all DSOs. -dis[assemble] Disassembles and annotates the analyzed object code with cycle times (or PC samples). This option can be used when generating reports for bbcounts, pcsamp, _hwc, or numa experiments. -repository directory Will use the SpeedShop DSO information from this directory instead of processing the DSO in memory. This option will reduce the time prof takes to retrieve source file information, procedure name and address from the DSO. If you have multiple SpeedShop DSO repositories, you can use -repository multiple times to define all the repositories you need. It can also be used to save the experiments because saving the DSO information will allow prof to not use the original DSO which may have been modified. The directory is used in Read Only mode. -update_repository directory Prof will process the DSOs and directory is where useful information from the DSO will be put. This option can only be specified once and is required to save the DSO information. SpeedShop DSO information files have the extension _ssInfo<abi> depending on the abi (32, n32 or 64). The directory is used in Read Write mode. -dsolist Lists all the DSOs in the program and their start and end text addresses. -feedback Produces files with information that can be used to (a) tell the compiler how to optimize compilation of the program next time and (b) arrange procedures in the binary in an optimal ordering. Users can invoke the following to use the feedback file for subsequent compilations. % cc -fb cfb-filename The -fb option is also implemented on the CC(1), f77(1), and f90(1) compiler command lines. Procedures are normally ordered by their measured invocation counts; if -butterfly is also specified, procedures are ordered using call graph counts rather than invocation counts. The cord feedback file is named a.out.fb. The -feedback option produces cord feedback for the executable only. To get cord feedback for all the DSOs, use the options -feedback -cordfball together. To disable cord feedback while producing compiler feedback, use the options -feedback and -nocordfb together. This option also produces a file with information that the compiler system can use to recompile, optimizing by using measured branch frequencies, and so on. The feedback file is produced for the executable only. It is named a.out.cfb and is a binary file. It may be dumped using the fbdump(1) routine. This option can only be used bbcounts data files, and the data should be recorded on a binary that was compiled -O0. Recording data on binaries with higher optimization will generate a feedback file that does not have the appropriate correspondence between source lines and machine code. -pthreads pthread_id1,pthread_id2,,pthread_idn For usertime, totaltime, _hwctime, io, fpe, numa, and mpi experiments on applications that use pthreads (on IRIX 6.5 or later), analyzes data only for the specified pthreads. This flag is ignored for other experiments. -ranks mpi-ranks For mpi experiments, analyzes data only for the specified MPI ranks. The ranks are specified in the comma-separated list mpi-ranks. This list may also contain dash-separated ranges of ranks. For example, the list "1-4,7" is the same as the list "1,2,3,4,7". Rank numbers are given in terms of the MPI_COMM_WORLD communicator. This flag is ignored for other experiments. -h[eavy] Reports the most heavily used lines in descending order of use. This option can be used when generating reports for bbcounts, pcsamp, or _hwc experiments. It is ignored for other experiments. -l[ines] Performs like -h[eavy], but groups lines by procedure, with procedures sorted in descending order of use. Within a procedure, lines are listed in source order. This option can be used when generating reports for bbcounts, pcsamp, or _hwc experiments. It is ignored for other experiments. -overhead For parallel programs, produces a high-level report showing the overhead specific to the programming model used, either OpenMP, pthreads, or MPI. It reports a set of categories particular to each programming model and the percentages of metrics attributed to each category. -showss Enables the display of functions from the SpeedShop runtime DSO. Usually those functions are suppressed from the reports and computations. In addition, some statistics for prof's own memory usage will be printed. -S or -source Disassembles and annotates the analyzed object code with cycle times (or PC samples) and interleaves the source code. This option can be used when generating reports for bbcounts, pcsamp, _hwc, or numa experiments. -u[sage] Prints a report on system statistics. -ws Generates, for the executable only, a working-set file for the current caliper setting. -wsall Generates, for the executable and all the non-ignored DSOs, a working-set file for the current caliper setting. -instrdistrib Reports the distribution of executed instructions between the categories: loads, stores, branches (and jumps), integer operations, and floating-point operations. Any instructions which prof cannot categorize are placed into a special `Other' category. Works only with bbcounts experiments. Output Controls The following options control the way prof displays its output: -calls Sorts function list by procedure calls rather than by time. This option can only be used when generating reports for bbcounts experiments. -dislimit n Disassembles only those basic blocks with frequency >= n. -inclusive Sorts function list by inclusive data rather than by exclusive data. This option can only be used when generating reports for those experiments that have inclusive data; it is ignored for others. -nh Suppress various header blocks from the output. -fpe_counts For FPE experiments, this produces a table displaying how many of each FPE type were recorded. -fpe_type n For FPE experiments, this option restricts the output to only show data from FPE events of type n. The default is to display the combined results from all FPE events. -q[uit] n or n% or ncum% Truncates the -h[eavy], -l[ines], and -b[utterfly] listings. It can truncate after n lines (if n is an integer), after the first entry that represents less than n percent of the total (if n is followed immediately by a %), or after enough entries have been printed to account for n percent of the total (if n is followed immediately by cum%). For example, -q 15 truncates each part of the listing after 15 lines of text, -q 15% truncates each part after the first line that represents less than 15 percent of the whole, and -q 15cum% truncates each part after the line that brought the cumulative percentage above 15 percent. For -b[utterfly], -q ncum% behaves the same as -q n%. -rel[ative] Shows percentage attribution in a butterfly report relative to the central function. The default is to show percentages as absolute percentages over the whole run. -by_site Applies to MPI experiments only. It shows, on a per- call-site basis, the number of times an MPI function was called and the total amount of time taken in those calls. The results are sorted in descending order by time. -by_chain Applies to MPI experiments only. It shows, on a per- call-chain basis, the number of times an MPI function was called and the total amount of time taken in those calls. Selectivity Options The following options define the limits of what prof displays: -dso dso_name Reports only on the named DSO. Only the base name, not the full pathname, need be specified to the DSO; the .so suffix is required. Multiple instances of the -dso flag can be given, and the executable is considered a DSO like any other. All the DSOs from an experiment can be listed with the -dsolist flag. See the -xdso option. -e[xclude] procedure_name If you use one or more -e[xclude] options, the profiler omits the specified procedure from the listing. If any option uses an uppercase E for -E[xclude], prof also omits that procedure from the base upon which it calculates percentages. -o[nly] procedure_name If you use one or more -o[nly] options, the profile listing includes only the named procedures, rather than the entire program. If any option uses an uppercase O for -O[nly], prof uses only the named procedures, rather than the entire program, as the base upon which it calculates percentages. -xdso dso_name Excludes the named DSO from any reports. Only the base name, not the full pathname, of the DSO need be specified; the .so suffix is required. Multiple instances of the -xdso flag can be given. See the -dso option. CPU Options prof usually uses the scheduling model for the processor on which it is being run to perform the analysis. You can override the default with any of the following options: -clock megahertz Sets the CPU clock speed to megahertz MHz. It alters the appropriate parts of the listing to reflect the clock speed. The default value is the clock speed of the machine on which the experiment is performed. -cycle nanosecond Sets the cycle time to nanosecond ns. This is the same as -clock 1000/nanosecond. -r16000 | -r14000 | -r12000 | -r10000 | -r8000 | -r5000 | -r4000 Note that these options are only meaningful for bbcounts data. Debugging Options The following prof options apply to debugging: -debug:debug_flags Sets debug_flags to combinations of the following: GPROF_FLAG 0x00000001 COUNTS_FLAG 0x00000002 SAMPLE_FLAG 0x00000004 MISS_FLAG 0x00000008 FEEDBACK_FLAG 0x00000010 CORD_FLAG 0x00000020 USERPC_FLAG 0x00000040 MDEBUG_FLAG 0x00000080 BEAD_FLAG 0x00000100 EXPERIMENTS The following are some of the more popular SpeedShop experiments: usertime (callstack profiling, based on user time plus system time spent on behalf of the program) Interrupts the program every 30 milliseconds during its running time (including when it is in user or system mode, but not including any wait time). It records the callstack at each interrupt. The experiment shows both inclusive and exclusive time. The usertime data is statistical in nature and will show some variance from run to run. totaltime (callstack profiling, based on wall-clock time) Interrupts the program every 30 milliseconds of wall- clock time during the run and records the callstack at each interrupt. It shows both inclusive and exclusive time. The totaltime data is statistical in nature and will show some variance from run to run. [f]pcsamp[x] Directs the kernel to look at the user program counter (PC) every 10 milliseconds and record a histogram of the value of the program counter at each clock tick. pcsamp uses 16-bit bins, one for each PC value. It can only show exclusive data; that is, data about the current function, not about functions that called the current function of functions that the current function called itself. The f prefix executes a profile every millisecond, instead of every 10 milliseconds. The x suffix specifies 32-bit, instead of 16-bit, count bins. The pcsamp data is statistical in nature and will show some variance from run to run. _hwc The experiments that include the string _hwc require the kernel to look at the user PC every time the hardware performance counter specified by the experiment overflows and record a histogram of the value of the program counter at overflow. These experiments can only be run on R10000, R12000, R14000, or R16000 machines; other machines do not have the hardware performance counters. A number of hardware counter experiments are defined; see the ssrun(1) or the speedshop(1) man page for the complete list. They can only show exclusive data, that is data about the current function, not functions that called the current function or functions called by the current function. The particular counter used and its overflow value are specified in the experiment. Some of the hardware performance counter sampling experiments are statistical in nature and will show some variance from run to run; others are exact, returning the same data from run to run, provided the program executes the exact same sequence of instructions. Among the interesting counter experiments are the following: cy_hwc for cycle counting; gi_hwc for graduated instructions; ic_hwc for primary instruction cache misses; sic_hwc for secondary instruction cache misses; dc_hwc for primary data cache misses; sc_hwc for secondary data cache misses; tlb_hwc for TLB misses; gfp_hwc for graduated floating-point instructions; and fsc_hwc for failed store-conditional instructions. _hwctime Experiments that include the string _hwctime (callstack profiling, returns hardware counter overflow information) interrupt the program at every n overflows of a counter and record the callstack at each interrupt. The value of n depends on the counter chosen. It can take the same set of prefixes as above, but since callstacks are recorded, it can show both inclusive and exclusive time. _hwctime data is statistical in nature and will show some variance from run to run. bbcounts Instruments the code to count the number of times each basic block is executed. (A basic block is a region of the program that can be entered only at the beginning and exited only at the end.) The data recorded also contains counts for all function pointer calls. From this data, a machine model is used to compute the exclusive time (cycles) spent in each function. Inclusive time computations, performed when the -b[utterfly] flag is specified to prof, calculate the exclusive time for each routine as above, and then propagates the time to each caller of each routine in proportion to the number of calls. For example, if sin(x) takes 1000 cycles, and its callers (procedures foo() and bar()) call it 25 and 75 times, respectively, 250 cycles are attributed to foo() and 750 to bar(). By propagating cycles this way, __start() (or the o32 equivalent, __istart()) should end up with all the cycles counted in the program. Note that the propagation according to the number of calls may not be reasonable for some routines and may lead to misleading reports. For example, if a matrix multiply routine is substituted for sin(x) in the above example, and bar's calls are for 2X2 matrices, while foo's calls are for 100X100 matrices, the attribution distributes three-fourths of the time to bar, whereas nearly all the time really should be attributed to foo. fpe The floating-point exception library traces all floating-point exceptions with their callstacks. These experiments show listings similar to those for usertime experiments, except that the data reported by the -b option is a count of floating-point exceptions, rather than projected CPU time. io Traces all I/O calls made by the program. These experiments show calls attributed to functions, and, with the -b[utterfly] option specified, can show which functions made which calls. mpi Traces calls to MPI functions. Collects data such as the time taken by the call, which thread and MPI rank made the call, etc. The report for this experiment is similar to that described below in the OUTPUT REPORTS section. The default report shows the total number of times each MPI function was called and the amount of time spent in that function. The results are sorted in descending order by time. Data can also be shown on a per-call-site or per-call-chain basis by using the -by_site or -by-chain options respectively. numa Samples memory accesses on systems using the SGI ccNUMA architecture. Collects data such as the ccNUMA node making the memory access, the ccNUMA node containing the memory being accessed, etc. This experiment gives statistical information on a global, per function, and per-PC basis. Reports both the percentage of memory accesses that were remote and the average ccNUMA routing distance between the nodes. numa data is statistical in nature and will show some variance from run to run. Results will also vary depending on the operating system thread and memory placement policy currently in effect. OUTPUT REPORTS prof writes an analysis of the performance data to stdout. The first thing in the list is a summary of the experiment and description of the environment in which it was recorded. Next comes a header that summarizes the particular data recorded. Following that is the function list; if -b[utterfly] is specified, and the data supports it, a butterfly list is presented. If the -h[eavy] or -l[ines] options are included for pcsamp, _hwc, and bbcounts experiments, a report of data at the source line level is appended. The report is sorted either by the performance metric computed on a line basis, or by functions and then by line numbers within a function. For other experiments, these options are ignored. For bbcounts experiments only, if -basicblocks is specified, a report of data at the basic-block level is appended. If -archinfo is specified, also for bbcounts experiments only, a summary report of register usage, instruction usage, and various other statistics is appended. For other experiments, these options are ignored. If -dsolist is specified, a list of the DSOs used by the program is appended. If -usage is specified, a summary of the resources used by the program is appended. The following is part of an bbcounts output report with no options. The function list is wrapped because of space limitations. ------------------------------------------------------------------------- Summary of bbcounts time data (bbcounts)-- 2062563179: Total number of instructions executed 3929944273: Total computed cycles 22.457: Total computed execution time (secs.) 1.905: Average cycles / instruction ------------------------------------------------------------------------- Function list, in descending order by exclusive bbcounts time ------------------------------------------------------------------------- [index] excl.secs excl.% cum.% cycles instructions calls function (dso: file, line) [1] 21.453 95.5% 95.5% 3754320037 1971220024 1 anneal (generic: generic.c, 1573) Function List The default output for any experiment is a function list, sorted in the order of exclusive values of the primary metric: the performance cost of the function, as computed from the data recorded. For many experiments, this metric is a time, printed in the report in seconds; for others, it is a count of events-FPEs, IO calls, or hardware counter overflow counts. The report begins with a legend line, naming each of the columns of data. Each line in the list has an index; if -b[utterfly] is not specified, the index will be in numerical order; if it is specified, the index will be in order that the functions as sorted by the inclusive metric. The index serves as a poor person's hyperlink through the butterfly and between the butterfly and the function list. The primary metric is to the right of the index, followed by its representation as a percentage of the whole program, followed by the cumulative percentage up to that point in the list. The next two columns are the inclusive value of the metric and its representation as a percentage. Additional columns of data, as appropriate to the particular data recorded, may follow. Finally, the name of the function and the DSO and source file basenames appear in the right-most columns. The following is a partial function list. Some columns are wrapped because of space limitations. [index] excl.secs excl.% cum.% incl.secs incl.% samples procedure(dso:file,line) [4] 22.770 93.8% 93.8% 22.770 93.8% 759 anneal (generic: generic.c, 1573) [6] 1.020 4.2% 98.0% 1.020 4.2% 34 slaveusrtime (dlslave.so: dlslave.c, 22) [9] 0.210 0.9% 98.9% 0.210 0.9% 7 cvttrap (generic: generic.c, 317) Butterfly List The butterfly list is a set of records that show the callers and callees of each function. The list is sorted in the order of inclusive values of the primary metric. For each function, its callers are shown above it and its callees below. The center line is for the function itself; the index appears at the beginning and the end of the line. The second and third columns are the inclusive percentages of the primary metric and its absolute value. The next two columns of the center line are the exclusive percentage and the exclusive value of the metric, followed by the function identification. Callees are shown below the function, with attribution percentages and values lined up below the self percentages and values of the center function. In absolute mode (the default), the percentages, including the center node's self value, should add up to the inclusive percentage of the center node; in relative mode, specified with -rel[ative] flag, the percentages should add up to 100 %. In either mode, the attribution values should add up to the inclusive time of the central function. The attribution data for callees is followed by the inclusive value for that caller and callsite, then the callsite identification, with an address, a source file, and a line number. For callees, the source file will be that of the central function. The callers are shown above the central function, also with attribution percentages and values, but these have a different meaning; they are aligned with the central nodes inclusive values to indicate the difference. The attribution percentage and value in a caller's line represents the percentage and value, respectively, of the central function's metric that was attributed to that callsite. In absolute mode, the percentages should add up to the central function's percentage; in relative mode, they should add up to 100%. The attribution values should add to the central function's values. The attribution data for callers is followed by the inclusive value for that caller and callsite, followed by the callsite identification, with an address, a source file, and a line number. For callers, the source file will be that of the calling function. The follow is a butterfly entry for the main function. Changes have been made to the display because of space limitations. ------------------------------------------------------------------------- attrib.% attrib.time incl.time caller (callsite) [index] [index] incl.% incl.time self% self-time procedure [index] attrib.% attrib.time incl.time callee (callsite) [index] ------------------------------------------------------------------------- 100.0% 22.456 22.456 __start [1] [2] 100.0% 22.456 0.0% 0.000 main [2] 100.0% 22.456 22.456 Scriptstring [3] ------------------------------------------------------------------------- Line-Level Lists When invoked with the -h[eavy] or -l[ines] arguments, a list of line- level data is produced. When -h is used, the line list is sorted by the primary metric associated with each line, with lines from the various DSOs and source files intermixed. When -l is used, the lines are sorted based on the function they came from, with all lines with non-zero values from each function printed before any lines from the next most important function. For each line, the leftmost column contains the value for the primary metric computed from the particular experiment, followed by its representation as a percentage, followed by the cumulative percentage over all lines printed thus far. These data are followed by any additional metrics for the particular experiment, and then by the name of the function, its DSO, file, and line number. The following is an example of a partial line list from pcsamp output with the -h option: secs % cum.% samples function (dso: file, line) 21.410 79.1 79.1 2141 anneal (generic: generic.c, 1590) 3.810 14.1 93.2 381 anneal (generic: generic.c, 1589) 0.610 2.3 95.4 61 slaveusrtime (dlslave.so: dlslave.c, 29) Basic Block List If you are running an bbcounts experiment and invoke the -basicblocks argument, a list of all the basic blocks in the program that were executed is generated. The list is preceded by a header and a column header line. Next comes the list of basic blocks, in order of total cycles. Each basic block is printed with its index, the number of cycles per execution of the block, the count of executions, the total cycles, and the total cycles represented as a percentage. Following that on each line is the function and the address of the first line, with the DSO, source file, and source line number at the end. The following is an example of part of the output. The end of each line is wrapped because of space limitations. index cyc./cnt. counts total cycles % function (address; dso: file, line) 0 259 14440000 3739960000 95.2% anneal (@0x100068c4; generic: generic.c, 1590) 1 29 5000000 145000000 3.7% slaveusrtime (@0x5ffe0748; dlslave.so: dlslave.c, 30) 2 30 500000 15000000 0.4% cvttrap (@0x10002834; generic: generic.c, 328) Architectural Information Report If you are running an bbcounts experiment and invoke the -archinfo argument, a list of various metrics concerning execution of the program is printed. The report consists of a header and a number of subsections, each with the appropriate headers. For integer registers, counts are printed of the number of times each register was used and its percentage, the number of times each register was used as a base register and its percentage, and the number of times each register is used as a destination register and its percentage. For floating-point registers, the same data, excluding the base count statistics, is printed. Following the register usage statistics, a number of counts of instruction types or sequences are printed, each with a description, followed by a list of all the different instructions used, sorted by the number of times each was executed. For each instruction, the dynamic count of executions and its representation as a percentage is printed. That is followed by the cumulative execution percentage, the count of the number of distinct instructions of that type that were executed one or more times, and that count represented as a percentage. The following is an example of part of the output: Integer register usage ---------------------- register use count % base count % dest. count % r08 189261367 4.79% 14441739 0.37% 87022419 2.20% r09 104994438 2.66% 380142 0.01% 501366 0.01% r10 190505413 4.82% 14441878 0.37% 86650626 2.19% DSO List When prof is invoked with the -dsolist argument, a report of summary information about the DSOs in the execution of the program is printed. For each DSO, the display gives the name, a count of instructions, functions, source-files, source-lines, the high and low addresses, and the full pathname to the DSO. If the DSO is ignored in computation, either because it belongs to the SpeedShop runtime DSO and the -showss flag is not provided, because the DSO was excluded by being listed as with a -xdso argument, or by virtue of its not being listed with a -dso argument, the DSO is flagged as ignored. The following is an example of the output. The DSO pathname has been wrapped to the next line because of space limitations. DSO name instrs. functions files lines low - high text addrs., DSO pathname generic 5342 39 5 0 0x10001ad8 - 0x10006e50, generic libss.so 571 31 4 0 0x09e210b0 - 0x09e2199c, /usr/lib32/libss.so libssrt.so 113284 1155 268 0 0x09f04380 - 0x09f72d90, /usr/lib32/libssrt.so [ignored] libm.so 40663 159 144 0 0x07eb4a6c - 0x07edc5c8, /usr/lib32/libm.so libc.so.1 237389 2840 959 0 0x0fa331cc - 0x0fb1af00, /usr/lib32/libc.so.1 dlslave.so 112 4 2 0 0x5ffe062c - 0x5ffe07ec, ./dlslave.so Resource Usage When invoked with the -usage argument, a report of summary usage data is produced. The data is as measured by the kernel during the run. The data consists of both per-process (per-file) and per-system metrics. The display includes real-time (wall-clock time), user time, system time, and wait time. The per-process data is shown as a sum of the data from the experiment files; the real-time, and the system-wide statistics are printed as maxima over the data in each file. The accounting timers, giving the time spent in each of the various process states, are included. The sum of the accounting timers should be approximately equal to the elapsed real time, since the process must always be in one of the states. There is some skew in the reading of the data, so some discrepancy should be expected. Other summary statistics include bytes read and written, page faults, context switches, system calls, and process-size statistics. The following is an example of a usage display: Real-time: 29.569613 (max.) Process Statistics (sum) System-wide Statistics (max.) ======================== ============================= user-time: 23.127237 total-user-cpu: 26.970000 sys-time: 0.417994 total-sys-cpu: 0.810000 wait-time: 6.024380 total-idle-cpu: 1.790000 accounting state timers ----------------------- user-run: 23.127232 system-run: 0.417997 interrupt-run: 0.329550 blocked-I/O-wait: 0.065945 memory-wait: 0.000000 select-wait: 0.000000 job-control-wait: 0.000000 run-queue-wait: 3.851447 sleep-wait: 1.718699 stream-monitor-wait: 0.000000 raw-I/O-wait: 0.000000 ----------- SUM: 29.510873 summary statistics ------------------ bytes-read: 7308500 bytes-written: 1152 page-faults: 4 minor-faults: 42 vol.-context-switches: 431 invol.-context-switches: 2985 total-context-switches: 4602 signals: 4 total-interrupts: 66630 user-syscalls: 877 total-syscalls: 8619 read-syscalls: 449 total-read-syscalls: 453 write-syscalls: 28 total-write-syscalls: 30 poll-syscalls: 0 ioctl-syscalls: 0 process-size: 2115 resident-set-size: 520 Disassembly and Source Lists When invoked with the -dis argument, an assembly listing version of the program is generated. If the -source argument is given as well, the source code for the disassembly is intermixed with it. The following is a partial example of the output when using the -dis option: anneal: <0x10006830-0x10006b3c> 2524 total samples(93.24%) [1573] 0x10006830 0x27bdffd0 addiu sp,sp,-48 # 1 [1573] 0x10006834 0xffbc0020 sd gp,32(sp) # 2 [1573] 0x10006838 0xffbf0018 sd ra,24(sp) # 3 [1573] 0x1000683c 0x3c030002 lui v1,0x2 # 4 [1573] 0x10006840 0x246397e8 addiu v1,v1,-26648 # 5 [1573] 0x10006844 0x0323e021 addu gp,t9,v1 # 6 [1575] 0x10006848 0xd7808370 ldc1 $f0,-31888(gp) # 7 <2 cycle stall for following instruction> [1575] 0x1000684c 0xf7a00000 sdc1 $f0,0(sp) # 10 DIAGNOSTICS prof prints warnings and fatal errors on stderr. With inclusive cycle counting, prof prints a list of functions at the end that are called but not defined. Any functions starting with _rld listed here represent normal behavior; they are not errors. They appear because rld is not instrumented. One way to sanity-check inclusive cycle counts is to look at the percentage cycles for __start(). If the value is less than 98-99%, the inclusive report is suspect. Look for other warnings that indicate that prof didn't take into account certain procedures. There are a number of known cases when prof fails to list cycles of a procedure in the inclusive listing. The reasons can be one of the following: * init and fini sections, MIPS.stubs, are not part of any procedure. * Calls to procedures that do not use a jump and link are not recognized as procedure calls. * Executions of global procedures with same name in different DSOs. In this case, only one of them is listed. All these exceptions are listed at the end of the -butterfly listing, under a separate section. LIMITATIONS With both PC-sampling and basic block counting, prof may report times for procedures named with a prefix of *DF*, for example *DF*_hello.init_2. DF stands for Dummy Function and indicates cycles spent in parts of text that are not in any function: for example, init and fini sections and MIPS.stubs sections. For PC sampling and hardware counter overflow PC sampling, a dummy function, *DF*_Other is used to report sampling hits in an overflow bin; these hits include time spent in rld and any other text regions not listed in any DSO. If any of the object files linked into the application have been stripped of line-number information (with ld -x, for example), prof warns about the affected procedures. The instruction counts for such procedures are shown as a procedure total, not on a per-basic-block basis. Where a line number would normally appear in a report on a function without line numbers, question marks appear instead. prof does not take into account interactions between basic blocks in bbcounts time runs. prof computes the cycles for one execution of a basic block, assuming all registers are free on entry to the block, and it multiplies this count by the number of times that basic block is executed. In real programs, a block may be entered with a result not yet ready in a register, or with a function unit busy, so the cycle count computed could be either higher or lower than the correct value. Extending the computations to include inter-block state would be prohibitively expensive. When run on a program using shared libraries, prof sometimes combines the times of real but anonymous procedures in a shared library into the preceding (according the library memory layout) externally visible function. The times reported for your procedures are not affected by this attribution error, which is normally minor. prof cannot be run on programs that have been stripped. Compiler optimization level 3 does procedure inlining. This can result in extremely misleading profiles, since the time spent in the inlined procedure shows up in the profile as time spent in the procedure into which it was inlined. It is generally better to use compiler optimization level 2 or less when gathering an execution profile. Fortran alternate entry point times are attributed to the main function or subroutine, since there is no general way for prof to separate the times for the alternate entries. EXAMPLES The first example shows how the data displayed in the RESOURCE USAGE section was generated. The ssrun(1) command picks up data from the executable program generic and builds the file generic.fpe.m12213. The prof command then analyzes that file. The output is directed into the file named usage.report. Example 1: % ssrun -fpe generic % prof -usage generic.fpe.m12213 >usage.report SEE ALSO speedshop(1), ssrun(1), sscord(1), ssorder(1), fbdump(1)