prof(1)



PROF(1)PROF(1)


NAME
     prof - Analyzes and displays SpeedShop performance data

SYNOPSIS
     prof [options] speedshop_data_file  ...
     prof [options] executable_name [speedshop_data_file] ...

IMPLEMENTATION
     IRIX systems

DESCRIPTION
     prof analyzes one or more data files generated by the SpeedShop
     performance tools, such as ssrun(1) , and produces a report. Most
     reports are formatted with long lines and are best viewed in a window
     that is 135 characters wide or printed in wide format.

     The second form is used to analyze data files generated by the
     SpeedShop performance tools if the target program is not in the same
     directory as the data files, in which case executable_name should be
     the path to the target program.

     Multiple files can be included only if they are recorded from the same
     executable with the same experiment type (see ssrun(1) ).

     The default listing for all experiments lists functions in descending
     order of the appropriate exclusive performance metric.  Exclusive data
     is derived from the function only; it does not include data from other
     functions that it calls.  Inclusive data is derived from both the
     function and other functions it calls.  (For more information, see the
     FUNCTION LIST section.)

     Options allow sorting by calls or by inclusive metrics, for those
     experiments in which the recorded data supports the two methods of
     sorting. Where applicable, the -b[utterfly] option also produces a
     listing of callers and callees for each function, with attribution
     percentages and time or counts.  (See the BUTTERFLY LIST section.)

     Additional listings are described in the EXPERIMENTS section.

OPTIONS
     The following are general options to the prof(1) command:

     -archinfo      Reports the number of times each register was used as a
                    destination, base (integer registers only) or source,
                    how many times each instruction opcode was used, and
                    some detailed statistics concerning branches jumps, and
                    how many delay slots were filled with no-op
                    instructions.  Works only with bbcounts experiments.

     -basicblocks   Prints a list of all the basic blocks executed, ordered
                    by the total number of cycles spent in each basic
                    block.  Works only with bbcounts experiments.

     -b[utterfly]   Causes prof to print a report showing the callers and
                    callees of each function, with inclusive time
                    attributed to each.  For bbcounts experiments, the
                    attribution is based on a heuristic.  For the various
                    call stack sampling and tracing experiments, the
                    attribution is precise. The usertime, totaltime, and
                    some _hwctime experiments are statistical in nature and
                    so are less exact.  This option is ignored for
                    experiments in which the data does not support
                    inclusive calculations.

     -calipers  [n1,]n2
                    Causes prof to compute the data between caliper points
                    n1 and n2, rather than for the entire experiment. If n1
                    >= n2, an error is reported. If n1 is negative, it is
                    set to the beginning of the experiment. If n2 is
                    greater than the maximum recorded, it is set to the
                    maximum.  If n1 is omitted, zero is assumed.

     -[no]cordfb    Disables or enables cord feedback file generation for
                    the executable only.  Cord feedback is used to arrange
                    procedures in the binary in an optimal ordering. This
                    improves both paging and instruction cache performance.
                    You can use cord(1) or ld(1) to actually do the
                    procedure ordering.

     -cordfball     Enables cord feedback for the executable and all DSOs.

     -dis[assemble] Disassembles and annotates the analyzed object code
                    with cycle times (or PC samples).  This option can be
                    used when generating reports for bbcounts, pcsamp,
                    _hwc, or numa experiments.

     -repository directory
                    Will use the SpeedShop DSO information from this
                    directory instead of processing the DSO in memory. This
                    option will reduce the time prof takes to retrieve
                    source file information, procedure name and address
                    from the DSO. If you have multiple SpeedShop DSO
                    repositories, you can use -repository multiple times to
                    define all the repositories you need.  It can also be
                    used to save the experiments because saving the DSO
                    information will allow prof to not use the original DSO
                    which may have been modified. The directory is used in
                    Read Only mode.

     -update_repository directory
                    Prof will process the DSOs and directory is where
                    useful information from the DSO will be put.  This
                    option can only be specified once and is required to
                    save the DSO information. SpeedShop DSO information
                    files have the extension _ssInfo<abi> depending on the
                    abi (32, n32 or 64).  The directory is used in Read
                    Write mode.

     -dsolist       Lists all the DSOs in the program and their start and
                    end text addresses.

     -feedback      Produces files with information that can be used to (a)
                    tell the compiler how to optimize compilation of the
                    program next time and (b) arrange procedures in the
                    binary in an optimal ordering.  Users can invoke the
                    following to use the feedback file for subsequent
                    compilations.

                         % cc -fb cfb-filename

                    The -fb option is also implemented on the CC(1),
                    f77(1), and f90(1) compiler command lines.

                    Procedures are normally ordered by their measured
                    invocation counts; if -butterfly is also specified,
                    procedures are ordered using call graph counts rather
                    than invocation counts.  The cord feedback file is
                    named a.out.fb.  The -feedback option produces cord
                    feedback for the executable only. To get cord feedback
                    for all the DSOs, use the options -feedback -cordfball
                    together.  To disable cord feedback while producing
                    compiler feedback, use the options -feedback and
                    -nocordfb together.

                    This option also produces a file with information that
                    the compiler system can use to recompile, optimizing by
                    using measured branch frequencies, and so on.  The
                    feedback file is produced for the executable only.  It
                    is named a.out.cfb and is a binary file.  It may be
                    dumped using the fbdump(1) routine.  This option can
                    only be used bbcounts data files, and the data should
                    be recorded on a binary that was compiled -O0.
                    Recording data on binaries with higher optimization
                    will generate a feedback file that does not have the
                    appropriate correspondence between source lines and
                    machine code.

     -pthreads pthread_id1,pthread_id2,,pthread_idn
                    For usertime, totaltime, _hwctime, io, fpe, numa, and
                    mpi experiments on applications that use pthreads (on
                    IRIX 6.5 or later), analyzes data only for the
                    specified pthreads.  This flag is ignored for other
                    experiments.

     -ranks mpi-ranks
                    For mpi experiments, analyzes data only for the
                    specified MPI ranks. The ranks are specified in the
                    comma-separated list mpi-ranks. This list may also
                    contain dash-separated ranges of ranks. For example,
                    the list "1-4,7" is the same as the list "1,2,3,4,7".
                    Rank numbers are given in terms of the MPI_COMM_WORLD
                    communicator. This flag is ignored for other
                    experiments.

     -h[eavy]       Reports the most heavily used lines in descending order
                    of use.  This option can be used when generating
                    reports for bbcounts, pcsamp, or _hwc experiments.  It
                    is ignored for other experiments.

     -l[ines]       Performs like -h[eavy], but groups lines by procedure,
                    with procedures sorted in descending order of use.
                    Within a procedure, lines are listed in source order.
                    This option can be used when generating reports for
                    bbcounts, pcsamp, or _hwc experiments.  It is ignored
                    for other experiments.

     -overhead      For parallel programs, produces a high-level report
                    showing the overhead specific to the programming model
                    used, either OpenMP, pthreads, or MPI. It reports a set
                    of categories particular to each programming model and
                    the percentages of metrics attributed to each category.

     -showss        Enables the display of functions from the SpeedShop
                    runtime DSO.  Usually those functions are suppressed
                    from the reports and computations.  In addition, some
                    statistics for prof's own memory usage will be printed.

     -S or -source  Disassembles and annotates the analyzed object code
                    with cycle times (or PC samples) and interleaves the
                    source code.  This option can be used when generating
                    reports for bbcounts, pcsamp, _hwc, or numa
                    experiments.

     -u[sage]       Prints a report on system statistics.

     -ws            Generates, for the executable only, a working-set file
                    for the current caliper setting.

     -wsall         Generates, for the executable and all the non-ignored
                    DSOs, a working-set file for the current caliper
                    setting.

     -instrdistrib  Reports the distribution of executed instructions
                    between the categories:  loads, stores, branches (and
                    jumps), integer operations, and floating-point
                    operations. Any instructions which prof cannot
                    categorize are placed into a special `Other' category.
                    Works only with bbcounts experiments.

   Output Controls
     The following options control the way prof displays its output:

     -calls         Sorts function list by procedure calls rather than by
                    time.  This option can only be used when generating
                    reports for bbcounts experiments.

     -dislimit n    Disassembles only those basic blocks with frequency >=
                    n.

     -inclusive     Sorts function list by inclusive data rather than by
                    exclusive data.  This option can only be used when
                    generating reports for those experiments that have
                    inclusive data; it is ignored for others.

     -nh            Suppress various header blocks from the output.

     -fpe_counts    For FPE experiments, this produces a table displaying
                    how many of each FPE type were recorded.

     -fpe_type n    For FPE experiments, this option restricts the output
                    to only show data from FPE events of type n. The
                    default is to display the combined results from all FPE
                    events.

     -q[uit]  n or n% or ncum%
                    Truncates the -h[eavy], -l[ines], and -b[utterfly]
                    listings. It can truncate after n lines (if n is an
                    integer), after the first entry that represents less
                    than n percent of the total (if n is followed
                    immediately by a %), or after enough entries have been
                    printed to account for n percent of the total (if n is
                    followed immediately by cum%).

                    For example, -q 15 truncates each part of the listing
                    after 15 lines of text, -q 15% truncates each part
                    after the first line that represents less than 15
                    percent of the whole, and -q 15cum% truncates each part
                    after the line that brought the cumulative percentage
                    above 15 percent.  For -b[utterfly], -q ncum% behaves
                    the same as -q n%.

     -rel[ative]    Shows percentage attribution in a butterfly report
                    relative to the central function.  The default is to
                    show percentages as absolute percentages over the whole
                    run.

     -by_site       Applies to MPI experiments only. It shows, on a per-
                    call-site basis, the number of times an MPI function
                    was called and the total amount of time taken in those
                    calls. The results are sorted in descending order by
                    time.

     -by_chain      Applies to MPI experiments only. It shows, on a per-
                    call-chain basis, the number of times an MPI function
                    was called and the total amount of time taken in those
                    calls.

   Selectivity Options
     The following options define the limits of what prof displays:

     -dso dso_name  Reports only on the named DSO.  Only the base name, not
                    the full pathname, need be specified to the DSO; the
                    .so suffix is required.  Multiple instances of the -dso
                    flag can be given, and the executable is considered a
                    DSO like any other.  All the DSOs from an experiment
                    can be listed with the -dsolist flag.  See the -xdso
                    option.

     -e[xclude] procedure_name
                    If you use one or more -e[xclude] options, the profiler
                    omits the specified procedure from the listing.  If any
                    option uses an uppercase E for -E[xclude], prof also
                    omits that procedure from the base upon which it
                    calculates percentages.

     -o[nly] procedure_name
                    If you use one or more -o[nly] options, the profile
                    listing includes only the named procedures, rather than
                    the entire program. If any option uses an uppercase O
                    for -O[nly], prof uses only the named procedures,
                    rather than the entire program, as the base upon which
                    it calculates percentages.

     -xdso dso_name Excludes the named DSO from any reports.  Only the base
                    name, not the full pathname, of the DSO need be
                    specified; the .so suffix is required.  Multiple
                    instances of the -xdso flag can be given.    See the
                    -dso option.

   CPU Options
     prof usually uses the scheduling model for the processor on which it
     is being run to perform the analysis.  You can override the default
     with any of the following options:

     -clock megahertz
                    Sets the CPU clock speed to megahertz MHz.  It alters
                    the appropriate parts of the listing to reflect the
                    clock speed.  The default value is the clock speed of
                    the machine on which the experiment is performed.

     -cycle nanosecond
                    Sets the cycle time to nanosecond ns. This is the same
                    as -clock 1000/nanosecond.

     -r16000 | -r14000 | -r12000 | -r10000 | -r8000 | -r5000 | -r4000
                    Note that these options are only meaningful for
                    bbcounts data.

   Debugging Options
     The following prof options apply to debugging:

     -debug:debug_flags
                    Sets debug_flags to combinations of the following:

                   GPROF_FLAG      0x00000001

                   COUNTS_FLAG     0x00000002

                   SAMPLE_FLAG     0x00000004

                   MISS_FLAG       0x00000008

                   FEEDBACK_FLAG   0x00000010

                   CORD_FLAG       0x00000020

                   USERPC_FLAG     0x00000040

                   MDEBUG_FLAG     0x00000080

                   BEAD_FLAG       0x00000100

EXPERIMENTS
     The following are some of the more popular SpeedShop experiments:

     usertime       (callstack profiling, based on user time plus system
                    time spent on behalf of the program) Interrupts the
                    program every 30 milliseconds during its running time
                    (including when it is in user or system mode, but not
                    including any wait time).  It records the callstack at
                    each interrupt.  The experiment shows both inclusive
                    and exclusive time.  The usertime data is statistical
                    in nature and will show some variance from run to run.

     totaltime      (callstack profiling, based on wall-clock time)
                    Interrupts the program every 30 milliseconds of wall-
                    clock time during the run and records the callstack at
                    each interrupt.  It shows both inclusive and exclusive
                    time.  The totaltime data is statistical in nature and
                    will show some variance from run to run.

     [f]pcsamp[x]   Directs the kernel to look at the user program counter
                    (PC) every 10 milliseconds and record a histogram of
                    the value of the program counter at each clock tick.
                    pcsamp uses 16-bit bins, one for each PC value.  It can
                    only show exclusive data; that is, data about the
                    current function, not about functions that called the
                    current function of functions that the current function
                    called itself.  The f prefix executes a profile every
                    millisecond, instead of every 10 milliseconds.  The x
                    suffix specifies 32-bit, instead of 16-bit, count bins.
                    The pcsamp data is statistical in nature and will show
                    some variance from run to run.

     _hwc           The experiments that include the string _hwc require
                    the kernel to look at the user PC every time the
                    hardware performance counter specified by the
                    experiment overflows and record a histogram of the
                    value of the program counter at overflow.  These
                    experiments can only be run on R10000, R12000, R14000,
                    or R16000 machines; other machines do not have the
                    hardware performance counters.

                    A number of hardware counter experiments are defined;
                    see the ssrun(1) or the speedshop(1) man page for the
                    complete list.  They can only show exclusive data, that
                    is data about the current function, not functions that
                    called the current function or functions called by the
                    current function.  The particular counter used and its
                    overflow value are specified in the experiment.  Some
                    of the hardware performance counter sampling
                    experiments are statistical in nature and will show
                    some variance from run to run; others are exact,
                    returning the same data from run to run, provided the
                    program executes the exact same sequence of
                    instructions.

                    Among the interesting counter experiments are the
                    following:

                     cy_hwc for cycle counting;
                     gi_hwc for graduated instructions;
                     ic_hwc for primary instruction cache misses;
                     sic_hwc for secondary instruction cache misses;
                     dc_hwc for primary data cache misses;
                     sc_hwc for secondary data cache misses;
                     tlb_hwc for TLB misses;
                     gfp_hwc for graduated floating-point instructions; and
                     fsc_hwc for failed store-conditional instructions.

     _hwctime       Experiments that include the string _hwctime (callstack
                    profiling, returns hardware counter overflow
                    information) interrupt the program at every n overflows
                    of a counter and record the callstack at each
                    interrupt.  The value of n depends on the counter
                    chosen.  It can take the same set of prefixes as above,
                    but since callstacks are recorded, it can show both
                    inclusive and exclusive time.  _hwctime data is
                    statistical in nature and will show some variance from
                    run to run.

     bbcounts       Instruments the code to count the number of times each
                    basic block is executed.  (A basic block is a region of
                    the program that can be entered only at the beginning
                    and exited only at the end.)  The data recorded also
                    contains counts for all function pointer calls. From
                    this data, a machine model is used to compute the
                    exclusive time (cycles) spent in each function.

                    Inclusive time computations, performed when the
                    -b[utterfly] flag is specified to prof, calculate the
                    exclusive time for each routine as above, and then
                    propagates the time to each caller of each routine in
                    proportion to the number of calls.  For example, if
                    sin(x) takes 1000 cycles, and its callers (procedures
                    foo() and bar()) call it 25 and 75 times, respectively,
                    250 cycles are attributed  to foo() and 750 to bar().
                    By propagating cycles this way, __start() (or the o32
                    equivalent, __istart()) should end up with all the
                    cycles counted in the program.

                    Note that the propagation according to the number of
                    calls may not be reasonable for some routines and may
                    lead to misleading reports.  For example, if a matrix
                    multiply routine is substituted for sin(x) in the above
                    example, and bar's calls are for 2X2 matrices, while
                    foo's calls are for 100X100 matrices, the attribution
                    distributes three-fourths of the time to bar, whereas
                    nearly all the time really should be attributed to foo.

     fpe            The floating-point exception library traces all
                    floating-point exceptions with their callstacks.  These
                    experiments show listings similar to those for usertime
                    experiments, except that the data reported by the -b
                    option is a count of floating-point exceptions, rather
                    than projected CPU time.

     io             Traces all I/O calls made by the program.  These
                    experiments show calls attributed to functions, and,
                    with the -b[utterfly] option specified, can show which
                    functions made which calls.

     mpi            Traces calls to MPI functions.  Collects data such as
                    the time taken by the call, which thread and MPI rank
                    made the call, etc.  The report for this experiment is
                    similar to that described below in the OUTPUT REPORTS
                    section.  The default report shows the total number of
                    times each MPI function was called and the amount of
                    time spent in that function. The results are sorted in
                    descending order by time. Data can also be shown on a
                    per-call-site or per-call-chain basis by using the
                    -by_site or -by-chain options respectively.

     numa           Samples memory accesses on systems using the SGI ccNUMA
                    architecture.  Collects data such as the ccNUMA node
                    making the memory access, the ccNUMA node containing
                    the memory being accessed, etc.  This experiment gives
                    statistical information on a global, per function, and
                    per-PC basis.  Reports both the percentage of memory
                    accesses that were remote and the average ccNUMA
                    routing distance between the nodes.  numa data is
                    statistical in nature and will show some variance from
                    run to run.  Results will also vary depending on the
                    operating system thread and memory placement policy
                    currently in effect.

OUTPUT REPORTS
     prof writes an analysis of the performance data to stdout.  The first
     thing in the list is a summary of the experiment and description of
     the environment in which it was recorded.  Next comes a header that
     summarizes the particular data recorded.  Following that is the
     function list;  if -b[utterfly] is specified, and the data supports
     it, a butterfly list is presented.

     If the -h[eavy] or -l[ines] options are included for pcsamp, _hwc, and
     bbcounts experiments, a report of data at the source line level is
     appended. The report is sorted either by the performance metric
     computed on a line basis, or by functions and then by line numbers
     within a function.  For other experiments, these options are ignored.

     For bbcounts experiments only, if -basicblocks is specified, a report
     of data at the basic-block level is appended.  If -archinfo is
     specified, also for bbcounts experiments only, a summary report of
     register usage, instruction usage, and various other statistics is
     appended.  For other experiments, these options are ignored.

     If -dsolist is specified, a list of the DSOs used by the program is
     appended.

     If -usage is specified, a summary of the resources used by the program
     is appended.

     The following is part of an bbcounts output report with no options.
     The function list is wrapped because of space limitations.

  -------------------------------------------------------------------------
  Summary of bbcounts time data (bbcounts)--
                      2062563179: Total number of instructions executed
                      3929944273: Total computed cycles
                          22.457: Total computed execution time (secs.)
                           1.905: Average cycles / instruction
  -------------------------------------------------------------------------
  Function list, in descending order by exclusive bbcounts time
  -------------------------------------------------------------------------
   [index]   excl.secs   excl.%     cum.%        cycles  instructions
  calls  function  (dso: file, line)

       [1]      21.453    95.5%     95.5%    3754320037    1971220024
     1  anneal (generic: generic.c, 1573)

   Function List
     The default output for any experiment is a function list, sorted in
     the order of exclusive values of the primary metric: the performance
     cost of the function, as computed from the data recorded.  For many
     experiments, this metric is a time, printed in the report in seconds;
     for others, it is a count of events-FPEs, IO calls, or hardware
     counter overflow counts.

     The report begins with a legend line, naming each of the columns of
     data.  Each line in the list has an index; if -b[utterfly] is not
     specified, the index will be in numerical order; if it is specified,
     the index will be in order that the functions as sorted by the
     inclusive metric.  The index serves as a poor person's hyperlink
     through the butterfly and between the butterfly and the function list.

     The primary metric is to the right of the index, followed by its
     representation as a percentage of the whole program, followed by the
     cumulative percentage up to that point in the list.  The next two
     columns are the inclusive value of the metric and its representation
     as a percentage.

     Additional columns of data, as appropriate to the particular data
     recorded, may follow.  Finally, the name of the function and the DSO
     and source file basenames appear in the right-most columns.

     The following is a partial function list.  Some columns are wrapped
     because of space limitations.

          [index]  excl.secs excl.%   cum.%  incl.secs incl.%   samples
          procedure(dso:file,line)

             [4]     22.770  93.8%   93.8%     22.770  93.8%        759
          anneal (generic: generic.c, 1573)
             [6]      1.020   4.2%   98.0%      1.020   4.2%         34
          slaveusrtime (dlslave.so: dlslave.c, 22)
             [9]      0.210   0.9%   98.9%      0.210   0.9%          7
          cvttrap (generic: generic.c, 317)

   Butterfly List
     The butterfly list is a set of records that show the callers and
     callees of each function.  The list is sorted in the order of
     inclusive values of the primary metric.  For each function, its
     callers are shown above it and its callees below.  The center line is
     for the function itself; the index appears at the beginning and the
     end of the line.  The second and third columns are the inclusive
     percentages of the primary metric and its absolute value.  The next
     two columns of the center line are the exclusive percentage and the
     exclusive value of the metric, followed by the function
     identification.

     Callees are shown below the function, with attribution percentages and
     values lined up below the self percentages and values of the center
     function.  In absolute mode (the default), the percentages, including
     the center node's self value, should add up to the inclusive
     percentage of the center node; in relative mode, specified with
     -rel[ative] flag, the percentages should add up to 100 %.  In either
     mode, the attribution values should add up to the inclusive time of
     the central function.  The attribution data for callees is followed by
     the inclusive value for that caller and callsite, then the callsite
     identification, with an address, a source file, and a line number.
     For callees, the source file will be that of the central function.

     The callers are shown above the central function, also with
     attribution percentages and values, but these have a different
     meaning; they are aligned with the central nodes inclusive values to
     indicate the difference.  The attribution percentage and value in a
     caller's line represents the percentage and value, respectively, of
     the central function's metric that was attributed to that callsite.
     In absolute mode, the percentages should add up to the central
     function's percentage; in relative mode, they should add up to 100%.
     The attribution values should add to the central function's values.
     The attribution data for callers is followed by the inclusive value
     for that caller and callsite, followed by the callsite identification,
     with an address, a source file, and a line number.  For callers, the
     source file will be that of the calling function.

     The follow is a butterfly entry for the main function.  Changes have
     been made to the display because of space limitations.

-------------------------------------------------------------------------
 attrib.% attrib.time                         incl.time  caller (callsite) [index]
 [index]   incl.%   incl.time    self%   self-time       procedure [index]
                      attrib.% attrib.time    incl.time  callee (callsite) [index]
-------------------------------------------------------------------------
   100.0%      22.456                            22.456  __start [1]
     [2]   100.0%      22.456     0.0%       0.000         main [2]
                        100.0%      22.456       22.456  Scriptstring [3]
-------------------------------------------------------------------------

   Line-Level Lists
     When invoked with the -h[eavy] or -l[ines] arguments, a list of line-
     level data is produced.  When -h is used, the line list is sorted by
     the primary metric associated with each line, with lines from the
     various DSOs and source files intermixed.  When -l is used, the lines
     are sorted based on the function they came from, with all lines with
     non-zero values from each function printed before any lines from the
     next most important function.

     For each line, the leftmost column contains the value for the primary
     metric computed from the particular experiment, followed by its
     representation as a percentage, followed by the cumulative percentage
     over all lines printed thus far.  These data are followed by any
     additional metrics for the particular experiment, and then by the name
     of the function, its DSO, file, and line number.

     The following is an example of a partial line list from pcsamp output
     with the -h option:

     secs     %   cum.%   samples  function (dso: file, line)

   21.410   79.1   79.1      2141  anneal (generic: generic.c, 1590)
    3.810   14.1   93.2       381  anneal (generic: generic.c, 1589)
    0.610    2.3   95.4        61  slaveusrtime (dlslave.so: dlslave.c, 29)


   Basic Block List
     If you are running an bbcounts experiment and invoke the -basicblocks
     argument, a list of all the basic blocks in the program that were
     executed is generated.  The list is preceded by a header and a column
     header line. Next comes the list of basic blocks, in order of total
     cycles.

     Each basic block is printed with its index, the number of cycles per
     execution of the block, the count of executions, the total cycles, and
     the total cycles represented as a percentage.  Following that on each
     line is the function and the address of the first line, with the DSO,
     source file, and source line number at the end.

     The following is an example of part of the output.  The end of each
     line is wrapped because of space limitations.

            index  cyc./cnt.     counts   total cycles     %   function
      (address; dso: file, line)

                0        259   14440000     3739960000  95.2%  anneal
      (@0x100068c4; generic: generic.c, 1590)
                1         29    5000000      145000000   3.7%  slaveusrtime
      (@0x5ffe0748; dlslave.so: dlslave.c, 30)
                2         30     500000       15000000   0.4%  cvttrap
      (@0x10002834; generic: generic.c, 328)

   Architectural Information Report
     If you are running an bbcounts experiment and invoke the -archinfo
     argument, a list of various metrics concerning execution of the
     program is printed.  The report consists of a header and a number of
     subsections, each with the appropriate headers.

     For integer registers, counts are printed of the number of times each
     register was used and its percentage, the number of times each
     register was used as a base register and its percentage, and the
     number of times each register is used as a destination register and
     its percentage.  For floating-point registers, the same data,
     excluding the base count statistics, is printed.

     Following the register usage statistics, a number of counts of
     instruction types or sequences are printed, each with a description,
     followed by a list of all the different instructions used, sorted by
     the number of times each was executed.  For each instruction, the
     dynamic count of executions and its representation as a percentage is
     printed.  That is followed by the cumulative execution percentage, the
     count of the number of distinct instructions of that type that were
     executed one or more times, and that count represented as a
     percentage.

     The following is an example of part of the output:

  Integer register usage
  ----------------------
   register      use count      %    base count      %   dest. count   %
        r08      189261367   4.79%     14441739   0.37%     87022419  2.20%
        r09      104994438   2.66%       380142   0.01%       501366  0.01%
        r10      190505413   4.82%     14441878   0.37%     86650626  2.19%

   DSO List
     When prof is invoked with the -dsolist argument, a report of summary
     information about the DSOs in the execution of the program is printed.
     For each DSO, the display gives the name, a count of instructions,
     functions, source-files, source-lines, the high and low addresses, and
     the full pathname to the DSO.  If the DSO is ignored in computation,
     either because it belongs to the SpeedShop runtime DSO and the -showss
     flag is not provided, because the DSO was excluded by being listed as
     with a -xdso argument, or by virtue of its not being listed with a
     -dso argument, the DSO is flagged as ignored.

     The following is an example of the output. The DSO pathname has been
     wrapped to the next line because of space limitations.

 DSO name    instrs.  functions      files      lines   low - high text addrs.,
   DSO pathname

generic        5342         39          5          0 0x10001ad8 - 0x10006e50,
   generic
libss.so        571         31          4          0 0x09e210b0 - 0x09e2199c,
   /usr/lib32/libss.so
libssrt.so   113284       1155        268          0 0x09f04380 - 0x09f72d90,
   /usr/lib32/libssrt.so [ignored]
libm.so       40663        159        144          0 0x07eb4a6c - 0x07edc5c8,
   /usr/lib32/libm.so
libc.so.1    237389       2840        959          0 0x0fa331cc - 0x0fb1af00,
   /usr/lib32/libc.so.1
dlslave.so      112          4          2          0 0x5ffe062c - 0x5ffe07ec,
   ./dlslave.so

   Resource Usage
     When invoked with the -usage argument, a report of summary usage data
     is produced. The data is as measured by the kernel during the run.
     The data consists of both per-process (per-file) and per-system
     metrics.  The display includes real-time (wall-clock time), user time,
     system time, and wait time.  The per-process data is shown as a sum of
     the data from the experiment files; the real-time, and the system-wide
     statistics are printed as maxima over the data in each file.

     The accounting timers, giving the time spent in each of the various
     process states, are included.  The sum of the accounting timers should
     be approximately equal to the elapsed real time, since the process
     must always be in one of the states.  There is some skew in the
     reading of the data, so some discrepancy should be expected.  Other
     summary statistics include bytes read and written, page faults,
     context switches, system calls, and process-size statistics.

     The following is an example of a usage display:

                             Real-time:    29.569613 (max.)

               Process Statistics (sum)       System-wide Statistics (max.)
               ========================       =============================
                user-time:    23.127237       total-user-cpu:    26.970000
                 sys-time:     0.417994        total-sys-cpu:     0.810000
                wait-time:     6.024380       total-idle-cpu:     1.790000

               accounting state timers
               -----------------------
                 user-run:    23.127232
               system-run:     0.417997
            interrupt-run:     0.329550
         blocked-I/O-wait:     0.065945
              memory-wait:     0.000000
              select-wait:     0.000000
         job-control-wait:     0.000000
           run-queue-wait:     3.851447
               sleep-wait:     1.718699
      stream-monitor-wait:     0.000000
             raw-I/O-wait:     0.000000
                            -----------
                      SUM:    29.510873

                   summary statistics
                   ------------------
                 bytes-read:    7308500
              bytes-written:       1152
                page-faults:          4
               minor-faults:         42
      vol.-context-switches:        431
    invol.-context-switches:       2985   total-context-switches: 4602
                    signals:          4         total-interrupts: 66630
              user-syscalls:        877           total-syscalls: 8619
              read-syscalls:        449      total-read-syscalls: 453
             write-syscalls:         28     total-write-syscalls: 30
              poll-syscalls:          0
             ioctl-syscalls:          0
               process-size:       2115
          resident-set-size:        520

   Disassembly and Source Lists
     When invoked with the -dis argument, an assembly listing version of
     the program is generated.  If the -source argument is given as well,
     the source code for the disassembly is intermixed with it.

     The following is a partial example of the output when using the -dis
     option:

          anneal: &lt;0x10006830-0x10006b3c>    2524 total samples(93.24%)
            [1573] 0x10006830 0x27bdffd0     addiu     sp,sp,-48      # 1
            [1573] 0x10006834 0xffbc0020     sd   gp,32(sp)      # 2
            [1573] 0x10006838 0xffbf0018     sd   ra,24(sp)      # 3
            [1573] 0x1000683c 0x3c030002     lui  v1,0x2    # 4
            [1573] 0x10006840 0x246397e8     addiu     v1,v1,-26648   # 5
            [1573] 0x10006844 0x0323e021     addu gp,t9,v1  # 6
            [1575] 0x10006848 0xd7808370     ldc1 $f0,-31888(gp)      # 7
                    &lt;2 cycle stall for following instruction>
            [1575] 0x1000684c 0xf7a00000     sdc1 $f0,0(sp)      # 10

DIAGNOSTICS
     prof prints warnings and fatal errors on stderr.  With inclusive cycle
     counting, prof prints a list of functions at the end that are called
     but not defined. Any functions starting with _rld listed here
     represent normal behavior; they are not errors. They appear because
     rld is not instrumented.

     One way to sanity-check inclusive cycle counts is to look at the
     percentage cycles for __start().  If the value is less than 98-99%,
     the inclusive report is suspect.  Look for other warnings that
     indicate that prof didn't take into account certain procedures.

     There are a number of known cases when prof fails to list cycles of a
     procedure in the inclusive listing.  The reasons can be one of the
     following:

     * init and fini sections, MIPS.stubs, are not part of any procedure.

     * Calls to procedures that do not use a jump and link are not
       recognized as procedure calls.

     * Executions of global procedures with same name in different DSOs.
       In this case, only one of them is listed.

     All these exceptions are listed at the end of the -butterfly listing,
     under a separate section.

LIMITATIONS
     With both PC-sampling and basic block counting, prof may report times
     for procedures named with a prefix of *DF*, for example
     *DF*_hello.init_2. DF stands for Dummy Function and indicates cycles
     spent in parts of text that are not in any function: for example, init
     and fini sections and MIPS.stubs sections.  For PC sampling and
     hardware counter overflow PC sampling, a dummy function, *DF*_Other is
     used to report sampling hits in an overflow bin; these hits include
     time spent in rld and any other text regions not listed in any DSO.

     If any of the object files linked into the application have been
     stripped of line-number information (with ld -x, for example), prof
     warns about the affected procedures.  The instruction counts for such
     procedures are shown as a procedure total, not on a per-basic-block
     basis.  Where a line number would normally appear in a report on a
     function without line numbers, question marks appear instead.

     prof does not take into account interactions between basic blocks in
     bbcounts time runs. prof computes the cycles for one execution of a
     basic block, assuming all registers are free on entry to the block,
     and it multiplies this count by the number of times that basic block
     is executed.  In real programs, a block may be entered with a result
     not yet ready in a register, or with a function unit busy, so the
     cycle count computed could be either higher or lower than the correct
     value.  Extending the computations to include inter-block state would
     be prohibitively expensive.

     When run on a program using shared libraries, prof sometimes combines
     the times of real but anonymous procedures in a shared library into
     the preceding (according the library memory layout) externally visible
     function.  The times reported for your procedures are not affected by
     this attribution error, which is normally minor.

     prof cannot be run on programs that have been stripped.

     Compiler optimization level 3 does procedure inlining.  This can
     result in extremely misleading profiles, since the time spent in the
     inlined procedure shows up in the profile as time spent in the
     procedure into which it was inlined.  It is generally better to use
     compiler optimization level 2 or less when gathering an execution
     profile.

     Fortran alternate entry point times are attributed to the main
     function or subroutine, since there is no general way for prof to
     separate the times for the alternate entries.

EXAMPLES
     The first example shows how the data displayed in the RESOURCE USAGE
     section was generated.  The ssrun(1) command picks up data from the
     executable program generic and builds the file generic.fpe.m12213.
     The prof command then analyzes that file.  The output is directed into
     the file named usage.report.

     Example 1:

          % ssrun -fpe generic
          % prof -usage generic.fpe.m12213 >usage.report

SEE ALSO
     speedshop(1), ssrun(1), sscord(1), ssorder(1), fbdump(1)