LNO(5)LNO(5)


NAME
     LNO - Compiler loop nest optimization option group

SYNOPSIS
     -LNO: ...

DESCRIPTION
     This man page describes the loop nest optimization options accepted by
     the f90(1), f77(1), CC(1), and the cc(1) commands.

     The -LNO: option group specifies options and transformations performed
     on loop nests.  The -LNO: option group is enabled only if the -O3
     option is also specified on the compiler command line.

     For information on the LNO options that are in effect during a
     compilation, use the -LIST:options=ON option.

     You can specify more than one suboption to the -LNO: option either by
     using colons to separate each suboption or by specifying multiple
     options on the command line.  For example, the following command lines
     are equivalent:

          f90 -LNO:auto_dist=ON:outer=OFF b.f
          f90 -LNO:auto_dist=ON -LNO:outer=OFF b.f

     Some -LNO: suboptions are specified with a setting that either enables
     or disables the feature.  To enable a feature, specify the argument
     either alone or with =1, =ON, or =TRUE.  To disable a feature, specify
     the suboption with either =0, =OFF, or =FALSE.  For example, the
     following command lines are equivalent:

          f90 -LNO:auto_dist:blocking=OFF:oinvar=FALSE a.f
          f90 -LNO:auto_dist=1:blocking=0:oinvar=OFF a.f

     For brevity, this man page shows only the ON or OFF settings to
     suboptions, but 0, 1, TRUE, and FALSE are also allowed as settings. In
     addition, this man page shows the abbreviated form for some of the
     suboption names.  You can use either the abbreviation or the complete
     suboption name when using the suboptions. The following is a list of
     the abbreviations and the complete suboption names:

          Complete name                 Abbreviation

          outer_unroll                  ou

          associativity                 assoc

          clean_miss_penalty            cmp

          dirty_miss_penalty            dmp

          cache_size                    cs

          is_memory_level               is_mem

          line_size                     ls

          tlb_entries                   tlb

          tlb_clean_miss_penalty        tlbcmp

          prefetch_level                pf

     See "F77 LNO Directives" at the end of this man page for a summary of
     the F77 directives for LNO.  See the MIPSpro Fortran 90 Commands and
     Directives Reference Manual, for a discussion of the Fortran 90 LNO
     directives.  See MIPSpro C and C++ Pragmas, for descriptions of the C
     and C++ LNO #pragma directives.

     The descriptions to the suboptions to -LNO: are divided into the
     following categories:

               * General options

               * Transformation options

               * Cache memory management options

               * TLB options

               * Prefetch options

     The -LNO option accepts the following general suboptions:

     Suboption   Action

     auto_dist[ = ( ON|OFF )]
                 Distributes local arrays in common blocks that are
                 accessed in parallel.  The default is OFF.

                 This optimization works with either automatic parallelism
                 or parallelism using directives; it is always safe, and
                 does not affect the layout of arrays in virtual space, and
                 does not incur addressing overhead.

     fission=n   Controls loop fission.  n can be one of the following:

                 0   Disables loop fission.

                 1   Performs normal fission as necessary.  This is the
                     default.

                 2   Specifies that fission be tried before fusion.

                 If -LNO:fission=n and -LNO:fusion=n are both set to 1 or
                 to 2, fusion is performed.

     fusion=n    Controls loop fusion.  n can be one of the following:

                 0   Disables loop fusion.

                 1   Performs standard outer loop fusion.  This is the
                     default.

                 2   Specifies that outer loops should be fused, even if it
                     means partial fusion.

                 The compiler attempts fusion before fission.  The compiler
                 performs partial fusion if not all levels can be fused in
                 the multiple-level fusion.

                 If -LNO=fission=n and -LNO:fusion=n are both set to 1 or
                 to 2, fusion is performed.

     fusion_peeling_limit=n
                 Sets the limit for the number of iterations allowed to be
                 peeled in fusion, where n >= 0.  By default, n=5.

     gather_scatter=n
                 Performs gather-scatter optimizations.  n can be one of
                 the following:

                 0   Disables all gather-scatter optimization.

                 1   Performs gather-scatter optimizations on non-nested IF
                     statements.  This is the default.

                 2   Performs multi-level gather-scatter optimizations.

     ignore_pragmas[ = ( ON|OFF )]
                 Specifies that the command line options override
                 directives in the source file.  The default is OFF.

     local_pad_size=n
                 Specifies the amount by which to pad local array
                 dimensions.  By default, the compiler automatically
                 chooses the amount of padding to improve cache behavior
                 for local array accesses.

     non_blocking_loads[ = ( ON|OFF )]
                 (C/C++ and F77 only) Specifies whether the processor
                 blocks on loads.  If not set, the default of the current
                 processor is used.

     oinvar[ = ( ON|OFF )]
                 Controls outer loop hoisting.  The default is ON.

     opt=n       Controls the LNO optimization level.  n can be one of the
                 following:

                 0   Disables nearly all loop nest optimization.

                 1   Peforms full loop nest transformations.  This is the
                     default.

     outer[ = ( ON|OFF )]
                 Enables or disables outer loop fusion.  The default is ON.

     parallel_overhead=num_cycles
                 Overrides internal compiler estimates concerning the
                 efficiency to be gained by executing certain loops in
                 parallel rather than serially.  num_cycles specifies the
                 number of processor cycles.  Specify an integer for
                 num_cycles.  The default is 2600.

     pure=n (MIPSpro C/C++)
                 Tells the compiler how to use the #pragma pure and #pragma
                 no side effects directives when performing parallel
                 analysis.  #pragma no side effects may read its arguments
                 and unspecified global data; #pragma pure can read only
                 its arguments; neither directive can modify its arguments
                 or global data.  Specify 0, 1, or 2 for n, as follows:

                 n Value   Description

                 0         The compiler ignores the #pragma pure and
                           #pragma no side effects directives when
                           gathering information for parallelization
                           analysis.

                 1         The compiler interprets the #pragma pure and
                           #pragma no side effects directives per their
                           definitions when gathering information for
                           parallelization analysis.

                 2         The compiler interprets #pragma no side effects
                           as #pragma pure when gathering information for
                           parallelization analysis.  This option is
                           provided because you may declare a function to
                           have no side effects, when in fact, it is pure,
                           except for references to system variables such
                           as errno.  In these cases, you can treat no side
                           effects functions as if they were pure for the
                           purposes of parallelization.

     pure=n (MIPSpro 7 Fortran 90)
                 Specifies the extent to which the compiler should consider
                 the effect of a PURE procedure or a !DIR$ NOSIDEEFFECTS
                 directive when performing parallel analysis.  Specify 0,
                 1, or 2 for n, as follows:

                 n Value   Description

                 0         Directs the compiler to ignore a PURE attribute
                           and the !DIR$ NOSIDEEFFECTS directive.

                 1         Directs the compiler to consider the fact that
                           PURE procedures and procedures preceded by a
                           !DIR$ NOSIDEEFFECTS directive do not modify
                           global data or procedure arguments when
                           performing parallel analysis.  Default.

                 2         Asserts to the compiler that that PURE
                           procedures and procedures preceded by a
                           !DIR$ NOSIDEEFFECTS directive do not modify
                           global data, do not modify procedure dummy
                           arguments, and do not access global data.

                           This setting asserts that the only non-local
                           data items referenced by the procedure are the
                           dummy arguments to the procedure.  This is an
                           extension of the Fortran standard meaning of
                           PURE and of the meaning of !DIR$ NOSIDEEFFECTS.
                           At this setting, more aggressive parallelization
                           can occur if procedures are known not to access
                           global data.

     vintr[ = ( ON|OFF )]
                 Specifies that vectorizable versions of the math intrinsic
                 functions should be used.  The default is ON.

                 For information on the math intrinsic functions, see
                 math(3M).

     The loop transformation arguments allow you to control cache blocking,
     loop unrolling, and loop interchange.  They are as follows:

     blocking[ = ( ON|OFF )]
               Specify blocking=OFF to disable the cache blocking
               transformation.  The default is ON.

     blocking_size=n
               Specifies a block size that the compiler must use when
               performing any blocking.  Specify a positive integer number
               that represents the number of iterations.

     interchange[ = ( ON|OFF )]
               Specifies whether or not loop interchange optimizations are
               performed.  The default is ON.

     ou=n      Indicates that all outer loops for which unrolling is legal
               should be unrolled by n, where n is a positive integer.  The
               compiler unrolls loops by this amount or not at all.

     ou_deep[ = ( ON|OFF )]
               Specifies that for loops with 3-deep, or deeper, loop nests,
               the compiler should outer unroll the wind-down loops that
               result from outer unrolling loops further out.  This results
               in large code size, but it generates much faster code
               whenever wind-down loop execution costs are important.  The
               default is ON.

     ou_further=n
               Specifies whether or not the compiler performs outer loop
               unrolling on wind-down loops.  Specify an integer for n.

     ou_max=n  Indicates that the compiler can unroll as many as n copies
               per loop, but no more.

     ou_prod_max=n
               Indicates that the product of unrolling of the various outer
               loops in a given loop nest is not to exceed n, where n is a
               positive integer.  The default is 16.

     pwr2[ = ( ON|OFF )]
               (C/C++ and F77 only) Specifies whether to ignore the leading
               dimension (set to OFF to ignore).

               You can disable additional unrolling by specifying
               -LNO:ou_further=999999.  Unrolling is enabled as much as is
               sensible by specifying -LNO:ou_further=3.

     Certain arguments allow you to describe the target cache memory
     system.  The numbering in the following arguments starts with the
     cache level closest to the processor and works outward:

     assoc1=n, assoc2=n, assoc3=n, assoc4=n
               Specifies the cache set associativity.  For a fully
               associative cache, such as main memory, set n to any
               sufficiently large number, such as 128.  Specify a positive
               integer for n.  Specifying n=0 indicates that there is no
               cache at that level.

     cmp1=n, cmp2=n, cmp3=n, cmp4=n
     dmp1=n, dmp2=n, dmp3=n, dmp4=n
               Specifies, in processor cycles, the time for a clean miss
               (cmpx=) or dirty miss (dmpx=) to the next outer level of the
               memory hierarchy.  This number is approximate because it
               depends upon a clean or dirty line, read or write miss, etc.
               Specify a positive integer for n.  Specifying n=0 indicates
               that there is no cache at that level.

     cs1=n, cs2=n, cs3=n, cs4=n
               Specifies the cache size.  The value n can be 0, or it can
               be a positive integer followed by one of the following
               letters:  k, K, m, or M.  This specifies the cache size in
               Kbytes or Mbytes.  Specifying 0 indicates that there is no
               cache at that level.

               cs1 refers to the primary cache.  cs2 refers to the
               secondary cache.  cs3 refers to memory.  cs4 refers to disk.
               The default cache size for each type of cache depends on
               your system.  You can use the -LIST:options=ON option to see
               the default cache sizes used during your compilation.  In
               addition you can enter the following command to see the
               secondary cache size(s) on your system:

                    hinv -c memory | grep Secondary

     is_mem1[ = ( ON|OFF )]
     is_mem2[ = ( ON|OFF )]
     is_mem3[ = ( ON|OFF )]
     is_mem4[ = ( ON|OFF )]
               Specifies that certain memory hierarchies should be modeled
               as memory, not cache.  The default is OFF for each option.

               Blocking can be attempted for this memory hierarchy level,
               and blocking appropriate for memory, rather than cache, is
               applied.  No prefetching is performed, and any prefetching
               options are ignored.  If an -OPT:is_memx[ = ( ON|OFF )]
               option is specified, the corresponding assocx=n
               specification is ignored, any cmpx=n and dmpx=n options on
               the command line are ignored.

     ls1=n, ls2=n, ls3=n, ls4=n
               Specifies the line size, in bytes.  This is the number of
               bytes, specified in the form of a positive integer number,
               n, that are moved from the memory hierarchy level further
               out to this level on a miss.  Specifying n=0 indicates that
               there is no cache at that level.

     Certain arguments control the TLB.  The TLB is a cache for the page
     table, and it is assumed to be fully associative.  The TLB control
     arguments are as follows:

     ps1=n, ps2=n, ps3=n, ps4=n
               Specifies the number of bytes in a page.  Specify a positive
               integer for n.  The default n depends on your system
               hardware.

     tlb1=n, tlb2=n, tlb3=n, tlb4=n
               Specifies the number of entries in the TLB for this cache
               level.  Specify a positive integer for n.  The default n
               depends on your system hardware.

     tlbcmp1=n, tlbcmp2=n, tlbcmp3=n, tlbcmp4=n
     tlbdmp1=n, tlbdmp2=n, tlbdmp3=n, tlbdmp4=n
               Specifies the number of processor cycles it takes to service
               a clean TLB miss (the tlbcmpx= options) or dirty TLB miss
               (the tlbdmpn= options).  Specify a positive integer for n.
               The default n depends on your system hardware.

     The following arguments control the prefetch operation:

     pf1[ = ( ON|OFF )]
     pf2[ = ( ON|OFF )]
     pf3[ = ( ON|OFF )]
     pf4[ = ( ON|OFF )]
               Selectively disables and enables prefetching for cache level
               x, for pfx[ = ( ON|OFF )]

               When -r10000 or -r12000 are in effect, pf1=ON and pf2=ON by
               default.  At any other -rn setting, OFF is in effect for all
               cache levels.

     prefetch=n
               Specifies levels of prefetching.  prefetch is only supported
               when the MIPS4 ISA and R10000 (or above) processors are
               used.  The default is 1 when it is supported and the default
               is 0 when not supported.

               n can be one of the following:

               0   Disables all prefetching.

               1   Enables conservative prefetching.

               2   Enables aggressive prefetching.

     prefetch_ahead=n
               Prefetches the specified number of cache lines ahead of the
               reference.  Specify a positive integer for n.  The default
               is 2.

     prefetch_manual[ = ( ON|OFF )]
               Specifies whether manual prefetches (through directives)
               should be respected or ignored.

               prefetch_manual=OFF ignores manual prefetches.  This is the
               default when -r8000, -r5000, or -r4000 is in effect.

               prefetch_manual=ON respects manual prefetches.  This is the
               default when -r10000 or -r12000 is in effect.

F77 LNO Directives
     Directives within a program unit apply only to that program unit,
     reverting to the default values at the end of the program unit.
     Directives that occur outside of a program unit alter the default
     value, and therefore apply to the rest of the file from that point on,
     until overridden by a subsequent directive.

     Directives within a file override the command line options by default.
     To have the command line options override directives, use the command
     line option:

          -LNO:ignore_pragmas

   Fission and Fusion Directives
     * C*$* AGGRESSIVE INNER LOOP FISSION: Fission this loop in
       inner_fission phase to as many loops as possible.  This must be
       followed by a inner loop and has no effect if that loop is not inner
       any more after the SNL phase.

     * C*$* FISSION [(n)] or C*$* FISSIONABLE:  Fission the enclosing n
       level of loops after this directive. Perform legality test unless a
       fissionable directive is also specified. Does not re-order
       statements.

     * C*$* FUSE [(n [,level] )] or C*$* FUSABLE:  Fuse the following n
       immediately adjacent loops. Fusion is attempted on each pair of
       adjacent loops and the level, by default, is the determined by the
       maximal SNL levels of the fused loops, although partial fusion is
       allowed.  Iterations may be peeled as needed during fusion; the
       peeling limit is 5 or the number specified by the
       -LNO:fusion_peeling_limit flag.  When the FUSABLE directive is
       present, no legality test is done and the fusion is done up to the
       maximal SNL levels where the iteration numbers matched for each pair
       of loops to be matched.  The default value for n is  2.

     * C*$* NO FISSION:  The loop following this directive should not be
       fissioned in either fiz_fuse phase or inner_fission phase. Its inner
       loops, however, are allowed to be fissioned.

     * C*$* NO FUSION:  The loop following this directive should not be
       fused with other loops.

   SNL Transformation Directives
     The parallelizing preprocessor may do some transformation for
     parallelism that violate some of these directives.

     * C*$* INTERCHANGE (I, J [,K ...] ):  Loops I, J and K (in any order)
       must directly follow this directive and be perfectly nested inside
       each other. If they are not perfectly nested, the compiler may
       perform loop distribution to make them so, or may ignore the
       annotation, or may apply imperfect interchange (this is not likely).
       The compiler attempts to reorder loops so that I is outermost, then
       J, then K.  The compiler may ignore this directive.  There must be a
       minimum of 2 indexes in the directive.

     * C*$* NO INTERCHANGE:  Prevents the compiler from involving the loop
       directly following this directive in a permutation, or any loop
       nested within this loop.

     * C*$* BLOCKING SIZE (n1,n2) or C*$* BLOCKING SIZE (n1) or C*$*
       BLOCKING SIZE (,n2): If the specified loop is involved in a blocking
       for the primary or secondary cache, it will have a blocksize of n1
       or n2. The compiler will try to include this loop within such a
       block.  If a blocking size is specified as 0, the loop is not
       actually stripped, but the entire loop is inside the block.

     * C*$* NO BLOCKING: Prevent the compiler from involving this loop in a
       cache blocking.

     * C*$* UNROLL (n [,n2] ): This directive suggests that n-1 copies of
       the loop body be added to the inner loop. If the loop that this
       directive directly preceeds is an inner loop, then it indicates
       standard unrolling. If the loop that this directive directly
       preceeds is not innermost, then outer loop unrolling is performed.
       n must be at least 1.  If n=1 then no unrolling will be performed.
       If n=0, then the default unrolling should be applied.  n2 is
       ignored.

     * C*$* BLOCKABLE (I,J [,K ...] ): The I, J and K loops must be
       adjacent and nested within each other, although not necessarily
       perfectly nested.  This directive informs the compiler that these
       loops may legally be involved in a blocking with each other, even if
       the compiler would consider such a transformation illegal.  The
       loops are also interchangeable and unrollable.  This directive does
       not instruct the compiler which of these transformations to apply.
       You must specify at least 2 loop indexes in the directive.

   Prefetch Directives
     * C*$* PREFETCH (n[,n]): Specify prefetching for each level of the
       cache. The scope is the entire function containing the directive.  n
       can be one of the following values:

       0    prefetching off (default for all processors except R10000)

       1    prefetching on, but conservative

       2    prefetching on, and aggressive (default when prefetch is on)

     * C*$* PREFETCH_MANUAL (n):  Specify if manual prefetches (through
       directives) should be respected or ignored.  Scope: Entire function
       containing the directive.  n can be one of the following values:

       0    ignore manual prefetches (default for mips3 and earlier)

       1    respect manual prefetches (default for mips4)

     * C*$* PREFETCH_REF_DISABLE=A [, size=num]:  This directive explicitly
       disables prefetching all references to array A in the current
       function. The auto-prefetcher runs (if enabled) ignoring array A.
       The size is used for volume analysis.  Scope: Entire function
       containing the directive.  size=num is the size of the array
       references in this loop, in Kbytes.  This is an optional argument
       and must be a constant.

     * C*$* PREFETCH_REF=array-ref,[stride=[str] [,str]], [level=[lev]
       [,lev]], [kind=[rd/wr]], [size=[sz]]: This directive generates a
       single prefetch instruction to the specified memory location. It
       searches for array references that match the supplied reference in
       the current loop-nest.  If such a reference is found, that reference
       is connected to this prefetch node with the specified latency. If no
       such reference is found, this prefetch node stays free-floating and
       is scheduled "loosely".

       All references to this array in this loop-nest are ignored by the
       automatic prefetcher (if enabled).

       If the size is supplied, then the auto-prefetcher (if enabled)
       reduces the effective cache size by that amount in its calculations.

       The compiler tries to issue one prefetch per stride iteration, but
       cannot guarantee it. Redundant prefetches are preferred to
       transformations (such as inserting conditionals) which incur other
       overhead.

       Scope: No scope. Just generates a prefetch instruction.

       The following arguments are used with this option:

       array-ref Required.  The reference itself, for example, A(i, j).

       str       Optional. Prefetch every str iterations of this loop.  The
                 default is 1.

       lev       Optional.  The level in memory hierarchy to prefetch. The
                 default is 2.  If lev=1, prefetch from L2 to L1 cache. If
                 lev=2, prefetch from memory to L1 cache.

       rd/wr     Optional.  The default is read/write.

       sz        Optional.  The size (in Kbytes) of the array referenced in
                 this loop. This must be a constant.

   Dependence Analysis Directives
     * CDIR$ IVDEP: This applies only to inner loops. Liberalize dependence
       analysis.  Given two memory references, where at least one is loop
       variant, ignore any loop-carried dependences between the two
       references. The following are examples of this directive.

          do i = 1,n
            b(k) = b(k) + a(i)
          enddo

     IVDEP does not break the dependence because b(k) is not loop-variant.

          do i=1,n
             a(i) = a(i-1) + 3
          enddo

     IVDEP does break the dependence but the compiler warns the user that
     it is breaking an obvious dependence.

          do i=1,n
          enddo

     IVDEP does break the dependence.

          do i = 1,n
             a(i) = b(i)
             c(i) = a(i) + 3.
          enddo

     IVDEP does not break the dependence on a[i] because it is within an
     iteration.

     If -OPT:cray_ivdep=ON, Cray semantics are used and all lexically
     backwards dependences are broken. The following are examples:

          do i=1,n
             a(i) = a(i-1) + 3.
          enddo

     IVDEP does break the dependence but the compiler warns the user that
     it's breaking an obvious dependence.

          do i=1,n
             a(i) = a(i+1) + 3.
          enddo

     IVDEP does not break the dependence because the dependence is from the
     load to the store, and the load comes lexically before the store.

     If -OPT:liberal_ivdep=ON, all dependences are broken.

SEE ALSO
     cc(1), CC(1), cord(1), dso(1), f77(1), f90(1), fpmode(1), hinv(1),
     ld(1), make(1), pixie(1), pmake(1), prof(1), rld(1), smake(1).

     math(3M).

     auto_p(5), gp_overflow(5), ipa(5), opt(5), pe_environ(5).

     MIPSpro C and C++ Pragmas, publication 007-3587-001

     C Language Reference Manual, publication 007-0701-120

     Compiler Information File (CIF) Reference Manual

     MIPSpro Fortran 77 Programmer's Guide

     MIPSpro Fortran 90 Commands and Directives Reference Manual

     MIPSpro 64-Bit Porting and Transition Guide