FLAG DESCRIPTIONS
SUN C AND FORTRAN
Sun Studio 9 
          08/12/04

Compiler Flags

Flag                               Description

-autopar			   Perform automatic loop parallelization.

-D                                 Set definition for preprocessor.

-dalign                            Assume double-type data is double
                                   aligned.

-dn                                Specify static binding.

-e                                 Accept extended (132 character) input
                                   source lines (FORTRAN).

-fast                              This is a convenience option for selecting
				   a set of optimizations for performance and
				   it chooses the following switches that are
				   defined elsewhere in this page:

				   (C)
	                             -D__MATHERR_ERRNO_DONTCARE
				     -dalign
				     -fns
				     -fsimple=2
				     -fsingle
				     -ftrap=%none
				     -xalias_level=basic
				     -xbuiltin=%all
				     -xdepend
				     -xlibmil
				     -xO5
				     -xprefetch=auto,explicit
				     -xtarget=native

			      	   (C++)
				     -dalign
				     -fns
				     -fsimple=2
				     -ftrap=%none
				     -xbuiltin=%all
				     -xlibmil
				     -xlibmopt
				     -xO5
				     -xtarget=native

			      	   (Fortran)
				     -dalign
				     -depend
				     -fns
				     -fsimple=2
				     -ftrap=common
				     -xlibmil
				     -xlibmopt
				     -xO5
				     -xpad=local
				     -xprefetch=auto,explicit
				     -xtarget=native
				     -xvector=yes

-fixed                             Accept fixed-format input source files
                                   (FORTRAN).

-fns                               Select non-standard floating point
                                   mode.

                                   This flag causes the nonstandard
                                   floating point mode to be enabled when
                                   a program begins execution. By default,
                                   the nonstandard floating point mode
                                   will not be enabled automatically.

                                   On some SPARC systems, the nonstandard
                                   floating point mode disables "gradual
                                   underflow", causing tiny results to be
                                   flushed to zero rather than producing
                                   subnormal numbers. It also causes
                                   subnormal operands to be silently
                                   replaced by zero. On those SPARC
                                   systems that do not support gradual
                                   underflow and subnormal numbers in
                                   hardware, use of this option can
                                   significantly improve the performance
                                   of some programs.

                                   Warning: When nonstandard mode is
                                   enabled, floating point arithmetic may
                                   produce results that do not con- form
                                   to the requirements of the IEEE 754
                                   standard. See the Numerical Computation
                                   Guide for more information.

-fsimple=0                         Permits no simplifying assumptions.
                                   Preserves strict IEEE 754 conformance.

-fsimple=1                         With -fsimple=1, the optimizer can
                                   assume the following:

                                   o The IEEE 754 default
                                   rounding/trapping modes do not change
                                   after process initialization.

                                   o Computations producing no visible
                                   result other than potential
                                   floating-point exceptions may be
                                   deleted.

                                   o Computations with Infinity or NaNs as
                                   operands need not propagate NaNs to
                                   their results. For example, x*0 may be
                                   replaced by 0.

                                   o Computations do not depend on sign of
                                   zero.

-fsimple=2                         Permits aggressive floating point
                                   optimizations that may cause programs
                                   to produce different numeric results
                                   due to changes in rounding. Even with
                                   -fsimple=2, the optimizer still is not
                                   permitted to introduce a floating point
                                   exception in a program that otherwise
                                   produces none.

-fsimple[=n]                       Allows the compiler to make simplifying
                                   assumptions concerning floating-point
                                   arithmetic.

-ftrap=t                           Sets the IEEE 754 trapping mode in
                                   effect at startup.

                                   t is a comma-separated list that
                                   consists of one or more of the
                                   following: %all, %none, common,
                                   [no%]invalid, [no%]overflow,
                                   [no%]underflow, [no%]division,
                                   [no%]inexact.

                                   The default is -ftrap=%none.

                                   This option sets the IEEE 754 trapping
                                   modes that are established at program
                                   initialization. Processing is
                                   left-to-right. The common exceptions,
                                   by definition, are invalid, division by
                                   zero, and overflow.

                                   o %none, the default, turns off all
                                   trapping modes.

                                   Do not use this option for programs
                                   that depend on IEEE standard exception
                                   handling; you can get different
                                   numerical results, premature program
                                   termination, or unexpected SIGFPE
                                   signals.

-libmil                            Use inline expansion templates for
                                   libm.

-lm                                Link with math library

-lmopt                             This chooses the math library that is
                                   optimized for speed

-lmtmalloc			   fast concurrent malloc library suitable for
				   multi-threaded applications

-native                            Select native machine characteristics
                                   for optimization.

-openmp				   enable explicit parallelization with
				   Fortran 90 OpenMP directives.

-pad				   Synonymous with -xpad (see -xpad below)

-xunroll=<n>                       Enabling unrolling of loops where possible.
                                   <n> is a positive integer. The choices are
                                   - <n> = 1 inhibits all loop unrolling.
                                   - <n> > 1 suggests to the optimizer that 
				     it attempts to unroll loops <n> times.

-Qoption <phase> <flags>           Pass <flags> along to compiler <phase>
				   (Fortran, C++):

                                   f90comp Fortran first pass

                                   iropt Global optimizer

                                   cg Code generator

-Qoption iropt <flags>             See -W2,<flags> below for C.

-Qoption iropt -Ainline[:cp=<n>]   Control the optimizer's inliner:
[:cs=<n>][:inc=<n>][:irs=<n>]	        cp=<n> The minimum call site frequency
[:mi][:recursion=<n>]			       counter in order to consider
					       a routine for inlining
               			        cs=<n> Set inline callee size limit to
					       n. The unit roughly corresponds
					       to the number of instructions.
               				inc=<n> The inliner is allowed to
						increase the size of
						the program by up to n%.
               				irs=<n> Allow routines to increase by
						up to n. The unit roughly
						corresponds to the number of
						instructions.
               				mi Perform maximum inlining (without
					   considering code size increase).
					recursion=<n> Allow a recursive call 
						      to be inlined up to 
						      n level. 

-Qoption iropt -Apf:const	   Mark prefetch candidates with detailed
				   analysis of constants in array subscripts.

-Qoption iropt -Apf:largedim       Mark prefetch candidates by assuming a large
				   first-dimension size for all arrays with
				   unknown sizes at compile time.

-Qoption iropt -Apf:outer=<n>	   Turn on (1) prefetch candidates marking in
				   the outer loop. 0 turns it off.

-Qoption iropt -Apf:pdl=1	   Do prefetching for one-level indirect
				   memory references.

-Qoption iropt -Atile:skewp[:b<n>] Perform loop tiling which is enabled by
				   loop skewing. Loop skewing transforms
				   a non-fully interchangeable loop nest to
				   a fully interchangeable loop nest.
				   The optional b<n> sets the tiling block
				   size to n.

-Qoption iropt -Aujam:inner=g      Increase the probability that
				   small-trip-count inner loops will be fully
				   unrolled.

-Qoption iropt -Athr		   Perform tree height reduction optimizations.

-Qoption iropt -Addint:sf=<n>      When considering whether to interchange 
                                   loops, set memory store operation weight
                                   to n. A higher value of n indicates 
                                   a greater performance cost for stores.

-Qoption cg <flags>                See -Wc,<flags> below for C.

-Qoption cg -Qlp[=<n>][-av=<n>]    Control prefetching for loops with 
 [-t=<n>][-fa=<n>][-fl=<n>]        control flow:
 [-ip=<n>][-ol=<n>]     

				       lp=<n> Turns the module on (1) or 
				              off (0) (default is on
                                              for F90; off for C/C++)

                                       lp     In Fortran, equivalent to 
					      -Qlp=1 and is used as a means
					      for setting sub-options listed
					      below. 
					
					      In C/C++, equivalent to -Qlp=0.
                                              However, when used with the 
					      options -xprefetch=auto or
					      -xprefetch_level=[2|3], 
                                              it is equivalent to -Qlp=1, 
					      and used as a means for setting 
					      sub-options listed below. 

               				-av=<n> Sets the prefetch look ahead
						distance, in bytes.
						Default is 256.
               				-t=<n> Sets the number of attempts at
					       prefetching. If not specified,
					       t=2 if -xprefetch_level=3 has
					       been set; otherwise, defaults
					       to t=1.
               				-fa=<n> 1=Force user settings to
						override internally computed
						values.
               				-fl=<n> 1=Force the optimization to
						be turned on for all languages 
                                        -ip=<n> Turns on (1) prefetching for 
						one-level indirect memory 
						acesses. 
					-ol=<n> Turns on (1) prefetching for
						outer loop.

-Qoption cg -Qms_pipe+prefolim=<n> Set prefetch ahead distance assuming that
				   the number of outstanding prefetches are 
				   <n>. With larger <n>, the ahead distance 
				   gets larger. Default value for <n> is 8 on
				   UltraSPARC-III.

-stackvar                          Allocate routine local variables on
                                   stack (FORTRAN).

-W<phase>,<flags>                  Pass <flags> along to compiler <phase> (C):

                                   2 Global optimizer

				   c Code generator

-W2,<flags>                        See -Qoption iropt <flags> above for 
				   Fortran and C++.

-W2,-Ainline:call_in_pragma	   Consider functions called in parallel
                                   regions and loops as candidates
                                   for inlining (default ON) . 

-Xa                                Assume ANSI C conformance, allow K & R
                                   extensions. (default mode)

-xalias_level=<a>                  Allows compiler to perform type-based
                                   alias analysis at the given alias
                                   level (C).

                                   basic assume ISO C9X aliasing rules for
                                   basic types only.

                                   std assume ISO C9X aliasing rules.

                                   strong assume all pointers are type
                                   safe (strongly typed).

-xarch=<a>                         Limit the set of instructions the
                                   compiler may use.

-xbuiltin=%all (C, C++)            Substitute intrinsic functions or inline
				   system functions where profitable for
				   performance.

-Xc                                Assume strict ANSI C conformance.

-xcache=<c>                        Defines the cache properties for use by
                                   the optimizer.

                                   c must be one of the following:

                                   o native (set parameters for the host
                                   environment)

                                   o s1/l1/a1

                                   o s1/l1/a1:s2/l2/a2

                                   o s1/l1/a1:s2/l2/a2:s3/l3/a3

                                   The si/li/ai are defined as follows:

                                   si The size of the data cache at level
                                   i, in kilobytes.

                                   li The line size of the data cache at
                                   level i, in bytes.

                                   ai The associativity of the data cache
                                   at level i.

-xchip=<c>                         Specifies the target processor for use
                                   by the optimizer. ultra3 (C, C++,Fortran)
				   for UltraSPARC-III based machines.

-xcode=<code>                      Specify code address space on SPARC 
				   platform. The values for <code> are

                                   - abs32: Generates 32-bit absolute
					    addresses. Code+data+bss size is 
					    limited to 2**32 bytes. This is 
					    the default on 32-bit platforms:
                                            arch=generic,v7,v8,v8a,v8plus,
					         v8plusa

                                   - abs44: Generates 44-bit absolute 
					    addresses.
                                            Code+data+bss size is limited to 
                                            2**44 bytes. Available only on 
					    64-bit platforms: -xarch=v9,v9a

                                   - abs64: Generates 64-bit absolute 
					    addresses. Available only on 
					    64-bit platforms: -xarch=v9,v9a

                                   - pic13: Generate position-independent code 
                                            (small model). Equivalent to -pic. 
				            Permits references to at most 
					    2**11 unique external symbols on 
					    32-bit platforms, 
					    2**10 on 64-bit platofrms. 
		
                                   - pic32: Generate position-independent code 
                                            (large model). Equivalent to -PIC. 
				            Permits references to at most 
					    2**30 unique external symbols on 
					    32-bit platforms, 
					    2**29 on 64-bit platofrms. 
		
-xdepend                           Analyze loops for data dependencies.

-xipo=n                            Performs optimizations across all
                                   object files in the link step: 0=off,
                                   1=on, 2=performs whole-program
                                   detection and analysis

-xlibmopt                          This chooses the math library that is
                                   optimized for speed.

-xlic_lib=sunperf		   Link with Sun Performance library (this
				   library implements optimized BLAS 1,2,3,
				   LAPACK, FFTPACK, Sparse linear algebra and
				   other mathematical functions).

-xO1                               Does basic local optimization
                                   (peephole).

-xO2                               xO1 and more local and global
                                   optimizations.

-xO3                               Besides what xO2 does, it optimizes
                                   references or definitions for external
                                   variables. Loop unrolling and software
                                   pipelining are also performed.

-xO4                               xO3 plus function inlining.

-xO5                               Besides what xO4 does, it enables
                                   speculative code motion.

-xopenmp			   Enable explicit parallelization with
				   C OpenMP directives.

-xpad=common[:<n>]                 Pad common block variables, for better
                                   use of cache. n specifies the amount of
                                   padding to apply, in units that are the
				   same size as the array elements. If no
				   parameter is specified then the compiler
				   selects one automatically.

-xpad=local[:<n>]                  Pad local variables only, for better
                                   use of cache. n specifies the amount of
                                   padding to apply, in units that are the
				   same size as the array elements. If no
				   parameter is specified then the compiler
				   selects one automatically.

-xpagesize=<n> 			   Set the preferred page size for running
				   the program.

-xprefetch[=value]                 Enable prefetch instructions on those
                                   architectures that support prefetch,
                                   such as UltraSPARC II (-xarch=v8plus,
                                   v8plusa, v9plusb, v9, v9a, or v9b)

                                   auto

                                   Enable automatic generation of prefetch
                                   instructions

                                   no%auto

                                   Disable automatic generation of
                                   prefetch instructions

                                   explicit

                                   Enable explicit prefetch macros

                                   no%explicit

                                   Disable explicit prefetch macros

                                   yes

                                   -xprefetch=yes is the same as
                                   -xprefetch=auto,explicit

                                   no

                                   -xprefetch=no is the same as
                                   -xprefetch=no%auto,no%explicit

                                   Defaults

                                   If -xprefetch is not specified,
                                   -xprefetch=no%auto,explicit is assumed.

                                   If only -xprefetch is specified,
                                   -xprefetch=auto,explicit is assumed.

-xprefetch=latx:<n> 		   Adjust the compiler's assumptions about
				   prefetch latency by the specified factor.
				   Typically values in the range of 0.5 to 2.0
				   will be useful. A lower number might
				   indicate that data will usually be cache
				   resident; a higher number might indicate
				   a relatively larger gap between the
				   processor speed and the memory speed
				   (compared to the assumptions built into
				   the compiler).

-xprefetch_level		   Insert prefetches in loops with control flow
				   -xprefetch_level=1	compiler inserts
							prefetches only in
							loops with no control
							flow
				   -xprefetch_level=2	compiler inserts
							prefetches in loops
							with control flow
				   -xprefetch_level=3	compiler aggressively
							inserts prefetches in
							loops with control
							flow

-xprofile=collect                  Collect profile data for feedback
                                   directed optimizations.

-xprofile=use                      Use data collected for profile
                                   feedback.

-xreduction			   Recognize reduction operations in loops.

-xrestrict[=f1,...,f2,%all,        Treat pointer-valued function
%none]                             parameters as restricted pointers. The
                                   default is %none. Specifying -xrestrict
                                   is equivalent to specifying
                                   -xrestrict=%all.

-Xt                                Assume K & R conformance, allow ANSI C.

-xtarget=native                    Same as -native

-xvector			   Enable automatic calls to SPARC vector
				   math library functions.


------------------------------------------------------------------

Kernel Parameters                  Description

shmsys:shminfo_shmmin              Minimum size of system V shared memory
                                   segment that can be created.

shmsys:shminfo_shmmax              Maximum size of system V shared memory
                                   segment that can be created. This
                                   parameter is an upper limit that is
                                   checked before the system sees if it
                                   actually has the physical resources to
                                   create the requested memory segment.

shmsys:shminfo_shmmni              System wide limit on number of shared
                                   memory segments that can be created.

shmsys:shminfo_shmseg              Limit on the number of shared memory
                                   segments that any one process can
                                   create.

tune_t_fsflushr			   Specifies the number of seconds between
				   fsflush (system daemon for file system
				   flushing) invocations.

autoup				   Along with tune_t_flushr, autoup controls
				   the amount of memory examined for dirty
				   pages in each invocation and frequency of
				   file system sync operations.

segvn_comb_thrshld		   specifies a threshold when two adjacent
				   segvn (vnode segment driver) segments
				   should be concatenated together.

------------------------------------------------------------------

Environment Variables              Description

OMP_DYNAMIC			   Enables or disables dynamic adjustment of
				   the number of threads available for
				   execution of parallel regions.

OMP_NUM_THREADS			   Sets the number of threads to use
				   during execution, unless that number is
				   explicitly changed by calling the
				   OMP_SET_NUM_THREADS subroutine.

SUNW_MP_PROCBIND		   This environment variable can be used to
(MT_BIND_PROCESSOR)		   bind the LWPs (lightweight processes) 
				   managed by the microtasking library, 
				   libmtsk, to processors.
				   Performance can be enhanced with processor
				   binding, but performance degradation
				   will occur if multiple LWPs are bound to
				   the same processor.
				   
                                   The value for SUNW_MP_PROCBIND can be
				   - The string TRUE or FALSE (in any case).
                                   - a non-negative integer.
                                   - a list of two or more non-negative 
				     integers separated by one or more 
				     spaces (" "). 
                                   - two non-negative integers, n1 and n2, 
                                     separated by a minus ("-");
                                     n1 must be less than or equal to n2.

				   Integers in the above denote the "logical"
				   processor IDs to which the LWPs are to be 
				   bound. Logical processor IDs are 
				   consecutive integers that start with 0, 
				   and may or may not be identical to the 
				   actual processsor IDs. If n processors are 
				   available online, then their logical 
				   processor IDs are 0, 1, ..., n-1.

                                   For multi-core systems, core 0 of each
                                   processor has ID between 0 and n-1.  Core 1 
                                   is given an ID between n and 2n-1.  Cores 
 				   which share a processor have IDs which 
				   differ by n.

				   By default, LWPs are not bound to 
				   processors. It is left up to the operating 
				   system, Solaris, to schedule LWPs onto 
				   processors.

                                   The variable MT_BIND_PROCESSOR is the old 
                                   name and is provided for historical value.

SUNW_MP_GUIDED_SCHED_WEIGHT        In loops with the GUIDED schedule, 
				   the chunk size is computed as follows:

					chunk_size = num_iterations_left / 
						     (num_threads * weight) 

				   where weight can be set via the environment
				   variable SUNW_MP_GUIDED_SCHED_WEIGHT.

				   The weight specified for 
				   SUNW_MP_GUIDED_SCHED_WEIGHT should be 
				   a positive real number. The weight will 
				   apply to all loops with the GUIDED schedule
				   in the program. 

				   If the environment variable 
				   SUNW_MP_GUIDED_SCHED_WEIGHT is not set, 
				   then the default weight is 1.0.

STACKSIZE			   A default stacksize of 4 MB (for 32-bit
				   programs) and 8 MB (for 64-bit programs) is
				   used for additional threads created in
				   an OpenMP program. The environment variable
				   STACKSIZE can be used to set it to
				   a different value. For example,
				   setenv STACKSIZE 2048 creates threads with
				   stacksize of 2 MB each.

MPSSHEAP=<n> 			   Specify the preferred page size for heap.
				   The specified page size is applied to all
				   created processes.

MPSSSTACK=<n> 			   Specify the preferred page size for stack.
				   The specified page size is applied to all
				   created processes.

LD_PRELOAD=mpss.so.1 (Unix)        Allow use of the mpss.so.1 shared object,
				   which provides a means by which preferred
				   stack and/or heap page sizes can be 
				   selected.

-------------------------------------------------------------------------------

SPEC tools, Unix 

submit = ppgsz -o heap=<x>,stack=<y> $cmd 
                                   The SPEC config file feature submit is 
				   used to cause ppgsz command (Unix) to set 
				   page size <x> for the heap and set page 
				   size <y> for the stack.

ppgsz				   set preferred stack and/or heap page size. 


-------------------------------------------------------------------------------

src.alt modification for 325.apsi_l peak runs (ompl2001-dd-20040128.tar.gz)

The src.alt for 325.apsi_l is a modified version of the file apsi.f,
replacing lines 344 through 383 with the following: 

!$OMP PARALLEL 
!$OMP+PRIVATE(I)
!DISTRIBUTE ARRAYS AROUND MACHINE with 4MB chunk
!$OMP DO SCHEDULE(STATIC,524288)
	DO I = 1, MWORK
		WORK(I) = 0.0D0
        ENDDO
!$OMP END DO
!$OMP END PARALLEL

It is similar to apsi.f in 324.apsi_m (OMPM). 
Only difference is the chunk size of the static schedule loop. 
It is 32768 in 324.apsi_m. It is changed to 524288 in 325.apsi_l. 
Thus 4MB (524288*8) chunk of WORK array is distributed to each processor
in a round robin fashion.