FLAG DESCRIPTIONS SUN C AND FORTRAN Sun Studio 9 08/12/04 Compiler Flags Flag Description -autopar Perform automatic loop parallelization. -D Set definition for preprocessor. -dalign Assume double-type data is double aligned. -dn Specify static binding. -e Accept extended (132 character) input source lines (FORTRAN). -fast This is a convenience option for selecting a set of optimizations for performance and it chooses the following switches that are defined elsewhere in this page: (C) -D__MATHERR_ERRNO_DONTCARE -dalign -fns -fsimple=2 -fsingle -ftrap=%none -xalias_level=basic -xbuiltin=%all -xdepend -xlibmil -xO5 -xprefetch=auto,explicit -xtarget=native (C++) -dalign -fns -fsimple=2 -ftrap=%none -xbuiltin=%all -xlibmil -xlibmopt -xO5 -xtarget=native (Fortran) -dalign -depend -fns -fsimple=2 -ftrap=common -xlibmil -xlibmopt -xO5 -xpad=local -xprefetch=auto,explicit -xtarget=native -xvector=yes -fixed Accept fixed-format input source files (FORTRAN). -fns Select non-standard floating point mode. This flag causes the nonstandard floating point mode to be enabled when a program begins execution. By default, the nonstandard floating point mode will not be enabled automatically. On some SPARC systems, the nonstandard floating point mode disables "gradual underflow", causing tiny results to be flushed to zero rather than producing subnormal numbers. It also causes subnormal operands to be silently replaced by zero. On those SPARC systems that do not support gradual underflow and subnormal numbers in hardware, use of this option can significantly improve the performance of some programs. Warning: When nonstandard mode is enabled, floating point arithmetic may produce results that do not con- form to the requirements of the IEEE 754 standard. See the Numerical Computation Guide for more information. -fsimple=0 Permits no simplifying assumptions. Preserves strict IEEE 754 conformance. -fsimple=1 With -fsimple=1, the optimizer can assume the following: o The IEEE 754 default rounding/trapping modes do not change after process initialization. o Computations producing no visible result other than potential floating-point exceptions may be deleted. o Computations with Infinity or NaNs as operands need not propagate NaNs to their results. For example, x*0 may be replaced by 0. o Computations do not depend on sign of zero. -fsimple=2 Permits aggressive floating point optimizations that may cause programs to produce different numeric results due to changes in rounding. Even with -fsimple=2, the optimizer still is not permitted to introduce a floating point exception in a program that otherwise produces none. -fsimple[=n] Allows the compiler to make simplifying assumptions concerning floating-point arithmetic. -ftrap=t Sets the IEEE 754 trapping mode in effect at startup. t is a comma-separated list that consists of one or more of the following: %all, %none, common, [no%]invalid, [no%]overflow, [no%]underflow, [no%]division, [no%]inexact. The default is -ftrap=%none. This option sets the IEEE 754 trapping modes that are established at program initialization. Processing is left-to-right. The common exceptions, by definition, are invalid, division by zero, and overflow. o %none, the default, turns off all trapping modes. Do not use this option for programs that depend on IEEE standard exception handling; you can get different numerical results, premature program termination, or unexpected SIGFPE signals. -libmil Use inline expansion templates for libm. -lm Link with math library -lmopt This chooses the math library that is optimized for speed -lmtmalloc fast concurrent malloc library suitable for multi-threaded applications -native Select native machine characteristics for optimization. -openmp enable explicit parallelization with Fortran 90 OpenMP directives. -pad Synonymous with -xpad (see -xpad below) -xunroll= Enabling unrolling of loops where possible. is a positive integer. The choices are - = 1 inhibits all loop unrolling. - > 1 suggests to the optimizer that it attempts to unroll loops times. -Qoption Pass along to compiler (Fortran, C++): f90comp Fortran first pass iropt Global optimizer cg Code generator -Qoption iropt See -W2, below for C. -Qoption iropt -Ainline[:cp=] Control the optimizer's inliner: [:cs=][:inc=][:irs=] cp= The minimum call site frequency [:mi][:recursion=] counter in order to consider a routine for inlining cs= Set inline callee size limit to n. The unit roughly corresponds to the number of instructions. inc= The inliner is allowed to increase the size of the program by up to n%. irs= Allow routines to increase by up to n. The unit roughly corresponds to the number of instructions. mi Perform maximum inlining (without considering code size increase). recursion= Allow a recursive call to be inlined up to n level. -Qoption iropt -Apf:const Mark prefetch candidates with detailed analysis of constants in array subscripts. -Qoption iropt -Apf:largedim Mark prefetch candidates by assuming a large first-dimension size for all arrays with unknown sizes at compile time. -Qoption iropt -Apf:outer= Turn on (1) prefetch candidates marking in the outer loop. 0 turns it off. -Qoption iropt -Apf:pdl=1 Do prefetching for one-level indirect memory references. -Qoption iropt -Atile:skewp[:b] Perform loop tiling which is enabled by loop skewing. Loop skewing transforms a non-fully interchangeable loop nest to a fully interchangeable loop nest. The optional b sets the tiling block size to n. -Qoption iropt -Aujam:inner=g Increase the probability that small-trip-count inner loops will be fully unrolled. -Qoption iropt -Athr Perform tree height reduction optimizations. -Qoption iropt -Addint:sf= When considering whether to interchange loops, set memory store operation weight to n. A higher value of n indicates a greater performance cost for stores. -Qoption cg See -Wc, below for C. -Qoption cg -Qlp[=][-av=] Control prefetching for loops with [-t=][-fa=][-fl=] control flow: [-ip=][-ol=] lp= Turns the module on (1) or off (0) (default is on for F90; off for C/C++) lp In Fortran, equivalent to -Qlp=1 and is used as a means for setting sub-options listed below. In C/C++, equivalent to -Qlp=0. However, when used with the options -xprefetch=auto or -xprefetch_level=[2|3], it is equivalent to -Qlp=1, and used as a means for setting sub-options listed below. -av= Sets the prefetch look ahead distance, in bytes. Default is 256. -t= Sets the number of attempts at prefetching. If not specified, t=2 if -xprefetch_level=3 has been set; otherwise, defaults to t=1. -fa= 1=Force user settings to override internally computed values. -fl= 1=Force the optimization to be turned on for all languages -ip= Turns on (1) prefetching for one-level indirect memory acesses. -ol= Turns on (1) prefetching for outer loop. -Qoption cg -Qms_pipe+prefolim= Set prefetch ahead distance assuming that the number of outstanding prefetches are . With larger , the ahead distance gets larger. Default value for is 8 on UltraSPARC-III. -stackvar Allocate routine local variables on stack (FORTRAN). -W, Pass along to compiler (C): 2 Global optimizer c Code generator -W2, See -Qoption iropt above for Fortran and C++. -W2,-Ainline:call_in_pragma Consider functions called in parallel regions and loops as candidates for inlining (default ON) . -Xa Assume ANSI C conformance, allow K & R extensions. (default mode) -xalias_level= Allows compiler to perform type-based alias analysis at the given alias level (C). basic assume ISO C9X aliasing rules for basic types only. std assume ISO C9X aliasing rules. strong assume all pointers are type safe (strongly typed). -xarch= Limit the set of instructions the compiler may use. -xbuiltin=%all (C, C++) Substitute intrinsic functions or inline system functions where profitable for performance. -Xc Assume strict ANSI C conformance. -xcache= Defines the cache properties for use by the optimizer. c must be one of the following: o native (set parameters for the host environment) o s1/l1/a1 o s1/l1/a1:s2/l2/a2 o s1/l1/a1:s2/l2/a2:s3/l3/a3 The si/li/ai are defined as follows: si The size of the data cache at level i, in kilobytes. li The line size of the data cache at level i, in bytes. ai The associativity of the data cache at level i. -xchip= Specifies the target processor for use by the optimizer. ultra3 (C, C++,Fortran) for UltraSPARC-III based machines. -xcode= Specify code address space on SPARC platform. The values for are - abs32: Generates 32-bit absolute addresses. Code+data+bss size is limited to 2**32 bytes. This is the default on 32-bit platforms: arch=generic,v7,v8,v8a,v8plus, v8plusa - abs44: Generates 44-bit absolute addresses. Code+data+bss size is limited to 2**44 bytes. Available only on 64-bit platforms: -xarch=v9,v9a - abs64: Generates 64-bit absolute addresses. Available only on 64-bit platforms: -xarch=v9,v9a - pic13: Generate position-independent code (small model). Equivalent to -pic. Permits references to at most 2**11 unique external symbols on 32-bit platforms, 2**10 on 64-bit platofrms. - pic32: Generate position-independent code (large model). Equivalent to -PIC. Permits references to at most 2**30 unique external symbols on 32-bit platforms, 2**29 on 64-bit platofrms. -xdepend Analyze loops for data dependencies. -xipo=n Performs optimizations across all object files in the link step: 0=off, 1=on, 2=performs whole-program detection and analysis -xlibmopt This chooses the math library that is optimized for speed. -xlic_lib=sunperf Link with Sun Performance library (this library implements optimized BLAS 1,2,3, LAPACK, FFTPACK, Sparse linear algebra and other mathematical functions). -xO1 Does basic local optimization (peephole). -xO2 xO1 and more local and global optimizations. -xO3 Besides what xO2 does, it optimizes references or definitions for external variables. Loop unrolling and software pipelining are also performed. -xO4 xO3 plus function inlining. -xO5 Besides what xO4 does, it enables speculative code motion. -xopenmp Enable explicit parallelization with C OpenMP directives. -xpad=common[:] Pad common block variables, for better use of cache. n specifies the amount of padding to apply, in units that are the same size as the array elements. If no parameter is specified then the compiler selects one automatically. -xpad=local[:] Pad local variables only, for better use of cache. n specifies the amount of padding to apply, in units that are the same size as the array elements. If no parameter is specified then the compiler selects one automatically. -xpagesize= Set the preferred page size for running the program. -xprefetch[=value] Enable prefetch instructions on those architectures that support prefetch, such as UltraSPARC II (-xarch=v8plus, v8plusa, v9plusb, v9, v9a, or v9b) auto Enable automatic generation of prefetch instructions no%auto Disable automatic generation of prefetch instructions explicit Enable explicit prefetch macros no%explicit Disable explicit prefetch macros yes -xprefetch=yes is the same as -xprefetch=auto,explicit no -xprefetch=no is the same as -xprefetch=no%auto,no%explicit Defaults If -xprefetch is not specified, -xprefetch=no%auto,explicit is assumed. If only -xprefetch is specified, -xprefetch=auto,explicit is assumed. -xprefetch=latx: Adjust the compiler's assumptions about prefetch latency by the specified factor. Typically values in the range of 0.5 to 2.0 will be useful. A lower number might indicate that data will usually be cache resident; a higher number might indicate a relatively larger gap between the processor speed and the memory speed (compared to the assumptions built into the compiler). -xprefetch_level Insert prefetches in loops with control flow -xprefetch_level=1 compiler inserts prefetches only in loops with no control flow -xprefetch_level=2 compiler inserts prefetches in loops with control flow -xprefetch_level=3 compiler aggressively inserts prefetches in loops with control flow -xprofile=collect Collect profile data for feedback directed optimizations. -xprofile=use Use data collected for profile feedback. -xreduction Recognize reduction operations in loops. -xrestrict[=f1,...,f2,%all, Treat pointer-valued function %none] parameters as restricted pointers. The default is %none. Specifying -xrestrict is equivalent to specifying -xrestrict=%all. -Xt Assume K & R conformance, allow ANSI C. -xtarget=native Same as -native -xvector Enable automatic calls to SPARC vector math library functions. ------------------------------------------------------------------ Kernel Parameters Description shmsys:shminfo_shmmin Minimum size of system V shared memory segment that can be created. shmsys:shminfo_shmmax Maximum size of system V shared memory segment that can be created. This parameter is an upper limit that is checked before the system sees if it actually has the physical resources to create the requested memory segment. shmsys:shminfo_shmmni System wide limit on number of shared memory segments that can be created. shmsys:shminfo_shmseg Limit on the number of shared memory segments that any one process can create. tune_t_fsflushr Specifies the number of seconds between fsflush (system daemon for file system flushing) invocations. autoup Along with tune_t_flushr, autoup controls the amount of memory examined for dirty pages in each invocation and frequency of file system sync operations. segvn_comb_thrshld specifies a threshold when two adjacent segvn (vnode segment driver) segments should be concatenated together. ------------------------------------------------------------------ Environment Variables Description OMP_DYNAMIC Enables or disables dynamic adjustment of the number of threads available for execution of parallel regions. OMP_NUM_THREADS Sets the number of threads to use during execution, unless that number is explicitly changed by calling the OMP_SET_NUM_THREADS subroutine. SUNW_MP_PROCBIND This environment variable can be used to (MT_BIND_PROCESSOR) bind the LWPs (lightweight processes) managed by the microtasking library, libmtsk, to processors. Performance can be enhanced with processor binding, but performance degradation will occur if multiple LWPs are bound to the same processor. The value for SUNW_MP_PROCBIND can be - The string TRUE or FALSE (in any case). - a non-negative integer. - a list of two or more non-negative integers separated by one or more spaces (" "). - two non-negative integers, n1 and n2, separated by a minus ("-"); n1 must be less than or equal to n2. Integers in the above denote the "logical" processor IDs to which the LWPs are to be bound. Logical processor IDs are consecutive integers that start with 0, and may or may not be identical to the actual processsor IDs. If n processors are available online, then their logical processor IDs are 0, 1, ..., n-1. For multi-core systems, core 0 of each processor has ID between 0 and n-1. Core 1 is given an ID between n and 2n-1. Cores which share a processor have IDs which differ by n. By default, LWPs are not bound to processors. It is left up to the operating system, Solaris, to schedule LWPs onto processors. The variable MT_BIND_PROCESSOR is the old name and is provided for historical value. SUNW_MP_GUIDED_SCHED_WEIGHT In loops with the GUIDED schedule, the chunk size is computed as follows: chunk_size = num_iterations_left / (num_threads * weight) where weight can be set via the environment variable SUNW_MP_GUIDED_SCHED_WEIGHT. The weight specified for SUNW_MP_GUIDED_SCHED_WEIGHT should be a positive real number. The weight will apply to all loops with the GUIDED schedule in the program. If the environment variable SUNW_MP_GUIDED_SCHED_WEIGHT is not set, then the default weight is 1.0. STACKSIZE A default stacksize of 4 MB (for 32-bit programs) and 8 MB (for 64-bit programs) is used for additional threads created in an OpenMP program. The environment variable STACKSIZE can be used to set it to a different value. For example, setenv STACKSIZE 2048 creates threads with stacksize of 2 MB each. MPSSHEAP= Specify the preferred page size for heap. The specified page size is applied to all created processes. MPSSSTACK= Specify the preferred page size for stack. The specified page size is applied to all created processes. LD_PRELOAD=mpss.so.1 (Unix) Allow use of the mpss.so.1 shared object, which provides a means by which preferred stack and/or heap page sizes can be selected. ------------------------------------------------------------------------------- SPEC tools, Unix submit = ppgsz -o heap=,stack= $cmd The SPEC config file feature submit is used to cause ppgsz command (Unix) to set page size for the heap and set page size for the stack. ppgsz set preferred stack and/or heap page size. ------------------------------------------------------------------------------- src.alt modification for 325.apsi_l peak runs (ompl2001-dd-20040128.tar.gz) The src.alt for 325.apsi_l is a modified version of the file apsi.f, replacing lines 344 through 383 with the following: !$OMP PARALLEL !$OMP+PRIVATE(I) !DISTRIBUTE ARRAYS AROUND MACHINE with 4MB chunk !$OMP DO SCHEDULE(STATIC,524288) DO I = 1, MWORK WORK(I) = 0.0D0 ENDDO !$OMP END DO !$OMP END PARALLEL It is similar to apsi.f in 324.apsi_m (OMPM). Only difference is the chunk size of the static schedule loop. It is 32768 in 324.apsi_m. It is changed to 524288 in 325.apsi_l. Thus 4MB (524288*8) chunk of WORK array is distributed to each processor in a round robin fashion.