PGI 6.2 Compilers for Linux Optimization, Compiler, and Other flags for use by SPEC OMP2001 Compilers: PGI 6.2 Operating systems: Linux Last updated: 11-Nov-2006 mec Portability Flags: -Mfixed Assume fixed-format source -Mextend Allow source lines up to 132 characters -DINTS_PER_CACHELINE=16 (330.art_m) -DDBLS_PER_CACHELINE=8 (330.art_m) Optimization Flags: -O Set the optimization level to -O2 -O0 A basic block is generated for each C statement. No scheduling is done between statements. No global optimizations are performed. -O1 Level-one optimization specifies local optimization (-O1). The compiler performs scheduling of basic blocks as well as register allocation. This optimization level is a good choice when the code is very irregular; that is it contains many short statements containing IF statements and the program does not contain loops (DO or DO WHILE statements). For certain types of code, this optimization level may perform better than level-two (-O2) although this case rarely occurs. The PGI compilers perform many different types of local optimizations, including but not limited to: Algebraic identity removal Constant folding Common subexpression elimination Local register optimization Peephole optimizations Redundant load and store elimination Strength reductions -O2 Level-two optimization (-O2 or -O) specifies global optimization. The -fast option generally will specify global optimization; however, the -fast switch will vary from release to release depending on a reasonable selection of switches for any one particular release. The -O or -O2 level performs all level-one local optimizations as well as global optimizations. Control flow analysis is applied and global registers are allocated for all functions and subroutines. Loop regions are given special consideration. This optimization level is a good choice when the program contains loops, the loops are short, and the structure of the code is regular. The PGI compilers perform many different types of global optimizations, including but not limited to: Branch to branch elimination Constant propagation Copy propagation Dead store elimination Global register allocation Invariant code motion Induction variable elimination -O3 All level 1 and 2 optimizations are performed. In addition, this level enables more aggressive code hoisting and scalar replacement optimizations that may or may not be profitable. -O4 or greater Same as "-O3". -fastsse Chooses generally optimal flags for a processor that supports SSE capabillity. Includes: -fast -Mvect=sse,altcode -Mcache_align -Mscalarsse. -fast Chooses generally optimal flags for the target platform. Includes: -O2 -Munroll=c:1 -Msmart -Mlre -Mnoframe -Mflushz -lacml Link with the AMD Core Math Library. Available from www.amd.com -Mcache_align Align "unconstrained" data objects of size greater than or equal to 16 bytes on cache-line boundaries. An "unconstrained" object is a variable or array that is not a member of an aggregate structure or common block, is not allocatable, and is not an automatic array. On by default on 64-bit Linux systems. -Mflushz Set SSE to flush-to-zero mode; if a floating-point underflow occurs, the value is set to zero. -Mdaz Treat denormalized numbers as zero. Included with "-fastsse" on Intel based systems. For AMD based systems, "-Mdaz" is not included by default with "-fastsse". -Mframe Generate code to set up a stack frame. -Mnoframe Eliminates operations that set up a true stack frame pointer for every function. With this option enabled, you cannot perform a traceback on the generated code and you cannot access local variables. -Mfprelaxed Instructs the compiler to use relaxed precision in the calculation of some intrinsic functions. Can result in improved performance at the expense of numerical accuracy. The default on an AMD system is "-Mfprelaxed=rsqrt". The default on an Intel system is "-Mfprelaxed=rsqrt,sqrt,div" -Mfprelaxed=rsqrt Instructs the compiler to use relaxed precision in the calculation of reciprocal square root (1/sqrt). Can result in improved performance at the expense of numerical accuracy. -Mfprelaxed=sqrt Instructs the compiler to use relaxed precision in the calculation of square root. Can result in improved performance at the expense of numerical accuracy. -Mfprelaxed=div Instructs the compiler to use relaxed precision in the calculation of divides. Can result in improved performance at the expense of numerical accuracy. -Mprefetch Enable generation of prefetch instructions on processors where they are supported. -Mprefetch=distance:m Set the fetch-ahead distance for prefetch instructions to m cache lines -Mprefetch=n:p Set maximum number of prefetch instructions to generate for a given loop to p. -Mprefetch=nta Use the prefetchnta instruction. -Mprefetch=plain Use the prefetch instruction. -Mprefetch=t0 Use the prefetcht0 instruction. -Mprefetch=w Use the AMD-specific prefetchw instruction. -Mnoprefetch Disable generation of prefetch instructions. -Mscalarsse Use SSE/SSE2 instructions to perform scalar floating-point arithmetic on targets where these instructions are supported. -Mnoscalarsse Do not use SSE/SSE2 instructions to perform scalar floating-point arithmetic; use x87 operations instead. -Msignextend Instructs the compiler to extend the sign bit that is set as a result of an object's conversion from one data type to an object of a larger signed data type. -Mlre Enables loop-carried redundancy elimination, an optimization that can reduce the number of arithmetic operations and memory references in loops. -Mlre=array Treat individual array element references as candidates for possible loop-carried redundancy elimination. The default is to eliminate only redundant expressions involving two or more operands. -Mlre=assoc Allow expression re-association; specifying this sub-option can increase opportunities for loop-carried redundancy elimination. -Mlre=noassoc Disable expression re-association. -Mnolre Disable loop-carried redundancy elimination. -Mnovintr Instructs the compiler not to perform idiom recognition or introduce calls to hand-optimized vector functions. -Mpfi Generate profile-feedback instrumentation (PFI); this includes extra code to collect run-time statistics and dump them to a trace file for use in a subsequent compilation. PFI gathers information about a program's execution and data values but does not gather information from hardware performance counters. PFI does gather data for optimizations which are unique to profile-feedback optimization. -Mpfo Enable profile-feedback optimizations. -Mipa Enable Interprocedural Analysis. Without a sub-option, IPA defaults to -Mipa=const -Mipa=fast Instructs the compiler to perform interprocedural analysis. Equivalant to -Mipa=align,arg,const,f90ptr,shape,globals,libc,localarg,ptr,pure. -Mipa=align Recognize when targets of pointer dummy are aligned. -Mipa=noalign Disable recognizition when targets of pointer dummy are aligned. -Mipa=arg Remove arguments replaced by -Mipa=ptr,const -Mipa=noarg Do not remove arguments replaced by -Mipa=ptr,const -Mipa=cg Generate call graph information for pgicg tool. -Mipa=nocg Do not generate call graph information for pgicg tool. -Mipa=const Enable interprocedural constant propagation. -Mipa=noconst Disable interprocedural constant propagation. -Mipa=except:func Used with -Mipa=inline to specify functions which should not be inlined. -Mipa=force Force all objects to recompile regardless if IPA information has changed. -Mipa=globals Optimize references to global values. -Mipa=noglobals Do not optimize references to global values. -Mipa=inline:n Automatically determine which functions to inline, limit to n levels. IPA-based function inlining is performed from leaf routines upward. -Mipa=inline Automatically determine which functions to inline. IPA-based function inlining is performed from leaf routines upward. -Mipa=libinline Allow inlining of routines from libraries. -Mipa=nolibinline Do not inline routines from libraries. -Mipa=libc Used to optimize calls to certain functions in the system standard C library, libc. -Mipa=libopt Allow recompiling and optimization of routines from libraries using IPA information. -Mipa=nolibopt Don't optimize routines in libraries. -Mipa=local[arg] -Mipa=arg plus externalizes local pointer targets. -Mipa=nolocal[arg] Do not externalize local pointer targets. -Mipa=ptr Enable pointer disambiguation across procedure calls. -Mipa=noptr Disable pointer disambiguation. -Mipa=f90ptr Fortran 90/95 Pointer disambiguation across calls. -Mipa=nof90ptr Disable Fortran 90/95 pointer disambiguation -Mipa=pure Pure function detection. -Mipa=nopure Disable pure function detection. -Mipa=shape Perform Fortran 90 array shape propagation. -Mipa=noshape Disable Fortran 90 array shape propagation. -Mipa=vestigial Remove functions that are never called. -Mipa=novestigial Do not remove functions that are never called. -Mconcur Instructs the compiler to enable auto-concurrentization of loops. If -Mconcur is specified, multiple processors will be used to execute loops that the compiler determines to be parallelizable. -Mconcur=altcode Instructs the parallelizer to generate alternate serial code for parallelized loops. Without arguments, the parallelizer determines an appropriate cutoff length and generates serial code to be executed whenever the loop count is -Mconcur=altcode:n Instructs the parallelizer to generate alternate serial code for parallelized loops. With arguments, the serial altcode is executed whenever the loop count is less than or equal to n. -Mconcur=noaltcode Always execute the parallelized version of a loop regardless of the loop count. -Mconcur=noassoc Disables parallelization of loops with reductions. -Mconcur=cncall Assume loops containing calls are safe to parallelize and allows loops containing calls to be candidates for parallelization. Also, no minimum loop count threshold must be satisfied before parallelization will occur, and last values of scalars are assumed to be safe. -Mconcur=nocncall Do not assume loops containing calls are safe to parallelize. -Mconcur=dist:block Parallelize with block distribution. Contiguous blocks of iterations of a parallelizable loop are assigned to the available processors. -Mconcur=dist:cyclic Parallelize with cyclic distribution. The outermost parallelizable loop in any loop nest is parallelized. If a parallelized loop is innermost, its iterations are allocated to processors cyclically. For example, if there are 3 processors executing a loop, processor 0 performs iterations 0, 3, 6, etc.; processor 1 performs iterations 1, 4, 7, etc.; and processor 2 performs iterations 2, 5, 8, etc. -Mconcur=innermost Enable parallelization of innermost loops. -Mconcur=noinnermost Disable parallelization of innermost loops. -Minline Instructs the inliner to perform 1 level of inlining. -Minline=lib:filename.ext Instructs the inliner to inline the functions within the library filename.ext. -Minline=except:func Instructs the inliner to inline all eligible functions except func, a function in the source text. Multiple functions can be listed, comma-separated. -Minline=name:func Instructs the inliner to inline function func. -Minline=size:n Instructs the inliner to inline functions with n or fewer statements. -Minline=levels:n Instructs the inliner to perform n levels of inlining. -Mnomain Don't include Fortran main program object module. -Msmartalloc Adds a call to the routine mallopt in the main routine. To be effective, this switch must be specified when compiling the file containing the Fortran, C, or C++ main program. The default is -Mnosmartalloc. -Msafeptr Instructs the C/C++ compiler to override data dependencies between pointers of a given storage class. -Msafeptr=all Assume all pointers and arrays are independent and safe for aggressive optimizations, and in particular that no pointers or arrays overlap of conflict with each other. -Msafeptr=arg Instructs the compiler that arrays and pointers are treated with the same copyin and copyout semantics as Fortran dummy arguments. -Msafeptr=auto Instructs the compiler that local pointers and arrays do not overlap or conflict with each other and are independent. -Msafeptr=local Instructs the compiler that local pointers and arrays do not overlap or conflict with each other and are independent. -Msafeptr=static Instructs the compiler that static pointers and arrays do not overlap or conflict with each other and are independent. -Msafeptr=global Instructs the compiler that global or external pointers and arrays do not overlap or conflict with each other and are independent. -Munroll Invokes the loop unroller. -Munroll=c:m Instructs the compiler to completely unroll loops with a constant loop count of less than or equal to m. -Munroll=n:u Instructs the compiler to unroll u times, a loop that is not completely unrolled, or has a non-constant loop count. -Mnounroll Disable loop unrolling. -Msmart Enable an optional post-pass instruction scheduling. -Mnosmart Disable an optional post-pass instruction scheduling. -Mnovect Disable automatic vector pipelining. -Mvect Enable automatic vector pipelining. -Mvect=altcode Instructs the vectorizer to generate alternate code for vectorized loops when appropriate. For each vectorized loop the compiler decides whether to generate altcode and what type or types to generate, which may be any or all of: Altcode without iteration peeling Altcode with non-temporal stores and other data cache optimizations Altcode base on array alignments calculated dynamically at runtime. The compiler also determines suitable loop count and array alignment conditions for executing the altcode. -Mvect=noaltcode Disables alternate code generation for vectorized loops. -Mvect=assoc Instructs the vectorizer to enable certain associativity conversions that can change the results of a computations due to roundoff error. A typical optimization is to change an arithmetic operation to an arithmetic opteration that is mathmatically correct, but can be computationally different, due to round-off error. -Mvect=noassoc Instructs the vectorizer to disable associativity conversions. -Mvect=cachesize:n Instructs the vectorizer, when performing cache tiling optimizations, to assume a cache size of n. The default size is n=262144. -Mvect=fuse Instructs the vectorizer to enable loop fusion. -Mvect=idiom Instructs the vectorizer to enable idiom recognition. -Mvect=nosizelimit Generate vector loops for all loops where possible regardless of the number of statements in the loop. This overrides a heuristic in the vectorizer that ordinarily prevents vectorization of loops with a number of statements that exceed a certain threshold. -Mvect=prefetch Instructs the vectorizer to generate prefetch instructions. -Mvect=sse Instructs the vectorizer to search for vectorizable loops and, where possible, make use of SSE, SSE2, and prefetch instructions. -Mnofptrap Disables -Ktrap=fp. -Ktrap= -Ktrap is only processed by the compilers when compiling main functions' programs. The options inv, denorm, divz, ovf, unf, and inexact correspond to the processor's exception mask bits invalid operation, denormalized operand, divide-by-zero, overflow, underflow, and precision, respectively. Normally, the processor's exception mask bits are on (floating-point exceptions are masked the processor recovers from the exceptions and continues). If a floating-point exception occurs and its corresponding mask bit is off (or unmasked ), execution terminates with an arithmetic exception (C's SIGFPE signal). -Ktrap=fp is equivalent to -Ktrap=inv,divz,ovf. -Mlongbranch Enable long branches. -mp Use the -mp option to instruct the compiler to interpret user-inserted OpenMP shared-memory parallel programming directives and generate an executable file which will utilize multiple processors in a shared-memory parallel system. When used strictly as a linker flag, the PGI OpenMP runtime will be linked and users can use the environment variables MP_BIND and MP_BLIST to bind a serial program to a CPU. -mp=align The align sub-option to -mp forces loop iterations to be allocated to OpenMP processes using an algorithm that maximizes alignment of vector sub-sections in loops that are both parallelized and vectorized for SSE. This can improve performance in program units that include many such loops. It can result in load-balancing problems that significantly decrease performance in program units with relatively short loops that contain a large amount of work in each iteration. -mp=numa The numa suboption to -mp uses libnuma on systems where it is available. -mp=nonuma The nonuma suboption to -mp tells the driver to not link with libnuma. -mcmodel=medium (For use only on 64-bit Linux targets) Generate code for the medium memory model in the linux86-64 execution environment. The default small memory model of the linux86-64 environment limits the combined area for a user's object or executable to 1GB, with the Linux kernel managing usage of the second 1GB of address for system routines, shared libraries, stacks, etc. Programs are started at a fixed address, and the program can use a single instruction to make most memory references. The medium memory model allows for larger than 2GB data areas, or .bss sections. Program units compiled using either -mcmodel=medium or -fpic require additional instructions to reference memory. The effect on performance is a function of the data-use of the application. The -mcmodel=medium switch must be used at both compile time and link time to create 64-bit executables. Program units compiled for the default small memory model can be linked into medium memory model executables as long as they are compiled -fpic, or position-independent. -Mlarge_arrays Enable support for 64-bit indexing and single static data objects larger than 2GB in size. This option is default in the presence of -mcmodel=medium. Can be used separately together with the default small memory model for certain 64-bit applications that manage their own memory space. -tp[=]k8-32 Specify the type of the target processor as AMD64 Processor 32-bit mode. -tp[=]k8-64 Specify the type of the target processor as AMD64 Processor 64-bit mode. -tp[=]p7-64 Specify the type of the target processor as Intel P7 Architecture with EM64t, 64-bit mode. -tp[=]p7 Specify the type of the target processor as Intel P7 Architecture (Pentium 4, Xeon, Centrino). -tp[=]core2-64 Specify the type of the target processor as Intel Core 2 EM64T or compatible architecture using 64-bit mode. -tp[=]core2 Specify the type of the target processor as Intel Core 2 or compatible architecture using 32-bit mode. -tp[=]x64 Use the unified AMD/Intel 64-bit mode. --no_exceptions Disable C++ exception handling support. --no_rtti Disable C++ run time type information support. Environment Varaibles: LD_LIBRARY_PATH Sets the path to an executable's runtime shared libraries. MP_BIND Instructs the runtime to bind a thread to a core when set to "yes". MP_BLIST Defines the thread-core relationship. NCPUS Sets the number of threads to use. OMP_NUM_THREADS Sets the number of threads to use. OMP_DYNAMIC Enables or disables dynamic adjustment of the number of threads available for execution of parallel regions. PGI Sets the base PGI installation directory.