SPEC CPU2017 software OS and BIOS Settings Descriptions for Cisco UCS Intel-based systems

Operating System Tuning Parameters

Operating System and Software Tuning Parameters

ulimit -s <n>

Sets the stack size to n kbytes, or unlimited to allow the stack size to grow without limit.

numactl --interleave=all "runspec command"

Launching a process with numactl --interleave=all sets the memory interleave policy so that memory will be allocated using round robin on nodes. When memory cannot be allocated on the current interleave target fall back to other nodes.

Free the file system page cache

The command "echo 3> /proc/sys/vm/drop_caches" is used to Clear PageCache, dentries and inodes.

Using numactl to bind processes and memory to cores

For multi-copy runs or single copy runs on systems with multiple sockets, it is advantageous to bind a process to a particular core. Otherwise, the OS may arbitrarily move your process from one core to another. This can affect performance. To help, SPEC allows the use of a "submit" command where users can specify a utility to use to bind processes. We have found the utility 'numactl' to be the best choice.

numactl runs processes with a specific NUMA scheduling or memory placement policy. The policy is set for a command and inherited by all of its children. The numactl flag "--physcpubind" specifies which core(s) to bind the process. "-l" instructs numactl to keep a process memory on the local node while "-m" specifies which node(s) to place a process memory. For full details on using numactl, please refer to your Linux documentation, 'man numactl'

Linux Huge Page settings

In order to take advantage of large pages, your system must be configured to use large pages. To configure your system for huge pages perform the following steps:

Create a mount point for the huge pages: "mkdir /mnt/hugepages" The huge page file system needs to be mounted when the systems reboots. Add the following to a system boot configuration file before any services are started: "mount -t hugetlbfs nodev /mnt/hugepages" Set vm/nr_hugepages=N in your /etc/sysctl.conf file where N is the maximum number of pages the system may allocate. Reboot to have the changes take effect. (Not necessary on some operating systems like RedHat Enterprise Linux 5.5).

Note that further information about huge pages may be found in your Linux documentation file: /usr/src/linux/Documentation/vm/hugetlbpage.txt

Transparent Huge Pages

On RedHat EL 6 and later, Transparent Hugepages increases the memory page size from 4 kilobytes to 2 megabytes. Transparent Hugepages provides significant performance advantages on systems with highly contended resources and large memory workloads. If memory utilization is too high or memory is badly fragmented which prevents hugepages being allocated, the kernel will assign smaller 4k pages instead. Hugepages are used by default if /sys/kernel/mm/redhat_transparent_hugepage/enabled is set to always.

HUGETLB_MORECORE

Set this environment variable to "yes" to enable applications to use large pages.

KMP_STACKSIZE

Specify stack size to be allocated for each thread.

KMP_AFFINITY

KMP_AFFINITY = < physical | logical >, starting-core-id specifies the static mapping of user threads to physical cores. For example, if you have a system configured with 8 cores, OMP_NUM_THREADS=8 and KMP_AFFINITY=physical,0 then thread 0 will mapped to core 0, thread 1 will be mapped to core 1, and so on in a round-robin fashion. KMP_AFFINITY = granularity=fine,scatter The value for the environment variable KMP_AFFINITY affects how the threads from an auto-parallelized program are scheduled across processors. Specifying granularity=fine selects the finest granularity level, causes each OpenMP thread to be bound to a single thread context. This ensures that there is only one thread per core on cores supporting HyperThreading Technology Specifying scatter distributes the threads as evenly as possible across the entire system. Hence a combination of these two options, will spread the threads evenly across sockets, with one thread per physical core.

OMP_NUM_THREADS

Sets the maximum number of threads to use for OpenMP* parallel regions if no other value is specified in the application. This environment variable applies to both -openmp and -parallel (Linux and Mac OS X) or /Qopenmp and /Qparallel (Windows). Example syntax on a Linux system with 8 cores: export OMP_NUM_THREADS=8


Firmware / BIOS / Microcode Settings

Intel Turbo Boost Technology:

Enabling this option allows the processor cores to automatically increase their frequency if they are running below power and temperature, thereby increasing performance. By default, this option is enabled.

Intel Hyper-Threading Technology:

Enabling this option allows processor resources to be used more efficiently, enabling multiple threads to run on each core and increasing processor throughput, improving overall performance on threaded software. By default, this option is enabled.

Enhanced Intel SpeedStep:

Enabling this option allows the system to dynamically adjust processor voltage and core frequency. This technology can result in decreased average power consumption and decreased average heat production. By default, this option is enabled.

Core Multi Processing:

This option specifies the number of logical processor cores that can run on the server. This option sets he state of logical processor cores in a package. If you disable this setting, Hyper Threading is also disabled.

Virtualization Technology:

This option allows the user whether the processor uses Intel Virtualization Technology, which allows a platform to run multiple operating systems and applications in independent partitions. This can be one of the following: Disabled - The processor does not permit virtualization. enabled — The processor allows multiple operating systems in independent partitions. Platform Default — The BIOS option uses the value for this attribute contained in the BIOS defaults for the server type and vendor. By default this BIOS option is enabled.

Direct Cache Access:

Enabling this option allows processors to increase I/O performance by placing data from I/O devices directly into the processor cache. This setting helps to reduce cache misses. By default, this option is enabled.

Power Technology:

This BIOS option enables the user to configure the CPU power management settings such as Enhanced Intel SpeedStep, Intel Turbo Boost Technology and Processor Power State C6. Settings in Custom will allow the user to change individual settings for the BIOS parameters in the preceding list. You must select this option if you want to change any of these BIOS parameters. Settings in Energy Efficient will determines the best settings for the BIOS parameters in the preceding list and ignores the individual settings for these parameters. Settings in Disabled state do not perform any CPU power management and any settings for the BIOS parameters in the preceding list are ignored

Processor C1E:

Enabling this option allows the processor to transition to its minimum frequency upon entering C1. This setting does not take effect until after you have rebooted the server. In disabled state, the CPU continues to run at its maximum frequency in C1 state. In enabled state, the CPU transitions to its minimum frequency. This option saves the maximum amount of power in the C1 state. By default, Processor C1E is disabled.

Processor C6 Report:

Enabling this option allows the processor to send the C6 report to the operating system. By default, Processor C6 Report is disabled. When the OS receives the report, it can transition the processor into the lower C6 power state to decrease energy use while maintaining optimal processor performance.

Energy Efficient Turbo:

This BIOS option allows to control whether the processor uses an energy-efficiency based policy. This mode of operation where a processor’s core frequency is adjusted within the turbo range based on workload. By default, this option is disabled.

Energy Performance:

This BIOS option allows you to determine whether system Performance or energy efficiency is more important on server. This can be one of the following: Balanced Energy, Balanced Performance, Energy Efficient and Performance. Balanced Performance optimized to maximum power savings with minimal impact on performance and it is enabled by default. Performance disables all power management options with any impact on performance. Balanced Energy is optimized for power efficiency and "Energy Efficient" for power savings. The BIOS option is only selectable if “Power Technology" is set to "Custom".

CPU Performance:

This BIOS option allows the enabling/disabling of a processor mechanism in 4 modes: Enterprise, High-Throughput, HPC and Custom.

Hardware prefetcher (Layer 2): The hardware prefetcher prefetches additional streams of instructions and data into the Layer 2 cache upon detection of an access stride. This behavior is more likely to occur during operations that sort through sequential data, such as database table scans and clustered index scans, or that run a tight loop in code.You can specify whether the processor allows the Intel hardware prefetcher to fetch streams of data and instructions from memory into the unified second-level cache when necessary.

The setting can be one of the following:

Adjacent Cache Line Prefetcher: The adjacent-cache-line prefetcher always prefetches the next cache line. Although this approach works well when data is accessed sequentially in memory, it can quickly litter the small Layer 2 cache with unneeded instructions and data if the system is not accessing data sequentially, causing frequently accessed instructions and code to leave the cache to make room for the adjacent-line data or instructions.You can specify whether the processor fetches cache lines in even or odd pairs instead of fetching just the required line.

The setting can be one of the following:

DCU Streamer Prefetch: Like the hardware prefetcher, the DCU streamer prefetcher prefetches additional streams of instructions or data upon detection of an access stride; however, it stores the streams in the tiny Layer 1 cache instead of the Layer 2 cache. This prefetcher is a Layer 1 data cache prefetcher. It detects multiple loads from the same cache line that occur within a time limit. Making the assumption that the next cache line is also required, the prefetcher loads the next line in advance to the Layer 1 cache from the Layer 2 cache or the main memory.

The setting can be one of the following:

DCU-IP prefetcher (Layer 1): The DCU-IP prefetcher predictably prefetches data into the Layer 1 cache on the basis of the recent instruction pointer load instruction history. You can specify whether the processor uses the DCU-IP prefetch mechanism to analyze historical cache access patterns and preload the most relevant lines in the Layer 1 cache.

The setting can be one of the following:

Data Reuse Technology: When enabled, Data Reuse reduces the frequency of L3 cache updates from the L1 cache and may improve performance by reducing the internal bandwidth consumed by constantly updating L1 cache lines in L3. This option is enabled by default.

LLC Prefetch:

This BIOS option configures the processor last level cache (LLC) prefetch feature as a result of the non-inclusive cache architecture. The LLC prefetcher exists on top of other prefetchers that can prefetch data into the core data cache unit (DCU) and mid-level cache (MLC). In some cases, setting this option to disabled can improve performance. Values for this BIOS option can be: Disabled: Disables the LLC prefetcher and forces data to fill in the MLC. The other core prefetchers are unaffected. Enabled: Gives the core prefetcher the ability to prefetch data directly to the LLC. By default, LLC prefetch option is disabled.

Power Performance Tuning:

This BIOS option determines how aggressively the CPU will be power managed and placed into turbo. With “BIOS Controls”, the system controls the setting. Selecting "OS Controls” allows the operating system to control it. By default, OS Control is enabled.

Memory Power Saving Mode:

This BIOS option controls the DIMM power savings mode policy. Setting this BIOS option in Disabled, DIMMs do not enter power saving mode. Setting this BIOS option in Slow, DIMMs can enter power saving mode, but the requirements are higher. Therefore, DIMMs enter power saving mode less frequently. Setting this BIOS option in Fast, DIMMs enter power saving mode as often as possible. Setting this BIOS option in Auto, BIOS controls when a DIMM enters power saving mode based on the DIMM configuration. By default, Memory Power Saving Mode is set too Disabled.

Memory Refresh Rate:

This BIOS option controls the refresh rate of the memory controller and might affect the performance and resiliency of the server memory. This option sets the memory refresh rate to either 1x Refresh or 2x Refresh. By default, 2X Refresh is enabled.

Partial Cache Line Sparing:

This BIOS option provides error-prevention mechanism in memory controllers. PCLS statically encodes the locations of the faulty nibbles of bits into a sparing directory along with the corresponding data content for replacement during memory accesses. By default, this option is Enabled.

ADDDC Sparing:

This BIOS option tracks correctable memory errors and dynamically maps out failing regions by putting those banks or ranks into virtual lockstep mode. This can prevent correctable errors from accumulating and becoming uncorrectable. When initiated, ADDDC allows the system to continue operation, at a reduced performance, until maintenance can be scheduled to repair the DIMM. ADDDC Sparing incurs a marginal performance impact when enabled. By default, this option is Enabled.

LV DDR Mode or Low Voltage DDR Mode and DRAM Clock Throttling:

This BIOS option controls the prioritization of memory operations. Setting this BIOS option in Power-saving-mode will prioritize low voltage memory operations over high frequency memory operations. This mode may lower memory frequency in order to keep the voltage low. By default, Power-saving-mode is enabled. Setting this BIOS option in Performance-mode will prioritize high frequency operations over low voltage operations.

Closed Loop Thermal Throttling:

This BIOS option allows the user to enable/disable temperature-based memory throttling. By default this BIOS option is enabled. By default this BIOS option is enabled. By enabling this BIOS option, the system BIOS will initiate memory throttling to manage memory performance by limiting bandwidth to the DIMMs, therefore capping the power consumption and preventing the DIMMs from overheating.

Memory RAS Configuration:

This BIOS option allows the user to configure memory reliability, availability and serviceability (RAS). Setting this BIOS option in Maximum Performance, system performance is optimized and enabled by default. Setting this BIOS option in Mirroring, system reliability is optimized by using half the system memory as backup. Setting this BIOS option in Lockstep, if the DIMM pairs in the server have an identical type, size, and organization and are populated across the SMI channels, you can enable lockstep mode to minimize memory access latency and provide better performance. Setting this BIOS option in Sparing, system reliability is enhanced with a degree of memory redundancy while making more memory available to the operating system than mirroring.

DRAM Refresh Rate:

This option controls the refresh interval rate for internal memory. By default, the refresh interval rate set as Auto, which is 2X DRAM refresh for every 32ns. Setting this BIOS option in 1X, DRAM cells are refreshed every 64ns.

Patrol Scrub:

This BIOS option is memory RAS feature which runs a background memory scrub against all DIMMs and it can negatively impact performance. By default, this option is enabled. Disabling this option improves performance.

QPI Snoop Configuration:

There are 4 snoop mode options for how to maintain cache coherency across the Intel QPI fabric, each with varying memory latency and bandwidth characteristics depending on how the snoop traffic is generated.

Cluster on Die (COD) mode logically splits a socket into 2 NUMA domains that are exposed to the OS with half the amount of cores and LLC assigned to each NUMA domain in a socket. This mode utilizes an on-die directory cache and in memory directory bits to determine whether a snoop needs to be sent. Use this mode for highly NUMA optimized workloads to get the lowest local memory latency and highest local memory bandwidth for NUMA workloads.

Home Directory Snoop with OSB is the Opportunistic Snoop Broadcast (OSB) directory mode, the HA could choose to do speculative home snoop broadcast under very lightly loaded conditions even before the directory information has been collected and checked.

In Home Snoop and Early Snoop modes, snoops are always sent , they just originate from different places: the caching agent (earlier) in Early Snoop mode and the home agent (later) in Home Snoop mode.

UPI Link Enablement:

This BIOS option allows to change number of UPI Links. Use this option to configure the UPI topology to use fewer links between processors, when available. Changing this option from the default can reduce UPI bandwidth performance in exchange for less power consumption. Values for this BIOS setting can be: 1,2,3 and Auto. By default this BIOS option set to Auto.

UPI Power Management

Use this option to place the Quick Path Interconnect (UPI) links into a low power state when the links are not being used. This lowers power usage with minimal effect on performance. Values for this BIOS setting can be: enabled and disabled By default this BIOS option set to Disabled.

UPI Link Frequency Select:

This BIOS option allows to set the UPI link speed. Running UPI link speed (Frequency) at a lower can reduce power consumption, but can also affect system performance. By default this BIOS option set to Auto.

Sub NUMA Clustering:

This BIOS option provides similar localization benefits as cluster-on-die (COD), without some of COD’s downsides. Sub-NUMA clustering (SNC) breaks up the LLC into two disjoint clusters based on address range, with each cluster bound to a subset of the memory controllers in the system. SNC improves average latency to the LLC (last level cache) and memory. Values for this BIOS option can be:

IMC Interleaving:

This BIOS option controls the interleaving between the Integrated Memory Controllers (IMCs). There are two IMCs per socket in Skylake Server. If IMC Interleaving is set to 2-way, addresses will be interleaved between the two IMCs. If IMC Interleaving is set to 1-way, there will be no interleaving. If SNC is disabled, IMC Interleaving should be set to 2-way. If SNC is enabled, IMC Interleaving should be set to 1-way. By default, IMC Interleaving is set to Auto, which is 2-way Interleaving.

LLC Dead Line:

With the Intel Xeon Scalable processors non-inclusive cache scheme, mid-level cache (MLC) evictions are filled into the last level cache (LLC) if the data is shared across processor cores. When cache lines are evicted from the MLC, the processor core can flag them as “dead” meaning they are not likely to be read again. With this option, the LLC can be configured to drop dead lines and not fill them in the LLC. Values for this BIOS option can be: Disabled: By disabling this option, the dead lines will be dropped from LLC. This provides better utilization in the LLC and prevents the LLC from evicting useful data. Enabled: By enabling this option, then the processor will determine whether to keep or drop deadlines. By default, this option is Enabled.

Enhanced CPU performance:

This BIOS option help users to modify Enhanced CPU Performance settings. When it is enabled, this option will adjust the processor settings and enables processor to run aggressively that can result in improved performance, but it may result in higher power consumption. Values for this BIOS option can be Auto, Enabled, or Disabled. By default Enhanced CPU performance set to Disabled.

XPT Remote Prefetch:

Use this option to configure the XPT Remote Prefetcher processor performance option. When enabled, this feature can improve remote read request latency from a processor core by directly accessing the UPI. Values for this BIOS setting can be auto, enabled, or disabled.

High Bandwidth:

Enabling this option allows the chipset to defer memory transactions and process them out of order for optimal performance.

CPUfreq governor:
The CPUfreq subsystem offers several tuning options for P-states: You can switch between the different governors, influence minimum or maximum CPU frequency to be used or change individual governor parameters. To switch to another governor at runtime, use "cpupower frequency-set with the -g" option.
Possible settings:

Performance: The CPU frequency is statically set to the highest possible for maximum performance. Consequently, saving power is not the focus of this governor.

On-demand: The kernel implementation of a dynamic CPU frequency policy: The governor monitors the processor usage. When it exceeds a certain threshold, the governor will set the frequency to the highest available. If the usage is less than the threshold, the next lowest frequency is used. If the system continues to be underemployed, the frequency is again reduced until the lowest available frequency is set.

Powersave: The CPU frequency is statically set to the lowest possible. This can have severe impact on the performance, as the system will never rise above this frequency no matter how busy the processors are.

Schedutil: The "schedutil" governor aims at better integration with the Linux kernel scheduler. Load estimation is achieved through the scheduler's Per-Entity Load Tracking (PELT) mechanism, which also provides information about the recent load.

submit= MYMASK=`printf '0x%x' \$((1<< \$SPECCOPYNUM))`; /usr/bin/taskset \$MYMASK $command

When running multiple copies of benchmarks, the SPEC config file feature submit is sometimes used to cause individual jobs to be bound to specific processors. This specific submit command is used for Linux. The description of the elements of the command are:

/usr/bin/taskset [options] [mask] [pid | command [arg] ... ] :
taskset is used to set or retreive the CPU affinity of a running process given its PID or to launch a new COMMAND with a given CPU affinity. The CPU affinity is represented as a bitmask, with the lowest order bit corresponding to the first logical CPU and highest order bit corresponding to the last logical CPU. When the taskset returns, it is guaranteed that the given program has been scheduled to a legal CPU.
:
The default behaviour of taskset is to run a new command with a given affinity mask: :
taskset [mask] [command] [arguments]