Saturday, July 27, 2013

server performance improvemevt tweek


Tickless System

Previously, the Linux kernel periodically interrupted each CPU on a system at a predetermined frequency — 100 Hz, 250 Hz, or 1000 Hz, depending on the platform. The kernel queried the CPU about the processes that it was executing, and used the results for process accounting and load balancing. Known as the timer tick, the kernel performed this interrupt regardless of the power state of the CPU. Therefore, even an idle CPU was responding to up to 1000 of these requests every second. On systems that implemented power saving measures for idle CPUs, the timer tick prevented the CPU from remaining idle long enough for the system to benefit from these power savings.

The tickless kernel feature allows for on-demand timer interrupts. This means that during idle periods, fewer timer interrupts will fire, which should lead to power savings, cooler running systems, and fewer useless context switches.

Kernel option: CONFIG_NO_HZ=y
To set kernel option change kernel option in /boot/config-x.x.x.x-generic

Timer Frequency

You can select the rate at which timer interrupts in the kernel will fire. When a timer interrupt fires on a CPU, the process running on that CPU is interrupted while the timer interrupt is handled. Reducing the rate at which the timer fires allows for fewer interruptions of your running processes. This option is particularly useful for servers with multiple CPUs where processes are not running interactively.
Kernel options: CONFIG_HZ_100=y and CONFIG_HZ=100


The connector module is a kernel module which reports process events such as forkexec, and exit to userland. This is extremely useful for process monitoring. You can build a simple system to watch mission-critical processes. If the processes die due to a signal (like SIGSEGV, or SIGBUS) or exit unexpectedly you’ll get an asynchronous notification from the kernel. The processes can then be restarted by your monitor keeping downtime to a minimum when unexpected events occur.
Applications that may find these events useful include accounting / auditing (for example, ELSA), system activity monitoring (for example, top), security, and resource management (for example, CKRM). Semantics provide the building blocks for features like per-user-namespace, "files as directories" and versioned file systems.


TCP segmentation offload (TSO)

A popular feature among newer NICs is TCP segmentation offload (TSO). This feature allows the kernel to offload the work of dividing large packets into smaller packets to the NIC. This frees up the CPU to do more useful work and reduces the amount of overhead that the CPU passes along the bus. If your NIC supports this feature. TCP offload engine or TOE is a technology used in network interface cards (NIC) to offload processing of the entire TCP/IP stack to the network controller. It is primarily used with high-speed network interfaces, such as gigabit Ethernet and 10 Gigabit Ethernet, where processing overhead of the network stack becomes significant.
sudo ethtool -K eth1 tso on
Data corruption on NFS file systems might be encountered on network adapters without support for error-correcting code (ECC) memory that also have TCP segmentation offloading (TSO) enabled in the driver. Note: data that might be corrupted by the sender still passes the checksum performed by the IP stack of the receiving machine A possible work around to this issue is to disable TSO on network adapters that do not support ECC memory. 
You can check  of it is working using
sudo ethtool -k eth1
netstat -nt | findstr /i offloaded
  TCP   ESTABLISHED     Offloaded
  TCP    ESTABLISHED     Offloaded
  TCP    ESTABLISHED     Offloaded
  TCP      ESTABLISHED     Offloaded

Intel I/OAT DMA Engine

This kernel option enables the Intel I/OAT DMA engine that is present in recent Xeon CPUs. This option increases network throughput as the DMA engine allows the kernel to offload network data copying from the CPU to the DMA engine. This frees up the CPU to do more useful work.
to check if it is enabled
dmesg | grep ioat
There’s also a sysfs interface where you can get some statistics about the DMA engine. Check the directories under/sys/class/dma/.


Direct Cache Access (DCA)

Intel’s I/OAT also includes a feature called Direct Cache Access (DCA). DCA allows a driver to warm a CPU cache. A few NICs support DCA, the most popular (to my knowledge) is the Intel 10GbE driver (ixgbe). Refer to your NIC driver documentation to see if your NIC supports DCA. To enable DCA, a switch in the BIOS must be flipped. Some vendors supply machines that support DCA, but don’t expose a switch for DCA.
If that is the case, see blog post for how to enable DCA manually.
dmesg | grep dca
dca service started, version 1.8

If DCA is possible on your system but disabled you’ll see:
ioatdma 0000:00:08.0: DCA is disabled in BIOS
Which means you’ll need to enable it in the BIOS or manually.
Kernel option: CONFIG_DCA=y


New API (also referred to as NAPI) is an interface to use interrupt mitigation techniques for networking devices in the Linux kernel. Such an approach is intended to reduce the overhead of packet receiving. The idea is to defer incoming message handling until there is a sufficient amount of them so that it is worth handling them all at once.
High-speed networking can create thousands of interrupts per second, all of which tell the system something it already knew: it has lots of packets to process. NAPI allows drivers to run with (some) interrupts disabled during times of high traffic, with a corresponding decrease in system load.
When the system is overwhelmed and must drop packets, it’s better if those packets are disposed of before much effort goes into processing them. NAPI-compliant drivers can often cause packets to be dropped in the network adaptor itself, before the kernel sees them at all. 
NAPI was first incorporated in the 2.5/2.6 kernel but was also backported to the 2.4.20 kernel. Note that use of NAPI is entirely optional, drivers will work just fine (though perhaps a little more slowly) without it. A driver may continue using the old 2.4 technique for interfacing to the network stack and not benefit from the NAPI changes. NAPI additions to the kernel do not break backward compatibility.
Many recent NIC drivers automatically support NAPI, so you don’t need to do anything. Some drivers need you to explicitly specify NAPI in the kernel config or on the command line when compiling the driver.

To check your driver
# ethtool -i eth0
driver: e1000e
version: 2.1.4-k
firmware-version: 0.13-3
bus-info: 0000:00:19.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

Performance under high packet load


A driver may continue using the old 2.4 technique for interfacing to the network stack and not benefit from the NAPI changes. NAPI additions to the kernel do not break backward compatibility.

Enable NAPI

Downolad the latest driver version by visting the following url:
  1. Linux kernel driver for the Intel(R) PRO/100 Ethernet devices, Intel(R) PRO/1000 gigabit Ethernet devices, and Intel(R) PRO/10GbE devices.
To enable NAPI, compile the driver module, passing in a configuration option:
make CFLAGS_EXTRA=-DE1000_NAPI install
Once done simply install new drivers.
See Intel e1000 documentation for more information 

Some drivers allow the user to specify the rate at which the NIC will generate interrupts. The e1000e driver allows you to pass a command line option InterruptThrottleRate
when loading the module with insmod. For the e1000e there are two dynamic interrupt throttle mechanisms, specified on the command line as 1 (dynamic) and 3 (dynamic conservative). The adaptive algorithm traffic into different classes and adjusts the interrupt rate appropriately. The difference between dynamic and dynamic conservative is the the rate for the “Lowest Latency” traffic class, dynamic (1) has a much more aggressive interrupt rate for this traffic class.
insmod e1000e.o InterruptThrottleRate=1


OProfile is a profiling system for Linux 2.6 and higher systems on a number of architectures. It is capable of profiling all parts of a running system, from the kernel (including modules and interrupt handlers) to shared libraries to binaries. OProfile can profile the whole system in the background, collecting information at a low overhead. These features make it ideal for profiling entire systems to determine bottle necks in real-world systems.
Many CPUs provide "performance counters", hardware registers that can count "events"; for example, cache misses, or CPU cycles. OProfile provides profiles of code based on the number of these occurring events: repeatedly, every time a certain (configurable) number of events has occurred, the PC value is recorded. This information is aggregated into profiles for each binary image.
oprofile is a system wide profiler that can profile both kernel and application level code. There is a kernel driver for oprofile which generates collects data in the x86′s Model Specific Registers (MSRs) to give very detailed information about the performance of running code. oprofile can also annotate source code with performance information to make fixing bottlenecks easy. See oprofile’s homepage for more information.


epoll(7) is useful for applications which must watch for events on large numbers of file descriptors. The epollinterface is designed to easily scale to large numbers of file descriptors. epoll is already enabled in most recent kernels, but some strange distributions (which will remain nameless) have this feature disabled.
Kernel option: CONFIG_EPOLL=y