June 4, 2013

High CPU usage on DB host?

Its Unix 101 statement, but I have heard it wrong so many time, so decided to put a blog for it.

If in OLTP environment, per ‘top’ %wa (IOWAIT) is major contributor for CPU busy, adding CPU would not help. There is no need for more CPU. Period.

************** %wa IS PERCENTAGE WHAT COUNT TO CPU IDLE ******************

Simple test.

Push some IO:

user1@myhost:~$ dd if=/dev/zero of=/tmp/file1 conv=notrunc bs=1000 count=3000000 & 
[1] 31240 
user1@myhost:~$ dd if=/dev/zero of=/tmp/file2 conv=notrunc bs=1000 count=3000000 & 
[2] 31241 
user1@myhost:~$ dd if=/dev/zero of=/tmp/file3 conv=notrunc bs=1000 count=3000000 & 
[3] 31242 
user1@myhost:~$ dd if=/dev/zero of=/tmp/file4 conv=notrunc bs=1000 count=3000000 & 
[4] 31243  

top looks:

user1@myhost:~$ top -b -i  

top - 23:05:42 up 8:37, 12 users, load average: 4.36, 3.91, 6.28 
Tasks: 239 total, 5 running, 230 sleeping, 0 stopped, 4 zombie 
Cpu(s): 3.1%us, 20.5%sy, 0.0%ni, 12.9%id, 63.3%wa, 0.0%hi, 0.3%si, 0.0%st 
Mem: 4080460k total, 3809420k used, 271040k free, 1580k buffers 
Swap: 4145148k total, 104240k used, 4040908k free, 1824928k cached 

PID 	USER 	PR 	NI 	VIRT	RES	SHR S %CPU 	%MEM TIME+	COMMAND 
31240 amoseyev  20 	0 	4376 	588	496 D  2 	0.0  0:12.29 	dd 
31241 amoseyev  20 	0 	4376 	588	500 D  2 	0.0  0:12.32 	dd 
31242 amoseyev  20 	0 	4376 	592	500 D  2 	0.0  0:12.38 	dd 
31243 amoseyev  20 	0 	4376 	592	500 D  1 	0.0  0:11.50 	dd  

%wa is high. iostat consistently giving write performance about 44MB/sec:

user1@myhost:~$ iostat 1 1000  

avg-cpu: %user 	%nice 	%system %iowait %steal 	%idle 
         1.76 	0.00 	12.09 	50.13 	0.00 	36.02 
		 
Device: tps 	kB_read/s kB_wrtn/s kB_read kB_wrtn 
sda     2111.00 8352.00   45668.00  8352    45668  

As shown above, dd process what causes I/O load almost always in “D” state, what is uninterruptible sleep.
“uninterruptible” comes from fact what they cant be killed, as process is in kernel mode (IO call has to be done in kernel mode).
Its uninterruptible but still SLEEP. Its idle process. It does not block CPU. If any other thread would require CPU (either for number cranching or for another I/O call), schedule would put it on CPU while dd is in sleep.
But if no other CPU load is available, top counts next idle CPU cycle as %wa.

Now push some real CPU load:

user1@myhost:~$ cat /dev/urandom > /dev/null & 
[1] 31224 
user1@myhost:~$ cat /dev/urandom > /dev/null & 
[2] 31225 
user1@myhost:~$ cat /dev/urandom > /dev/null & 
[3] 31229 
user1@myhost:~$ cat /dev/urandom > /dev/null & 
[4] 31231 

user1@myhost:~$ 
user1@myhost:~$ top -b -i 

top - 23:19:16 up 8:50, 12 users, load average: 7.84, 7.15, 7.10 
Tasks: 239 total, 6 running, 229 sleeping, 0 stopped, 4 zombie 
Cpu(s): 0.8%us, 98.1%sy, 0.0%ni, 0.0%id, 0.5%wa, 0.0%hi, 0.6%si, 0.0%st 
Mem: 4080460k total, 3838860k used, 241600k free, 2168k buffers 
Swap: 4145148k total, 104240k used, 4040908k free, 2264144k cached 

PID   USER     PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 
31279 amoseyev 20 0  4220 544 456 R 96   0.0  0:16.22 cat 
31281 amoseyev 20 0  4220 544 456 R 87   0.0  0:13.96 cat 
31280 amoseyev 20 0  4220 544 456 R 83   0.0  0:15.20 cat 
31278 amoseyev 20 0  4220 540 456 R 80   0.0  0:16.01 cat 
31241 amoseyev 20 0  4376 588 500 D 2    0.0  0:14.29 dd 
31242 amoseyev 20 0  4376 592 500 D 2    0.0  0:14.26 dd 
31240 amoseyev 20 0  4376 588 496 D 1    0.0  0:14.17 dd 
31243 amoseyev 20 0  4376 592 500 D 1    0.0  0:13.34 dd  

%wa went to almost 0. %sy is close to 100%.
So when CPU spends its cycles on real load, it does not addup to %wa. And same time, I/O throughput did not change with CPU been 100% busy:

user1@myhost:~$ iostat sda 1 1000 
avg-cpu: %user %nice %system %iowait %steal %idle 
         1.50  0.00  98.50   0.00    0.00   0.00 

Device: tps      kB_read/s kB_wrtn/s kB_read kB_wrtn 
sda     1415.00  5376.00   44500.00  5376    44500  

What again proves what OLTP I/O does not need that much CPU.

The “D” process also add up to load average, so its its also not the best value to judge how busy CPU is.
All said above would be applicable for for OLTP, A lot of small I/Os. NAS, SAN. RAW/block devices, or file system. All of them.
If we are talking about some crazy 1+GB/sec full table scans in OLAP/dw world, CPU probably would be affected, especially if its NFS (and not direct NFS). But it will be mostly on %sy and %si (not %wa), as ethernet traffic would be handled thru soft interrupts, and with high throughput, its CPU intensive. Context switches may also add up to CPU on some platforms, where they still used for switching between user/kernel modes.