Prox3.4 load / iowait issues

dik23 · Apr 22, 2015

I've been running Proxmox @ OVH for years with very little problem. I've now hit something that I'm stuck on.

Whenever I move a large-ish volume of data across folders on a single Prox node I end up with iowait issues. The cpu will be next to idle but the load will shoot up to ridiculous levels, if I'm not careful it can reach over 80 !

I've tried multiple ways of moving data. For example using cp to move a directory of ~66GB containing ~5300 files. With rsync moving a directory of ~21GB containing ~12000 files. With tar backing up a directory of ~21GB containing ~12000 files. Restoring data to a Zimbra server running in an OpenVZ container.

The high load doesn't appear straight way and seems to occur at a random point in the move. Once it's occurred when I've missed it and then fallen back to normal levels again while the process is still active and before it's finished. I am now having to run a script while doing larger data moves to monitor / kill the process if the load goes too high, but this shouldn't be necessary.

I have thought that perhaps it was a hardware issue but the OVH control panel reports nothing. A couple of weeks ago I booted into rescue mode and again saw nothing reported. I have contacted OVH support who agree, after a long email chain, that there's no hardware problems.

I have also performed multiple tests on the mdadm raid array and can find nothing there.

I've attached some files for extra View attachment fdisk.txt View attachment lspci.txt View attachment mdstat.txt View attachment parted.txt View attachment pveperf-root.txtinformation.

Code:

# pveversion -v
proxmox-ve-2.6.32: 3.4-150 (running kernel: 2.6.32-37-pve)
pve-manager: 3.4-3 (running version: 3.4-3/2fc72fee)
pve-kernel-2.6.32-37-pve: 2.6.32-150
pve-kernel-2.6.32-34-pve: 2.6.32-140
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-16
qemu-server: 3.4-3
pve-firmware: 1.1-4
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-32
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-8
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

dik23 · Apr 22, 2015

Here's a couple of other attachments.

Thanks in advance for any help / comments

mir · Apr 22, 2015

How much free space does your disks have?

dik23 · Apr 22, 2015

Lots

Code:

# df -h
Filesystem                Size  Used Avail Use% Mounted on
udev                       10M     0   10M   0% /dev
tmpfs                     3.2G   61M  3.1G   2% /run
/dev/md2                   15G  3.9G   10G  28% /
tmpfs                     5.0M     0  5.0M   0% /run/lock
tmpfs                     6.7G   31M  6.7G   1% /run/shm
/dev/mapper/pve-data      1.6T  874G  662G  57% /var/lib/vz
/dev/mapper/pve-backup    197G   89G   99G  48% /root/backup
/dev/fuse                  30M   32K   30M   1% /etc/pve

mir · Apr 22, 2015

Your pveperf test

Code:

pveperf /
CPU BOGOMIPS:      54400.48
REGEX/SECOND:      1528862
HD SIZE:           14.53 GB (/dev/md2)
BUFFERED READS:    17.03 MB/sec
AVERAGE SEEK TIME: 12.79 ms
FSYNCS/SECOND:     392.23

1) Buffered reads are extremely slow. It should normally be in the range 100 - 160 MB/sec given SATA3
2) fsyncs/second is also extremely slow. Recommended minimum is 600.

What is boils down to: I think you have an alignment problem given your disks are 4k disks.

Read this for inspiration: http://dennisfleurbaaij.blogspot.com/2013/01/setting-up-linux-mdadm-raid-array-with.html

dik23 · Apr 22, 2015

Are you suggesting that I should reformat the drives and start again ?

These pveperf results are with CTs running, will that make much difference ?

dik23 · Apr 23, 2015

I've now stopped all CTs and run pveperf again, please see attached

Are these results closer to what they should be ?

View attachment pveperf-vz2.txt

View attachment pveperf-root2.txt

dik23 · Apr 23, 2015

Also, how it it possible that a CT with ioprio set to minimum can cause such high iowait ?

Surely this setting should prevent this, or have I misunderstood ?

Code:

[COLOR=#000000][FONT=tahoma]vzctl set XXX --ioprio 0 --save[/FONT][/COLOR]

Search

Search

Prox3.4 load / iowait issues

dik23

Well-Known Member

dik23

Well-Known Member

Attachments

mir

Famous Member

dik23

Well-Known Member

mir

Famous Member

dik23

Well-Known Member

dik23

Well-Known Member

dik23

Well-Known Member