Proxmox5: Bad disc IO on host and KVM but not LXC

fkh

Renowned Member
May 18, 2014
25
0
66
Hi guys

A couple of month ago I've switched from P4 to P5 (new installation, no upgrade, OVH Proxmox pre-built template with soft-raid 5, same host) and am now experiencing bad disc performance with KVM (haven't used KVM instances in P5 until recently, only LXC containers). Maybe someone has experienced something similar or has some advice how I could proceed from here.

I ran some discio benchmarks as described at https://www.binarylane.com.au/support/solutions/articles/1000055889-how-to-benchmark-disk-i-o (basically fio and ioping). The weird thing is that both the host itself and the KVMs have horrible disc io while the LXC containers are running just fine. Running an apt upgrade/yum update/zypper update on multiple KVM instances at the same time brings the system to almost a stop and it literally takes forever to finish. Doing the same in LXC works just fine.

May this be related to LVM or the soft-raid 5? Is there anything I can do to test and tweak settings? Any help is much appreciated.

Code:
cat /etc/pve/qemu-server/900.conf
balloon: 512
bootdisk: virtio0
cores: 2
ide2: local:iso/ubuntu-16.04-server.iso,media=cdrom
memory: 2048
name: ubuntu1
net0: virtio=0E:36:91:CB:D3:37,bridge=vmbr1
numa: 0
ostype: l26
parent: python2
scsihw: virtio-scsi-pci
smbios1: uuid=135b3d73-6b52-4591-8156-0b3fbf968dca
sockets: 1
virtio0: local:900/vm-900-disk-1.qcow2,size=32G

[initial]
balloon: 512
bootdisk: virtio0
cores: 2
ide2: local:iso/ubuntu-16.04-server.iso,media=cdrom
machine: pc-i440fx-2.9
memory: 2048
name: ubuntu1
net0: virtio=0E:36:91:CB:D3:37,bridge=vmbr1
numa: 0
ostype: l26
scsihw: virtio-scsi-pci
smbios1: uuid=135b3d73-6b52-4591-8156-0b3fbf968dca
snaptime: 1504904522
sockets: 1
virtio0: local:900/vm-900-disk-1.qcow2,size=32G
vmstate: local:900/vm-900-state-initial.raw

[python2]
balloon: 512
bootdisk: virtio0
cores: 2
ide2: local:iso/ubuntu-16.04-server.iso,media=cdrom
machine: pc-i440fx-2.9
memory: 2048
name: ubuntu1
net0: virtio=0E:36:91:CB:D3:37,bridge=vmbr1
numa: 0
ostype: l26
parent: initial
scsihw: virtio-scsi-pci
smbios1: uuid=135b3d73-6b52-4591-8156-0b3fbf968dca
snaptime: 1504905988
sockets: 1
virtio0: local:900/vm-900-disk-1.qcow2,size=32G
vmstate: local:900/vm-900-state-python2.raw
https://www.ovh.de/dedicated_server/bare-metal-servers/sp-64-d.xml
Code:
CPU Intel Xeon E5-1620v2
4 cores /8 threads 3,7GHz / 3,9GHz
RAM 64 GB DDR3
4 x 2 TB  HDD (SATA)
Code:
pveversion -v
proxmox-ve: 5.0-21 (running kernel: 4.10.15-1-pve)
pve-manager: 5.0-31 (running version: 5.0-31/27769b1f)
pve-kernel-4.10.17-2-pve: 4.10.17-20
pve-kernel-4.10.15-1-pve: 4.10.15-15
pve-kernel-4.10.17-3-pve: 4.10.17-21
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve3
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-12
qemu-server: 5.0-15
pve-firmware: 2.0-2
libpve-common-perl: 5.0-16
libpve-guest-common-perl: 2.0-11
libpve-access-control: 5.0-6
libpve-storage-perl: 5.0-14
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.0-9
pve-qemu-kvm: 2.9.0-5
pve-container: 2.0-15
pve-firewall: 3.0-2
pve-ha-manager: 2.0-2
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.0.8-3
lxcfs: 2.0.6-pve500
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.6.5.11-pve17~bpo90
Code:
cat /etc/fstab
# <file system>    <mount point>    <type>    <options>    <dump>    <pass>
/dev/md2    /    ext4    errors=remount-ro    0    1
/dev/pve/data    /var/lib/vz    xfs    defaults    1    2
/dev/pve/backup    /backup    xfs    defaults    1    2
/dev/pve/log    /var/log    xfs    defaults    1    2
/dev/sda3    swap    swap    defaults    0    0
/dev/sdb3    swap    swap    defaults    0    0
/dev/sdc3    swap    swap    defaults    0    0
/dev/sdd3    swap    swap    defaults    0    0
proc            /proc   proc    defaults        0       0
sysfs           /sys    sysfs   defaults        0       0

pvdisplay
  WARNING: Not using lvmetad because config setting use_lvmetad=0.
  WARNING: To avoid corruption, rescan devices to make changes visible (pvscan --cache).
  --- Physical volume ---
  PV Name               /dev/md4
  VG Name               pve
  PV Size               5.38 TiB / not usable 2.50 MiB
  Allocatable           yes
  PE Size               4.00 MiB
  Total PE              1409648
  Free PE               1024
  Allocated PE          1408624
  PV UUID               t54g78-AwW8-1R48-QhEy-O08E-BYg4-XxRJyQ

vgdisplay
  WARNING: Not using lvmetad because config setting use_lvmetad=0.
  WARNING: To avoid corruption, rescan devices to make changes visible (pvscan --cache).
  --- Volume group ---
  VG Name               pve
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  5
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                3
  Open LV               3
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               5.38 TiB
  PE Size               4.00 MiB
  Total PE              1409648
  Alloc PE / Size       1408624 / 5.37 TiB
  Free  PE / Size       1024 / 4.00 GiB
  VG UUID               LSkTtU-L4Rd-XiPU-YYM9-c6RW-r3dh-hDMzVo

lvdisplay
  WARNING: Not using lvmetad because config setting use_lvmetad=0.
  WARNING: To avoid corruption, rescan devices to make changes visible (pvscan --cache).
  --- Logical volume ---
  LV Path                /dev/pve/data
  LV Name                data
  VG Name                pve
  LV UUID                1UCYSt-Ny0T-BsVY-hIGt-Ru2N-RkYJ-JK0737
  LV Write Access        read/write
  LV Creation host, time rescue.ovh.net, 2017-07-05 19:46:17 +0000
  LV Status              available
  # open                 1
  LV Size                1.90 TiB
  Current LE             498975
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     6144
  Block device           253:0

  --- Logical volume ---
  LV Path                /dev/pve/backup
  LV Name                backup
  VG Name                pve
  LV UUID                9vKzTk-fGIY-igA5-DcXf-4Bya-bDp0-Nsd6eA
  LV Write Access        read/write
  LV Creation host, time rescue.ovh.net, 2017-07-05 19:46:17 +0000
  LV Status              available
  # open                 1
  LV Size                3.47 TiB
  Current LE             908650
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     6144
  Block device           253:1

  --- Logical volume ---
  LV Path                /dev/pve/log
  LV Name                log
  VG Name                pve
  LV UUID                mn3b3j-MziC-lXVr-aeWk-7MDd-IIv7-Zb4omE
  LV Write Access        read/write
  LV Creation host, time rescue.ovh.net, 2017-07-05 19:46:17 +0000
  LV Status              available
  # open                 1
  LV Size                3.90 GiB
  Current LE             999
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     6144
  Block device           253:2

systemctl status lvm2-lvmetad
● lvm2-lvmetad.service - LVM2 metadata daemon
   Loaded: loaded (/lib/systemd/system/lvm2-lvmetad.service; disabled; vendor preset: enabled)
   Active: active (running) since Sat 2017-08-12 15:32:34 UTC; 1 months 12 days ago
     Docs: man:lvmetad(8)
 Main PID: 20865 (lvmetad)
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/lvm2-lvmetad.service
           └─20865 /sbin/lvmetad -f

Aug 12 15:32:34 phost1 systemd[1]: Started LVM2 metadata daemon.

cat /etc/lvm/lvm.conf | grep locking_type
    locking_type = 1
iops (read/write) = 387/128
Code:
./fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=1G --readwrite=randrw --rwmixread=75
test: (g=0): rw=randrw, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.0.9
Starting 1 process
Jobs: 1 (f=1): [m] [100.0% done] [2320K/732K /s] [580 /183  iops] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=31847: Sun Sep 24 09:04:41 2017
  read : io=787416KB, bw=1551.1KB/s, iops=387 , runt=507369msec
  write: io=261160KB, bw=527087 B/s, iops=128 , runt=507369msec
  cpu          : usr=0.23%, sys=1.05%, ctx=249719, majf=0, minf=4
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=196854/w=65290/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=787416KB, aggrb=1551KB/s, minb=1551KB/s, maxb=1551KB/s, mint=507369msec, maxt=507369msec
  WRITE: io=261160KB, aggrb=514KB/s, minb=514KB/s, maxb=514KB/s, mint=507369msec, maxt=507369msec

Disk stats (read/write):
    md2: ios=196835/66444, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=49297/66919, aggrmerge=892/3383, aggrticks=5429360/5054483, aggrin_queue=10483838, aggrutil=98.62%
  sdd: ios=48438/66934, merge=719/4886, ticks=4359940/3694648, in_queue=8054572, util=97.34%
  sdc: ios=49712/66852, merge=1161/2831, ticks=4458780/3641252, in_queue=8100008, util=97.41%
  sdb: ios=49184/66868, merge=917/2759, ticks=6445852/6584964, in_queue=13030816, util=98.62%
  sda: ios=49855/67022, merge=772/3059, ticks=6452868/6297068, in_queue=12749956, util=98.56%
Code:
ioping -c 10 /
4 KiB <<< / (ext4 /dev/md2): request=1 time=71.0 us (warmup)
4 KiB <<< / (ext4 /dev/md2): request=2 time=68.1 us
4 KiB <<< / (ext4 /dev/md2): request=3 time=73.4 us
4 KiB <<< / (ext4 /dev/md2): request=4 time=16.5 ms
4 KiB <<< / (ext4 /dev/md2): request=5 time=83.2 us
4 KiB <<< / (ext4 /dev/md2): request=6 time=6.53 ms
4 KiB <<< / (ext4 /dev/md2): request=7 time=67.1 us (fast)
4 KiB <<< / (ext4 /dev/md2): request=8 time=14.4 ms
4 KiB <<< / (ext4 /dev/md2): request=9 time=111.4 us (fast)
4 KiB <<< / (ext4 /dev/md2): request=10 time=84.3 us (fast)

--- / (ext4 /dev/md2) ioping statistics ---
9 requests completed in 37.9 ms, 36 KiB read, 237 iops, 950.1 KiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 67.1 us / 4.21 ms / 16.5 ms / 6.34 ms

ioping -c 10 /dev/pve/data
4 KiB <<< /dev/pve/data (block device 1.90 TiB): request=1 time=12.8 ms (warmup)
4 KiB <<< /dev/pve/data (block device 1.90 TiB): request=2 time=25.9 ms
4 KiB <<< /dev/pve/data (block device 1.90 TiB): request=3 time=48.8 ms
4 KiB <<< /dev/pve/data (block device 1.90 TiB): request=4 time=8.64 ms
4 KiB <<< /dev/pve/data (block device 1.90 TiB): request=5 time=5.59 ms
4 KiB <<< /dev/pve/data (block device 1.90 TiB): request=6 time=16.8 ms
4 KiB <<< /dev/pve/data (block device 1.90 TiB): request=7 time=27.9 ms
4 KiB <<< /dev/pve/data (block device 1.90 TiB): request=8 time=26.1 ms
4 KiB <<< /dev/pve/data (block device 1.90 TiB): request=9 time=10.4 ms
4 KiB <<< /dev/pve/data (block device 1.90 TiB): request=10 time=10.8 ms

--- /dev/pve/data (block device 1.90 TiB) ioping statistics ---
9 requests completed in 180.8 ms, 36 KiB read, 49 iops, 199.1 KiB/s
generated 10 requests in 9.01 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 5.59 ms / 20.1 ms / 48.8 ms / 12.9 ms

ioping -c 10 /var/lib/vz/
4 KiB <<< /var/lib/vz/ (xfs /dev/dm-0): request=1 time=75.0 us (warmup)
4 KiB <<< /var/lib/vz/ (xfs /dev/dm-0): request=2 time=75.6 us
4 KiB <<< /var/lib/vz/ (xfs /dev/dm-0): request=3 time=71.5 us
4 KiB <<< /var/lib/vz/ (xfs /dev/dm-0): request=4 time=65.9 us
4 KiB <<< /var/lib/vz/ (xfs /dev/dm-0): request=5 time=5.19 ms
4 KiB <<< /var/lib/vz/ (xfs /dev/dm-0): request=6 time=67.8 us
4 KiB <<< /var/lib/vz/ (xfs /dev/dm-0): request=7 time=70.7 us (fast)
4 KiB <<< /var/lib/vz/ (xfs /dev/dm-0): request=8 time=31.3 ms (slow)
4 KiB <<< /var/lib/vz/ (xfs /dev/dm-0): request=9 time=92.9 us (fast)
4 KiB <<< /var/lib/vz/ (xfs /dev/dm-0): request=10 time=96.2 us (fast)

--- /var/lib/vz/ (xfs /dev/dm-0) ioping statistics ---
9 requests completed in 37.0 ms, 36 KiB read, 243 iops, 972.0 KiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 65.9 us / 4.12 ms / 31.3 ms / 9.74 ms
iops (read/write) = 208.2k/69.7k
Code:
./fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=1G --readwrite=randrw --rwmixread=75
test: (g=0): rw=randrw, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.0.9
Starting 1 process
Jobs: 1 (f=1)
test: (groupid=0, jobs=1): err= 0: pid=24494: Sun Sep 24 11:05:54 2017
  read : io=785468KB, bw=832946KB/s, iops=208236 , runt=   943msec
  write: io=263108KB, bw=279012KB/s, iops=69752 , runt=   943msec
  cpu          : usr=21.23%, sys=78.56%, ctx=14, majf=0, minf=4
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=196367/w=65777/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=785468KB, aggrb=832945KB/s, minb=832945KB/s, maxb=832945KB/s, mint=943msec, maxt=943msec
  WRITE: io=263108KB, aggrb=279011KB/s, minb=279011KB/s, maxb=279011KB/s, mint=943msec, maxt=943msec

Disk stats (read/write):
  loop0: ios=186981/62680, merge=0/0, ticks=568/240, in_queue=784, util=77.26%
Code:
ioping -c 10 .
4 KiB from . (ext4 /dev/loop0): request=1 time=32 us
4 KiB from . (ext4 /dev/loop0): request=2 time=37 us
4 KiB from . (ext4 /dev/loop0): request=3 time=31 us
4 KiB from . (ext4 /dev/loop0): request=4 time=30 us
4 KiB from . (ext4 /dev/loop0): request=5 time=29 us
4 KiB from . (ext4 /dev/loop0): request=6 time=28 us
4 KiB from . (ext4 /dev/loop0): request=7 time=29 us
4 KiB from . (ext4 /dev/loop0): request=8 time=31 us
4 KiB from . (ext4 /dev/loop0): request=9 time=55 us
4 KiB from . (ext4 /dev/loop0): request=10 time=76 us

--- . (ext4 /dev/loop0) ioping statistics ---
10 requests completed in 9.00 s, 26.5 k iops, 103.3 MiB/s
min/avg/max/mdev = 28 us / 37 us / 76 us / 14 us
iops (read/write) = 201/67
Code:
./fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=1G --readwrite=randrw --rwmixread=75
test: (g=0): rw=randrw, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.0.9
Starting 1 process
Jobs: 1 (f=1): [m] [100.0% done] [1468K/376K /s] [367 /94  iops] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=1658: Sun Sep 24 11:23:29 2017
  read : io=786532KB, bw=824913 B/s, iops=201 , runt=976355msec
  write: io=262044KB, bw=274831 B/s, iops=67 , runt=976355msec
  cpu          : usr=0.07%, sys=0.38%, ctx=113244, majf=0, minf=4
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=196633/w=65511/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=786532KB, aggrb=805KB/s, minb=805KB/s, maxb=805KB/s, mint=976355msec, maxt=976355msec
  WRITE: io=262044KB, aggrb=268KB/s, minb=268KB/s, maxb=268KB/s, mint=976355msec, maxt=976355msec

Disk stats (read/write):
    dm-0: ios=196614/66076, merge=0/0, ticks=28239808/33076984, in_queue=61340368, util=100.00%, aggrios=196697/66093, aggrmerge=0/207, aggrticks=28260604/33024900, aggrin_queue=61285444, aggrutil=100.00%
  vda: ios=196697/66093, merge=0/207, ticks=28260604/33024900, in_queue=61285444, util=100.00%
Code:
ioping -c 10 .
4 KiB from . (ext4 /dev/mapper/ubuntu3--vg-root): request=1 time=140 us
4 KiB from . (ext4 /dev/mapper/ubuntu3--vg-root): request=2 time=25.5 ms
4 KiB from . (ext4 /dev/mapper/ubuntu3--vg-root): request=3 time=263 us
4 KiB from . (ext4 /dev/mapper/ubuntu3--vg-root): request=4 time=18.7 ms
4 KiB from . (ext4 /dev/mapper/ubuntu3--vg-root): request=5 time=4.97 ms
4 KiB from . (ext4 /dev/mapper/ubuntu3--vg-root): request=6 time=1.18 ms
4 KiB from . (ext4 /dev/mapper/ubuntu3--vg-root): request=7 time=6.59 ms
4 KiB from . (ext4 /dev/mapper/ubuntu3--vg-root): request=8 time=221 us
4 KiB from . (ext4 /dev/mapper/ubuntu3--vg-root): request=9 time=8.47 ms
4 KiB from . (ext4 /dev/mapper/ubuntu3--vg-root): request=10 time=187 us
 
For your fio tests, you should use at least twice the amount of RAM as size for testing, otherwise, you could have caching effects that could lead to strange results.

ioping times of us for harddisks is total bogus, all disks I ever saw had > 2ms seek time. Your LXC are totally of the charts and cannot be true.
 
Hi LnxBil

Thx for the hint, I'll do another run with twice the memory. For LXC it seems I must use the host memory (64GB) * 2 instead of the actual container memory * 2 otherwise the results are still very fast. bonnie++ also seems to think that the LXC container has 64GB memory. Maybe that's the reason why FIO was so fast on LXC.

Cheers
 
Try to change your KVM from virtio to ide:
from:
virtio0: local:900/vm-900-disk-1.qcow2,size=32G

to:
ide0: local:900/vm-900-disk-1.qcow2,size=32G

stop/start the VM and try again.
 
ide? That is the slowest of all possible entries. You should stick to the choices based on the OS by the Proxmox VE Gui which is proven to be the best performance/stability combination. For Linux this is (on PVE5) scsi with virtio based scsi controller. The wiki also states that if you need best performance, you should use virtio single scsi controller and enable iothread:

https://pve.proxmox.com/wiki/Qemu/KVM_Virtual_Machines#qm_virtual_machines_settings
 
Hi guys

I ran FIO with 128GB in LXC but it didn't finish after more than two days so I canceled the script. I ran dd and bonnie++ instead. It seems that KVM is not much slower than LXC so I think I was wrong in assuming that disc i/o is the reason for the difference.What I noticed is that when running pveperf it provides different fsync/s values (sometimes the value is even doubled compared to previous runs) so maybe this was the issue when I observed the slow KVM instances. Guess I have to wait and see if I find some other reason for the differences. Luckily it's only a dev box and most of the time I use LXC anyways.

@micro I did test ide but it didn't speed things up for me. Anyway thank you for the suggestion.
 
I ran FIO with 128GB in LXC but it didn't finish after more than two days so I canceled the script.

Normally use time-bound tests, it's more reliable for predicting the runtime. I also use disks for tests, not files. Disks are already there and only need to be written, test can start immediately. But keep in mind that write-tests will write to the block device


Oh, very interesting. I was not aware of that bug.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!