KVM slow I/O with DL380 g8 + 1gb cache

erfus · Sep 17, 2020

Hi all!

I've got a HP DL380 G8 running with 2x300GB 15k SAS in RAID 1 and 4x300GB 15k SAS in RAID10. Controller has 1GB of cache, Write Cache is enabled, BBU is present. Controller status is ok.

I've just set it up (yes, raid resync is long done) and moved my VMs from my other server. LXC Containers are fine, 1GB of write cache is amazing.
But I've noticed bad performance in KVM VMs. It was just a feeling, apt-get upgrades were way slower, docker image unpacking took way too long. So I spun up a test VM with GRML live and a 40gb image on my RAID10 array. Settings are virtio scsi with writeback enabled (which should be ok with a BBU, right?)

I did a simple dd test:
First, on the PVE Host (Raid1 Array):

root@neto:~# dd if=/dev/zero of=test.img bs=512 count=10000 oflag=direct
10000+0 records in
10000+0 records out
5120000 bytes (5.1 MB, 4.9 MiB) copied, 0.553076 s, 9.3 MB/s

then in the KVM VM (Raid10 Array):

root@grml:~# dd if=/dev/zero of=/mnt/test.img bs=512 count=10000 oflag=direct
10000+0 records in
10000+0 records out
5120000 bytes (5.1 MB, 4.9 MiB) copied, 2.22463 s, 2.3 MB/s

as you can see, KVM is a LOT slower even though its on the faster raid array (10 vs 1). Both arrays are idle, no other VMs / Services running.
I'm aware that dd is not the best performance test. But real world performance is way too low.

Any ideas on that topic?

Thanks!

Greetings,
erfus

erfus · Sep 17, 2020

Hi all,

ok, maybe I'm on the wrong track here.
I've done one more test, KVM VM with 40gb disk on the RAID10 lvm-thin array and LXC Container on the same array. I've run FIO (https://pve.proxmox.com/wiki/Iscsi/tests) Test and after that I run the DD test.

Results:
KVM:

Code:

root@grml /mnt # fio /root/fio.cfg
iometer: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T) 512B-64.0KiB, ioengine=libaio, iodepth=64
fio-3.20
Starting 1 process
iometer: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [m(1)][100.0%][r=20.7MiB/s,w=5404KiB/s][r=7304,w=1864 IOPS][eta 00m:00s]
iometer: (groupid=0, jobs=1): err= 0: pid=2472: Thu Sep 17 13:28:29 2020
  Description  : [Emulation of Intel IOmeter File Server Access Pattern]
  read: IOPS=7680, BW=35.2MiB/s (36.9MB/s)(3279MiB/93080msec)
    slat (usec): min=6, max=6892, avg=43.36, stdev=25.95
    clat (usec): min=158, max=24633, avg=6572.08, stdev=1343.58
     lat (usec): min=193, max=24734, avg=6617.40, stdev=1350.43
    clat percentiles (usec):
     |  1.00th=[ 2769],  5.00th=[ 3490], 10.00th=[ 5276], 20.00th=[ 5997],
     | 30.00th=[ 6325], 40.00th=[ 6521], 50.00th=[ 6718], 60.00th=[ 6915],
     | 70.00th=[ 7111], 80.00th=[ 7373], 90.00th=[ 7701], 95.00th=[ 8029],
     | 99.00th=[ 9110], 99.50th=[10421], 99.90th=[18220], 99.95th=[21365],
     | 99.99th=[23200]
   bw (  KiB/s): min=20576, max=115171, per=100.00%, avg=36159.55, stdev=17728.37, samples=185
   iops        : min= 6650, max=14820, avg=7685.37, stdev=1254.97, samples=185
  write: IOPS=1927, BW=8984KiB/s (9199kB/s)(817MiB/93080msec); 0 zone resets
    slat (usec): min=7, max=14810, avg=259.52, stdev=175.38
    clat (usec): min=595, max=23775, avg=6532.17, stdev=1328.83
     lat (usec): min=1160, max=24674, avg=6793.41, stdev=1366.77
    clat percentiles (usec):
     |  1.00th=[ 2769],  5.00th=[ 3490], 10.00th=[ 5211], 20.00th=[ 5997],
     | 30.00th=[ 6259], 40.00th=[ 6456], 50.00th=[ 6652], 60.00th=[ 6849],
     | 70.00th=[ 7046], 80.00th=[ 7308], 90.00th=[ 7701], 95.00th=[ 8029],
     | 99.00th=[ 9110], 99.50th=[10421], 99.90th=[17695], 99.95th=[21103],
     | 99.99th=[22938]
   bw (  KiB/s): min= 4992, max=29278, per=100.00%, avg=9004.47, stdev=4332.77, samples=185
   iops        : min= 1678, max= 3630, avg=1929.21, stdev=317.73, samples=185
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.01%
  lat (msec)   : 2=0.01%, 4=6.76%, 10=92.62%, 20=0.54%, 50=0.08%
  cpu          : usr=15.32%, sys=51.20%, ctx=317484, majf=0, minf=13
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=714864,179431,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=35.2MiB/s (36.9MB/s), 35.2MiB/s-35.2MiB/s (36.9MB/s-36.9MB/s), io=3279MiB (3439MB), run=93080-93080msec
  WRITE: bw=8984KiB/s (9199kB/s), 8984KiB/s-8984KiB/s (9199kB/s-9199kB/s), io=817MiB (856MB), run=93080-93080msec

Disk stats (read/write):
  sda: ios=713199/179059, merge=0/49, ticks=87086/49596, in_queue=24460, util=99.90%
fio /root/fio.cfg  15,54s user 52,89s system 50% cpu 2:16,37 total



root@grml /mnt # dd if=/dev/zero of=/mnt/test.img bs=512 count=10000 oflag=direct
10000+0 records in
10000+0 records out
5120000 bytes (5,1 MB, 4,9 MiB) copied, 2,23596 s, 2,3 MB/s

LXC:

Code:

root@testIO2:~# fio fio.cfg
iometer: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T) 512B-64.0KiB, ioengine=libaio, iodepth=64
fio-3.12
clock setaffinity failed: Invalid argument
clock setaffinity failed: Invalid argument
clock setaffinity failed: Invalid argument
clock setaffinity failed: Invalid argument
Starting 1 process
iometer: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [m(1)][100.0%][r=9.77MiB/s,w=2497KiB/s][r=3437,w=871 IOPS][eta 00m:00s]
iometer: (groupid=0, jobs=1): err= 0: pid=896: Thu Sep 17 13:37:02 2020
  Description  : [Emulation of Intel IOmeter File Server Access Pattern]
  read: IOPS=2523, BW=11.6MiB/s (12.1MB/s)(3279MiB/283324msec)
    slat (usec): min=9, max=43488, avg=43.13, stdev=112.37
    clat (usec): min=3, max=427620, avg=13665.87, stdev=17356.77
     lat (usec): min=34, max=427664, avg=13709.68, stdev=17357.20
    clat percentiles (usec):
     |  1.00th=[    38],  5.00th=[   955], 10.00th=[  1893], 20.00th=[  3032],
     | 30.00th=[  4228], 40.00th=[  5538], 50.00th=[  7701], 60.00th=[ 10290],
     | 70.00th=[ 14222], 80.00th=[ 20579], 90.00th=[ 32637], 95.00th=[ 46400],
     | 99.00th=[ 83362], 99.50th=[102237], 99.90th=[156238], 99.95th=[175113],
     | 99.99th=[229639]
   bw (  KiB/s): min= 5274, max=40016, per=100.00%, avg=11852.18, stdev=6322.35, samples=566
   iops        : min= 1494, max= 4540, avg=2521.46, stdev=411.07, samples=566
  write: IOPS=633, BW=2951KiB/s (3022kB/s)(817MiB/283324msec); 0 zone resets
    slat (usec): min=11, max=45328, avg=50.47, stdev=318.95
    clat (usec): min=3, max=235337, avg=46353.89, stdev=38956.60
     lat (usec): min=39, max=235384, avg=46405.07, stdev=38955.57
    clat percentiles (usec):
     |  1.00th=[    36],  5.00th=[    38], 10.00th=[    39], 20.00th=[    44],
     | 30.00th=[    59], 40.00th=[ 41681], 50.00th=[ 59507], 60.00th=[ 68682],
     | 70.00th=[ 74974], 80.00th=[ 82314], 90.00th=[ 91751], 95.00th=[ 99091],
     | 99.00th=[114820], 99.50th=[123208], 99.90th=[158335], 99.95th=[175113],
     | 99.99th=[198181]
   bw (  KiB/s): min= 1380, max= 9961, per=100.00%, avg=2951.56, stdev=1546.36, samples=566
   iops        : min=  394, max= 1136, avg=632.93, stdev=104.29, samples=566
  lat (usec)   : 4=0.01%, 10=0.01%, 20=0.01%, 50=6.97%, 100=2.96%
  lat (usec)   : 250=0.72%, 500=0.32%, 750=0.17%, 1000=0.38%
  lat (msec)   : 2=4.75%, 4=13.87%, 10=24.70%, 20=16.24%, 50=14.13%
  lat (msec)   : 100=13.45%, 250=1.33%, 500=0.01%
  cpu          : usr=4.07%, sys=16.98%, ctx=697715, majf=0, minf=1350
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=714864,179431,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=11.6MiB/s (12.1MB/s), 11.6MiB/s-11.6MiB/s (12.1MB/s-12.1MB/s), io=3279MiB (3439MB), run=283324-283324msec
  WRITE: bw=2951KiB/s (3022kB/s), 2951KiB/s-2951KiB/s (3022kB/s-3022kB/s), io=817MiB (856MB), run=283324-283324msec

Disk stats (read/write):
    dm-22: ios=761539/191271, merge=0/0, ticks=10495556/8923956, in_queue=19419512, util=100.00%, aggrios=761816/192369, aggrmerge=0/0, aggrticks=10489092/8980028, aggrin_queue=19469120, aggrutil=100.00%
    dm-3: ios=761816/192369, merge=0/0, ticks=10489092/8980028, in_queue=19469120, util=100.00%, aggrios=380908/96184, aggrmerge=0/0, aggrticks=5243502/4489760, aggrin_queue=9733262, aggrutil=100.00%
    dm-1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=714968/180352, aggrmerge=46931/12017, aggrticks=9763447/8332870, aggrin_queue=16502928, aggrutil=100.00%
  sdc: ios=714968/180352, merge=46931/12017, ticks=9763447/8332870, in_queue=16502928, util=100.00%
  dm-2: ios=761816/192369, merge=0/0, ticks=10487004/8979520, in_queue=19466524, util=100.00%




root@testIO2:~# dd if=/dev/zero of=/mnt/test.img bs=512 count=10000 oflag=direct
10000+0 records in
10000+0 records out
5120000 bytes (5.1 MB, 4.9 MiB) copied, 0.898708 s, 5.7 MB/s

So LXC is a lot faster in the dd test, but fio tells me that KVM seems to be faster. Both running on 4x300GB SAS 15K Raid10 with 1gb write cache, KVM settings virtio with writeback, otherwise full stock LXC/KVM setup.

Is this result to be expected? I'm a little confused here.

Thanks!

Greetings,
erfus

LnxBil · Sep 17, 2020

First, to test the performance, you should use the same

erfus said:
Is this result to be expected? I'm a little confused here.

If LXC or KVM is faster relies on the backend storage. You have hardware RAID and I suppose you also have LVM? Then you have in both scenarios a filesystem on a block level device, so they should in theory be equally fast, whereas I would think that LXC should be a bit faster.

I would rerun the fio test, but with at least 4K blocksize, that is the default for all ordinary filesystems. If you read or write less than that, you will have read and write amplification, so that can reduce your numbers.

erfus · Sep 17, 2020

Hi LnxBil,

Yes, I have a HWR Controller with 1gb cache. Both LXC and KVM have been on the raid10 array with lvm-thin. KVM test was on ext4 with default paramters.
I've just set up a debian machine in KVM again, with writeback enabled in proxmox and cputype=host. It seems that things have improved a lot.

Just to confirm: "writeback" in proxmox is safe to use with a writeback HWR controller with BBU? So that is "real" writeback, and not some kind of Software/RAM thing? I didn't know that it is possible to turn on/off writeback for a VM if the controller has it enabled. I always thought the OS has no control over the cache setting if WB is enabled on the controller.

Thanks!

LnxBil · Sep 18, 2020

erfus said:
I always thought the OS has no control over the cache setting if WB is enabled on the controller.

It does not, but this is an additional layer of write-back cache, so you have two layers. Improves performance, but may be unsafer.

According to the documentation:

cache=writeback

host do read/write cache
guest disk cache mode is writeback
Warn : you can loose datas in case of a powerfailure
you need to use barrier option in your linux guest fstab if kernel < 2.6.37 to avoid fs corruption in case of powerfailure.

This mode causes qemu-kvm to interact with the disk image file or block device with neither O_DSYNC nor O_DIRECT semantics,
so the host page cache is used and writes are reported to the guest as completed when placed in the host page cache,
and the normal page cache management will handle commitment to the storage device.
Additionally, the guest's virtual storage adapter is informed of the writeback cache,
so the guest would be expected to send down flush commands as needed to manage data integrity.
Analogous to a raid controller with RAM cache.

erfus · Sep 18, 2020

Ah ok!
I got confused from the sentence

Note: The overview below is dependent of the specific hardware used, i.e. a HW Raid with a BBU backed disk cache works just fine with 'writeback' mode, so take it just as an general overview.

because I thought it meant that WB with BBU is okay.

So I guess writethrough is then the recommended setting for a Controller with BBU, right? Then I will do some tests again with writethrough setting.

LnxBil · Sep 18, 2020

erfus said:
So I guess writethrough is then the recommended setting for a Controller with BBU, right? Then I will do some tests again with writethrough setting.

Both work, writethrough will be faster than writeback.

Search

Search

KVM slow I/O with DL380 g8 + 1gb cache

erfus

Member

erfus

Member

LnxBil

Distinguished Member

erfus

Member

LnxBil

Distinguished Member

erfus

Member

LnxBil

Distinguished Member