Improve VM guest disk performance (Ceph, 10 GBE, Qemu, Virtio)

Mario Minati · Jun 23, 2019

Hello @all,
we are running a Proxmox cluster with five nodes. Three of them are used for ceph, providing 2 pools, one with hdd, the other one with ssd. The two other nodes are used for virtualization with qemu.
We have redundant 10 GBE storage networks and we have redundant 10 GBE ceph networks.
The nodes are equipped with dual cpus and between 96 and 128 MB RAM. The three ceph nodes are completely identical.

We read a lot of proxmox docs, this forum, did hours of googling, but we didn't find a solution for our performance troubles, yet.

We are using the latest Proxmox:

Code:

# pveversion -v
proxmox-ve: 5.4-1 (running kernel: 4.15.18-14-pve)
pve-manager: 5.4-6 (running version: 5.4-6/aa7856c5)
pve-kernel-4.15: 5.4-2
pve-kernel-4.15.18-14-pve: 4.15.18-39
pve-kernel-4.15.18-11-pve: 4.15.18-34
pve-kernel-4.15.17-1-pve: 4.15.17-9
ceph: 12.2.12-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-10
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-52
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-43
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-28
pve-cluster: 5.0-37
pve-container: 2.0-39
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-21
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-2
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-51
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

We made rados benchmarks from our virtualization host into our ceph hdd pool and got the following results.
Write:

Code:

# rados -p pub.hdd.bench bench -b 4M 60 write -t 16 --no-cleanup
[...]
Total time run: 60.571563
Total writes made: 1715
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 113.254
Stddev Bandwidth: 40.2683
Max bandwidth (MB/sec): 176
Min bandwidth (MB/sec): 0
Average IOPS: 28
Stddev IOPS: 10
Max IOPS: 44
Min IOPS: 0
Average Latency(s): 0.564394
Stddev Latency(s): 0.343622
Max latency(s): 2.84305
Min latency(s): 0.0969665

Read:

Code:

# rados -p pub.hdd.bench bench 60 seq -t 16 --no-cleanup
[...]
Total time run: 17.727840
Total reads made: 1715
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 386.962
Average IOPS: 96
Stddev IOPS: 21
Max IOPS: 135
Min IOPS: 48
Average Latency(s): 0.163484
Max latency(s): 1.54406
Min latency(s): 0.0274371

The maximum latency is a little high but shall not be the focus of this conversation.

The OSD tree is synchon on all nodes:

Code:

# ceph osd tree
ID CLASS WEIGHT   TYPE NAME                 STATUS REWEIGHT PRI-AFF
-1       35.36691 root default                                   
-3       11.78897     host pub-ceph-node-01                       
 0   hdd  5.45789         osd.0                 up  1.00000 1.00000
 1   hdd  5.45789         osd.1                 up  1.00000 1.00000
 8   ssd  0.87320         osd.8                 up  1.00000 1.00000
-5       11.78897     host pub-ceph-node-02                       
 2   hdd  5.45789         osd.2                 up  1.00000 1.00000
 3   hdd  5.45789         osd.3                 up  1.00000 1.00000
 7   ssd  0.87320         osd.7                 up  1.00000 1.00000
-7       11.78897     host pub-ceph-node-03                       
 4   hdd  5.45789         osd.4                 up  1.00000 1.00000
 5   hdd  5.45789         osd.5                 up  1.00000 1.00000
 6   ssd  0.87320         osd.6                 up  1.00000 1.00000

On our first virtualization server we have eight linux guests and two windows guests. The qemu agent is activated on all guests. All guest disks are created as VirtIO drives and are stored on our hdd pool.

A linux guest configuration looks like this:

Code:

# qm config 402
agent: 1
balloon: 0
boot: cdn
bootdisk: virtio0
cores: 2
ide2: none,media=cdrom
memory: 16384
name: hbm-srv-02
net0: virtio=52:54:00:6a:24:0a,bridge=vmbr0
net1: virtio=A2:64:0E:18:02:27,bridge=vmbr1
numa: 0
ostype: l26
scsihw: virtio-scsi-pci
smbios1: uuid=c1587fd0-0b8a-4a84-9d4a-b9b1b919d3c5
sockets: 2
virtio0: pub.hdd.vm:vm-402-disk-0,cache=writeback,iothread=1,size=30G
virtio1: pub.hdd.vm:vm-402-disk-1,cache=writeback,iothread=1,size=500G
vmgenid: bbdb6d92-959f-41fc-951e-442c4cdf3626

Running a fio benchmark on the client with the configuration above, while there was almost no traffic on the other clients, gives the following results:

Code:

# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=/var/fio.tmp --bs=4k --iodepth=64 --size=8G --readwrite=randrw --rwmixread=75
test: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.16
Starting 1 process
Jobs: 1 (f=1): [m(1)] [99.7% done] [56044KB/19020KB/0KB /s] [14.2K/4755/0 iops] [eta 00m:03s]
test: (groupid=0, jobs=1): err= 0: pid=17921: Sun Jun 23 21:21:35 2019
  read : io=6142.3MB, bw=6373.6KB/s, iops=1593, runt=986843msec
  write: io=2049.8MB, bw=2126.1KB/s, iops=531, runt=986843msec
  cpu          : usr=1.45%, sys=4.28%, ctx=1218785, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=1572409/w=524743/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: io=6142.3MB, aggrb=6373KB/s, minb=6373KB/s, maxb=6373KB/s, mint=986843msec, maxt=986843msec
  WRITE: io=2049.8MB, aggrb=2126KB/s, minb=2126KB/s, maxb=2126KB/s, mint=986843msec, maxt=986843msec

Disk stats (read/write):
  vdb: ios=1572293/525175, merge=0/16, ticks=62668876/392852, in_queue=65241904, util=100.00%

This looks like we are loosing quite a bit of disk performance. But why?
We tried to switch to SCSI disk access in guests, but that doesn't improve anything against VirtIO.
We have actived the extra thread for each disk and set the caching strategy to Writeback for best performance.

What else can we do to improve the disk performance?

How much bandwidth from the host should one expect within a guest?

Why is the %util value 100% while doing the fio test. Is this a hint of the source of the problem?

Any help or ideas are welcome.

Best greets,

Mario Minati

sb-jw · Jun 23, 2019

What about your Pools, PGs, Replication Rules etc.

Normally i would recommend to use SSDs only instead of HDD. Your Results seems not too bad for me, more expected for the Hardware behind. A bigger Network doesn't help you, when the Disks are not able to deliver these Performance.

dcsapak · Jun 24, 2019

Mario Minati said:
This looks like we are loosing quite a bit of disk performance. But why?

afaics you are comparing oranges and apples

the radosbench tests with 4M blocksize =>

write:

Bandwidth (MB/sec): 113.254
Average IOPS: 28

read:

Bandwidth (MB/sec): 386.962
Average IOPS: 96

while your fio tests with 4k size:

read : io=6142.3MB, bw=6373.6KB/s, iops=1593, runt=986843msec
write: io=2049.8MB, bw=2126.1KB/s, iops=531, runt=986843msec

you get less bandwidth with smaller blocksize, but more iops

Alwin · Jun 25, 2019

I go d’accord with @sb-jw and @dcsapak, but I want to add our Ceph benchmark paper for reference.
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

Another possible optimization (though very small), might be to set the SSD OSDs fixed as primary OSD for the PGs.
https://ceph.com/geen-categorie/ceph-get-the-best-of-your-ssd-with-primary-affinity/

Mario Minati · Oct 26, 2019

As we learned yesterday from the nice guys at inett.de (Saarbrücken, Germany), we have choosen wrong SSD types for the ceph DB storage. We will report how things evolve.

spirit · Oct 26, 2019

Mario Minati said:
As we learned yesterday from the nice guys at inett.de (Saarbrücken, Germany), we have choosen wrong SSD types for the ceph DB storage. We will report how things evolve.

See also this old article (but still true, consumer ssd sucks for syncronous write)
https://www.sebastien-han.fr/blog/2...-if-your-ssd-is-suitable-as-a-journal-device/

Search

Search

Improve VM guest disk performance (Ceph, 10 GBE, Qemu, Virtio)

Mario Minati

Active Member

sb-jw

Famous Member

dcsapak

Proxmox Staff Member

Alwin

Proxmox Retired Staff

Mario Minati

Active Member

spirit

Distinguished Member

We value your privacy