[SOLVED] ceph bad performances only from VM and CT

ar7yss · Dec 24, 2019

Hello,

We are actually using proxmox on a 4 nodes cluster with a ceph storage with 2 OSD of 420GiB on each node except on the last node (2x900GiB).
Our ceph cluster is communicating on a dedicated vlan on a 1Gb/s Vrack (OVH, french hosting platform)
Since the 10th of december, we were having some slowness on ceph storage.
We added a 4th node in the proxmox cluster and ceph cluster.
Then we made the upgrade from proxmox VE 5 to proxmox VE 6 the 10th december.
We finished with the ceph upgrade the 23rd of december.

The slowness is only present in VM and CT but not on proxmox itself.
In fact, if we are moving a disk, a volume or if we are doing a backup, ceph storage is ok ( W > 300MiB/s).
As soon as we copy a file on a VM, ceph storage is very very very slow (R: 8MiB/s W: 4MiB/s).

The question we're asking ourselves is how to explain so large differences between proxmox accessing ceph and VM accessing to the storage.
Would you have any leads about this subject ?
Of course, any advice will also be appreciated.
Thank in advance.

Dominic · Jan 7, 2020

Could you try to benchmark this with a tool like fio?

ar7yss · Jan 8, 2020

Hi,

All my problems are now solved.

I executed the next command on my VM last week.
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=2G --readwrite=randrw --rwmixread=75

Normally, I get these iops (or even above):
read = 24000 iops
write = 8000 iops

Since 10th of december, I get:
read : io=1534.2MB, bw=6505.3KB/s, iops=1626, runt=241507msec
write: io=526144KB, bw=2178.7KB/s, iops=544, runt=241507msec
When fio is executing, it can be stuck at 0iops for 15 seconds.

To solve my problems, I excluded 2 OSD from the last node, made ZFS storage and migrated all VMs from ceph to ZFS storage. Then I destroyed all my ceph cluster, purged all my configurations and rebuild my ceph storage.
Since I reinstalled and reconfigured my ceph storage, My performances are now back.

read = 22513 iops
write = 7540 iops

So unfortunately, I didn't found why ceph performances were very bad, I suppose a misconfiguration for the upgrade from ceph Luminous to Nautilus.

My problem is now solved. Thanks.

atec666 · Jan 15, 2020

oups ... this is not a solution !
if issue then reinstall all

... are you on Windows ?

ar7yss · Jan 15, 2020

First, I didn't reinstall all, I only reconstructed Ceph storage.
Next, all my clients where on an infrastructure where storage was so slow (unusable) that I had to move then on a none shared storage, and where HA was not available anymore. Unfortunately, my infrastructure is rather busy on the end of the year, so wrong period.
I stayed 2 weeks with those problems with many tests and no solutions from forum or official proxmox support.

I estimate that my clients and I have suffered enough and when I find potential solution to my problems, I test it and if it works, I use it.
Thank for your useful message.

Search

Search

[SOLVED] ceph bad performances only from VM and CT

ar7yss

Member

Attachments

Dominic

Proxmox Retired Staff

ar7yss

Member

atec666

Member

ar7yss

Member