[SOLVED] ceph bad performances only from VM and CT

Dec 24, 2019
6
5
23
67
Hello,

We are actually using proxmox on a 4 nodes cluster with a ceph storage with 2 OSD of 420GiB on each node except on the last node (2x900GiB).
Our ceph cluster is communicating on a dedicated vlan on a 1Gb/s Vrack (OVH, french hosting platform)
Since the 10th of december, we were having some slowness on ceph storage.
We added a 4th node in the proxmox cluster and ceph cluster.
Then we made the upgrade from proxmox VE 5 to proxmox VE 6 the 10th december.
We finished with the ceph upgrade the 23rd of december.

The slowness is only present in VM and CT but not on proxmox itself.
In fact, if we are moving a disk, a volume or if we are doing a backup, ceph storage is ok ( W > 300MiB/s).
As soon as we copy a file on a VM, ceph storage is very very very slow (R: 8MiB/s W: 4MiB/s).

The question we're asking ourselves is how to explain so large differences between proxmox accessing ceph and VM accessing to the storage.
Would you have any leads about this subject ?
Of course, any advice will also be appreciated.
Thank in advance.
 

Attachments

Could you try to benchmark this with a tool like fio?
 
Hi,

All my problems are now solved.

I executed the next command on my VM last week.
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=2G --readwrite=randrw --rwmixread=75

Normally, I get these iops (or even above):
read = 24000 iops
write = 8000 iops

Since 10th of december, I get:
read : io=1534.2MB, bw=6505.3KB/s, iops=1626, runt=241507msec
write: io=526144KB, bw=2178.7KB/s, iops=544, runt=241507msec
When fio is executing, it can be stuck at 0iops for 15 seconds.

To solve my problems, I excluded 2 OSD from the last node, made ZFS storage and migrated all VMs from ceph to ZFS storage. Then I destroyed all my ceph cluster, purged all my configurations and rebuild my ceph storage.
Since I reinstalled and reconfigured my ceph storage, My performances are now back.

read = 22513 iops
write = 7540 iops

So unfortunately, I didn't found why ceph performances were very bad, I suppose a misconfiguration for the upgrade from ceph Luminous to Nautilus.

My problem is now solved. Thanks.
 
  • Like
Reactions: Moayad and Dominic
First, I didn't reinstall all, I only reconstructed Ceph storage.
Next, all my clients where on an infrastructure where storage was so slow (unusable) that I had to move then on a none shared storage, and where HA was not available anymore. Unfortunately, my infrastructure is rather busy on the end of the year, so wrong period.
I stayed 2 weeks with those problems with many tests and no solutions from forum or official proxmox support.

I estimate that my clients and I have suffered enough and when I find potential solution to my problems, I test it and if it works, I use it.
Thank for your useful message.
 
  • Like
Reactions: Ax2020

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!