[SOLVED] Huge increase in I/O load on NVMe disks (at equal VM load) after upgrade from Ceph 12 to 15

lucaferr

Renowned Member
Jun 21, 2011
71
9
73
Hi! Last night we upgraded our production 9-nodes cluster from PVE 5.4 to PVE 6.4 and from Ceph 12 to 14 and then to 15 (Octopus), following the official tutorials. Everything went smoothly and all running VMs have been online during the upgrade, so we're very happy about the operation. Now the cluster and Ceph are all HEALTH_OK and stable, no rebalancing or recovery in process.
But our monitoring system (which is Zabbix-based) is telling us that the OSDs (which are all NVMe SSDs, 4 on each of the 9 nodes for a total of 36) frequently spike to 100% I/O activity. Analysing the data and comparing it with the data from a few days ago, we realised that although the bandwidth in reads and writes from VMs to Ceph and the IOPS are similar (the VMs have the same load they had a few days ago), the individual NVMe SSDs do a much higher number of writes and reads (by a factor of x50!)
I'm afraid that this will greatly accelerate the SSD wear process a lot and, under high VM load conditions, also slow down the performance (for now the client-side performance remains good, but August is not a busy month and the VMs are very underutilised).
Curiously, network traffic on the 10 Gb/s network dedicated to Ceph did not increase at all (and so which data is Ceph reading and writing continuously on the OSDs? Only transferring data between the OSDs of each single node? Maybe doing some kind of internal format conversion?)
Do you have any ideas? Thank you very much!
 
Last edited:
possibly this (https://docs.ceph.com/en/latest/releases/octopus/#v15-2-0-octopus):
5. Upgrade all OSDs by installing the new packages and restarting the ceph-osd daemons on all OSD hosts:

# systemctl restart ceph-osd.target

Note that the first time each OSD starts, it will do a format conversion to improve the accounting for “omap” data. This may take a few minutes to as much as a few hours (for an HDD with lots of omap data).

?
 
Hi Fabian, the first OSD restart did took 5 to 10 minutes, during which the upgraded OSD where down and the CPU on the node very high....but then, once all the OSDs went up on all nodes and Ceph went back to HEALTH_OK I assumed that the upgrade process was completed. Did I assume wrong? Also, more than 24 hours have passed since the update and these are fast NVMe drives with 2 TB capacity each, so I guess that in 24 hours it would have completed any adjustment process...unless it had to do some sort of indexing or other optimisation... has this happened to anyone else?

PS: ceph versions shows everything already upgraded to 15.2.13:
2021-08-13 15_05_43-217.61.42.68 - PuTTY.png
 
Last edited:
I apologise, the monitoring system was reporting incorrect I/O data because the iostat output used changed between PVE 5 and PVE 6. So false alarm, PVE 6 with Ceph Octopus works perfectly!
I mark the topic as [SOLVED]
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!