Hi,
We've been working through upgrading individual nodes to PVE 6.3 and extended this to now include hyper converged Ceph cluster nodes. The upgrades themselves went very smoothly but the recommendation around setting all storage pools to autoscale can cause some headaches.
The last paragraph in the Ceph Nautilus to Octopus upgrade notes (https://pve.proxmox.com/wiki/Ceph_Nautilus_to_Octopus) reference the following command:
ceph osd pool set POOLNAME pg_autoscale_mode on
or, to apply to all pools:
The algorithm holds back flipping placement group numbers by only actioning a change when it's more than x 3 out by what it recommends. Whilst this sounds great it does mean that resulting changes are relatively massive (typically by a factor of 4). The unexpected consequence of large changes are that the cluster will be in a degraded state for a considerably amount of time and subsequently not trim the monitor database stores.
This may lead to the /var/lib/ceph/mon/*/store.db directory growing to consume all available space and subsequently freeze all I/O. Attempting to compact the monitors and simply restarting them didn't help. Really annoying that Ceph doesn't trim the monitor databases after each incremental change...
Is anyone aware of a command to force truncating unnecessarily old revisions in the monitor databases in these situations? We reverted to temporarily attaching SSDs in USB 3 casings, stopping the monitor, copying the content of the 'store.db' directory over and then mounting the drive on that directory (don't forget to change ownership of /var/lib/ceph/mon/*/store.db after mounting).
The Ceph monitor directory is typically 1 GiB but grew to over 65 GiB whilst reducing placement groups from 256 to 64 on a 18 OSD cluster.
Whilst I've read that Ceph won't trim the monitor store osdmap revisions, whilst unhealthy, the store.db directory shrank from 65 GiB to 3 Gib once we'd transferred them to a larger volume and subsequently restarted the monitor service. We had tried restarting the monitor process before moving the content and this hadn't changed anything; it's almost as if running out of space crashed the monitors in a state where they needed a relatively tiny amount of additional space to subsequently trim old data. Running manual monitor compacts (ceph tell mon.[ID] compact) typically results in the database files being rewritten, requiring twice the space they did initially, but this did not happen after starting the monitors after transferring the 'store.db' directory to a larger volume after they crashed when running out of space.
We subsequently immediately stopped the monitor process and transferred the directory back to it's original volume, as the NVMe volume was substantially faster. We ended up repeating this cycle three times but no matter how many times we tried to compact the database or restart the monitors (yes, we even added the 'mon compact on start = true' switch to ceph.conf) the storage utilisation would grow until nearly doubling before reducing to virtually the same starting size. If we however waited for a monitor to run out of space and then repeated the above they shrank when subsequently starting them with available space.
Regards
David Herselman
We've been working through upgrading individual nodes to PVE 6.3 and extended this to now include hyper converged Ceph cluster nodes. The upgrades themselves went very smoothly but the recommendation around setting all storage pools to autoscale can cause some headaches.
The last paragraph in the Ceph Nautilus to Octopus upgrade notes (https://pve.proxmox.com/wiki/Ceph_Nautilus_to_Octopus) reference the following command:
ceph osd pool set POOLNAME pg_autoscale_mode on
or, to apply to all pools:
Code:
for f in `ceph osd pool ls`; do ceph osd set $f pg_autoscale_mode on; done
The algorithm holds back flipping placement group numbers by only actioning a change when it's more than x 3 out by what it recommends. Whilst this sounds great it does mean that resulting changes are relatively massive (typically by a factor of 4). The unexpected consequence of large changes are that the cluster will be in a degraded state for a considerably amount of time and subsequently not trim the monitor database stores.
This may lead to the /var/lib/ceph/mon/*/store.db directory growing to consume all available space and subsequently freeze all I/O. Attempting to compact the monitors and simply restarting them didn't help. Really annoying that Ceph doesn't trim the monitor databases after each incremental change...
Is anyone aware of a command to force truncating unnecessarily old revisions in the monitor databases in these situations? We reverted to temporarily attaching SSDs in USB 3 casings, stopping the monitor, copying the content of the 'store.db' directory over and then mounting the drive on that directory (don't forget to change ownership of /var/lib/ceph/mon/*/store.db after mounting).
The Ceph monitor directory is typically 1 GiB but grew to over 65 GiB whilst reducing placement groups from 256 to 64 on a 18 OSD cluster.
Whilst I've read that Ceph won't trim the monitor store osdmap revisions, whilst unhealthy, the store.db directory shrank from 65 GiB to 3 Gib once we'd transferred them to a larger volume and subsequently restarted the monitor service. We had tried restarting the monitor process before moving the content and this hadn't changed anything; it's almost as if running out of space crashed the monitors in a state where they needed a relatively tiny amount of additional space to subsequently trim old data. Running manual monitor compacts (ceph tell mon.[ID] compact) typically results in the database files being rewritten, requiring twice the space they did initially, but this did not happen after starting the monitors after transferring the 'store.db' directory to a larger volume after they crashed when running out of space.
We subsequently immediately stopped the monitor process and transferred the directory back to it's original volume, as the NVMe volume was substantially faster. We ended up repeating this cycle three times but no matter how many times we tried to compact the database or restart the monitors (yes, we even added the 'mon compact on start = true' switch to ceph.conf) the storage utilisation would grow until nearly doubling before reducing to virtually the same starting size. If we however waited for a monitor to run out of space and then repeated the above they shrank when subsequently starting them with available space.
Regards
David Herselman