Proxmox Ceph Cluster Optimizations?

ByteArchitect

New Member
Mar 11, 2025
6
1
3
localhost
Hello everyone,

I am running a Proxmox Ceph cluster with three nodes configured as follows:
  • Ceph: 2×10G SFP+ (Active/Backup)
  • VMs: 2×10G copper (Active/Backup)
  • Corosync: 2×1Gbit copper (dedicated ports)
  • WebUI: 2×1Gbit copper (active/backup) → These ports were still free, so I use them for the WebUI with redundancy.
As storage, I use two local SSDs in a ZFS RAID1 for the operating system. The OSDs consist of three 3.84TB Datacenter PCIe 4.0 NVMe per node. I am aware that I will never achieve the full speed of the NVMes. However, they were hardly more expensive than normal Datacenter SSDs SATA 6GB, and for future upgrades it seemed sensible to go straight for them.

Optimization of my Ceph cluster​

I have already read some posts in the forum and considered the following points for optimization:
  • Enable KRBD on the Ceph storage (requires cold reboot of VMs) - I don't currently use this
  • VMs: Use SCSI + virtio-scsi-single - I already use this
  • Enable write-back cache, set SSD option, enable discard, use IO thread - I already use this
  • Change Async IO from default (io_uring) to threads - I don't use this at the moment

I am unsure whether these changes will actually bring noticeable improvements:

Question about stability during a complete shutdown​

In an earlier test cluster I had problems because Ceph could no longer heal itself. I received error messages like:
libceph: mon1 (1)10.12.12.102:6789 socket closed (con state V1_BANNER)
libceph: mon1 (1)10.12.12.102:6789 socket closed (con state V1_BANNER)
libceph: mon1 (1)10.12.12.102:6789 socket closed (con state V1_BANNER)
libceph: mon0 (1)10.12.12.101:6789 socket closed (con state OPEN)
libceph: mon2 (1)10.12.12.103:6789 socket closed (con state OPEN)
ceph: No mds Server is up or the cluster is laggy
In the end, I reinstalled the entire Proxmox cluster. To be fair, I have to say that the switches had a problem back then (manufacturer firmware), for which I am now using a pre-release version. They had always restarted automatically, which was still an RSTP problem, which is now fixed. I had also removed and deleted a disk as a test via. Ceph, which I attribute to this problem, which was probably the cause.

Performance is currently this:
1741783307550.png
I am hoping for a performance increase with KRBD as I have already read in some posts like this one e.g. Rocket Fly

My question is:
Can I shut down a Proxmox Ceph cluster completely without hesitation?
In my tests with the new cluster, this has worked without any problems so far. I stopped all VMs, shut down the nodes cleanly one after the other and then restarted them in the same order - without any noticeable problems.

Expectations & conclusion
As I would like to take this cluster into a productive environment, I would appreciate your feedback on my optimizations and any concerns. I am particularly interested in opinions from experienced users - and of course input from a Proxmox employee would also be very welcome.

Many thanks in advance!
 
Last edited: