Proxmox Ceph Cluster Optimizations?

ByteArchitect · Mar 12, 2025

Hello everyone,

I am running a Proxmox Ceph cluster with three nodes configured as follows:

Ceph: 2×10G SFP+ (Active/Backup)
VMs: 2×10G copper (Active/Backup)
Corosync: 2×1Gbit copper (dedicated ports)
WebUI: 2×1Gbit copper (active/backup) → These ports were still free, so I use them for the WebUI with redundancy.

As storage, I use two local SSDs in a ZFS RAID1 for the operating system. The OSDs consist of three 3.84TB Datacenter PCIe 4.0 NVMe per node. I am aware that I will never achieve the full speed of the NVMes. However, they were hardly more expensive than normal Datacenter SSDs SATA 6GB, and for future upgrades it seemed sensible to go straight for them.

Optimization of my Ceph cluster

I have already read some posts in the forum and considered the following points for optimization:

Enable KRBD on the Ceph storage (requires cold reboot of VMs) - I don't currently use this
VMs: Use SCSI + virtio-scsi-single - I already use this
Enable write-back cache, set SSD option, enable discard, use IO thread - I already use this
Change Async IO from default (io_uring) to threads - I don't use this at the moment

I am unsure whether these changes will actually bring noticeable improvements:

Disable debug options?

Question about stability during a complete shutdown

In an earlier test cluster I had problems because Ceph could no longer heal itself. I received error messages like:

libceph: mon1 (1)10.12.12.102:6789 socket closed (con state V1_BANNER)
libceph: mon1 (1)10.12.12.102:6789 socket closed (con state V1_BANNER)
libceph: mon1 (1)10.12.12.102:6789 socket closed (con state V1_BANNER)
libceph: mon0 (1)10.12.12.101:6789 socket closed (con state OPEN)
libceph: mon2 (1)10.12.12.103:6789 socket closed (con state OPEN)
ceph: No mds Server is up or the cluster is laggy

In the end, I reinstalled the entire Proxmox cluster. To be fair, I have to say that the switches had a problem back then (manufacturer firmware), for which I am now using a pre-release version. They had always restarted automatically, which was still an RSTP problem, which is now fixed. I had also removed and deleted a disk as a test via. Ceph, which I attribute to this problem, which was probably the cause.

Performance is currently this:

I am hoping for a performance increase with KRBD as I have already read in some posts like this one e.g. Rocket Fly

My question is:
Can I shut down a Proxmox Ceph cluster completely without hesitation?
In my tests with the new cluster, this has worked without any problems so far. I stopped all VMs, shut down the nodes cleanly one after the other and then restarted them in the same order - without any noticeable problems.

Expectations & conclusion
As I would like to take this cluster into a productive environment, I would appreciate your feedback on my optimizations and any concerns. I am particularly interested in opinions from experienced users - and of course input from a Proxmox employee would also be very welcome.

Many thanks in advance!

jdancer · Mar 18, 2025

May also want to ask your question on Reddit's r/ceph

Search

Search

Proxmox Ceph Cluster Optimizations?

ByteArchitect

New Member

Optimization of my Ceph cluster

Question about stability during a complete shutdown

jdancer

Renowned Member

We value your privacy

Proxmox Ceph Cluster Optimizations?

ByteArchitect

New Member

Optimization of my Ceph cluster​

Question about stability during a complete shutdown​

jdancer

Renowned Member

We value your privacy

Optimization of my Ceph cluster

Question about stability during a complete shutdown