We have been running ProxmoxVE since 5.0 (now in 6.4-15) and we noticed a decay in performance whenever there is some heavy reading/writing.
We have 9 nodes, 7 with CEPH (14.2.22) and 56 OSDs (8 on each node). OSDs are hard drives (HDD) WD Gold or better (4~12 Tb). Nodes with 64/128 Gbytes RAM, dual Xeon CPU mainboards (various models).
We already tried simple tests like "ceph tell osd.* bench" getting stable 110 Mb/sec data transfer to each of them with +- 10 Mb/sec spread during normal operations. Apply/Commit Latency is normally below 55 ms with a couple of OSDs reaching 100 ms and one-third below 20 ms.
The front network and back network are both 1 Gbps (separated in VLANs), we are trying to move to 10 Gbps but we found some trouble we are still trying to figure out how to solve (unstable OSDs disconnections).
The Pool is defined as "replicated" with 3 copies (2 needed to keep running). Now the total amount of disk space is 305 Tb (72% used), reweight is in use as some OSDs were getting much more data than others.
Virtual machines run on the same 9 nodes, most are not CPU intensive:
Anyone with experience in getting better Proxmox+CEPH performance?
I know there are many guides, we already tried most of them, but getting that much disk space on SSDs is not within our reach.
Thank you all in advance for taking the time to read,
Ruben.
We have 9 nodes, 7 with CEPH (14.2.22) and 56 OSDs (8 on each node). OSDs are hard drives (HDD) WD Gold or better (4~12 Tb). Nodes with 64/128 Gbytes RAM, dual Xeon CPU mainboards (various models).
We already tried simple tests like "ceph tell osd.* bench" getting stable 110 Mb/sec data transfer to each of them with +- 10 Mb/sec spread during normal operations. Apply/Commit Latency is normally below 55 ms with a couple of OSDs reaching 100 ms and one-third below 20 ms.
The front network and back network are both 1 Gbps (separated in VLANs), we are trying to move to 10 Gbps but we found some trouble we are still trying to figure out how to solve (unstable OSDs disconnections).
The Pool is defined as "replicated" with 3 copies (2 needed to keep running). Now the total amount of disk space is 305 Tb (72% used), reweight is in use as some OSDs were getting much more data than others.
Virtual machines run on the same 9 nodes, most are not CPU intensive:
- Avg. VM CPU Usage < 6%
- Avg. Node CPU Usage < 4.5%
- Peak VM CPU Usage 40%
- Peak Node CPU Usage 30%
- Avg. Node IO Delay 11
- Max. Node IO delay 38
Anyone with experience in getting better Proxmox+CEPH performance?
I know there are many guides, we already tried most of them, but getting that much disk space on SSDs is not within our reach.
Thank you all in advance for taking the time to read,
Ruben.