How to get better performance in ProxmoxVE + CEPH cluster

Neodata · Jul 28, 2022

We have been running ProxmoxVE since 5.0 (now in 6.4-15) and we noticed a decay in performance whenever there is some heavy reading/writing.

We have 9 nodes, 7 with CEPH (14.2.22) and 56 OSDs (8 on each node). OSDs are hard drives (HDD) WD Gold or better (4~12 Tb). Nodes with 64/128 Gbytes RAM, dual Xeon CPU mainboards (various models).

We already tried simple tests like "ceph tell osd.* bench" getting stable 110 Mb/sec data transfer to each of them with +- 10 Mb/sec spread during normal operations. Apply/Commit Latency is normally below 55 ms with a couple of OSDs reaching 100 ms and one-third below 20 ms.

The front network and back network are both 1 Gbps (separated in VLANs), we are trying to move to 10 Gbps but we found some trouble we are still trying to figure out how to solve (unstable OSDs disconnections).

The Pool is defined as "replicated" with 3 copies (2 needed to keep running). Now the total amount of disk space is 305 Tb (72% used), reweight is in use as some OSDs were getting much more data than others.

Virtual machines run on the same 9 nodes, most are not CPU intensive:

Avg. VM CPU Usage < 6%
Avg. Node CPU Usage < 4.5%
Peak VM CPU Usage 40%
Peak Node CPU Usage 30%

But I/O Wait is a different story:

Avg. Node IO Delay 11
Max. Node IO delay 38

Disk writing load is around 4 Mbytes/sec on average, with peaks up to 20 Mbytes/sec.

Anyone with experience in getting better Proxmox+CEPH performance?

I know there are many guides, we already tried most of them, but getting that much disk space on SSDs is not within our reach.

Thank you all in advance for taking the time to read,

Ruben.

gurubert · Jul 29, 2022

I do not think that you can achieve much better with HDD-only OSDs.
The RocksDB read and write requests for the OSD management will kill any rotating rust.

The best option is to rebuild all OSDs and to externalize the RocksDB onto SSDs. Add two SSDs into your OSD nodes and put the RocksDB of four OSDs each onto one SSD.

Neodata · Jul 29, 2022

Hi gurubert!

Thanks for your answer, we tried some time ago with Add-on boards with 256 Gb NVMe units (no more SATA connections available in some nodes), but the experience was terrible, RocksDB ended up garbled every now and then and OSDs needed to be rebuilt once and again, also as it needs 4% of total CEPH disk size, that means going to NVMe units with huge capacity (12*4=48Tb -> 4% => 2 Tb), which are expensive and with poor previous experience difficult to get approval for them from management...

We set up 8 partitions on those NVMe units and set each RocksDB (block.db) on one of them. Could it be that we exceeded the maximum number of 4 ODSs per SSD could be what caused these issues?

Writing this I realize I'll really need to purchase Learning CEPH 2nd ed. ASAP...

Ruben.

gurubert · Jul 29, 2022

The 4% number is ancient and was never practical. A better number is 70GB per RocksDB.

NVMe can easily hold 8 RocksDBs, but your issues may come from the limited space 8*32GB offers. RocksDB needs more space to compact itself.

Neodata · Jul 29, 2022

Thank you very much!
I'll try to get those NVMe units (1 Tb, to be on the safe side) and make some tests in a three nodes test cluster I have (pending being added to the current cluster as soon as I can get spare time to make the in-place upgrade from 6.4 to 7.2)...

It doesn't seem to be an easy task at all reading all problems that people are reporting.

Regards!

jdancer · Jul 29, 2022

This is what I use to increase IOPS on a Ceph cluster using SAS drives, YMMV:

Set write cache enable (WCE) to 1 on SAS drives
Set VM cache to none
Set VM CPU type to 'host'
Set RBD pool to use the 'krbd' option
Use VirtIO-single SCSI controller and enable IO thread and discard option
On Linux VMs set the I/O scheduler to none/noop and install qmeu-guest-agent

Neodata · Jul 29, 2022

Hi jdancer!

I was checking up and found the following:

- Set write cache enable (WCE) to 1 on SAS drives (mine are SATA, but cache is already enabled)
- Set VM cache to none (default on my setup)
- On Linux VMs set the I/O scheduler to none/noop and install qmeu-guest-agent (default on my setup)

Will try these after investigating how can they affect the stability, as it is a production cluster and don't have the chance to break it (perhaps I'll use the small test cluster first and try to kill it before going to production).

- Set VM CPU type to 'host'
- Set RBD pool to use the 'krbd' option
- Use VirtIO-single SCSI controller and enable IO thread and discard option (using single VirtIO, but without enabling those options)

Thank you very much for your support, great to find so nice people on the forum!

Ruben.

Search

Search

How to get better performance in ProxmoxVE + CEPH cluster

Neodata

New Member

gurubert

Distinguished Member

Neodata

New Member

gurubert

Distinguished Member

Neodata

New Member

jdancer

Renowned Member

Neodata

New Member

We value your privacy