Performance Issues with CEPH osds

Volker Lieder · May 21, 2025

Hi,
we have two clusters, each with a NVME CEPH.
On one cluster, when we try some performance tests like : ceph daemon osd.X bench
we recieve the following results:

Good Cluster:
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.34241843399999999,
"bytes_per_sec": 3135759402.4859071,
"iops": 747.6233011450546
}

Slow Cluster:
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 3.7230354280000002,
"bytes_per_sec": 288404943.96713537,
"iops": 68.761096946510165
}

{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 3.7165428359999999,
"bytes_per_sec": 288908771.23742104,
"iops": 68.881218728404292
}

{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 3.6945760989999998,
"bytes_per_sec": 290626527.9772222,
"iops": 69.290763849549819
}

{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 3.7006634480000002,
"bytes_per_sec": 290148466.37304908,
"iops": 69.176785081159849
}

Any ideas what can cause the 2nd cluster to have such different results?

We check the osd performance because some vms have icmp delays between some ms and 3000-5000ms when pinging them.
Network already checked, perhaps its caused on io waits.

Any ideas?

Debian 12, pve-manager 8.4.1 installed.
82 osds installed

root@pve-node01:~# ceph -s
cluster:
id: ...
health: HEALTH_WARN
7 daemons have recently crashed

services:
mon: 3 daemons, quorum pve-node01,pve-node03,pve-node05 (age 2w)
mgr: pve-node05(active, since 95m), standbys: pve-node03, pve-node01
osd: 82 osds: 82 up (since 17h), 81 in (since 37h)

data:
pools: 2 pools, 2049 pgs
objects: 11.32M objects, 42 TiB
usage: 110 TiB used, 173 TiB / 283 TiB avail
pgs: 2048 active+clean
1 active+clean+scrubbing+deep

io:
client: 17 MiB/s rd, 185 MiB/s wr, 1.18k op/s rd, 4.31k op/s wr

Best regards,
Volker

spirit · May 21, 2025

what is your nvme model ?

We check the osd performance because some vms have icmp delays between some ms and 3000-5000ms when pinging them.

seem unrelated, icmp is network, slow storage can't impact it.
but slow network between the osd could impact io. (do you have tried some icmp test between your proxmox nodes directly ?)

7 daemons have recently crashed

should be investigated

gfngfn256 · May 21, 2025

Volker Lieder said:
Network already checked, perhaps its caused on io waits.

As mentioned above slow NW could/will impact the CEPH cluster. Do you have dedicated NW for CEPH as mentioned in the docs?

Those slow pings 3-5 seconds are bothersome and most usually are NW orientated.
You may have to rethink your NW infrastructure.

What does general IO look like in those VMs both disk read/write?

Volker Lieder · May 21, 2025

Hi, we have dedicated network for ceph and VMs (2 x 10Gbit with LACP each segment)
The icmp between pve-nodes is fine, icmp and performance of a vm is only negative on one pve-node, even the vm migrates to another pve-node.
Mostly other pings are fine. The VM is a big database server with 700GB RAM.
nc is fine between all nodes on all connections.
The actual investigations seem to handle with the big VM while all other communcations run fine.
Will check IO inside the VM next again.

Best regards,
Volker

Volker Lieder · May 22, 2025

Here are some more information while debugging the setup:
The cluster consists of 7 PVE Nodes, each 1TB Ram.
The VM with problems (icmp delays, shell freezes several times per hour) has the following settings:

Code:

agent: 1
balloon: 0
boot: order=scsi0;ide2;net0
cores: 16
cpu: host
ide2: none,media=cdrom
memory: 819200
meta: creation-qemu=7.2.0,ctime=1684047848
name: n1
net0: virtio=32:FC:F5:3B:FA:12,bridge=vmbr1,tag=2800
numa: 1
onboot: 1
ostype: l26
scsi0: nvme01:vm-181-disk-0,iothread=1,size=300G
scsi1: nvme01:vm-181-disk-1,backup=0,iothread=1,size=2136G

Here are ping results from VM n1 to another VM in the same vlan:

Code:

64 bytes from n1 (10.100.10.181): icmp_seq=121 ttl=64 time=0.159 ms
64 bytes from n1 (10.100.10.181): icmp_seq=122 ttl=64 time=0.108 ms
64 bytes from n1 (10.100.10.181): icmp_seq=123 ttl=64 time=5162 ms
64 bytes from n1 (10.100.10.181): icmp_seq=124 ttl=64 time=4138 ms
64 bytes from n1 (10.100.10.181): icmp_seq=125 ttl=64 time=3114 ms
64 bytes from n1 (10.100.10.181): icmp_seq=126 ttl=64 time=2091 ms
64 bytes from n1 (10.100.10.181): icmp_seq=127 ttl=64 time=1067 ms

In the whole cluster are round about 100VMs.
No other VM has the described problems like VM n1.

If VM n1 is restarted, the problems start after 24-28 hours.

top of pve-node04 where n1 is guest:

Code:

top - 11:49:01 up 20 days, 31 min,  1 user,  load average: 7.56, 7.72, 8.06
Tasks: 1010 total,   1 running, 1009 sleeping,   0 stopped,   0 zombie
%Cpu(s):  7.2 us,  1.2 sy,  0.0 ni, 84.9 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
MiB Mem : 1019409.+total, 115341.1 free, 820354.1 used,  90660.5 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 199055.8 avail Mem

Is there something special we have to take care with such a big VM?

Best regards,
Volker

gfngfn256 · May 22, 2025

Just to summarize (to my understanding) for the sake of clarity.
Are we looking at one problem or multiple?

From the original post we have:

Slow Cluster vs Good Cluster in ceph daemon osd.X bench testing.
We then have: ceph HEALTH_WARN & 7 daemons have recently crashed.
We also have "some vms have icmp delays between some ms and 3000-5000ms when pinging them".

In subsequent post we have:

The general icmp between pve-nodes is OK. (I assume this is from the above 'Slow Cluster').
Negative (comparative?) icmp and performance of a single VM (I assume n1) even if the vm migrates to another pve-node.
The VM is a big database server with 700GB RAM.

Please correct me if anything is wrong in my summary.

Other things to possibly look into:

1. What does top look like from the VM n1 point-of-view? It sounds like it has a heavy workload.

2. Is that workload CPU disk or NW insensitive - or a mixture?

3. What happens if you migrate the VM n1 to the 'Good Cluster' above (if at all possible - maybe clone?)? Does this resolve it's issues?

4. What are the HW/NW/workload differences between Slow Cluster vs Good Cluster? (If any).

Good luck.

Volker Lieder · May 23, 2025

Hi,
sorry for that confusion.

Main issue is, that VM n1 freezes in shell and has delays while writing database operations and has high icmp latencys to VMS in the same subnet.
Other VMs on the same CEPH, Cluster, Node seems fine.

While investigating this issue there was the comparison of two cephs, from the cluster with n1 and one other. There we registered the difference of IOPs between both clusters.

For the moment we changed some scsi0 options:
scsi0: nvme01:vm-181-disk-0,iothread=1,size=300G,ssd=1
scsi1: nvme01:vm-181-disk-1,backup=0,iothread=1,size=2136G,ssd=1

To check if this solve the issue we have to wait 24-28 hours, the problems starts first after this period after a reboot.

Next step could be to connect the cluster with n1 to the other ceph cluster and try to move the storage and see if it depends on other NVMEs or plattform.

To answer your questions:

In subsequent post we have:

The general icmp between pve-nodes is OK. (I assume this is from the above 'Slow Cluster').

Negative (comparative?) icmp and performance of a single VM (I assume n1) even if the vm migrates to another pve-node.

The VM is a big database server with 700GB RAM.

- icmp and nc between the pve-nodes is fine, on mgmt net and storage network
- yes, only n1 has bad icmp perfomance/delays, even if we migrate the vm to another pve-node
- yes, its a big database server, but with 800GB Ram

Best regards,
Volker

SteveITS · May 23, 2025

I think I’d work on one issue at a time.

If you destroy an OSD and benchmark that drive, is it slow or fast?

Search

Search

Performance Issues with CEPH osds

Volker Lieder

Renowned Member

spirit

Distinguished Member

gfngfn256

Distinguished Member

Volker Lieder

Renowned Member

Volker Lieder

Renowned Member

gfngfn256

Distinguished Member

Volker Lieder

Renowned Member

SteveITS

Active Member

We value your privacy