PVE 8.2.2 and Ceph: all OSDs report slow ops

Minotaur · May 25, 2024

Hello!

I'm testing Ceph setup on single node, 12 HDDs are connected via SAS9211-4i contoller:

Code:

root@pve-1:~# ceph osd tree
ID  CLASS  WEIGHT    TYPE NAME       STATUS  REWEIGHT  PRI-AFF
-1         32.74786  root default
-3         32.74786      host pve-1
 0    hdd   2.72899          osd.0       up         0  1.00000
 1    hdd   2.72899          osd.1       up   1.00000  1.00000
 2    hdd   2.72899          osd.2       up   1.00000  1.00000
 3    hdd   2.72899          osd.3       up   1.00000  1.00000
 4    hdd   2.72899          osd.4       up   1.00000  1.00000
 5    hdd   2.72899          osd.5       up   1.00000  1.00000
 6    hdd   2.72899          osd.6       up   1.00000  1.00000
 7    hdd   2.72899          osd.7       up   1.00000  1.00000
 8    hdd   2.72899          osd.8       up         0  1.00000
 9    hdd   2.72899          osd.9       up         0  1.00000
10    hdd   2.72899          osd.10      up         0  1.00000
11    hdd   2.72899          osd.11      up   1.00000  1.00000

The problem is that even one PG in default pool .mgr is in inactive state because OSD reports slow ops:

Code:

root@pve-1:~# ceph -s
  cluster:
    id:     02e2a493-9957-45f8-a634-f2be25df74fc
    health: HEALTH_WARN
            Reduced data availability: 1 pg inactive
            1 slow ops, oldest one blocked for 385 sec, osd.3 has slow ops

  services:
    mon: 1 daemons, quorum pve-1 (age 9m)
    mgr: pve-1(active, since 9m)
    osd: 12 osds: 12 up (since 8m), 8 in (since 6m)

  data:
    pools:   1 pools, 1 pgs
    objects: 0 objects, 0 B
    usage:   248 MiB used, 22 TiB / 22 TiB avail
    pgs:     100.000% pgs unknown
             1 unknown

I tried to move out such OSDs from bucket, and it seems that all 12 OSDs have slow ops. They definitely cannot be all faulty.
Any help is kindly appreciated! Thanks.

VictorSTS · May 25, 2024

Minotaur said:
I'm testing Ceph setup on single node

You simply can't test Ceph with a single node. You need at least 3 nodes for a bare minimum setup.

Minotaur · May 25, 2024

VictorSTS said:
You simply can't test Ceph with a single node. You need at least 3 nodes for a bare minimum setup.

Thank you for reply.
Isn't osd_crush_chooseleaf_type = 0 a way to use single-node cluster?

alexskysilk · May 25, 2024

you can drive a car with one wheel, but dont complain when it doesnt work the way you expect.

VictorSTS · May 26, 2024

Minotaur said:
Thank you for reply.
Isn't osd_crush_chooseleaf_type = 0 a way to use single-node cluster?

Yes and no. AFAIR (it's been a lot since I played with that setting), it had to be set in ceph.conf before creating the first monitor. If you add it afterwards, you needed to manually edit the CRUSH map.

If you want to test drive Ceph, create 3 or more VMs in that host, install PVE using nested virtualization and deploy Ceph with those VMs. You can practice everything that way, except pure performance, of course.

PS: please, next time you ask for help tell us you are using some non-standard configuration

_--James--_ · May 26, 2024

how many cores and how much ram does this single host have? What VMs are you running on top of the host and how "wide" are they? Is this RBD or CephFS you are testing? Starving the host of memory can crash out OSDs, maxing CPU will increase IO wait latency, crashing OSDs...etc.

I have used a single Ceph node for years without any major issues. But, that host has 32cores, 256GB of ram, 19 OSDs(mixed Sata SSD and NVMe) and runs "light" VMs for testing, my other PVE nodes externally connect to the standalone host for CephFS and RBD access "sometimes".

But do know that if you lose an OSD those PGs are gone as is any data in the pool since there are no replicas. So treat this setup as a Raid0 and make sure you are backing up your data so you can rebuild the pool and restore. Else do this correctly and build a three node Ceph cluster.

Minotaur · May 27, 2024

VictorSTS said:
Yes and no. AFAIR (it's been a lot since I played with that setting), it had to be set in ceph.conf before creating the first monitor. If you add it afterwards, you needed to manually edit the CRUSH map.

I've done this. On another server with onboard SATA controller everything works fine. Problematic one has SAS9211-4i controller, and slow ops are observed on each of 12 drives connected to it.

Minotaur · May 27, 2024

jmounts79 said:
how many cores and how much ram does this single host have? What VMs are you running on top of the host and how "wide" are they? Is this RBD or CephFS you are testing?

It is a two-socket E5-2620 machine, 6 cores per socket, with 64GB RAM. 12 HDDs are connected via SAS9211-4i controller flashed with latest available -IT firmware. It has no any VMs yet, Ceph reports slow ops for single PG in default .mgr pool on each of installed 12 HDDs. I'm going to remove Ceph and test local LVM storage.

Minotaur · May 27, 2024

Just a short update: after fresh installation of PVE 8.2, updates and Ceph the problem didnot appear again. I don't know what was a root cause, now everything looks fine. Thank you all for replies.

Maximiliano · May 27, 2024

Hello,

What do you want to accomplish? Ceph only makes sense in setups with at the very least 3 nodes.

_--James--_ · May 27, 2024

Minotaur said:
It is a two-socket E5-2620 machine, 6 cores per socket, with 64GB RAM. 12 HDDs are connected via SAS9211-4i controller flashed with latest available -IT firmware. It has no any VMs yet, Ceph reports slow ops for single PG in default .mgr pool on each of installed 12 HDDs. I'm going to remove Ceph and test local LVM storage.

9211-4i connected to a backplane or sata/sas expanders? 4i only has four 6Gbps channels and 12HDDs could saturate that quite easily with SEQ read/writes. Saying nothing of a bad cable, bad expander, or a damaged backplane. Have you benchmarked each drive to ensure they are getting OEM specs? what about each SATA/SAS path/Channel? are the HDDs in good health (no pending sectors, no CRC errors) Ceph will quite literally be as fast as your slowest OSD with in the same crush map on the same host.

Also, 6cores per socket with 32GB per socket(if your 64GB is balanced) may not be enough since you are running 12 OSDs here. Depending on how threaded your single Ceph node is you might need to drop the OSD count or opt in for more cores per socket. Also 2620's are very long in the tooth for performance today and are lacking a few modern instructions found on 2600v4's. Here is a good write up on this https://ceph.io/en/news/blog/2022/ceph-osd-cpu-scaling/ You might be able to drop in 2670v2's on that motherboard to gain a really cheap performance gain.

**I saw that you reinstalled and it seems to be better now. But PG's will degrade if the root of this issue is not figured out. All it takes it one OSD to drop out to start this cycle all over again. Before calling it "fixed" I would dig into each HDD to make sure they are healthy and you are able ot get the correct performance out of each and everyone of them.

Search

Search

PVE 8.2.2 and Ceph: all OSDs report slow ops

Minotaur

Member

VictorSTS

Famous Member

Minotaur

Member

alexskysilk

Distinguished Member

VictorSTS

Famous Member

_--James--_

Member

Minotaur

Member

Minotaur

Member

Minotaur

Member

Maximiliano

Proxmox Staff Member

_--James--_

Member