PVE 8.2.2 and Ceph: all OSDs report slow ops

Minotaur

Member
Jun 20, 2019
10
1
23
43
Hello!

I'm testing Ceph setup on single node, 12 HDDs are connected via SAS9211-4i contoller:
Code:
root@pve-1:~# ceph osd tree
ID  CLASS  WEIGHT    TYPE NAME       STATUS  REWEIGHT  PRI-AFF
-1         32.74786  root default
-3         32.74786      host pve-1
 0    hdd   2.72899          osd.0       up         0  1.00000
 1    hdd   2.72899          osd.1       up   1.00000  1.00000
 2    hdd   2.72899          osd.2       up   1.00000  1.00000
 3    hdd   2.72899          osd.3       up   1.00000  1.00000
 4    hdd   2.72899          osd.4       up   1.00000  1.00000
 5    hdd   2.72899          osd.5       up   1.00000  1.00000
 6    hdd   2.72899          osd.6       up   1.00000  1.00000
 7    hdd   2.72899          osd.7       up   1.00000  1.00000
 8    hdd   2.72899          osd.8       up         0  1.00000
 9    hdd   2.72899          osd.9       up         0  1.00000
10    hdd   2.72899          osd.10      up         0  1.00000
11    hdd   2.72899          osd.11      up   1.00000  1.00000

The problem is that even one PG in default pool .mgr is in inactive state because OSD reports slow ops:

Code:
root@pve-1:~# ceph -s
  cluster:
    id:     02e2a493-9957-45f8-a634-f2be25df74fc
    health: HEALTH_WARN
            Reduced data availability: 1 pg inactive
            1 slow ops, oldest one blocked for 385 sec, osd.3 has slow ops

  services:
    mon: 1 daemons, quorum pve-1 (age 9m)
    mgr: pve-1(active, since 9m)
    osd: 12 osds: 12 up (since 8m), 8 in (since 6m)

  data:
    pools:   1 pools, 1 pgs
    objects: 0 objects, 0 B
    usage:   248 MiB used, 22 TiB / 22 TiB avail
    pgs:     100.000% pgs unknown
             1 unknown

I tried to move out such OSDs from bucket, and it seems that all 12 OSDs have slow ops. They definitely cannot be all faulty.
Any help is kindly appreciated! Thanks.
 
Thank you for reply.
Isn't osd_crush_chooseleaf_type = 0 a way to use single-node cluster?
Yes and no. AFAIR (it's been a lot since I played with that setting), it had to be set in ceph.conf before creating the first monitor. If you add it afterwards, you needed to manually edit the CRUSH map.

If you want to test drive Ceph, create 3 or more VMs in that host, install PVE using nested virtualization and deploy Ceph with those VMs. You can practice everything that way, except pure performance, of course.

PS: please, next time you ask for help tell us you are using some non-standard configuration :)
 
Last edited:
how many cores and how much ram does this single host have? What VMs are you running on top of the host and how "wide" are they? Is this RBD or CephFS you are testing? Starving the host of memory can crash out OSDs, maxing CPU will increase IO wait latency, crashing OSDs...etc.

I have used a single Ceph node for years without any major issues. But, that host has 32cores, 256GB of ram, 19 OSDs(mixed Sata SSD and NVMe) and runs "light" VMs for testing, my other PVE nodes externally connect to the standalone host for CephFS and RBD access "sometimes".

But do know that if you lose an OSD those PGs are gone as is any data in the pool since there are no replicas. So treat this setup as a Raid0 and make sure you are backing up your data so you can rebuild the pool and restore. Else do this correctly and build a three node Ceph cluster.
 
Yes and no. AFAIR (it's been a lot since I played with that setting), it had to be set in ceph.conf before creating the first monitor. If you add it afterwards, you needed to manually edit the CRUSH map.
I've done this. On another server with onboard SATA controller everything works fine. Problematic one has SAS9211-4i controller, and slow ops are observed on each of 12 drives connected to it.
 
how many cores and how much ram does this single host have? What VMs are you running on top of the host and how "wide" are they? Is this RBD or CephFS you are testing?
It is a two-socket E5-2620 machine, 6 cores per socket, with 64GB RAM. 12 HDDs are connected via SAS9211-4i controller flashed with latest available -IT firmware. It has no any VMs yet, Ceph reports slow ops for single PG in default .mgr pool on each of installed 12 HDDs. I'm going to remove Ceph and test local LVM storage.
 
Just a short update: after fresh installation of PVE 8.2, updates and Ceph the problem didnot appear again. I don't know what was a root cause, now everything looks fine. Thank you all for replies.
 
Hello,

What do you want to accomplish? Ceph only makes sense in setups with at the very least 3 nodes.
 
It is a two-socket E5-2620 machine, 6 cores per socket, with 64GB RAM. 12 HDDs are connected via SAS9211-4i controller flashed with latest available -IT firmware. It has no any VMs yet, Ceph reports slow ops for single PG in default .mgr pool on each of installed 12 HDDs. I'm going to remove Ceph and test local LVM storage.
9211-4i connected to a backplane or sata/sas expanders? 4i only has four 6Gbps channels and 12HDDs could saturate that quite easily with SEQ read/writes. Saying nothing of a bad cable, bad expander, or a damaged backplane. Have you benchmarked each drive to ensure they are getting OEM specs? what about each SATA/SAS path/Channel? are the HDDs in good health (no pending sectors, no CRC errors) Ceph will quite literally be as fast as your slowest OSD with in the same crush map on the same host.

Also, 6cores per socket with 32GB per socket(if your 64GB is balanced) may not be enough since you are running 12 OSDs here. Depending on how threaded your single Ceph node is you might need to drop the OSD count or opt in for more cores per socket. Also 2620's are very long in the tooth for performance today and are lacking a few modern instructions found on 2600v4's. Here is a good write up on this https://ceph.io/en/news/blog/2022/ceph-osd-cpu-scaling/ You might be able to drop in 2670v2's on that motherboard to gain a really cheap performance gain.

**I saw that you reinstalled and it seems to be better now. But PG's will degrade if the root of this issue is not figured out. All it takes it one OSD to drop out to start this cycle all over again. Before calling it "fixed" I would dig into each HDD to make sure they are healthy and you are able ot get the correct performance out of each and everyone of them.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!