howto stop or remove a ops in ceph

huky · Jan 15, 2021

My ceph cluster is unhealth:

Code:

            1 filesystem is degraded
            11 PGs pending on creation
            Reduced data availability: 202 pgs inactive, 6 pgs down
            Degraded data redundancy: 269/10009374 objects degraded (0.003%), 17 pgs degraded, 3 pgs undersized
            2 daemons have recently crashed
            17 slow ops, oldest one blocked for 6512 sec, daemons [osd.30,osd.32,osd.35] have slow ops.

How could I find and stop the OPS?

t.lamprecht · Jan 15, 2021

Hi,

huky said:
daemons [osd.30,osd.32,osd.35] have slow ops.

does integers are the OSD IDs, so first thing would be checking those disks health and status (e.g., smart health data) and the host those OSDs reside on, check also dmesg (kernel log) and journal for any errors on disk or ceph daemons.

Which Ceph and PVE version is in use in that setup?

What does the setup look in general? I.e., how many nodes, what networks (and bandwidth) how many OSDs per node, which type of disk tech (NVMe, SSD or spinner), ...?
As sometimes, this can stem from an overloaded part in the cluster.

You may get some more details on the reason for those operations being slow by following:
https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-osd/#debugging-slow-requests

huky · Jan 15, 2021

thanks.
the disk smart is health.
pve is v6 upgrade from v5, 9 nodes.
ceph is Nautilus upgrade from Luminous.
the cluster has 43 OSDs(most size 2T) with 9 ssd(a ssd every node) and work normally.

i added a 4T at January 13, The process went smoothly.
then I added a 10T at January 14, The entire cluster and all VMs and CTs are very slow. now after 2d, it is still Not available.
I want to stop ceph reblance and to use VMs and CTs now.

thanks again.

Code:

  cluster:
    id:     225397cb-7b69-4c24-8c34-f43951f42974
    health: HEALTH_WARN
            1 filesystem is degraded
            12 PGs pending on creation
            Reduced data availability: 202 pgs inactive
            Degraded data redundancy: 269/9985113 objects degraded (0.003%), 17 pgs degraded, 3 pgs undersized
            2 daemons have recently crashed
            68 slow ops, oldest one blocked for 1930 sec, daemons [osd.11,osd.30,osd.32,osd.35] have slow ops.
 
  services:
    mon: 3 daemons, quorum node003,node009,node008 (age 31m)
    mgr: node003(active, since 31m), standbys: node009, node008
    mds: cephfs1:3/3 {0=node003=up:replay,1=node009=up:resolve,2=node008=up:resolve}
    osd: 44 osds: 44 up (since 32m), 44 in (since 2d); 177 remapped pgs
 
  task status:
    scrub status:
        mds.node003: idle
        mds.node008: idle
        mds.node009: idle
 
  data:
    pools:   7 pools, 2720 pgs
    objects: 3.33M objects, 12 TiB
    usage:   38 TiB used, 57 TiB / 94 TiB avail
    pgs:     0.368% pgs unknown
             7.059% pgs not active
             269/9985113 objects degraded (0.003%)
             307894/9985113 objects misplaced (3.084%)
             2518 active+clean
             174  activating+remapped
             14   activating+degraded
             10   unknown
             3    activating+undersized+degraded+remapped
             1    activating

t.lamprecht said:
Hi,

does integers are the OSD IDs, so first thing would be checking those disks health and status (e.g., smart health data) and the host those OSDs reside on, check also dmesg (kernel log) and journal for any errors on disk or ceph daemons.

Which Ceph and PVE version is in use in that setup?

What does the setup look in general? I.e., how many nodes, what networks (and bandwidth) how many OSDs per node, which type of disk tech (NVMe, SSD or spinner), ...?
As sometimes, this can stem from an overloaded part in the cluster.

You may get some more details on the reason for those operations being slow by following:
https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-osd/#debugging-slow-requests

spirit · Jan 15, 2021

2 daemons have recently crashed

you can have details with "#ceph crash ls", to see which osd have crashed recently, with details

Search

Search

howto stop or remove a ops in ceph

huky

Renowned Member

t.lamprecht

Proxmox Staff Member

huky

Renowned Member

spirit

Distinguished Member

We value your privacy