Ceph Timeout on one node

ScottDavis · May 31, 2024

Very slow to respond, and when trying to add OSD, etc. I get timeout errors.

ceph -s shows this ...

Code:

root@pmox01-scan-hq:~# ceph -s
  cluster:
    id:     7363a620-944a-4321-ad70-d12dd688bac7
    health: HEALTH_WARN
            clock skew detected on mon.pmox03-scan-hq, mon.pmox01-scan-hq
            Degraded data redundancy: 128 pgs undersized
            17304 slow ops, oldest one blocked for 80760 sec, mon.pmox01-scan-hq has slow ops
 
  services:
    mon: 3 daemons, quorum pmox02-scan-hq,pmox03-scan-hq,pmox01-scan-hq (age 10h)
    mgr: pmox02-scan-hq(active, since 23h), standbys: pmox03-scan-hq
    osd: 4 osds: 4 up (since 22h), 4 in (since 22h); 1 remapped pgs
 
  data:
    pools:   2 pools, 129 pgs
    objects: 2 objects, 1.0 MiB
    usage:   110 MiB used, 7.0 TiB / 7.0 TiB avail
    pgs:     2/6 objects misplaced (33.333%)
             128 active+undersized
             1   active+clean+remapped

Any ideas as to what is causing the issue? Other two nodes are fine.

gurubert · Jun 1, 2024

Restart the MON in pmox01-scan-hq to see if the slow ops vanish.

ScottDavis · Jun 3, 2024

gurubert said:
Restart the MON in pmox01-scan-hq to see if the slow ops vanish.

Done that a few times now. Same issue persists.

gurubert · Jun 3, 2024

Does the filesystem for /var/lib/ceph on that node has any issues?
What kind of storage is used there?

ScottDavis · Jun 3, 2024

gurubert said:
Does the filesystem for /var/lib/ceph on that node has any issues?
What kind of storage is used there?

How do I check that?

Its just basic bluestore OSD's on cef storage.

Monitors show running, but manager on that node is 'unknown' still with timeout when I try to view monitor or storage info on that node. No replication either.

gurubert · Jun 3, 2024

No, the MON each has a local database stored in /var/lib/ceph/mon/…
The filesystem of that directory is crucial for the MON performance.

I have seen Ceph clusters where this filesystem was stored on cheap SD cards which were not able to deliver the performance needed for the MON operation.

ScottDavis · Jun 3, 2024

gurubert said:
No, the MON each has a local database stored in /var/lib/ceph/mon/…
The filesystem of that directory is crucial for the MON performance.

I have seen Ceph clusters where this filesystem was stored on cheap SD cards which were not able to deliver the performance needed for the MON operation.

Interesting. The install has been done on SD cards with the ceph storage using enterprise SSD's.

gurubert · Jun 3, 2024

Do not run a Proxmox installation on SD cards. They get weared out way too fast.

ScottDavis · Jun 3, 2024

gurubert said:
Do not run a Proxmox installation on SD cards. They get weared out way too fast.

This is for a three node test cluster so we can trial before making any decision on to deploy new hardware in production.

Search

Search

Ceph Timeout on one node

ScottDavis

New Member

gurubert

Distinguished Member

ScottDavis

New Member

gurubert

Distinguished Member

ScottDavis

New Member

gurubert

Distinguished Member

ScottDavis

New Member

gurubert

Distinguished Member

ScottDavis

New Member

We value your privacy