ceph - got timeout (500) on most all nodes now

GoZippy

Member
Nov 27, 2020
109
2
23
45
www.gozippy.com
All the proxmox nodes seem connected fine and working - but ceph is giving me fits

1653000053945.png

every node shows timeout when using gui to see ceph status... in fact every node is not responding to ceph commands at cli except for node 2 - which is the node I decided to keep for the last known good monitor and manager and was previously told to take all the others down and delete the other monitors and try to rebuild from the one known good...

On node2 --->
all seems to be working - well there are errors for sure but it is at least responding

1653000210361.png


1653000274302.png

Interestingly while writing this I found node2 running 100% cpu and 100% ram and 100% swap for some reason... was unresponsive for a bit.. it has no VM on it...
ceph-mgr using a lot of memory for some reason...
1653001215004.png

When I rebooted the node2 everything seems to go back to normal...

but still timeout on all the other nodes..


1653001781632.png

Any ideas where to start troubleshooting this mess?
 
Just in case it helps: ceph -s

Code:
root@node2:~# ceph -s

  cluster:

    id:     cfa7f7e5-64a7-48dd-bd77-466ff1e77bbb

    health: HEALTH_WARN

            1 filesystem is degraded

            1 MDSs report slow metadata IOs

            6 osds down

            2 hosts (2 osds) down

            Reduced data availability: 484 pgs inactive, 52 pgs down, 29 pgs peering

            Degraded data redundancy: 24966/267340 objects degraded (9.339%), 19 pgs degraded, 30 pgs undersized

            111 pgs not deep-scrubbed in time

            111 pgs not scrubbed in time

            267 slow ops, oldest one blocked for 111265 sec, daemons [osd.1,mon.node2] have slow ops.

 

  services:

    mon: 1 daemons, quorum node2 (age 4d)

    mgr: node2(active, since 4m)

    mds: 1/1 daemons up

    osd: 14 osds: 4 up (since 4d), 10 in (since 5w)

 

  data:

    volumes: 0/1 healthy, 1 recovering

    pools:   6 pools, 512 pgs

    objects: 133.67k objects, 516 GiB

    usage:   720 GiB used, 212 GiB / 932 GiB avail

    pgs:     78.320% pgs unknown

             16.211% pgs not active

             24966/267340 objects degraded (9.339%)

             401 unknown

             52  down

             29  peering

             17  active+undersized+degraded

             11  active+undersized

             2   undersized+degraded+peered
 
Last edited:
Code:
ceph health details

[/B]

root@node2:~# ceph health detail

HEALTH_WARN 1 filesystem is degraded; 1 MDSs report slow metadata IOs; 6 osds down; 2 hosts (2 osds) down; Reduced data availability: 484 pgs inactive, 52 pgs down, 29 pgs peering; Degraded data redundancy: 24966/267340 objects degraded (9.339%), 19 pgs degraded, 30 pgs undersized; 111 pgs not deep-scrubbed in time; 111 pgs not scrubbed in time; 267 slow ops, oldest one blocked for 111270 sec, daemons [osd.1,mon.node2] have slow ops.

[WRN] FS_DEGRADED: 1 filesystem is degraded

    fs cephfs is degraded

[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs

    mds.node2(mds.0): 4 slow metadata IOs are blocked > 30 secs, oldest blocked for 415117 secs

[WRN] OSD_DOWN: 6 osds down

    osd.0 (root=default,host=stack1) is down

    osd.6 (root=default,host=node7) is down

    osd.8 (root=default,host=node900) is down

    osd.9 (root=default,host=node900) is down

    osd.10 (root=default,host=node900) is down

    osd.11 (root=default,host=node900) is down

[WRN] OSD_HOST_DOWN: 2 hosts (2 osds) down

    host stack1 (root=default) (1 osds) is down

    host node7 (root=default) (1 osds) is down

[WRN] PG_AVAILABILITY: Reduced data availability: 484 pgs inactive, 52 pgs down, 29 pgs peering

    pg 7.cc is stuck inactive for 5m, current state unknown, last acting []

    pg 7.cd is stuck inactive for 5m, current state unknown, last acting []

    pg 7.ce is stuck inactive for 5m, current state unknown, last acting []

    pg 7.cf is stuck inactive for 5m, current state unknown, last acting []

    pg 7.d0 is stuck inactive for 5m, current state unknown, last acting []

    pg 7.d1 is down, acting [1]

    pg 7.d2 is down, acting [1]

    pg 7.d3 is stuck inactive for 5m, current state unknown, last acting []

    pg 7.d4 is stuck inactive for 5m, current state unknown, last acting []

    pg 7.d5 is stuck inactive for 5m, current state unknown, last acting []

    pg 7.d6 is down, acting [1]

    pg 7.d7 is stuck inactive for 5m, current state unknown, last acting []

    pg 7.d8 is stuck inactive for 5m, current state unknown, last acting []

    pg 7.d9 is stuck inactive for 5m, current state unknown, last acting []

    pg 7.da is stuck inactive for 5m, current state unknown, last acting []

    pg 7.db is stuck inactive for 5m, current state unknown, last acting []

    pg 7.dc is stuck inactive for 5m, current state unknown, last acting []

    pg 7.dd is stuck inactive for 5m, current state unknown, last acting []

    pg 7.de is stuck inactive for 5m, current state unknown, last acting []

    pg 7.df is stuck peering for 2M, current state peering, last acting [1,14]

    pg 7.e0 is stuck inactive for 5m, current state unknown, last acting []

    pg 7.e1 is stuck inactive for 5m, current state unknown, last acting []

    pg 7.e2 is stuck inactive for 5m, current state unknown, last acting []

    pg 7.e4 is stuck peering for 2M, current state peering, last acting [1]

    pg 7.e5 is stuck inactive for 5m, current state unknown, last acting []

    pg 7.e6 is stuck inactive for 5m, current state unknown, last acting []

    pg 7.e7 is stuck inactive for 5m, current state unknown, last acting []

    pg 7.e8 is stuck inactive for 5m, current state unknown, last acting []

    pg 7.e9 is stuck inactive for 5m, current state unknown, last acting []

    pg 7.ea is stuck inactive for 5m, current state unknown, last acting []

    pg 7.eb is down, acting [1]

    pg 7.ec is stuck inactive for 5m, current state unknown, last acting []

    pg 7.ed is stuck inactive for 5m, current state unknown, last acting []

    pg 7.ee is stuck peering for 2M, current state peering, last acting [1]

    pg 7.ef is stuck inactive for 5m, current state unknown, last acting []

    pg 7.f0 is stuck inactive for 5m, current state unknown, last acting []

    pg 7.f1 is stuck inactive for 5m, current state unknown, last acting []

    pg 7.f2 is stuck peering for 2M, current state peering, last acting [1]

    pg 7.f3 is stuck inactive for 5m, current state unknown, last acting []

    pg 7.f4 is stuck inactive for 5m, current state unknown, last acting []

    pg 7.f5 is stuck inactive for 5m, current state unknown, last acting []

    pg 7.f6 is stuck inactive for 5m, current state unknown, last acting []

    pg 7.f7 is stuck inactive for 5m, current state unknown, last acting []

    pg 7.f8 is stuck inactive for 5m, current state unknown, last acting []

    pg 7.f9 is stuck inactive for 5m, current state unknown, last acting []

    pg 7.fa is stuck peering for 2M, current state peering, last acting [1]

    pg 7.fb is stuck inactive for 5m, current state unknown, last acting []

    pg 7.fc is stuck inactive for 5m, current state unknown, last acting []

    pg 7.fd is stuck inactive for 5m, current state unknown, last acting []

    pg 7.fe is stuck inactive for 5m, current state unknown, last acting []

    pg 7.ff is stuck inactive for 5m, current state unknown, last acting []

[WRN] PG_DEGRADED: Degraded data redundancy: 24966/267340 objects degraded (9.339%), 19 pgs degraded, 30 pgs undersized

    pg 1.f is stuck undersized for 30h, current state undersized+degraded+peered, last acting [1]

    pg 1.12 is stuck undersized for 30h, current state undersized+degraded+peered, last acting [1]

    pg 2.3 is stuck undersized for 30h, current state active+undersized, last acting [1]

    pg 2.18 is stuck undersized for 30h, current state active+undersized, last acting [1]

    pg 2.1f is stuck undersized for 30h, current state active+undersized, last acting [1]

    pg 2.20 is stuck undersized for 30h, current state active+undersized, last acting [1]

    pg 2.29 is stuck undersized for 30h, current state active+undersized, last acting [1]

    pg 2.2b is stuck undersized for 30h, current state active+undersized, last acting [1]

    pg 2.2e is stuck undersized for 30h, current state active+undersized, last acting [1]

    pg 2.50 is stuck undersized for 30h, current state active+undersized+degraded, last acting [1]

    pg 2.5c is stuck undersized for 30h, current state active+undersized, last acting [1]

    pg 2.66 is stuck undersized for 30h, current state active+undersized, last acting [1]

    pg 3.b is stuck undersized for 30h, current state active+undersized+degraded, last acting [1]

    pg 3.11 is stuck undersized for 30h, current state active+undersized+degraded, last acting [1]

    pg 3.13 is stuck undersized for 30h, current state active+undersized+degraded, last acting [1]

    pg 5.18 is stuck undersized for 30h, current state active+undersized+degraded, last acting [1]

    pg 5.1b is stuck undersized for 30h, current state active+undersized, last acting [1]

    pg 5.1d is stuck undersized for 30h, current state active+undersized, last acting [1]

    pg 6.6 is stuck undersized for 30h, current state active+undersized+degraded, last acting [1]

    pg 6.12 is stuck undersized for 30h, current state active+undersized+degraded, last acting [1]

    pg 6.14 is stuck undersized for 30h, current state active+undersized+degraded, last acting [1]

    pg 7.d is stuck undersized for 30h, current state active+undersized+degraded, last acting [1]

    pg 7.2d is stuck undersized for 30h, current state active+undersized+degraded, last acting [1]

    pg 7.3d is stuck undersized for 30h, current state active+undersized+degraded, last acting [1]

    pg 7.40 is stuck undersized for 30h, current state active+undersized+degraded, last acting [1]

    pg 7.4f is stuck undersized for 30h, current state active+undersized+degraded, last acting [1]

    pg 7.80 is stuck undersized for 30h, current state active+undersized+degraded, last acting [1]

    pg 7.86 is stuck undersized for 30h, current state active+undersized+degraded, last acting [1]

    pg 7.a6 is stuck undersized for 30h, current state active+undersized+degraded, last acting [1]

    pg 7.e3 is stuck undersized for 30h, current state active+undersized+degraded, last acting [1]

[WRN] PG_NOT_DEEP_SCRUBBED: 111 pgs not deep-scrubbed in time

    pg 7.fa not deep-scrubbed since 2022-02-19T19:53:41.989589-0600

    pg 7.f2 not deep-scrubbed since 2022-02-15T03:16:01.368385-0600

    pg 7.ee not deep-scrubbed since 2022-02-19T10:55:56.230313-0600

    pg 7.eb not deep-scrubbed since 2022-02-18T03:05:15.432523-0600

    pg 7.e4 not deep-scrubbed since 2022-02-19T17:06:07.782522-0600

    pg 7.e3 not deep-scrubbed since 2022-02-15T13:09:13.702290-0600

    pg 7.df not deep-scrubbed since 2022-02-16T06:46:59.842350-0600

    pg 7.d6 not deep-scrubbed since 2022-02-15T21:28:42.039202-0600

    pg 7.d2 not deep-scrubbed since 2022-02-14T19:56:50.595631-0600

    pg 7.d1 not deep-scrubbed since 2022-02-19T04:33:55.597278-0600

    pg 7.c8 not deep-scrubbed since 2022-02-14T02:24:38.501030-0600

    pg 7.c3 not deep-scrubbed since 2022-02-18T12:42:04.417557-0600

    pg 7.c1 not deep-scrubbed since 2022-02-20T05:33:46.915820-0600

    pg 7.bc not deep-scrubbed since 2022-02-17T14:59:08.550113-0600

    pg 7.b7 not deep-scrubbed since 2022-02-15T06:10:40.049759-0600

    pg 7.b4 not deep-scrubbed since 2022-02-20T00:16:25.868411-0600

    pg 7.b2 not deep-scrubbed since 2022-02-17T07:40:27.996673-0600

    pg 7.b1 not deep-scrubbed since 2022-02-14T10:35:41.165786-0600

    pg 7.af not deep-scrubbed since 2022-02-18T12:38:01.059448-0600

    pg 7.ac not deep-scrubbed since 2022-02-19T23:41:19.833720-0600

    pg 7.aa not deep-scrubbed since 2022-02-18T09:39:32.101527-0600

    pg 7.a7 not deep-scrubbed since 2022-02-19T13:10:10.638701-0600

    pg 7.a6 not deep-scrubbed since 2022-02-18T04:03:20.186108-0600

    pg 7.a2 not deep-scrubbed since 2022-02-19T20:51:56.955783-0600

    pg 7.9c not deep-scrubbed since 2022-02-20T02:32:06.967256-0600

    pg 7.93 not deep-scrubbed since 2022-02-18T11:38:27.327522-0600

    pg 7.8a not deep-scrubbed since 2022-02-14T13:56:15.049478-0600

    pg 7.87 not deep-scrubbed since 2022-02-18T11:47:22.052407-0600

    pg 7.86 not deep-scrubbed since 2022-02-18T11:24:35.008269-0600

    pg 7.80 not deep-scrubbed since 2022-02-19T08:18:42.350978-0600

    pg 2.5c not deep-scrubbed since 2022-02-18T14:24:19.997986-0600

    pg 2.55 not deep-scrubbed since 2022-04-08T21:24:01.962090-0500

    pg 7.56 not deep-scrubbed since 2022-02-14T11:40:21.380875-0600

    pg 7.55 not deep-scrubbed since 2022-02-14T07:24:09.654127-0600

    pg 2.50 not deep-scrubbed since 2022-02-19T10:22:42.460227-0600

    pg 7.49 not deep-scrubbed since 2022-02-20T09:07:58.018585-0600

    pg 7.4f not deep-scrubbed since 2022-02-14T20:21:45.642100-0600

    pg 7.40 not deep-scrubbed since 2022-02-17T03:38:21.851149-0600

    pg 2.43 not deep-scrubbed since 2022-02-16T02:01:35.870049-0600

    pg 2.3c not deep-scrubbed since 2022-02-15T10:47:28.171861-0600

    pg 7.3e not deep-scrubbed since 2022-02-16T04:29:59.058175-0600

    pg 7.3d not deep-scrubbed since 2022-02-18T20:15:58.369274-0600

    pg 2.37 not deep-scrubbed since 2022-02-13T07:44:51.931653-0600

    pg 7.31 not deep-scrubbed since 2022-02-18T02:05:56.266003-0600

    pg 7.36 not deep-scrubbed since 2022-02-15T12:10:23.425667-0600

    pg 2.33 not deep-scrubbed since 2022-02-19T10:44:21.304468-0600

    pg 2.30 not deep-scrubbed since 2022-02-18T15:50:31.124333-0600

    pg 7.2b not deep-scrubbed since 2022-02-18T00:48:49.497798-0600

    pg 2.2e not deep-scrubbed since 2022-02-17T03:10:06.379365-0600

    pg 2.2d not deep-scrubbed since 2022-02-19T20:59:15.969475-0600

    61 more pgs...

[WRN] PG_NOT_SCRUBBED: 111 pgs not scrubbed in time

    pg 7.fa not scrubbed since 2022-02-19T19:53:41.989589-0600

    pg 7.f2 not scrubbed since 2022-02-19T23:05:03.708307-0600

    pg 7.ee not scrubbed since 2022-02-19T10:55:56.230313-0600

    pg 7.eb not scrubbed since 2022-02-19T14:14:53.148503-0600

    pg 7.e4 not scrubbed since 2022-02-19T17:06:07.782522-0600

    pg 7.e3 not scrubbed since 2022-02-20T06:49:28.385202-0600

    pg 7.df not scrubbed since 2022-02-20T00:21:01.742549-0600

    pg 7.d6 not scrubbed since 2022-02-19T19:49:35.166568-0600

    pg 7.d2 not scrubbed since 2022-02-20T01:13:33.831961-0600

    pg 7.d1 not scrubbed since 2022-02-20T05:53:20.885760-0600

    pg 7.c8 not scrubbed since 2022-02-19T13:59:52.730228-0600

    pg 7.c3 not scrubbed since 2022-02-19T12:58:11.560726-0600

    pg 7.c1 not scrubbed since 2022-02-20T05:33:46.915820-0600

    pg 7.bc not scrubbed since 2022-02-20T03:34:16.839466-0600

    pg 7.b7 not scrubbed since 2022-02-20T02:42:04.377479-0600

    pg 7.b4 not scrubbed since 2022-02-20T00:16:25.868411-0600

    pg 7.b2 not scrubbed since 2022-02-19T14:40:46.699789-0600

    pg 7.b1 not scrubbed since 2022-02-19T06:07:18.893033-0600

    pg 7.af not scrubbed since 2022-02-19T17:32:05.851796-0600

    pg 7.ac not scrubbed since 2022-02-19T23:41:19.833720-0600

    pg 7.aa not scrubbed since 2022-02-19T19:37:24.360624-0600

    pg 7.a7 not scrubbed since 2022-02-19T13:10:10.638701-0600

    pg 7.a6 not scrubbed since 2022-02-20T13:10:43.756729-0600

    pg 7.a2 not scrubbed since 2022-02-19T20:51:56.955783-0600

    pg 7.9c not scrubbed since 2022-02-20T02:32:06.967256-0600

    pg 7.93 not scrubbed since 2022-02-19T16:15:35.431782-0600

    pg 7.8a not scrubbed since 2022-02-20T06:31:56.712967-0600

    pg 7.87 not scrubbed since 2022-02-19T20:03:33.304921-0600

    pg 7.86 not scrubbed since 2022-02-19T14:20:07.795002-0600

    pg 7.80 not scrubbed since 2022-02-19T08:18:42.350978-0600

    pg 2.5c not scrubbed since 2022-02-19T14:39:54.909546-0600

    pg 2.55 not scrubbed since 2022-04-08T21:24:01.962090-0500

    pg 7.56 not scrubbed since 2022-02-19T12:18:58.843074-0600

    pg 7.55 not scrubbed since 2022-02-19T12:46:44.064498-0600

    pg 2.50 not scrubbed since 2022-02-19T10:22:42.460227-0600

    pg 7.49 not scrubbed since 2022-02-20T09:07:58.018585-0600

    pg 7.4f not scrubbed since 2022-02-19T16:53:03.909236-0600

    pg 7.40 not scrubbed since 2022-02-19T18:38:01.693652-0600

    pg 2.43 not scrubbed since 2022-02-19T16:23:47.316977-0600

    pg 2.3c not scrubbed since 2022-02-19T21:29:17.693747-0600

    pg 7.3e not scrubbed since 2022-02-19T14:06:47.043142-0600

    pg 7.3d not scrubbed since 2022-02-19T22:37:04.805939-0600

    pg 2.37 not scrubbed since 2022-02-19T18:19:33.044190-0600

    pg 7.31 not scrubbed since 2022-02-19T12:09:45.145492-0600

    pg 7.36 not scrubbed since 2022-02-19T09:52:54.395219-0600

    pg 2.33 not scrubbed since 2022-02-20T12:13:59.044765-0600

    pg 2.30 not scrubbed since 2022-02-19T17:49:43.253252-0600

    pg 7.2b not scrubbed since 2022-02-20T13:18:35.667059-0600

    pg 2.2e not scrubbed since 2022-02-19T12:04:55.944390-0600

    pg 2.2d not scrubbed since 2022-02-19T20:59:15.969475-0600

    61 more pgs...

[WRN] SLOW_OPS: 267 slow ops, oldest one blocked for 111270 sec, daemons [osd.1,mon.node2] have slow ops.



[B][/ICODE]

ceph df

[ICODE]

root@node2:~# ceph df

--- RAW STORAGE ---

CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED

hdd    932 GiB  212 GiB  720 GiB   720 GiB      77.29

TOTAL  932 GiB  212 GiB  720 GiB   720 GiB      77.29

 

--- POOLS ---

POOL                   ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL

cephfs_data             1   32  4.0 GiB    1.62k  7.1 GiB   0.33    1.2 TiB

cephfs_metadata         2  128  2.6 KiB        4  196 KiB      0    1.2 TiB

CStore1                 3   32   27 GiB    8.08k   47 GiB   2.15    1.2 TiB

device_health_metrics   5   32      0 B        1      0 B      0    2.1 TiB

Ceph-USDivide200        6   32  7.8 GiB    2.19k   12 GiB   0.55    1.4 TiB

CPool1                  7  256  354 GiB  121.77k  651 GiB  23.30    1.1 TiB

root@node2:~#
 
I just turned off my proxmox firewall after 4 months of scratching and playing with all sorts of settings - and all of a sudden it all started communicating and balancing... omg.. bug or hang up with the upgrade I think.
 
I'm getting a similar thing, but with a perfectly healthy Ceph cluster (two pools)
View attachment 37318
Why are you using so many MON and MGR and MDS? Just noticed it, because this Thread got pushed up by Rigel. But yeah, this is kinda overkill in most situations. You usually do not need more then 5 MON in a cluster. MGR usuall even less then MONs. Im just curious :)

That error usually comes when MON election is running, means MON Master role is given to another MON. For example, this happens also on 3 node systems, when on mon goes offline (node reboots or mon is manually stopped) The error should disappear though when new mon master is selected.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!