[SOLVED] Ceph Pacific Cluster Crash Shortly After Upgrade

branto · Jul 15, 2021

Hello All,

I am new to ProxMox, and could use your help. I've done some troubleshooting, but don't know what to do next. Hoping someone can provide some guidance.

I was running a 4-host, 30-guest ProxMox 6.4 HCI cluster with Ceph Octopus for about a month that was working pretty well . I upgraded to 7.0 and that went OK. Once I upgraded to Pacific, that's where things went off the rails. Now, the clusterfs is completely inaccessible, only 1/2 devices are in, 3 are down.

Whenever I tried to bring up any of the OSDs, it would crash immediately.

Here's some output that I'd appreciate you kind souls reviewing:

ceph -s

root@sin-dc-196m-hv-cpu-001:~# ceph -s
cluster:
id: 9f0013c5-930e-4dc4-b359-02e8f0af74ad
health: HEALTH_WARN
1 filesystem is degraded
3 MDSs report slow metadata IOs
1 osds down
2 hosts (2 osds) down
1 nearfull osd(s)
Reduced data availability: 193 pgs inactive
4 pool(s) nearfull
8 daemons have recently crashed

services:
mon: 4 daemons, quorum sin-dc-196m-hv-cpu-001,sin-dc-196m-hv-cpu-002,sin-dc-196m-hv-cpu-003,sin-dc-196m-hv-cpu-004 (age 27h)
mgr: sin-dc-196m-hv-cpu-004(active, since 43h), standbys: sin-dc-196m-hv-cpu-001, sin-dc-196m-hv-cpu-002, sin-dc-196m-hv-cpu-003
mds: 3/3 daemons up, 1 standby
osd: 4 osds: 1 up (since 45h), 2 in (since 43h)

data:
volumes: 0/1 healthy, 1 recovering
pools: 4 pools, 193 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
193 unknown

root@sin-dc-196m-hv-cpu-002:~# ceph -s
cluster:
id: 9f0013c5-930e-4dc4-b359-02e8f0af74ad
health: HEALTH_WARN
1 filesystem is degraded
3 MDSs report slow metadata IOs
1 osds down
2 hosts (2 osds) down
1 nearfull osd(s)
Reduced data availability: 193 pgs inactive
4 pool(s) nearfull
8 daemons have recently crashed

services:
mon: 4 daemons, quorum sin-dc-196m-hv-cpu-001,sin-dc-196m-hv-cpu-002,sin-dc-196m-hv-cpu-003,sin-dc-196m-hv-cpu-004 (age 27h)
mgr: sin-dc-196m-hv-cpu-004(active, since 43h), standbys: sin-dc-196m-hv-cpu-001, sin-dc-196m-hv-cpu-002, sin-dc-196m-hv-cpu-003
mds: 3/3 daemons up, 1 standby
osd: 4 osds: 1 up (since 45h), 2 in (since 43h)

data:
volumes: 0/1 healthy, 1 recovering
pools: 4 pools, 193 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
193 unknown

root@sin-dc-196m-hv-cpu-003:~# ceph -s
cluster:
id: 9f0013c5-930e-4dc4-b359-02e8f0af74ad
health: HEALTH_WARN
1 filesystem is degraded
3 MDSs report slow metadata IOs
1 osds down
2 hosts (2 osds) down
1 nearfull osd(s)
Reduced data availability: 193 pgs inactive
4 pool(s) nearfull
8 daemons have recently crashed

services:
mon: 4 daemons, quorum sin-dc-196m-hv-cpu-001,sin-dc-196m-hv-cpu-002,sin-dc-196m-hv-cpu-003,sin-dc-196m-hv-cpu-004 (age 27h)
mgr: sin-dc-196m-hv-cpu-004(active, since 43h), standbys: sin-dc-196m-hv-cpu-001, sin-dc-196m-hv-cpu-002, sin-dc-196m-hv-cpu-003
mds: 3/3 daemons up, 1 standby
osd: 4 osds: 1 up (since 45h), 2 in (since 43h)

data:
volumes: 0/1 healthy, 1 recovering
pools: 4 pools, 193 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
193 unknown

root@sin-dc-196m-hv-cpu-004:~# ceph -s
cluster:
id: 9f0013c5-930e-4dc4-b359-02e8f0af74ad
health: HEALTH_WARN
1 filesystem is degraded
3 MDSs report slow metadata IOs
1 osds down
2 hosts (2 osds) down
1 nearfull osd(s)
Reduced data availability: 193 pgs inactive
4 pool(s) nearfull
8 daemons have recently crashed

services:
mon: 4 daemons, quorum sin-dc-196m-hv-cpu-001,sin-dc-196m-hv-cpu-002,sin-dc-196m-hv-cpu-003,sin-dc-196m-hv-cpu-004 (age 27h)
mgr: sin-dc-196m-hv-cpu-004(active, since 43h), standbys: sin-dc-196m-hv-cpu-001, sin-dc-196m-hv-cpu-002, sin-dc-196m-hv-cpu-003
mds: 3/3 daemons up, 1 standby
osd: 4 osds: 1 up (since 45h), 2 in (since 43h)

data:
volumes: 0/1 healthy, 1 recovering
pools: 4 pools, 193 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
193 unknown

ceph health detail will be posted in a followup message due to post size limitations.

branto · Jul 15, 2021

root@sin-dc-196m-hv-cpu-001:~# ceph health detail
HEALTH_WARN 1 filesystem is degraded; 3 MDSs report slow metadata IOs; 1 osds down; 2 hosts (2 osds) down; 1 nearfull osd(s); Reduced data availability: 193 pgs inactive; 4 pool(s) nearfull; 8 daemons have recently crashed
[WRN] FS_DEGRADED: 1 filesystem is degraded
fs acme.lab.cephfs is degraded
[WRN] MDS_SLOW_METADATA_IO: 3 MDSs report slow metadata IOs
mds.sin-dc-196m-hv-cpu-004(mds.1): 3 slow metadata IOs are blocked > 30 secs, oldest blocked for 167575 secs
mds.sin-dc-196m-hv-cpu-001(mds.0): 4 slow metadata IOs are blocked > 30 secs, oldest blocked for 162522 secs
mds.sin-dc-196m-hv-cpu-002(mds.2): 3 slow metadata IOs are blocked > 30 secs, oldest blocked for 99347 secs
[WRN] OSD_DOWN: 1 osds down
osd.1 (root=default,host=sin-dc-196m-hv-cpu-003) is down
[WRN] OSD_HOST_DOWN: 2 hosts (2 osds) down
host sin-dc-196m-hv-cpu-002 (root=default) (1 osds) is down
host sin-dc-196m-hv-cpu-003 (root=default) (1 osds) is down
[WRN] OSD_NEARFULL: 1 nearfull osd(s)
osd.2 is near full
[WRN] PG_AVAILABILITY: Reduced data availability: 193 pgs inactive
pg 4.4d is stuck inactive for 44h, current state unknown, last acting []
pg 4.4e is stuck inactive for 44h, current state unknown, last acting []
pg 4.4f is stuck inactive for 44h, current state unknown, last acting []
pg 4.50 is stuck inactive for 44h, current state unknown, last acting []
pg 4.51 is stuck inactive for 44h, current state unknown, last acting []
pg 4.52 is stuck inactive for 44h, current state unknown, last acting []
pg 4.53 is stuck inactive for 44h, current state unknown, last acting []
pg 4.54 is stuck inactive for 44h, current state unknown, last acting []
pg 4.55 is stuck inactive for 44h, current state unknown, last acting []
pg 4.56 is stuck inactive for 44h, current state unknown, last acting []
pg 4.57 is stuck inactive for 44h, current state unknown, last acting []
pg 4.58 is stuck inactive for 44h, current state unknown, last acting []
pg 4.59 is stuck inactive for 44h, current state unknown, last acting []
pg 4.5a is stuck inactive for 44h, current state unknown, last acting []
pg 4.5b is stuck inactive for 44h, current state unknown, last acting []
pg 4.5c is stuck inactive for 44h, current state unknown, last acting []
pg 4.5d is stuck inactive for 44h, current state unknown, last acting []
pg 4.5e is stuck inactive for 44h, current state unknown, last acting []
pg 4.5f is stuck inactive for 44h, current state unknown, last acting []
pg 4.60 is stuck inactive for 44h, current state unknown, last acting []
pg 4.61 is stuck inactive for 44h, current state unknown, last acting []
pg 4.62 is stuck inactive for 44h, current state unknown, last acting []
pg 4.63 is stuck inactive for 44h, current state unknown, last acting []
pg 4.64 is stuck inactive for 44h, current state unknown, last acting []
pg 4.65 is stuck inactive for 44h, current state unknown, last acting []
pg 4.66 is stuck inactive for 44h, current state unknown, last acting []
pg 4.67 is stuck inactive for 44h, current state unknown, last acting []
pg 4.68 is stuck inactive for 44h, current state unknown, last acting []
pg 4.69 is stuck inactive for 44h, current state unknown, last acting []
pg 4.6a is stuck inactive for 44h, current state unknown, last acting []
pg 4.6b is stuck inactive for 44h, current state unknown, last acting []
pg 4.6c is stuck inactive for 44h, current state unknown, last acting []
pg 4.6d is stuck inactive for 44h, current state unknown, last acting []
pg 4.6e is stuck inactive for 44h, current state unknown, last acting []
pg 4.6f is stuck inactive for 44h, current state unknown, last acting []
pg 4.70 is stuck inactive for 44h, current state unknown, last acting []
pg 4.71 is stuck inactive for 44h, current state unknown, last acting []
pg 4.72 is stuck inactive for 44h, current state unknown, last acting []
pg 4.73 is stuck inactive for 44h, current state unknown, last acting []
pg 4.74 is stuck inactive for 44h, current state unknown, last acting []
pg 4.75 is stuck inactive for 44h, current state unknown, last acting []
pg 4.76 is stuck inactive for 44h, current state unknown, last acting []
pg 4.77 is stuck inactive for 44h, current state unknown, last acting []
pg 4.78 is stuck inactive for 44h, current state unknown, last acting []
pg 4.79 is stuck inactive for 44h, current state unknown, last acting []
pg 4.7a is stuck inactive for 44h, current state unknown, last acting []
pg 4.7b is stuck inactive for 44h, current state unknown, last acting []
pg 4.7c is stuck inactive for 44h, current state unknown, last acting []
pg 4.7d is stuck inactive for 44h, current state unknown, last acting []
pg 4.7e is stuck inactive for 44h, current state unknown, last acting []
pg 4.7f is stuck inactive for 44h, current state unknown, last acting []
[WRN] POOL_NEARFULL: 4 pool(s) nearfull
pool 'device_health_metrics' is nearfull
pool 'acme.lab.cephfs_data' is nearfull
pool 'acme.lab.cephfs_metadata' is nearfull
pool 'lab-hc-ssdpool' is nearfull
[WRN] RECENT_CRASH: 8 daemons have recently crashed
osd.3 crashed on host sin-dc-196m-hv-cpu-001.domain.name.removed at 2021-07-12T22:43:42.839926Z
osd.3 crashed on host sin-dc-196m-hv-cpu-001.domain.name.removed at 2021-07-12T22:44:15.780599Z
osd.3 crashed on host sin-dc-196m-hv-cpu-001.domain.name.removed at 2021-07-12T22:44:46.725679Z
osd.3 crashed on host sin-dc-196m-hv-cpu-001.domain.name.removed at 2021-07-12T22:45:17.570109Z
osd.0 crashed on host sin-dc-196m-hv-cpu-002.domain.name.removed at 2021-07-12T23:25:36.445383Z
osd.0 crashed on host sin-dc-196m-hv-cpu-002.domain.name.removed at 2021-07-12T23:25:57.585415Z
osd.0 crashed on host sin-dc-196m-hv-cpu-002.domain.name.removed at 2021-07-12T23:26:16.921570Z
osd.0 crashed on host sin-dc-196m-hv-cpu-002.domain.name.removed at 2021-07-12T23:26:36.074884Z
root@sin-dc-196m-hv-cpu-001:~#

root@sin-dc-196m-hv-cpu-002:~# ceph health detail
HEALTH_WARN 1 filesystem is degraded; 3 MDSs report slow metadata IOs; 1 osds down; 2 hosts (2 osds) down; 1 nearfull osd(s); Reduced data availability: 193 pgs inactive; 4 pool(s) nearfull; 8 daemons have recently crashed
[WRN] FS_DEGRADED: 1 filesystem is degraded
fs acme.lab.cephfs is degraded
[WRN] MDS_SLOW_METADATA_IO: 3 MDSs report slow metadata IOs
mds.sin-dc-196m-hv-cpu-004(mds.1): 3 slow metadata IOs are blocked > 30 secs, oldest blocked for 167575 secs
mds.sin-dc-196m-hv-cpu-001(mds.0): 4 slow metadata IOs are blocked > 30 secs, oldest blocked for 162522 secs
mds.sin-dc-196m-hv-cpu-002(mds.2): 3 slow metadata IOs are blocked > 30 secs, oldest blocked for 99347 secs
[WRN] OSD_DOWN: 1 osds down
osd.1 (root=default,host=sin-dc-196m-hv-cpu-003) is down
[WRN] OSD_HOST_DOWN: 2 hosts (2 osds) down
host sin-dc-196m-hv-cpu-002 (root=default) (1 osds) is down
host sin-dc-196m-hv-cpu-003 (root=default) (1 osds) is down
[WRN] OSD_NEARFULL: 1 nearfull osd(s)
osd.2 is near full
[WRN] PG_AVAILABILITY: Reduced data availability: 193 pgs inactive
pg 4.4d is stuck inactive for 44h, current state unknown, last acting []
pg 4.4e is stuck inactive for 44h, current state unknown, last acting []
pg 4.4f is stuck inactive for 44h, current state unknown, last acting []
pg 4.50 is stuck inactive for 44h, current state unknown, last acting []
pg 4.51 is stuck inactive for 44h, current state unknown, last acting []
pg 4.52 is stuck inactive for 44h, current state unknown, last acting []
pg 4.53 is stuck inactive for 44h, current state unknown, last acting []
pg 4.54 is stuck inactive for 44h, current state unknown, last acting []
pg 4.55 is stuck inactive for 44h, current state unknown, last acting []
pg 4.56 is stuck inactive for 44h, current state unknown, last acting []
pg 4.57 is stuck inactive for 44h, current state unknown, last acting []
pg 4.58 is stuck inactive for 44h, current state unknown, last acting []
pg 4.59 is stuck inactive for 44h, current state unknown, last acting []
pg 4.5a is stuck inactive for 44h, current state unknown, last acting []
pg 4.5b is stuck inactive for 44h, current state unknown, last acting []
pg 4.5c is stuck inactive for 44h, current state unknown, last acting []
pg 4.5d is stuck inactive for 44h, current state unknown, last acting []
pg 4.5e is stuck inactive for 44h, current state unknown, last acting []
pg 4.5f is stuck inactive for 44h, current state unknown, last acting []
pg 4.60 is stuck inactive for 44h, current state unknown, last acting []
pg 4.61 is stuck inactive for 44h, current state unknown, last acting []
pg 4.62 is stuck inactive for 44h, current state unknown, last acting []
pg 4.63 is stuck inactive for 44h, current state unknown, last acting []
pg 4.64 is stuck inactive for 44h, current state unknown, last acting []
pg 4.65 is stuck inactive for 44h, current state unknown, last acting []
pg 4.66 is stuck inactive for 44h, current state unknown, last acting []
pg 4.67 is stuck inactive for 44h, current state unknown, last acting []
pg 4.68 is stuck inactive for 44h, current state unknown, last acting []
pg 4.69 is stuck inactive for 44h, current state unknown, last acting []
pg 4.6a is stuck inactive for 44h, current state unknown, last acting []
pg 4.6b is stuck inactive for 44h, current state unknown, last acting []
pg 4.6c is stuck inactive for 44h, current state unknown, last acting []
pg 4.6d is stuck inactive for 44h, current state unknown, last acting []
pg 4.6e is stuck inactive for 44h, current state unknown, last acting []
pg 4.6f is stuck inactive for 44h, current state unknown, last acting []
pg 4.70 is stuck inactive for 44h, current state unknown, last acting []
pg 4.71 is stuck inactive for 44h, current state unknown, last acting []
pg 4.72 is stuck inactive for 44h, current state unknown, last acting []
pg 4.73 is stuck inactive for 44h, current state unknown, last acting []
pg 4.74 is stuck inactive for 44h, current state unknown, last acting []
pg 4.75 is stuck inactive for 44h, current state unknown, last acting []
pg 4.76 is stuck inactive for 44h, current state unknown, last acting []
pg 4.77 is stuck inactive for 44h, current state unknown, last acting []
pg 4.78 is stuck inactive for 44h, current state unknown, last acting []
pg 4.79 is stuck inactive for 44h, current state unknown, last acting []
pg 4.7a is stuck inactive for 44h, current state unknown, last acting []
pg 4.7b is stuck inactive for 44h, current state unknown, last acting []
pg 4.7c is stuck inactive for 44h, current state unknown, last acting []
pg 4.7d is stuck inactive for 44h, current state unknown, last acting []
pg 4.7e is stuck inactive for 44h, current state unknown, last acting []
pg 4.7f is stuck inactive for 44h, current state unknown, last acting []
[WRN] POOL_NEARFULL: 4 pool(s) nearfull
pool 'device_health_metrics' is nearfull
pool 'acme.lab.cephfs_data' is nearfull
pool 'acme.lab.cephfs_metadata' is nearfull
pool 'lab-hc-ssdpool' is nearfull
[WRN] RECENT_CRASH: 8 daemons have recently crashed
osd.3 crashed on host sin-dc-196m-hv-cpu-001.domain.name.removed at 2021-07-12T22:43:42.839926Z
osd.3 crashed on host sin-dc-196m-hv-cpu-001.domain.name.removed at 2021-07-12T22:44:15.780599Z
osd.3 crashed on host sin-dc-196m-hv-cpu-001.domain.name.removed at 2021-07-12T22:44:46.725679Z
osd.3 crashed on host sin-dc-196m-hv-cpu-001.domain.name.removed at 2021-07-12T22:45:17.570109Z
osd.0 crashed on host sin-dc-196m-hv-cpu-002.domain.name.removed at 2021-07-12T23:25:36.445383Z
osd.0 crashed on host sin-dc-196m-hv-cpu-002.domain.name.removed at 2021-07-12T23:25:57.585415Z
osd.0 crashed on host sin-dc-196m-hv-cpu-002.domain.name.removed at 2021-07-12T23:26:16.921570Z
osd.0 crashed on host sin-dc-196m-hv-cpu-002.domain.name.removed at 2021-07-12T23:26:36.074884Z

branto · Jul 15, 2021

ceph health detail 2/2

root@sin-dc-196m-hv-cpu-003:~# ceph health detail
HEALTH_WARN 1 filesystem is degraded; 3 MDSs report slow metadata IOs; 1 osds down; 2 hosts (2 osds) down; 1 nearfull osd(s); Reduced data availability: 193 pgs inactive; 4 pool(s) nearfull; 8 daemons have recently crashed
[WRN] FS_DEGRADED: 1 filesystem is degraded
fs acme.lab.cephfs is degraded
[WRN] MDS_SLOW_METADATA_IO: 3 MDSs report slow metadata IOs
mds.sin-dc-196m-hv-cpu-004(mds.1): 3 slow metadata IOs are blocked > 30 secs, oldest blocked for 167575 secs
mds.sin-dc-196m-hv-cpu-001(mds.0): 4 slow metadata IOs are blocked > 30 secs, oldest blocked for 162522 secs
mds.sin-dc-196m-hv-cpu-002(mds.2): 3 slow metadata IOs are blocked > 30 secs, oldest blocked for 99347 secs
[WRN] OSD_DOWN: 1 osds down
osd.1 (root=default,host=sin-dc-196m-hv-cpu-003) is down
[WRN] OSD_HOST_DOWN: 2 hosts (2 osds) down
host sin-dc-196m-hv-cpu-002 (root=default) (1 osds) is down
host sin-dc-196m-hv-cpu-003 (root=default) (1 osds) is down
[WRN] OSD_NEARFULL: 1 nearfull osd(s)
osd.2 is near full
[WRN] PG_AVAILABILITY: Reduced data availability: 193 pgs inactive
pg 4.4d is stuck inactive for 44h, current state unknown, last acting []
pg 4.4e is stuck inactive for 44h, current state unknown, last acting []
pg 4.4f is stuck inactive for 44h, current state unknown, last acting []
pg 4.50 is stuck inactive for 44h, current state unknown, last acting []
pg 4.51 is stuck inactive for 44h, current state unknown, last acting []
pg 4.52 is stuck inactive for 44h, current state unknown, last acting []
pg 4.53 is stuck inactive for 44h, current state unknown, last acting []
pg 4.54 is stuck inactive for 44h, current state unknown, last acting []
pg 4.55 is stuck inactive for 44h, current state unknown, last acting []
pg 4.56 is stuck inactive for 44h, current state unknown, last acting []
pg 4.57 is stuck inactive for 44h, current state unknown, last acting []
pg 4.58 is stuck inactive for 44h, current state unknown, last acting []
pg 4.59 is stuck inactive for 44h, current state unknown, last acting []
pg 4.5a is stuck inactive for 44h, current state unknown, last acting []
pg 4.5b is stuck inactive for 44h, current state unknown, last acting []
pg 4.5c is stuck inactive for 44h, current state unknown, last acting []
pg 4.5d is stuck inactive for 44h, current state unknown, last acting []
pg 4.5e is stuck inactive for 44h, current state unknown, last acting []
pg 4.5f is stuck inactive for 44h, current state unknown, last acting []
pg 4.60 is stuck inactive for 44h, current state unknown, last acting []
pg 4.61 is stuck inactive for 44h, current state unknown, last acting []
pg 4.62 is stuck inactive for 44h, current state unknown, last acting []
pg 4.63 is stuck inactive for 44h, current state unknown, last acting []
pg 4.64 is stuck inactive for 44h, current state unknown, last acting []
pg 4.65 is stuck inactive for 44h, current state unknown, last acting []
pg 4.66 is stuck inactive for 44h, current state unknown, last acting []
pg 4.67 is stuck inactive for 44h, current state unknown, last acting []
pg 4.68 is stuck inactive for 44h, current state unknown, last acting []
pg 4.69 is stuck inactive for 44h, current state unknown, last acting []
pg 4.6a is stuck inactive for 44h, current state unknown, last acting []
pg 4.6b is stuck inactive for 44h, current state unknown, last acting []
pg 4.6c is stuck inactive for 44h, current state unknown, last acting []
pg 4.6d is stuck inactive for 44h, current state unknown, last acting []
pg 4.6e is stuck inactive for 44h, current state unknown, last acting []
pg 4.6f is stuck inactive for 44h, current state unknown, last acting []
pg 4.70 is stuck inactive for 44h, current state unknown, last acting []
pg 4.71 is stuck inactive for 44h, current state unknown, last acting []
pg 4.72 is stuck inactive for 44h, current state unknown, last acting []
pg 4.73 is stuck inactive for 44h, current state unknown, last acting []
pg 4.74 is stuck inactive for 44h, current state unknown, last acting []
pg 4.75 is stuck inactive for 44h, current state unknown, last acting []
pg 4.76 is stuck inactive for 44h, current state unknown, last acting []
pg 4.77 is stuck inactive for 44h, current state unknown, last acting []
pg 4.78 is stuck inactive for 44h, current state unknown, last acting []
pg 4.79 is stuck inactive for 44h, current state unknown, last acting []
pg 4.7a is stuck inactive for 44h, current state unknown, last acting []
pg 4.7b is stuck inactive for 44h, current state unknown, last acting []
pg 4.7c is stuck inactive for 44h, current state unknown, last acting []
pg 4.7d is stuck inactive for 44h, current state unknown, last acting []
pg 4.7e is stuck inactive for 44h, current state unknown, last acting []
pg 4.7f is stuck inactive for 44h, current state unknown, last acting []
[WRN] POOL_NEARFULL: 4 pool(s) nearfull
pool 'device_health_metrics' is nearfull
pool 'acme.lab.cephfs_data' is nearfull
pool 'acme.lab.cephfs_metadata' is nearfull
pool 'lab-hc-ssdpool' is nearfull
[WRN] RECENT_CRASH: 8 daemons have recently crashed
osd.3 crashed on host sin-dc-196m-hv-cpu-001.domain.name.removed at 2021-07-12T22:43:42.839926Z
osd.3 crashed on host sin-dc-196m-hv-cpu-001.domain.name.removed at 2021-07-12T22:44:15.780599Z
osd.3 crashed on host sin-dc-196m-hv-cpu-001.domain.name.removed at 2021-07-12T22:44:46.725679Z
osd.3 crashed on host sin-dc-196m-hv-cpu-001.domain.name.removed at 2021-07-12T22:45:17.570109Z
osd.0 crashed on host sin-dc-196m-hv-cpu-002.domain.name.removed at 2021-07-12T23:25:36.445383Z
osd.0 crashed on host sin-dc-196m-hv-cpu-002.domain.name.removed at 2021-07-12T23:25:57.585415Z
osd.0 crashed on host sin-dc-196m-hv-cpu-002.domain.name.removed at 2021-07-12T23:26:16.921570Z
osd.0 crashed on host sin-dc-196m-hv-cpu-002.domain.name.removed at 2021-07-12T23:26:36.074884Z
root@sin-dc-196m-hv-cpu-003:~#

root@sin-dc-196m-hv-cpu-004:~# ceph health detail
HEALTH_WARN 1 filesystem is degraded; 3 MDSs report slow metadata IOs; 1 osds down; 2 hosts (2 osds) down; 1 nearfull osd(s); Reduced data availability: 193 pgs inactive; 4 pool(s) nearfull; 8 daemons have recently crashed
[WRN] FS_DEGRADED: 1 filesystem is degraded
fs acme.lab.cephfs is degraded
[WRN] MDS_SLOW_METADATA_IO: 3 MDSs report slow metadata IOs
mds.sin-dc-196m-hv-cpu-004(mds.1): 3 slow metadata IOs are blocked > 30 secs, oldest blocked for 167575 secs
mds.sin-dc-196m-hv-cpu-001(mds.0): 4 slow metadata IOs are blocked > 30 secs, oldest blocked for 162522 secs
mds.sin-dc-196m-hv-cpu-002(mds.2): 3 slow metadata IOs are blocked > 30 secs, oldest blocked for 99347 secs
[WRN] OSD_DOWN: 1 osds down
osd.1 (root=default,host=sin-dc-196m-hv-cpu-003) is down
[WRN] OSD_HOST_DOWN: 2 hosts (2 osds) down
host sin-dc-196m-hv-cpu-002 (root=default) (1 osds) is down
host sin-dc-196m-hv-cpu-003 (root=default) (1 osds) is down
[WRN] OSD_NEARFULL: 1 nearfull osd(s)
osd.2 is near full
[WRN] PG_AVAILABILITY: Reduced data availability: 193 pgs inactive
pg 4.4d is stuck inactive for 44h, current state unknown, last acting []
pg 4.4e is stuck inactive for 44h, current state unknown, last acting []
pg 4.4f is stuck inactive for 44h, current state unknown, last acting []
pg 4.50 is stuck inactive for 44h, current state unknown, last acting []
pg 4.51 is stuck inactive for 44h, current state unknown, last acting []
pg 4.52 is stuck inactive for 44h, current state unknown, last acting []
pg 4.53 is stuck inactive for 44h, current state unknown, last acting []
pg 4.54 is stuck inactive for 44h, current state unknown, last acting []
pg 4.55 is stuck inactive for 44h, current state unknown, last acting []
pg 4.56 is stuck inactive for 44h, current state unknown, last acting []
pg 4.57 is stuck inactive for 44h, current state unknown, last acting []
pg 4.58 is stuck inactive for 44h, current state unknown, last acting []
pg 4.59 is stuck inactive for 44h, current state unknown, last acting []
pg 4.5a is stuck inactive for 44h, current state unknown, last acting []
pg 4.5b is stuck inactive for 44h, current state unknown, last acting []
pg 4.5c is stuck inactive for 44h, current state unknown, last acting []
pg 4.5d is stuck inactive for 44h, current state unknown, last acting []
pg 4.5e is stuck inactive for 44h, current state unknown, last acting []
pg 4.5f is stuck inactive for 44h, current state unknown, last acting []
pg 4.60 is stuck inactive for 44h, current state unknown, last acting []
pg 4.61 is stuck inactive for 44h, current state unknown, last acting []
pg 4.62 is stuck inactive for 44h, current state unknown, last acting []
pg 4.63 is stuck inactive for 44h, current state unknown, last acting []
pg 4.64 is stuck inactive for 44h, current state unknown, last acting []
pg 4.65 is stuck inactive for 44h, current state unknown, last acting []
pg 4.66 is stuck inactive for 44h, current state unknown, last acting []
pg 4.67 is stuck inactive for 44h, current state unknown, last acting []
pg 4.68 is stuck inactive for 44h, current state unknown, last acting []
pg 4.69 is stuck inactive for 44h, current state unknown, last acting []
pg 4.6a is stuck inactive for 44h, current state unknown, last acting []
pg 4.6b is stuck inactive for 44h, current state unknown, last acting []
pg 4.6c is stuck inactive for 44h, current state unknown, last acting []
pg 4.6d is stuck inactive for 44h, current state unknown, last acting []
pg 4.6e is stuck inactive for 44h, current state unknown, last acting []
pg 4.6f is stuck inactive for 44h, current state unknown, last acting []
pg 4.70 is stuck inactive for 44h, current state unknown, last acting []
pg 4.71 is stuck inactive for 44h, current state unknown, last acting []
pg 4.72 is stuck inactive for 44h, current state unknown, last acting []
pg 4.73 is stuck inactive for 44h, current state unknown, last acting []
pg 4.74 is stuck inactive for 44h, current state unknown, last acting []
pg 4.75 is stuck inactive for 44h, current state unknown, last acting []
pg 4.76 is stuck inactive for 44h, current state unknown, last acting []
pg 4.77 is stuck inactive for 44h, current state unknown, last acting []
pg 4.78 is stuck inactive for 44h, current state unknown, last acting []
pg 4.79 is stuck inactive for 44h, current state unknown, last acting []
pg 4.7a is stuck inactive for 44h, current state unknown, last acting []
pg 4.7b is stuck inactive for 44h, current state unknown, last acting []
pg 4.7c is stuck inactive for 44h, current state unknown, last acting []
pg 4.7d is stuck inactive for 44h, current state unknown, last acting []
pg 4.7e is stuck inactive for 44h, current state unknown, last acting []
pg 4.7f is stuck inactive for 44h, current state unknown, last acting []
[WRN] POOL_NEARFULL: 4 pool(s) nearfull
pool 'device_health_metrics' is nearfull
pool 'acme.lab.cephfs_data' is nearfull
pool 'acme.lab.cephfs_metadata' is nearfull
pool 'lab-hc-ssdpool' is nearfull
[WRN] RECENT_CRASH: 8 daemons have recently crashed
osd.3 crashed on host sin-dc-196m-hv-cpu-001.domain.name.removed at 2021-07-12T22:43:42.839926Z
osd.3 crashed on host sin-dc-196m-hv-cpu-001.domain.name.removed at 2021-07-12T22:44:15.780599Z
osd.3 crashed on host sin-dc-196m-hv-cpu-001.domain.name.removed at 2021-07-12T22:44:46.725679Z
osd.3 crashed on host sin-dc-196m-hv-cpu-001.domain.name.removed at 2021-07-12T22:45:17.570109Z
osd.0 crashed on host sin-dc-196m-hv-cpu-002.domain.name.removed at 2021-07-12T23:25:36.445383Z
osd.0 crashed on host sin-dc-196m-hv-cpu-002.domain.name.removed at 2021-07-12T23:25:57.585415Z
osd.0 crashed on host sin-dc-196m-hv-cpu-002.domain.name.removed at 2021-07-12T23:26:16.921570Z
osd.0 crashed on host sin-dc-196m-hv-cpu-002.domain.name.removed at 2021-07-12T23:26:36.074884Z
root@sin-dc-196m-hv-cpu-004:~#

sippe · Jul 18, 2021

I have exactly the same situation. After Ceph Pacific update it worked a couple of hours well and after that the whole cephfs melted down. Proxmox 6.4 worked always like a charm. Looks like the pacific update chashed whole cephfs.

At the moment the situation looks like this:

Code:

  cluster:
    id:     e5afa215-2a06-49c4-9b68-f0d708f68ffa
    health: HEALTH_WARN
            1 filesystem is degraded
            1 MDSs report slow metadata IOs
            1 osds down
            1 host (1 osds) down
            1 nearfull osd(s)
            Reduced data availability: 161 pgs inactive
            Degraded data redundancy: 108998/163497 objects degraded (66.667%), 143 pgs degraded, 161 pgs undersized
            3 pool(s) nearfull
            6182 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum pve1,pve2,pve3 (age 78m)
    mgr: pve1(active, since 82m), standbys: pve2, pve3
    mds: 1/1 daemons up, 2 standby
    osd: 3 osds: 1 up (since 82m), 2 in (since 20h)
 
  data:
    volumes: 0/1 healthy, 1 recovering
    pools:   3 pools, 161 pgs
    objects: 54.50k objects, 211 GiB
    usage:   208 GiB used, 30 GiB / 238 GiB avail
    pgs:     100.000% pgs not active
             108998/163497 objects degraded (66.667%)
             143 undersized+degraded+peered
             18  undersized+peered

"ceph heath detail" looks like this:

Code:

HEALTH_WARN 1 filesystem is degraded; 1 MDSs report slow metadata IOs; 1 osds down; 1 host (1 osds) down; 1 nearfull osd(s); Reduced data availability: 161 pgs inactive; Degraded data redundancy: 108998/163497 objects degraded (66.667%), 143 pgs degraded, 161 pgs undersized; 3 pool(s) nearfull; 6210 daemons have recently crashed
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs cephfs is degraded
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
    mds.pve2(mds.0): 4 slow metadata IOs are blocked > 30 secs, oldest blocked for 4787 secs
[WRN] OSD_DOWN: 1 osds down
    osd.1 (root=default,host=pve2) is down
[WRN] OSD_HOST_DOWN: 1 host (1 osds) down
    host pve2 (root=default) (1 osds) is down
[WRN] OSD_NEARFULL: 1 nearfull osd(s)
    osd.0 is near full
[WRN] PG_AVAILABILITY: Reduced data availability: 161 pgs inactive
    pg 1.3 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.4 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.5 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.6 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.7 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.8 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.9 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.a is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.b is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.1d is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.1e is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.1f is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.20 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.21 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.22 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.23 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.24 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.25 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.26 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.27 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.28 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.29 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.2a is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.2b is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.2c is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.2d is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.2e is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.2f is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.30 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.31 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.32 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.33 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.34 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.35 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.36 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.37 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.38 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.39 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 1.7f is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 2.0 is stuck inactive for 20h, current state undersized+peered, last acting [0]
    pg 2.4 is stuck inactive for 20h, current state undersized+peered, last acting [0]
    pg 2.5 is stuck inactive for 20h, current state undersized+peered, last acting [0]
    pg 2.6 is stuck inactive for 20h, current state undersized+degraded+peered, last acting [0]
    pg 2.7 is stuck inactive for 21M, current state undersized+degraded+peered, last acting [0]
    pg 2.8 is stuck inactive for 20h, current state undersized+peered, last acting [0]
    pg 2.9 is stuck inactive for 20h, current state undersized+peered, last acting [0]
    pg 2.a is stuck inactive for 20h, current state undersized+peered, last acting [0]
    pg 2.b is stuck inactive for 7w, current state undersized+degraded+peered, last acting [0]
    pg 2.1c is stuck inactive for 21M, current state undersized+degraded+peered, last acting [0]
    pg 2.1d is stuck inactive for 21M, current state undersized+degraded+peered, last acting [0]
    pg 2.1e is stuck inactive for 20h, current state undersized+peered, last acting [0]
[WRN] PG_DEGRADED: Degraded data redundancy: 108998/163497 objects degraded (66.667%), 143 pgs degraded, 161 pgs undersized
    pg 1.3 is undersized+degraded+peered, acting [0]
    pg 1.4 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.5 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.6 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.7 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.8 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.9 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.a is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.b is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.1d is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.1e is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.1f is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.20 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.21 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.22 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.23 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.24 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.25 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.26 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.27 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.28 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.29 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.2a is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.2b is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.2c is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.2d is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.2e is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.2f is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.30 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.31 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.32 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.33 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.34 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.35 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.36 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.37 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.38 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.39 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 1.7f is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 2.0 is stuck undersized for 84m, current state undersized+peered, last acting [0]
    pg 2.4 is stuck undersized for 84m, current state undersized+peered, last acting [0]
    pg 2.5 is stuck undersized for 84m, current state undersized+peered, last acting [0]
    pg 2.6 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 2.7 is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 2.8 is stuck undersized for 84m, current state undersized+peered, last acting [0]
    pg 2.9 is stuck undersized for 84m, current state undersized+peered, last acting [0]
    pg 2.a is stuck undersized for 84m, current state undersized+peered, last acting [0]
    pg 2.b is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 2.1c is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 2.1d is stuck undersized for 84m, current state undersized+degraded+peered, last acting [0]
    pg 2.1e is stuck undersized for 84m, current state undersized+peered, last acting [0]
[WRN] POOL_NEARFULL: 3 pool(s) nearfull
    pool 'cephfs_data' is nearfull
    pool 'cephfs_metadata' is nearfull
    pool 'device_health_metrics' is nearfull
[WRN] RECENT_CRASH: 6210 daemons have recently crashed
    osd.2 crashed on host pve3 at 2021-07-17T21:34:12.860855Z
    osd.2 crashed on host pve3 at 2021-07-17T21:34:34.797254Z
    osd.2 crashed on host pve3 at 2021-07-17T21:34:56.436257Z
    osd.2 crashed on host pve3 at 2021-07-17T21:35:21.354660Z
    osd.1 crashed on host pve2 at 2021-07-17T21:36:58.392341Z
    osd.1 crashed on host pve2 at 2021-07-17T21:37:19.515643Z
    osd.1 crashed on host pve2 at 2021-07-17T21:37:39.712476Z
    osd.1 crashed on host pve2 at 2021-07-17T21:38:00.249149Z
    osd.2 crashed on host pve3 at 2021-07-17T22:37:02.180118Z
    osd.2 crashed on host pve3 at 2021-07-17T22:37:22.848071Z
    osd.2 crashed on host pve3 at 2021-07-17T22:37:47.338571Z
    osd.1 crashed on host pve2 at 2021-07-17T22:37:48.677341Z
    osd.1 crashed on host pve2 at 2021-07-17T22:38:09.623846Z
    osd.1 crashed on host pve2 at 2021-07-17T22:38:30.395264Z
    osd.1 crashed on host pve2 at 2021-07-17T22:38:50.962771Z
    osd.1 crashed on host pve2 at 2021-07-17T22:39:11.324262Z
    osd.1 crashed on host pve2 at 2021-07-17T22:39:31.800564Z
    osd.1 crashed on host pve2 at 2021-07-17T22:39:52.590122Z
    osd.1 crashed on host pve2 at 2021-07-17T22:40:13.308751Z
    osd.1 crashed on host pve2 at 2021-07-17T22:40:33.878776Z
    osd.1 crashed on host pve2 at 2021-07-17T22:40:54.252907Z
    osd.1 crashed on host pve2 at 2021-07-17T22:41:15.072678Z
    osd.1 crashed on host pve2 at 2021-07-17T22:41:36.067645Z
    osd.1 crashed on host pve2 at 2021-07-17T22:41:56.820660Z
    osd.1 crashed on host pve2 at 2021-07-17T22:42:17.589467Z
    osd.1 crashed on host pve2 at 2021-07-17T22:42:38.346118Z
    osd.1 crashed on host pve2 at 2021-07-17T22:42:59.086581Z
    osd.1 crashed on host pve2 at 2021-07-17T22:43:19.849973Z
    osd.1 crashed on host pve2 at 2021-07-17T22:43:40.530506Z
    osd.1 crashed on host pve2 at 2021-07-17T22:44:01.005473Z
    and 6180 more

I suppose there is something totally wrong with Ceph Pacific version but I don't find the reason why it act like this. Just wondering is there some way to fix this or step backwards.

branto · Jul 18, 2021

I feel your pain. As you can see, there isn't any activity on this thread... I'm hoping someone will help, but it looks unlikely.

sippe · Jul 18, 2021

Just wonderin is there any work around to save files and start over if fixing is impossible... At this moment I have read a lot of ceph and proxmox forums but don't find any clever tips how to fix this situation. Even I would be very pleased if I even can save the data some how. I don't have any idea how to fix this...

ph0x · Jul 18, 2021

Everyone who had problems with the upgrade posted logs with full or near full pools/OSDs. This is a situation which should be avoided at all costs.
The Ceph mailing list ia probably the best addressee for such problems.
And as the saying goes: no backup - no pity.

sippe · Jul 18, 2021

Well backups are a couple weeks old so it would be nice to push this work for at least to save current state of the virtual machines. Before the update pools/OSDs worked without issues or alerts about nearfull state and there was not any kind of note that pool usage state may affect somehow.

ph0x · Jul 18, 2021

Maybe they went nearfull because of the rebalancing, true. I would still contact the Ceph experts through the respective channels. Those are a lot more experienced with disaster recovery than the Proxmox forum (no offense [at all!]).

sippe · Jul 19, 2021

Perhaps you are right. Just wondering that ceph has worked like charm until Pacific update when this melt happened... There must be something fundamentally different than before but just don't understand what it could be... Actually I don't find the reason under network configuration also. weird... Quite often the reason is configuration change after update or behaviour change but this time I don't find any clever. Well... Perhaps someone can solve this situation. At this moment fortunately this system is my home/hobby hyperconverged system and not any of business systems. But this make me feel worried about the update... It would be total disaster if business systems melts like this because no one have time to fix and try to find solution in business case. Quite often time window is a couple of hours in business life. Actually this is good situation for me because this happened. Now I now that Octopus to Pacific update may affect like this and now I have time to find out is this situation permanent or can it be solved somehow. If the situation is permanent the conclusion is quite similar. Need to find something more stable.

ph0x · Jul 19, 2021

There are already a lot of people that did the upgrade without any issues so yes, finding the cause would be highly interesting.

sippe · Jul 19, 2021

Mkay... I found something intresting:

Under first node...

Code:

root@pve1:~# systemctl status ceph-osd@0
● ceph-osd@0.service - Ceph object storage daemon osd.0
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
             └─ceph-after-pve-cluster.conf
     Active: active (running) since Mon 2021-07-19 01:47:32 EEST; 1h 4min ago
   Main PID: 4992 (ceph-osd)
      Tasks: 70
     Memory: 789.4M
        CPU: 11.397s
     CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@0.service
             └─4992 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph

Jul 19 01:47:32 pve1 systemd[1]: Starting Ceph object storage daemon osd.0...
Jul 19 01:47:32 pve1 systemd[1]: Started Ceph object storage daemon osd.0.
Jul 19 01:47:42 pve1 ceph-osd[4992]: 2021-07-19T01:47:42.589+0300 7fd5dd04df00 -1 osd.0 3368 log_to_monitors {default=true}
Jul 19 01:48:04 pve1 ceph-osd[4992]: 2021-07-19T01:48:04.386+0300 7fd5d5555700 -1 osd.0 3368 set_numa_affinity unable to identify public interface 'eno1' numa node: (0) Success

"set_numa_affinity unable to identify public interface 'eno1' numa node: (0) Success"
So... I don't know what is this, but looks like it fails but result is success. I suppose it's not working as it should be. What else can be wrong...

And ip a tells me that there is interface eno1

Code:

2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether fc:3f:db:03:5d:49 brd ff:ff:ff:ff:ff:ff
    altname enp0s31f6
    inet 192.168.199.13/24 brd 192.168.199.255 scope global eno1
       valid_lft forever preferred_lft forever
    inet6 fe80::fe3f:dbff:fe03:5d49/64 scope link
       valid_lft forever preferred_lft forever

But... The altname is actually enp0s31f6. I suppose that's the real name of the interface. So the eno1 is renamed enp0s31f6???

This osd starts anyway...

Under second node:
ceph-osd@1 dies immediately when it starts and seems to be in endless loop.

Code:

root@pve2:~# systemctl status ceph-osd@1
● ceph-osd@1.service - Ceph object storage daemon osd.1
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
             └─ceph-after-pve-cluster.conf
     Active: active (running) since Mon 2021-07-19 03:17:10 EEST; 10s ago
   Main PID: 70495 (ceph-osd)
      Tasks: 25
     Memory: 550.5M
        CPU: 5.939s
     CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@1.service
             └─70495 /usr/bin/ceph-osd -f --cluster ceph --id 1 --setuser ceph --setgroup ceph

Jul 19 03:17:10 pve2 systemd[1]: Starting Ceph object storage daemon osd.1...
Jul 19 03:17:10 pve2 systemd[1]: Started Ceph object storage daemon osd.1.

and after starting...

Code:

root@pve2:~# systemctl status ceph-osd@1
● ceph-osd@1.service - Ceph object storage daemon osd.1
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
             └─ceph-after-pve-cluster.conf
     Active: activating (auto-restart) (Result: signal) since Mon 2021-07-19 03:17:21 EEST; 2s ago
    Process: 70495 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id 1 --setuser ceph --setgroup ceph (code=killed, signal=ABRT)
   Main PID: 70495 (code=killed, signal=ABRT)
        CPU: 6.293s

ip a tells:

Code:

2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether c8:d3:ff:9a:bd:d0 brd ff:ff:ff:ff:ff:ff
    altname enp0s31f6
    inet 192.168.199.14/24 brd 192.168.199.255 scope global eno1
       valid_lft forever preferred_lft forever
    inet6 fe80::cad3:ffff:fe9a:bdd0/64 scope link
       valid_lft forever preferred_lft forever

So eno1 is renamed enp0s31f6 again???

Under third node:
ceph-osd@2 dies immediately when it starts and seems to be in endless loop. The situation is identical with second node.

But... ip a tells like this:

Code:

2: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether ec:b1:d7:6e:29:73 brd ff:ff:ff:ff:ff:ff
    inet 192.168.199.15/24 brd 192.168.199.255 scope global enp2s0
       valid_lft forever preferred_lft forever
    inet6 fe80::eeb1:d7ff:fe6e:2973/64 scope link
       valid_lft forever preferred_lft forever

Hardware of the nodes are identical but there is no altname...

There must be something differences between Octopus and Pacific. I suppose this is some kind network problem inside node but how to fix this... And actually if this is the right conclusion there must be a lot of systems under dark cloud after upgrade when things went wrong...

Ben B · Jul 19, 2021

sippe said:

Mkay... I found something intresting:

Under first node...

Code:

root@pve1:~# systemctl status ceph-osd@0
● ceph-osd@0.service - Ceph object storage daemon osd.0
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
             └─ceph-after-pve-cluster.conf
     Active: active (running) since Mon 2021-07-19 01:47:32 EEST; 1h 4min ago
   Main PID: 4992 (ceph-osd)
      Tasks: 70
     Memory: 789.4M
        CPU: 11.397s
     CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@0.service
             └─4992 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph

Jul 19 01:47:32 pve1 systemd[1]: Starting Ceph object storage daemon osd.0...
Jul 19 01:47:32 pve1 systemd[1]: Started Ceph object storage daemon osd.0.
Jul 19 01:47:42 pve1 ceph-osd[4992]: 2021-07-19T01:47:42.589+0300 7fd5dd04df00 -1 osd.0 3368 log_to_monitors {default=true}
Jul 19 01:48:04 pve1 ceph-osd[4992]: 2021-07-19T01:48:04.386+0300 7fd5d5555700 -1 osd.0 3368 set_numa_affinity unable to identify public interface 'eno1' numa node: (0) Success

"set_numa_affinity unable to identify public interface 'eno1' numa node: (0) Success"
So... I don't know what is this, but looks like it fails but result is success. I suppose it's not working as it should be. What else can be wrong...

And ip a tells me that there is interface eno1

Code:

2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether fc:3f:db:03:5d:49 brd ff:ff:ff:ff:ff:ff
    altname enp0s31f6
    inet 192.168.199.13/24 brd 192.168.199.255 scope global eno1
       valid_lft forever preferred_lft forever
    inet6 fe80::fe3f:dbff:fe03:5d49/64 scope link
       valid_lft forever preferred_lft forever

But... The altname is actually enp0s31f6. I suppose that's the real name of the interface. So the eno1 is renamed enp0s31f6???

This osd starts anyway...

Under second node:
ceph-osd@1 dies immediately when it starts and seems to be in endless loop.

Code:

root@pve2:~# systemctl status ceph-osd@1
● ceph-osd@1.service - Ceph object storage daemon osd.1
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
             └─ceph-after-pve-cluster.conf
     Active: active (running) since Mon 2021-07-19 03:17:10 EEST; 10s ago
   Main PID: 70495 (ceph-osd)
      Tasks: 25
     Memory: 550.5M
        CPU: 5.939s
     CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@1.service
             └─70495 /usr/bin/ceph-osd -f --cluster ceph --id 1 --setuser ceph --setgroup ceph

Jul 19 03:17:10 pve2 systemd[1]: Starting Ceph object storage daemon osd.1...
Jul 19 03:17:10 pve2 systemd[1]: Started Ceph object storage daemon osd.1.

and after starting...

Code:

root@pve2:~# systemctl status ceph-osd@1
● ceph-osd@1.service - Ceph object storage daemon osd.1
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
             └─ceph-after-pve-cluster.conf
     Active: activating (auto-restart) (Result: signal) since Mon 2021-07-19 03:17:21 EEST; 2s ago
    Process: 70495 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id 1 --setuser ceph --setgroup ceph (code=killed, signal=ABRT)
   Main PID: 70495 (code=killed, signal=ABRT)
        CPU: 6.293s

ip a tells:

Code:

2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether c8:d3:ff:9a:bd:d0 brd ff:ff:ff:ff:ff:ff
    altname enp0s31f6
    inet 192.168.199.14/24 brd 192.168.199.255 scope global eno1
       valid_lft forever preferred_lft forever
    inet6 fe80::cad3:ffff:fe9a:bdd0/64 scope link
       valid_lft forever preferred_lft forever

So eno1 is renamed enp0s31f6 again???

Under third node:
ceph-osd@2 dies immediately when it starts and seems to be in endless loop. The situation is identical with second node.

But... ip a tells like this:

Code:

2: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether ec:b1:d7:6e:29:73 brd ff:ff:ff:ff:ff:ff
    inet 192.168.199.15/24 brd 192.168.199.255 scope global enp2s0
       valid_lft forever preferred_lft forever
    inet6 fe80::eeb1:d7ff:fe6e:2973/64 scope link
       valid_lft forever preferred_lft forever

Hardware of the nodes are identical but there is no altname...

There must be something differences between Octopus and Pacific. I suppose this is some kind network problem inside node but how to fix this... And actually if this is the right conclusion there must be a lot of systems under dark cloud after upgrade when things went wrong...

Can you maybe post your /etc/pve/ceph.conf ?

sippe · Jul 19, 2021

Sure and thank you very much about your time.

Code:

[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 192.168.199.0/24
     fsid = e5afa215-2a06-49c4-9b68-f0d708f68ffa
     mon_allow_pool_delete = true
     mon_host = 192.168.199.13 192.168.199.14 192.168.199.15
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 192.168.199.0/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
     keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.pve3]
     host = pve3
     mds standby for name = pve

[mds.pve1]
     host = pve1
     mds standby for name = pve

[mds.pve2]
     host = pve2
     mds standby for name = pve

Corosync and payload are in different network segment. Like you see the ceph configuration is very simple and it worked before like a charm.

BTW...

Actually I found something intresting under this thread:
https://forum.proxmox.com/threads/ceph-16-2-pacific-cluster-crash.92367/

My system is running under ceph 16.2.4 and there is some bug what is perhaps the main reason why osd fails after start:
https://github.com/ceph/ceph/pull/41655 and https://tracker.ceph.com/issues/50656

The work around might be to use "bluestore_allocator = bitmap" value but I need to find out how to use that and where to put it so it work like expected.

There is also fixed version 16.2.5. Proxmox staff member t.lamprecht says: "FYI: that release is available on our Ceph pacific test repository already" under this thread.

Fingers cross this can be solved... Hopefully...

Ben B · Jul 19, 2021

sippe said:
Sure and thank you very much about your time.

Code:

[global] auth_client_required = cephx auth_cluster_required = cephx auth_service_required = cephx cluster_network = 192.168.199.0/24 fsid = e5afa215-2a06-49c4-9b68-f0d708f68ffa mon_allow_pool_delete = true mon_host = 192.168.199.13 192.168.199.14 192.168.199.15 osd_pool_default_min_size = 2 osd_pool_default_size = 3 public_network = 192.168.199.0/24 [client] keyring = /etc/pve/priv/$cluster.$name.keyring [mds] keyring = /var/lib/ceph/mds/ceph-$id/keyring [mds.pve3] host = pve3 mds standby for name = pve [mds.pve1] host = pve1 mds standby for name = pve [mds.pve2] host = pve2 mds standby for name = pve

Corosync and payload are in different network segment. Like you see the ceph configuration is very simple and it worked before like a charm.

BTW...

Actually I found some intresting under this thread:
https://forum.proxmox.com/threads/ceph-16-2-pacific-cluster-crash.92367/

My system is running under ceph 16.2.4 and there some bug what is perhaps the main reason why osd fails after start:
https://github.com/ceph/ceph/pull/41655 and https://tracker.ceph.com/issues/50656

The work around might be to use "bluestore_allocator = bitmap" value but I need to find out how to use that and where to put it so it work like expected.

There is also fixed version 16.2.5. Proxmox staff member t.lamprecht says: "FYI: that release is available on our Ceph pacific test repository already" under this thread.

Fingers cross this can be solved... Hopefully...

Fingers crossed indeed

Beyond that, as you're using names for your host definitions, I'd be checking that they are either in all of the /etc/hosts files across all nodes (with correct addresses) or resolvable by 'host' etc. This could be causing your interface binding issues, especially if its picking up the wrong addresses somehow. That said, if corosync etc is working fine, its likely that is all OK unless its configuration is based on addresses instead of names.

sippe · Jul 19, 2021

Here is more information about Ceph 16.2.5

sippe · Jul 19, 2021

I can confirm that this work around for Ceph 16.2.4 solved my problem and my system is up and running again.

There is a file called /etc/ceph/ceph.conf to which I added the following two lines:

Code:

[osd]
     bluestore_allocator = bitmap

After the Ceph 16.2.5 update, the system should work without these two lines ... At least I suppose, but we'll see what happens. Thanks to everyone who shared the information with me.

sippe · Jul 20, 2021

Seems that set_numa_affinity unable to identify public interface 'some_inerface' is bug also but harmless so it can be ignored...

sippe · Jul 20, 2021

A couple of minutes ago I updated Ceph 16.2.4 to 16.2.5 and removed bluestore_allocator = bitmap value. Everything fine at the moment and system runs smoothly.

After command ceph crash archive-all the Ceph looks nice and clean again. If the situation is same next couple of days or week perhaps I think this bug disaster is solved.

Näyttökuva 2021-7-20 kello 9.19.21.png

jsterr · Jul 20, 2021

sippe said:
A couple of minutes ago I updated Ceph 16.2.4 to 16.2.5 and removed bluestore_allocator = bitmap value. Everything fine at the moment and system runs smoothly.

After command ceph crash archive-all the Ceph looks nice and clean again. If the situation is same next couple of days or week perhaps I think this bug disaster is solved.

View attachment 27879

Good. your only having 3 osds? I hope this is test system only, looks kinda dangerous to run a ceph cluster with only one disk per node.

[SOLVED] Ceph Pacific Cluster Crash Shortly After Upgrade

Member

Member

Member

Member

Member

Member

Renowned Member

Member

Renowned Member

Member

Renowned Member

Member

Active Member

Member

Active Member

Member

Member

Member

Member

Renowned Member

We value your privacy