CEPH Crash after 8.2.4 update

Jan 11, 2024
10
0
1
I have a very serious issue, I am getting "got timeout (500) " on my CEPH Cluster after doing the 8.2.4 update. All CEPH nodes are down. the PVE instance on all are up and responsive but all my CEPH storage is unreachable. I really need some help on this one.
 
cluster:
id: 8f47be00-ff2e-4265-8fc8-91ce4b8d1671
health: HEALTH_WARN
1 filesystem is degraded
1 MDSs report slow metadata IOs
noscrub,nodeep-scrub flag(s) set
12 osds down
Reduced data availability: 545 pgs inactive

services:
mon: 4 daemons, quorum CHC-000-Prox01,CHC-000-Prox03,CHC-000-ProxNAS03-1,CHC-000-ProxNAS01-1 (age 38h)
mgr: CHC-000-ProxNAS03(active, since 38h), standbys: CHC-000-ProxNAS01, CHC-000-Prox03, CHC-000-Prox01
mds: 1/1 daemons up, 1 standby
osd: 26 osds: 7 up (since 37h), 19 in (since 23h)
flags noscrub,nodeep-scrub

data:
volumes: 0/1 healthy, 1 recovering
pools: 6 pools, 545 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
545 unknown
 
Your problem is you have 12 OSD down, you need to look at why they are down and get them back online.

Check the ceps service logs for the OSD's and see whats causing them to crash/not start
 
Aug 21 12:25:54 CHC-000-ProxNAS03 systemd[1]: ceph-osd@15.service: Main process exited, code=exited, status=1/FAILURE
Aug 21 12:25:54 CHC-000-ProxNAS03 systemd[1]: ceph-osd@15.service: Failed with result 'exit-code'.
Aug 21 12:25:54 CHC-000-ProxNAS03 systemd[1]: ceph-osd@23.service: Main process exited, code=exited, status=1/FAILURE
Aug 21 12:25:54 CHC-000-ProxNAS03 systemd[1]: ceph-osd@23.service: Failed with result 'exit-code'.
Aug 21 12:25:54 CHC-000-ProxNAS03 systemd[1]: ceph-osd@14.service: Main process exited, code=exited, status=1/FAILURE
Aug 21 12:25:54 CHC-000-ProxNAS03 systemd[1]: ceph-osd@14.service: Failed with result 'exit-code'.
Aug 21 12:25:54 CHC-000-ProxNAS03 systemd[1]: ceph-osd@21.service: Main process exited, code=exited, status=1/FAILURE
Aug 21 12:25:54 CHC-000-ProxNAS03 systemd[1]: ceph-osd@21.service: Failed with result 'exit-code'.
Aug 21 12:25:54 CHC-000-ProxNAS03 systemd[1]: ceph-osd@25.service: Main process exited, code=exited, status=1/FAILURE
Aug 21 12:25:54 CHC-000-ProxNAS03 systemd[1]: ceph-osd@25.service: Failed with result 'exit-code'.
Aug 21 12:25:54 CHC-000-ProxNAS03 systemd[1]: ceph-osd@5.service: Main process exited, code=exited, status=1/FAILURE
Aug 21 12:25:54 CHC-000-ProxNAS03 systemd[1]: ceph-osd@5.service: Failed with result 'exit-code'.
 
Yeah you need to go into the logs of one of the services to get more detail.

In "/var/log/ceph" check one of the ceph-osd.xx.log files
 
This is what I have in one of the log files for one of the OSDs that are down...

2024-08-21T14:58:10.324-0600 72799340d6c0 1 bdev(0x579c9de4f180 /var/lib/ceph/osd/ceph-1/block) open path /var/lib/ceph/osd/ceph-1/block
2024-08-21T14:58:10.324-0600 72799340d6c0 0 bdev(0x579c9de4f180 /var/lib/ceph/osd/ceph-1/block) ioctl(F_SET_FILE_RW_HINT) on /var/lib/ceph/osd/ceph-1/block failed: (22) Invalid argument
2024-08-21T14:58:10.325-0600 72799340d6c0 1 bdev(0x579c9de4f180 /var/lib/ceph/osd/ceph-1/block) open size 2000397795328 (0x1d1c1000000, 1.8 TiB) block_size 4096 (4 KiB) non-rotational device, discard supported
2024-08-21T14:58:10.325-0600 72799340d6c0 1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-1/block size 1.8 TiB
2024-08-21T14:58:10.325-0600 72799340d6c0 1 bdev(0x579c9de4f180 /var/lib/ceph/osd/ceph-1/block) close
2024-08-21T14:58:10.603-0600 72799340d6c0 1 bdev(0x579c9de4ee00 /var/lib/ceph/osd/ceph-1/block) close
2024-08-21T14:58:10.848-0600 72799340d6c0 1 objectstore numa_node 0
2024-08-21T14:58:10.848-0600 72799340d6c0 0 starting osd.1 osd_data /var/lib/ceph/osd/ceph-1 /var/lib/ceph/osd/ceph-1/journal
2024-08-21T14:58:10.853-0600 72799340d6c0 -1 unable to parse network: 172.30.250.0.0/24
2024-08-21T14:58:21.093-0600 7a23a2e886c0 0 set uid:gid to 64045:64045 (ceph:ceph)
2024-08-21T14:58:21.093-0600 7a23a2e886c0 0 ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable), process ceph-osd, pid 1073050
2024-08-21T14:58:21.093-0600 7a23a2e886c0 0 pidfile_write: ignore empty --pid-file
2024-08-21T14:58:21.103-0600 7a23a2e886c0 1 bdev(0x58a44aaf8e00 /var/lib/ceph/osd/ceph-1/block) open path /var/lib/ceph/osd/ceph-1/block
2024-08-21T14:58:21.103-0600 7a23a2e886c0 0 bdev(0x58a44aaf8e00 /var/lib/ceph/osd/ceph-1/block) ioctl(F_SET_FILE_RW_HINT) on /var/lib/ceph/osd/ceph-1/block failed: (22) Invalid argument
2024-08-21T14:58:21.104-0600 7a23a2e886c0 1 bdev(0x58a44aaf8e00 /var/lib/ceph/osd/ceph-1/block) open size 2000397795328 (0x1d1c1000000, 1.8 TiB) block_size 4096 (4 KiB) non-rotational device, discard supported
2024-08-21T14:58:21.104-0600 7a23a2e886c0 1 bluestore(/var/lib/ceph/osd/ceph-1) _set_cache_sizes cache_size 3221225472 meta 0.45 kv 0.45 data 0.06
2024-08-21T14:58:21.104-0600 7a23a2e886c0 1 bdev(0x58a44aaf9180 /var/lib/ceph/osd/ceph-1/block) open path /var/lib/ceph/osd/ceph-1/block
2024-08-21T14:58:21.104-0600 7a23a2e886c0 0 bdev(0x58a44aaf9180 /var/lib/ceph/osd/ceph-1/block) ioctl(F_SET_FILE_RW_HINT) on /var/lib/ceph/osd/ceph-1/block failed: (22) Invalid argument
2024-08-21T14:58:21.105-0600 7a23a2e886c0 1 bdev(0x58a44aaf9180 /var/lib/ceph/osd/ceph-1/block) open size 2000397795328 (0x1d1c1000000, 1.8 TiB) block_size 4096 (4 KiB) non-rotational device, discard supported
2024-08-21T14:58:21.105-0600 7a23a2e886c0 1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-1/block size 1.8 TiB
2024-08-21T14:58:21.105-0600 7a23a2e886c0 1 bdev(0x58a44aaf9180 /var/lib/ceph/osd/ceph-1/block) close
2024-08-21T14:58:21.386-0600 7a23a2e886c0 1 bdev(0x58a44aaf8e00 /var/lib/ceph/osd/ceph-1/block) close
2024-08-21T14:58:21.653-0600 7a23a2e886c0 1 objectstore numa_node 0
2024-08-21T14:58:21.653-0600 7a23a2e886c0 0 starting osd.1 osd_data /var/lib/ceph/osd/ceph-1 /var/lib/ceph/osd/ceph-1/journal
 
Is that the end of the log? As the log is showing the OSD starting its boot process but no error.
 
sorry this is the last line in that log.

2024-08-21T14:58:21.656-0600 7a23a2e886c0 -1 unable to parse network: 172.30.250.0.0/24

I do not understand this, I can ping from machine to machine across that network with no issue. so I do not understand why it would say unable to parse.
 
Check your ceph.conf file.

172.30.250.0.0/24 is not a valid network.

I imagine you mean 172.30.250.0/24

Update this in your ceph.conf file and then restart your servers.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!