[SOLVED] PVE not booting after Ceph installation

UnLock · Aug 20, 2023

I have a three-node cluster with PVE8 and Ceph installed. The names are pfsense-1, pfsense-2 and r730. I have been running PVE for about a year and recently installed Ceph on these nodes. It worked well, but when I reboot the r730 node, it won't boot (I waited for 15 hours). I reinstalled the R730 node, it seems to be fixed. But when I reboot the pfsense-2 node, the problem also occurred ( I waited for 4 hours). The stuck screen was identical to what happened on r730 previously. Here's the "ceph -s result", stuck screen and "ceph old tree" result.

cluster:
id: 4823fd4b-a059-4c88-b287-e26ae916d3fb
health: HEALTH_WARN
1/3 mons down, quorum pfsense-1,r730
Degraded data redundancy: 202217/606651 objects degraded (33.333%), 176 pgs degraded, 193 pgs undersized

services:
mon: 3 daemons, quorum pfsense-1,r730 (age 4h), out of quorum: pfsense-2
mgr: pfsense-1(active, since 4d), standbys: r730
mds: 1/1 daemons up, 1 standby
osd: 21 osds: 16 up (since 4h), 16 in (since 4h)

data:
volumes: 1/1 healthy
pools: 4 pools, 193 pgs
objects: 202.22k objects, 788 GiB
usage: 2.6 TiB used, 17 TiB / 20 TiB avail
pgs: 202217/606651 objects degraded (33.333%)
176 active+undersized+degraded
17 active+undersized

ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 27.01759 root default
-7 7.27745 host pfsense-1
1 ssd 1.45549 osd.1 up 1.00000 1.00000
3 ssd 1.45549 osd.3 up 1.00000 1.00000
5 ssd 1.45549 osd.5 up 1.00000 1.00000
7 ssd 1.45549 osd.7 up 1.00000 1.00000
8 ssd 1.45549 osd.8 up 1.00000 1.00000
-10 7.27745 host pfsense-2
10 ssd 1.45549 osd.10 down 0 1.00000
12 ssd 1.45549 osd.12 down 0 1.00000
13 ssd 1.45549 osd.13 down 0 1.00000
15 ssd 1.45549 osd.15 down 0 1.00000
17 ssd 1.45549 osd.17 down 0 1.00000
-3 12.46269 host r730
0 hdd 1.20079 osd.0 up 1.00000 1.00000
2 hdd 1.20079 osd.2 up 1.00000 1.00000
4 hdd 1.20079 osd.4 up 1.00000 1.00000
6 hdd 1.20079 osd.6 up 1.00000 1.00000
9 hdd 1.20079 osd.9 up 1.00000 1.00000
11 hdd 1.20079 osd.11 up 1.00000 1.00000
14 hdd 1.20079 osd.14 up 1.00000 1.00000
16 hdd 1.20079 osd.16 up 1.00000 1.00000
18 hdd 1.20079 osd.18 up 1.00000 1.00000
19 hdd 1.20079 osd.19 up 1.00000 1.00000
20 ssd 0.45479 osd.20 up 1.00000 1.00000

UnLock · Aug 21, 2023

It turned out that it wasn't about ceph. The proxmox will be stuck at waiting for service networking/start, which has no time limit. It can be resolved by booting into recovery mode and run "systemctl stop networking", "systemctl start networking", "ifup -a". After pressing ctrl+D in recovery mode, proxmox will boot properly. However, it is impractical to involve any manual procedures in each reboot. I was wondering why would this happen and how can I resolve this. Thanks.

DerDanilo · Aug 21, 2023

If you run pve 8 it could be the ntp problem. remove ntp and ntpsec, replace it by chrony.

UnLock · Aug 22, 2023

DerDanilo said:
If you run pve 8 it could be the ntp problem. remove ntp and ntpsec, replace it by chrony.

Thank you so much! The problem has been solved perfectly!

[SOLVED] PVE not booting after Ceph installation

UnLock

Member

UnLock

Member

DerDanilo

Famous Member

UnLock

Member

We value your privacy