[SOLVED] PVE not booting after Ceph installation

UnLock

Member
May 7, 2022
3
0
6
I have a three-node cluster with PVE8 and Ceph installed. The names are pfsense-1, pfsense-2 and r730. I have been running PVE for about a year and recently installed Ceph on these nodes. It worked well, but when I reboot the r730 node, it won't boot (I waited for 15 hours). I reinstalled the R730 node, it seems to be fixed. But when I reboot the pfsense-2 node, the problem also occurred ( I waited for 4 hours). The stuck screen was identical to what happened on r730 previously. Here's the "ceph -s result", stuck screen and "ceph old tree" result.

cluster:
id: 4823fd4b-a059-4c88-b287-e26ae916d3fb
health: HEALTH_WARN
1/3 mons down, quorum pfsense-1,r730
Degraded data redundancy: 202217/606651 objects degraded (33.333%), 176 pgs degraded, 193 pgs undersized

services:
mon: 3 daemons, quorum pfsense-1,r730 (age 4h), out of quorum: pfsense-2
mgr: pfsense-1(active, since 4d), standbys: r730
mds: 1/1 daemons up, 1 standby
osd: 21 osds: 16 up (since 4h), 16 in (since 4h)

data:
volumes: 1/1 healthy
pools: 4 pools, 193 pgs
objects: 202.22k objects, 788 GiB
usage: 2.6 TiB used, 17 TiB / 20 TiB avail
pgs: 202217/606651 objects degraded (33.333%)
176 active+undersized+degraded
17 active+undersized

2023-08-20 22.35.31.jpeg
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 27.01759 root default
-7 7.27745 host pfsense-1
1 ssd 1.45549 osd.1 up 1.00000 1.00000
3 ssd 1.45549 osd.3 up 1.00000 1.00000
5 ssd 1.45549 osd.5 up 1.00000 1.00000
7 ssd 1.45549 osd.7 up 1.00000 1.00000
8 ssd 1.45549 osd.8 up 1.00000 1.00000
-10 7.27745 host pfsense-2
10 ssd 1.45549 osd.10 down 0 1.00000
12 ssd 1.45549 osd.12 down 0 1.00000
13 ssd 1.45549 osd.13 down 0 1.00000
15 ssd 1.45549 osd.15 down 0 1.00000
17 ssd 1.45549 osd.17 down 0 1.00000
-3 12.46269 host r730
0 hdd 1.20079 osd.0 up 1.00000 1.00000
2 hdd 1.20079 osd.2 up 1.00000 1.00000
4 hdd 1.20079 osd.4 up 1.00000 1.00000
6 hdd 1.20079 osd.6 up 1.00000 1.00000
9 hdd 1.20079 osd.9 up 1.00000 1.00000
11 hdd 1.20079 osd.11 up 1.00000 1.00000
14 hdd 1.20079 osd.14 up 1.00000 1.00000
16 hdd 1.20079 osd.16 up 1.00000 1.00000
18 hdd 1.20079 osd.18 up 1.00000 1.00000
19 hdd 1.20079 osd.19 up 1.00000 1.00000
20 ssd 0.45479 osd.20 up 1.00000 1.00000
 
It turned out that it wasn't about ceph. The proxmox will be stuck at waiting for service networking/start, which has no time limit. It can be resolved by booting into recovery mode and run "systemctl stop networking", "systemctl start networking", "ifup -a". After pressing ctrl+D in recovery mode, proxmox will boot properly. However, it is impractical to involve any manual procedures in each reboot. I was wondering why would this happen and how can I resolve this. Thanks.
 
If you run pve 8 it could be the ntp problem. remove ntp and ntpsec, replace it by chrony.
 
  • Like
Reactions: UnLock

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!