Hi,
After a HW failure in a 3 nodes Proxmox VE 6.3 cluster, I replaced the HW, and re-joined the new node.
The replaced node is called hystou1, the 2 other nodes are hystou2 and 3.
I had a couple of minor issues when re-joining the new node since it has the same name, and I had to remove conflicting entries in the /etc/ssh/ssh_known_hosts of the remaining nodes. Eventually, the cluster came back as expected. Except Ceph.
When enabling Ceph on the replaced node, it installed required packages with the same (nautilus) version as on other nodes:
The
Before fixing the OSD, I want to first fix the MON.
I manually added the link to the Ceph config file since it was missing :
From the GUI, I cannot delete and re-create the Monitor process, since it's in undefined state (
Everything seem 'as expected' from the terminal as well:
When I follow the Ceph documentation to re-create the mon from any of the the quorum node, I have the following error:
Do you have any idea or guide I should follow to fix the Ceph cluster after a HW failure on Proxmox VE 6.3 ?
After a HW failure in a 3 nodes Proxmox VE 6.3 cluster, I replaced the HW, and re-joined the new node.
The replaced node is called hystou1, the 2 other nodes are hystou2 and 3.
I had a couple of minor issues when re-joining the new node since it has the same name, and I had to remove conflicting entries in the /etc/ssh/ssh_known_hosts of the remaining nodes. Eventually, the cluster came back as expected. Except Ceph.
When enabling Ceph on the replaced node, it installed required packages with the same (nautilus) version as on other nodes:
Code:
# cat /etc/apt/sources.list.d/ceph.list
deb http://download.proxmox.com/debian/ceph-nautilus buster main
The
/mnt/pve/cephfs
properly mounted and the ceph block devices are available for the VMs, but the none of the mon nor the osd did restart on the new node (hystou1). There's even no startup systemd config.Before fixing the OSD, I want to first fix the MON.
I manually added the link to the Ceph config file since it was missing :
ln -s /etc/pve/ceph.conf /etc/ceph/
From the GUI, I cannot delete and re-create the Monitor process, since it's in undefined state (
entry has no host
error ):Everything seem 'as expected' from the terminal as well:
Code:
root@hystou1:~# ceph status
cluster:
id: 3ce42410-da82-4467-bc36-258e1f2217b1
health: HEALTH_WARN
Degraded data redundancy: 35111/726129 objects degraded (4.835%), 40 pgs degraded, 40 pgs undersized
1/3 mons down, quorum hystou3,hystou2
services:
mon: 3 daemons, quorum hystou3,hystou2 (age 2h), out of quorum: hystou1
mgr: hystou3(active, since 5d), standbys: hystou2
mds: cephfs:1 {0=hystou3=up:active} 1 up:standby
osd: 3 osds: 2 up (since 23h), 2 in (since 23h)
task status:
scrub status:
mds.hystou3: idle
data:
pools: 3 pools, 72 pgs
objects: 345.51k objects, 578 GiB
usage: 1.1 TiB used, 254 GiB / 1.3 TiB avail
pgs: 35111/726129 objects degraded (4.835%)
40 active+undersized+degraded
32 active+clean
io:
client: 5.8 MiB/s rd, 290 KiB/s wr, 7 op/s rd, 43 op/s wr
When I follow the Ceph documentation to re-create the mon from any of the the quorum node, I have the following error:
Code:
root@hystou2:~# ceph-mon -i `hostname` --extract-monmap /tmp/monmap
2021-03-14 18:45:45.503 7f61f3ed8400 -1 rocksdb: IO error: While lock file: /var/lib/ceph/mon/ceph-hystou2/store.db/LOCK: Resource temporarily unavailable
2021-03-14 18:45:45.503 7f61f3ed8400 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-hystou2': (22) Invalid argument
root@hystou2:~#
Do you have any idea or guide I should follow to fix the Ceph cluster after a HW failure on Proxmox VE 6.3 ?
Last edited: