[SOLVED] Reinstall Proxmox-Ceph Node after Crash

Fladi

Renowned Member
Feb 27, 2015
31
9
73
Hi all,

we had a major outage of our cluster consisting of 3 PVE Host which servers Containers and VMs. In addition there are 3 Proxmox based Ceph servers.
I was in the preperation to switch from running 4.4 to 5.x.

1. ceph 3 didn't come up after reboot. It turned out that the HBA controller with the boot disks is no longer available to the bios and thus can't be used as a boot device. It's not so the controller. This is running fine. Tried different controllers and no extra controller is viewable in the bios. I got around this with attaching the boot-disks to an sas-controller on board.

2. While working on 1) another ceph-node crashed. Ceph-1 had two sata-dom with the proxmox installation. Both seem to be damaged. Can't access them anymore.

So, Ceph-3 has booted with a lot of errors on zfs but is running. Ceph-Cluster is in the progress of healing and some VM are available again. But overall-state is still error (see image)

However I would like to get the Ceph-1 online again as soon as possible. I have an "unused" SSD in there which is configured as a part of a separate SSD-Pool. As this is empty I could use it as a system-disk and boot from it.

What would be the best way to reinstall Ceph-1? Install "normal" giving the same ip/id as before and joining cluster? Will ceph detect the osd on this machine again? Reinstall as "new" node?

Oh, what a weekend...

Thanks and best regards
 

Attachments

  • ceph-health.png
    ceph-health.png
    62.7 KB · Views: 12