Hi all,
we had a major outage of our cluster consisting of 3 PVE Host which servers Containers and VMs. In addition there are 3 Proxmox based Ceph servers.
I was in the preperation to switch from running 4.4 to 5.x.
1. ceph 3 didn't come up after reboot. It turned out that the HBA controller with the boot disks is no longer available to the bios and thus can't be used as a boot device. It's not so the controller. This is running fine. Tried different controllers and no extra controller is viewable in the bios. I got around this with attaching the boot-disks to an sas-controller on board.
2. While working on 1) another ceph-node crashed. Ceph-1 had two sata-dom with the proxmox installation. Both seem to be damaged. Can't access them anymore.
So, Ceph-3 has booted with a lot of errors on zfs but is running. Ceph-Cluster is in the progress of healing and some VM are available again. But overall-state is still error (see image)
However I would like to get the Ceph-1 online again as soon as possible. I have an "unused" SSD in there which is configured as a part of a separate SSD-Pool. As this is empty I could use it as a system-disk and boot from it.
What would be the best way to reinstall Ceph-1? Install "normal" giving the same ip/id as before and joining cluster? Will ceph detect the osd on this machine again? Reinstall as "new" node?
Oh, what a weekend...
Thanks and best regards
we had a major outage of our cluster consisting of 3 PVE Host which servers Containers and VMs. In addition there are 3 Proxmox based Ceph servers.
I was in the preperation to switch from running 4.4 to 5.x.
1. ceph 3 didn't come up after reboot. It turned out that the HBA controller with the boot disks is no longer available to the bios and thus can't be used as a boot device. It's not so the controller. This is running fine. Tried different controllers and no extra controller is viewable in the bios. I got around this with attaching the boot-disks to an sas-controller on board.
2. While working on 1) another ceph-node crashed. Ceph-1 had two sata-dom with the proxmox installation. Both seem to be damaged. Can't access them anymore.
So, Ceph-3 has booted with a lot of errors on zfs but is running. Ceph-Cluster is in the progress of healing and some VM are available again. But overall-state is still error (see image)
However I would like to get the Ceph-1 online again as soon as possible. I have an "unused" SSD in there which is configured as a part of a separate SSD-Pool. As this is empty I could use it as a system-disk and boot from it.
What would be the best way to reinstall Ceph-1? Install "normal" giving the same ip/id as before and joining cluster? Will ceph detect the osd on this machine again? Reinstall as "new" node?
Oh, what a weekend...
Thanks and best regards