Just for ref if anyone wants to find some help reading this...
I ended firing up a new virtual machine (Debian11) on which I attach a copy of the faulty (virtual) drive. The drive had no partition but I succeeded in repairing the (raw) drive with testdisk. Once I had access to the data I made a backup copy on another machine using RSync.
Finally we remade the whole cluster but before that we changed the internal disk drives on node2 and P3 which were HDD for SSDs.
It's been a few days it's running. Of course lesson learnt, we now have a daily backup on an external drive (using PBS), other lesson a cluster (even with cph replication) is not three machines with replcation but should be seen as a whole from which one node can be down. If ever this happen, and knowing that the ceph settings are for 2 disks minimum, never ever touch (reboot or stop) the remaining working nodes before you replaced the faulty one.
I ended firing up a new virtual machine (Debian11) on which I attach a copy of the faulty (virtual) drive. The drive had no partition but I succeeded in repairing the (raw) drive with testdisk. Once I had access to the data I made a backup copy on another machine using RSync.
Finally we remade the whole cluster but before that we changed the internal disk drives on node2 and P3 which were HDD for SSDs.
It's been a few days it's running. Of course lesson learnt, we now have a daily backup on an external drive (using PBS), other lesson a cluster (even with cph replication) is not three machines with replcation but should be seen as a whole from which one node can be down. If ever this happen, and knowing that the ceph settings are for 2 disks minimum, never ever touch (reboot or stop) the remaining working nodes before you replaced the faulty one.
Last edited: