Hi folks
I have added a new node to my cluster today, then I realized the new node's network configuration might have an issue that it cannot communicate with the CEPH IP ranges, I have restarted the network service using "systemctl restart networking"
after this, the disaster happened and I realized all of my other 13 nodes got rebooted one by one and obviously all the VMs within them.
Once all nodes came back online, I saw unusual issues; one node was not able to ping others, one CEPH node was not up even though the OS booted and everything seemed ok.
So I thought it might be the new node that somehow did all these, so I shut it down and everything got back to normal.
I tried so hard to look at all logs (syslogs, daemon.log, kern.log,etc) but there was nothing, it seemed everything was ok and then suddenly a reboot happened.
It seems something triggered fencing on all nodes. but why and what?
I would really appreciate if you can help me find the root cause of this disaster
I have added a new node to my cluster today, then I realized the new node's network configuration might have an issue that it cannot communicate with the CEPH IP ranges, I have restarted the network service using "systemctl restart networking"
after this, the disaster happened and I realized all of my other 13 nodes got rebooted one by one and obviously all the VMs within them.
Once all nodes came back online, I saw unusual issues; one node was not able to ping others, one CEPH node was not up even though the OS booted and everything seemed ok.
So I thought it might be the new node that somehow did all these, so I shut it down and everything got back to normal.
I tried so hard to look at all logs (syslogs, daemon.log, kern.log,etc) but there was nothing, it seemed everything was ok and then suddenly a reboot happened.
It seems something triggered fencing on all nodes. but why and what?
I would really appreciate if you can help me find the root cause of this disaster