Hey,
yesterday I started to upgrade our cluster. Most of the 13 nodes where running 6.1-8 but two or three newer ones already used 6.2-10 as they where added in the last weeks or got upgraded already. No Problem so far. But yesterday I wanted to upgrade another node (node5) and after running the upgrade, just after apt stopped the upgrade process, everything crashed and the whole cluster went offline. All 13 nodes rebooted simultaneously in the same 30 seconds. It looks like the HA fancing was the problem and triggered the reboot but I have no glue why this was the case. The network was working properly and after the reboot everything was okay again. I will add some logs from different nodes...maybe someone can see something I have overseen.
Last part of apt term log:
Syslog from two randome node just before reboot:
node17: https://drop.leah.is/eTrgFxN9
node6: https://drop.leah.is/Rx9wTitp
And this is the syslog of the host I've upgraded:
https://drop.leah.is/GctH7J3Z
yesterday I started to upgrade our cluster. Most of the 13 nodes where running 6.1-8 but two or three newer ones already used 6.2-10 as they where added in the last weeks or got upgraded already. No Problem so far. But yesterday I wanted to upgrade another node (node5) and after running the upgrade, just after apt stopped the upgrade process, everything crashed and the whole cluster went offline. All 13 nodes rebooted simultaneously in the same 30 seconds. It looks like the HA fancing was the problem and triggered the reboot but I have no glue why this was the case. The network was working properly and after the reboot everything was okay again. I will add some logs from different nodes...maybe someone can see something I have overseen.
Last part of apt term log:
Bash:
Processing triggers for ca-certificates (20200601~deb10u1) ...^M
Updating certificates in /etc/ssl/certs...^M
0 added, 0 removed; done.^M
Running hooks in /etc/ca-certificates/update.d...^M
done.^M
Processing triggers for initramfs-tools (0.133+deb10u1) ...^M
update-initramfs: Generating /boot/initrd.img-5.4.44-2-pve^M
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
Syslog from two randome node just before reboot:
node17: https://drop.leah.is/eTrgFxN9
node6: https://drop.leah.is/Rx9wTitp
And this is the syslog of the host I've upgraded:
https://drop.leah.is/GctH7J3Z