Scenario:
We want to go from a single 2x10G network bond on our PVE 7.4 compute nodes to 2x10G+2x25G (and at a later point upgrade to 8.0)
8x 7.4 nodes in a cluster, 5x "compute nodes" prox10/11/12/13/14 (running the VMs, no local storage for VMs) and 3x "storage nodes" proxstore11/12/13 (Ceph, no VMs running on the nodes)
We've added the new 25G NICs to the compute nodes and then evacuated each of the 5 nodes one by one by bulk migrating its VMs to another node, then reconfiguring /etc/network/interfaces to add the new bond and move the guest vmbr0 bridge over, then "ifreload -a" to enable the changes.
This has actually worked fine for all nodes over the course of 1-2 hours.
1 minute after the last of the 5 nodes was ifreloaded - ALL compute nodes hard-rebooted!
We then thought maybe we overlooked the HA watchdog (lrm / crm services had NOT been stopped on any of the 8 nodes)
-> but then why would ALL nodes reboot? Should it not be the node that believes to be seperated from the others, only?
Why are quorate members rebooting - all at once?
Also, when such things happened in the past we would get an email saying "node XYZ has been fenced" -> no such email for any node was sent this time.
After rebooting all nodes came online, the few VMs that had been configured for HA were rebooted automatically on one of the nodes while we could manually, sucessfully, start all of the other VMs affected.
And now nothing further has happened for 15 hours.
Note also:
As 1 of the 5 nodes did not run any VMs anyway because we had just reconfigured its network config (prox14) and 1 of the remaining 4 nodes (prox10) did not host a "monitored" VM, only 3 out of the 5 (8) nodes were HA state = active, while the others (including the HA master) were (and are) "idle" - why did all compute nodes reboot and none of the Ceph nodes?
Are we safe now to migrate VMs and so on? What exactly triggered the meltdown?
The last node, prox14, was ifreloaded at 16:32, the nodes crashed at 16:33:26.
We want to go from a single 2x10G network bond on our PVE 7.4 compute nodes to 2x10G+2x25G (and at a later point upgrade to 8.0)
8x 7.4 nodes in a cluster, 5x "compute nodes" prox10/11/12/13/14 (running the VMs, no local storage for VMs) and 3x "storage nodes" proxstore11/12/13 (Ceph, no VMs running on the nodes)
We've added the new 25G NICs to the compute nodes and then evacuated each of the 5 nodes one by one by bulk migrating its VMs to another node, then reconfiguring /etc/network/interfaces to add the new bond and move the guest vmbr0 bridge over, then "ifreload -a" to enable the changes.
This has actually worked fine for all nodes over the course of 1-2 hours.
1 minute after the last of the 5 nodes was ifreloaded - ALL compute nodes hard-rebooted!
We then thought maybe we overlooked the HA watchdog (lrm / crm services had NOT been stopped on any of the 8 nodes)
-> but then why would ALL nodes reboot? Should it not be the node that believes to be seperated from the others, only?
Why are quorate members rebooting - all at once?
Also, when such things happened in the past we would get an email saying "node XYZ has been fenced" -> no such email for any node was sent this time.
After rebooting all nodes came online, the few VMs that had been configured for HA were rebooted automatically on one of the nodes while we could manually, sucessfully, start all of the other VMs affected.
And now nothing further has happened for 15 hours.
Note also:
As 1 of the 5 nodes did not run any VMs anyway because we had just reconfigured its network config (prox14) and 1 of the remaining 4 nodes (prox10) did not host a "monitored" VM, only 3 out of the 5 (8) nodes were HA state = active, while the others (including the HA master) were (and are) "idle" - why did all compute nodes reboot and none of the Ceph nodes?
Are we safe now to migrate VMs and so on? What exactly triggered the meltdown?
The last node, prox14, was ifreloaded at 16:32, the nodes crashed at 16:33:26.
Code:
proxmox-ve: 7.4-1 (running kernel: 5.15.116-1-pve)
pve-manager: 7.4-16 (running version: 7.4-16/0f39f621)
pve-kernel-5.15: 7.4-6
pve-kernel-5.15.116-1-pve: 5.15.116-1
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.107-1-pve: 5.15.107-1
pve-kernel-5.15.85-1-pve: 5.15.85-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.3-1
proxmox-backup-file-restore: 2.4.3-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-5
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
Attachments
-
proxstore13-20231031.txt21.8 KB · Views: 2
-
prox10-20231031.txt15.9 KB · Views: 3
-
prox11-20231031.txt17 KB · Views: 1
-
prox12-20231031.txt15.7 KB · Views: 0
-
prox13-20231031.txt12.6 KB · Views: 1
-
prox14-20231031.txt30.3 KB · Views: 0
-
proxstore11-20231031.txt29.1 KB · Views: 0
-
proxstore12-20231031.txt22.7 KB · Views: 0