I have a 28 node PVE cluster running PVE 7.4-16. All nodes are Dell R640/R650 servers with 1.5 or 2TB of RAM and Intel Xeon Gold CPU's. They all have 2 x 1GB NIC's and 4 x 25 GB NIC's.
We are connected to an external CEPH cluster.
Node network config:
1 x 1GB NIC is used for management/ssh/GUI access. Connected to a 1GB top-of rack switch
1 x 1GB NIC for cluster ring2 with highest priority. Connected to a dedicated 1GB switch used only for this purpose and for this cluster
2 x 25GB OVS bond for VM traffic, migration network and cluster ring 0
2 x 25GB OVS bond for ceph-public and cluster ring 1
4 nodes in this cluster had HA enabled vm's running on them and quorum master was located on a fifth node (which had no HA enabled VM's running).
During a planned DR test for a customer (which did not have any VM's running on either of the 5 nodes mentioned above, we pulled the power from two hypervisors containing the customer VM's. (Yes, we know, it may not be great for the vm's/corruption/etc, that's not the point here )
To our surprise the 4 hypervisors containing HA enabled VM's + the at the time quorum master rebooted very soon after.
The log below is from one of the mentioned 5 hypervisors that were fenced (all 5 have similar logs). It shows that host 17 and host 22 is unavailable (that's expected as they had power removed from them), and then
We have performed similar DR tests before, but the new factor in the cluster is enabling HA for some VM's.
Is it possible to figure out why the fencing happened? Can I provide more logs of some sort for anyone to help me understand if I have any issues in my configuration?
Many thanks
Bjørn
We are connected to an external CEPH cluster.
Node network config:
1 x 1GB NIC is used for management/ssh/GUI access. Connected to a 1GB top-of rack switch
1 x 1GB NIC for cluster ring2 with highest priority. Connected to a dedicated 1GB switch used only for this purpose and for this cluster
2 x 25GB OVS bond for VM traffic, migration network and cluster ring 0
2 x 25GB OVS bond for ceph-public and cluster ring 1
4 nodes in this cluster had HA enabled vm's running on them and quorum master was located on a fifth node (which had no HA enabled VM's running).
During a planned DR test for a customer (which did not have any VM's running on either of the 5 nodes mentioned above, we pulled the power from two hypervisors containing the customer VM's. (Yes, we know, it may not be great for the vm's/corruption/etc, that's not the point here )
To our surprise the 4 hypervisors containing HA enabled VM's + the at the time quorum master rebooted very soon after.
The log below is from one of the mentioned 5 hypervisors that were fenced (all 5 have similar logs). It shows that host 17 and host 22 is unavailable (that's expected as they had power removed from them), and then
Code:
Nov 11 10:11:13 pve191 watchdog-mux[815]: client watchdog expired - disable watchdog updates
Code:
Nov 11 10:10:24 pve191 corosync[1405]: [KNET ] link: host: 17 link: 0 is down
Nov 11 10:10:24 pve191 corosync[1405]: [KNET ] link: host: 17 link: 1 is down
Nov 11 10:10:24 pve191 corosync[1405]: [KNET ] link: host: 17 link: 2 is down
Nov 11 10:10:24 pve191 corosync[1405]: [KNET ] host: host: 17 (passive) best link: 2 (pri: 40)
Nov 11 10:10:24 pve191 corosync[1405]: [KNET ] host: host: 17 has no active links
Nov 11 10:10:24 pve191 corosync[1405]: [KNET ] host: host: 17 (passive) best link: 2 (pri: 40)
Nov 11 10:10:24 pve191 corosync[1405]: [KNET ] host: host: 17 has no active links
Nov 11 10:10:24 pve191 corosync[1405]: [KNET ] host: host: 17 (passive) best link: 2 (pri: 40)
Nov 11 10:10:24 pve191 corosync[1405]: [KNET ] host: host: 17 has no active links
Nov 11 10:10:31 pve191 corosync[1405]: [TOTEM ] Token has not been received in 14925 ms
Nov 11 10:10:49 pve191 corosync[1405]: [KNET ] link: host: 22 link: 0 is down
Nov 11 10:10:49 pve191 corosync[1405]: [KNET ] link: host: 22 link: 1 is down
Nov 11 10:10:49 pve191 corosync[1405]: [KNET ] link: host: 22 link: 2 is down
Nov 11 10:10:49 pve191 corosync[1405]: [KNET ] host: host: 22 (passive) best link: 2 (pri: 40)
Nov 11 10:10:49 pve191 corosync[1405]: [KNET ] host: host: 22 has no active links
Nov 11 10:10:49 pve191 corosync[1405]: [KNET ] host: host: 22 (passive) best link: 2 (pri: 40)
Nov 11 10:10:49 pve191 corosync[1405]: [KNET ] host: host: 22 has no active links
Nov 11 10:10:49 pve191 corosync[1405]: [KNET ] host: host: 22 (passive) best link: 2 (pri: 40)
Nov 11 10:10:49 pve191 corosync[1405]: [KNET ] host: host: 22 has no active links
Nov 11 10:11:13 pve191 watchdog-mux[815]: client watchdog expired - disable watchdog updates
-- Boot 003ce3afa06e4086bcf8d1bf8a73ebfe --
We have performed similar DR tests before, but the new factor in the cluster is enabling HA for some VM's.
Is it possible to figure out why the fencing happened? Can I provide more logs of some sort for anyone to help me understand if I have any issues in my configuration?
proxmox-ve: 7.4-1 (running kernel: 5.15.116-1-pve)
pve-manager: 7.4-16 (running version: 7.4-16/0f39f621)
pve-kernel-5.15: 7.4-6
pve-kernel-5.15.116-1-pve: 5.15.116-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph: 17.2.6-pve1
ceph-fuse: 17.2.6-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
openvswitch-switch: 2.15.0+ds1-2+deb11u4
proxmox-backup-client: 2.4.3-1
proxmox-backup-file-restore: 2.4.3-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.2
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-5
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
pve-manager: 7.4-16 (running version: 7.4-16/0f39f621)
pve-kernel-5.15: 7.4-6
pve-kernel-5.15.116-1-pve: 5.15.116-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph: 17.2.6-pve1
ceph-fuse: 17.2.6-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
openvswitch-switch: 2.15.0+ds1-2+deb11u4
proxmox-backup-client: 2.4.3-1
proxmox-backup-file-restore: 2.4.3-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.2
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-5
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
Many thanks
Bjørn