Running a 7 node 4.4 cluster with VM storage in LVMs from Vol groups with PVs from a shared iSCSI SAN.
Seems either our iSCSI devices or number of VM LVMs have caused slow OS probing during grub updating, causing risks that the SW watchdog sometimes firing a NMI during grub configuration as it takes more than 60 secs. So to try to avoid such NMIs, we've used to do our patching of PVE like this:
Only the vgexport has the unwanted side effect that other nodes thus sees their volume groups as exported and thus canøt live migrate VM until patching node has rebooted. Is there a better way to remove iSCSI devices from a single node before patching?
This weekend we did patch our 4.4 cluster from this level:
to this level:
Here we saw NMIs getting firing on a few nodes, both when dismantling our iSCSI devices as shown above and also in a test without dismantling the iSCSI devices, just by running 'apt-get -y dist-upgrade' after live migrating all VM to other nodes first.
Getting a NMI fired from the HA SW watchdog is very disturbing especially during kernel patch as it may break node as unbootable...
Are we doing the shared LVMs/iSCSI correctly?
How to avoid such NMIs firing during patching?
Seems either our iSCSI devices or number of VM LVMs have caused slow OS probing during grub updating, causing risks that the SW watchdog sometimes firing a NMI during grub configuration as it takes more than 60 secs. So to try to avoid such NMIs, we've used to do our patching of PVE like this:
# let's get all non-essentiel disk device out of the way...
vgexport -a
umount /mnt/pve/backupA
umount /mnt/pve/backupB
sleep 2
# close multipath, only used by our iSCSI devices
dmsetup remove_all
# logout off iSCSI
iscsiadm -m session -u
# now run update/upgrade(s)
apt-get update
# skip apt-get upgrade,
# see https://forum.proxmox.com/threads/upgrade-issues.32727/#post-162695
#apt-get -y upgrade
# go directly to dist-upgrade
apt-get -y dist-upgrade
Only the vgexport has the unwanted side effect that other nodes thus sees their volume groups as exported and thus canøt live migrate VM until patching node has rebooted. Is there a better way to remove iSCSI devices from a single node before patching?
This weekend we did patch our 4.4 cluster from this level:
proxmox-ve: 4.4-79 (running kernel: 4.4.35-2-pve)
pve-manager: 4.4-12 (running version: 4.4-12/e71b7a74)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.35-2-pve: 4.4.35-79
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-108
pve-firmware: 1.1-10
libpve-common-perl: 4.0-91
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-73
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-1
pve-container: 1.0-93
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-1
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve14~bpo80
openvswitch-switch: 2.6.0-2
to this level:
proxmox-ve: 4.4-82 (running kernel: 4.4.40-1-pve)
pve-manager: 4.4-12 (running version: 4.4-12/e71b7a74)
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.4.40-1-pve: 4.4.40-82
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-109
pve-firmware: 1.1-10
libpve-common-perl: 4.0-92
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-94
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-3
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
openvswitch-switch: 2.6.0-2
Here we saw NMIs getting firing on a few nodes, both when dismantling our iSCSI devices as shown above and also in a test without dismantling the iSCSI devices, just by running 'apt-get -y dist-upgrade' after live migrating all VM to other nodes first.
Getting a NMI fired from the HA SW watchdog is very disturbing especially during kernel patch as it may break node as unbootable...
Are we doing the shared LVMs/iSCSI correctly?
How to avoid such NMIs firing during patching?