SW watchdog sometimes fires NMIs while patching

stefws · Mar 15, 2017

Running a 7 node 4.4 cluster with VM storage in LVMs from Vol groups with PVs from a shared iSCSI SAN.
Seems either our iSCSI devices or number of VM LVMs have caused slow OS probing during grub updating, causing risks that the SW watchdog sometimes firing a NMI during grub configuration as it takes more than 60 secs. So to try to avoid such NMIs, we've used to do our patching of PVE like this:

# let's get all non-essentiel disk device out of the way...
vgexport -a
umount /mnt/pve/backupA
umount /mnt/pve/backupB
sleep 2
# close multipath, only used by our iSCSI devices
dmsetup remove_all
# logout off iSCSI
iscsiadm -m session -u

# now run update/upgrade(s)
apt-get update
# skip apt-get upgrade,
# see https://forum.proxmox.com/threads/upgrade-issues.32727/#post-162695
#apt-get -y upgrade
# go directly to dist-upgrade
apt-get -y dist-upgrade

Only the vgexport has the unwanted side effect that other nodes thus sees their volume groups as exported and thus canøt live migrate VM until patching node has rebooted. Is there a better way to remove iSCSI devices from a single node before patching?

This weekend we did patch our 4.4 cluster from this level:

proxmox-ve: 4.4-79 (running kernel: 4.4.35-2-pve)
pve-manager: 4.4-12 (running version: 4.4-12/e71b7a74)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.35-2-pve: 4.4.35-79
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-108
pve-firmware: 1.1-10
libpve-common-perl: 4.0-91
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-73
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-1
pve-container: 1.0-93
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-1
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve14~bpo80
openvswitch-switch: 2.6.0-2

to this level:

proxmox-ve: 4.4-82 (running kernel: 4.4.40-1-pve)
pve-manager: 4.4-12 (running version: 4.4-12/e71b7a74)
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.4.40-1-pve: 4.4.40-82
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-109
pve-firmware: 1.1-10
libpve-common-perl: 4.0-92
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-94
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-3
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
openvswitch-switch: 2.6.0-2

Here we saw NMIs getting firing on a few nodes, both when dismantling our iSCSI devices as shown above and also in a test without dismantling the iSCSI devices, just by running 'apt-get -y dist-upgrade' after live migrating all VM to other nodes first.

Getting a NMI fired from the HA SW watchdog is very disturbing especially during kernel patch as it may break node as unbootable...

Are we doing the shared LVMs/iSCSI correctly?

How to avoid such NMIs firing during patching?

fabian · Mar 15, 2017

stefws said:
Running a 7 node 4.4 cluster with VM storage in LVMs from Vol groups with PVs from a shared iSCSI SAN.
Seems either our iSCSI devices or number of VM LVMs have caused slow OS probing during grub updating, causing risks that the SW watchdog sometimes firing a NMI during grub configuration as it takes more than 60 secs. So to try to avoid such NMIs, we've used to do our patching of PVE like this:

Only the vgexport has the unwanted side effect that other nodes thus sees their volume groups as exported and thus canøt live migrate VM until patching node has rebooted. Is there a better way to remove iSCSI devices from a single node before patching?

This weekend we did patch our 4.4 cluster from this level:

to this level:

Here we saw NMIs getting firing on a few nodes, both when dismantling our iSCSI devices as shown above and also in a test without dismantling the iSCSI devices, just by running 'apt-get -y dist-upgrade' after live migrating all VM to other nodes first.

Getting a NMI fired from the HA SW watchdog is very disturbing especially during kernel patch as it may break node as unbootable...

Are we doing the shared LVMs/iSCSI correctly?

How to avoid such NMIs firing during patching?

there was an issue with the corosync package (<= 2.4.2-1) which could lead to a corosync service downtime during the upgrade (between the "unpacking corosync-pve.." and "setting up corosync-pve.." output lines). this issue should be gone for corosync-pve >= 2.4.2-2~pve4.

if you have persistent logs enabled, you should be able to verify whether this was the cause by looking at the journal from start of the upgrade on. if you see corosync.service stopping, but not starting again before the reboot/fence, the node was probably fenced because of this situation.

normally this would not be an issue, but if you have a lot of pending updates or if the update process takes long for some other reason (like in your case with the grub updates, or the recent bug in the ceph packages that caused upgrades to hang right in the middle), it can be quite problematic.

stefws · Mar 15, 2017

Hm now we're on 2.4.2-1 corosync, so we still have this potential issue?

Would this count as you describe where corosync is stopped (@14:23:11) but not started again before after been fenced off by watchdog (@14:24:04) and reboots (@14:28:00)?

Mar 12 14:22:49 n1 iscsid: Connection3:0 to [target: iqn.1986-03.com.hp:storage.msa1040.151725e557, portal: 10.45.67.2,3260] through [iface: default] is shutdown.
Mar 12 14:22:49 n1 iscsid: Connection4:0 to [target: iqn.1986-03.com.hp:storage.msa1040.151725e557, portal: 10.45.67.2,3260] through [iface: default] is shutdown.
Mar 12 14:22:49 n1 iscsid: Connection8:0 to [target: iqn.1986-03.com.hp:storage.msa1040.151725e557, portal: 10.45.67.1,3260] through [iface: default] is shutdown.
Mar 12 14:22:49 n1 iscsid: Connection2:0 to [target: iqn.1986-03.com.hp:storage.msa1040.151725e557, portal: 10.45.66.2,3260] through [iface: default] is shutdown.
Mar 12 14:22:49 n1 iscsid: Connection7:0 to [target: iqn.1986-03.com.hp:storage.msa1040.151725e557, portal: 10.45.67.1,3260] through [iface: default] is shutdown.
Mar 12 14:22:49 n1 iscsid: Connection6:0 to [target: iqn.1986-03.com.hp:storage.msa1040.151725e557, portal: 10.45.66.1,3260] through [iface: default] is shutdown.
Mar 12 14:22:49 n1 iscsid: Connection1:0 to [target: iqn.1986-03.com.hp:storage.msa1040.151725e557, portal: 10.45.66.2,3260] through [iface: default] is shutdown.
Mar 12 14:22:49 n1 iscsid: Connection5:0 to [target: iqn.1986-03.com.hp:storage.msa1040.151725e557, portal: 10.45.66.1,3260] through [iface: default] is shutdown.
Mar 12 14:23:05 n1 systemd[1]: Reloading.
Mar 12 14:23:10 n1 systemd[1]: Stopped Import ZFS pools by cache file.
Mar 12 14:23:10 n1 systemd[1]: Stopping ZFS file system shares...
Mar 12 14:23:10 n1 systemd[1]: Stopped ZFS file system shares.
Mar 12 14:23:10 n1 systemd[1]: Stopping Mount ZFS filesystems...
Mar 12 14:23:10 n1 systemd[1]: Stopped Mount ZFS filesystems.
Mar 12 14:23:10 n1 systemd[1]: Stopped Import ZFS pools by device scanning.
Mar 12 14:23:10 n1 systemd[1]: Stopping ZFS startup target.
Mar 12 14:23:10 n1 systemd[1]: Stopped target ZFS startup target.
Mar 12 14:23:10 n1 systemd[1]: Reloading.
Mar 12 14:23:10 n1 systemd[1]: Stopping ZFS Event Daemon (zed)...
Mar 12 14:23:10 n1 zed[5313]: Exiting
Mar 12 14:23:10 n1 systemd[1]: Stopped ZFS Event Daemon (zed).
Mar 12 14:23:10 n1 systemd[1]: Reloading.
Mar 12 14:23:11 n1 systemd[1]: Stopping Corosync Cluster Engine...
Mar 12 14:23:11 n1 corosync[46003]: Signaling Corosync Cluster Engine (corosync) to terminate: [ OK ]
Mar 12 14:23:11 n1 pmxcfs[5596]: [confdb] crit: cmap_dispatch failed: 2
Mar 12 14:23:11 n1 pmxcfs[5596]: [status] crit: cpg_dispatch failed: 2
Mar 12 14:23:11 n1 pmxcfs[5596]: [status] crit: cpg_leave failed: 2
Mar 12 14:23:11 n1 pmxcfs[5596]: [dcdb] crit: cpg_dispatch failed: 2
Mar 12 14:23:11 n1 pmxcfs[5596]: [dcdb] crit: cpg_leave failed: 2
Mar 12 14:23:11 n1 pmxcfs[5596]: [quorum] crit: quorum_dispatch failed: 2
Mar 12 14:23:11 n1 pmxcfs[5596]: [status] notice: node lost quorum
Mar 12 14:23:11 n1 pve-ha-crm[6152]: status change slave => wait_for_quorum
Mar 12 14:23:12 n1 pmxcfs[5596]: [quorum] crit: quorum_initialize failed: 2
Mar 12 14:23:12 n1 pmxcfs[5596]: [quorum] crit: can't initialize service
Mar 12 14:23:12 n1 pmxcfs[5596]: [confdb] crit: cmap_initialize failed: 2
Mar 12 14:23:12 n1 pmxcfs[5596]: [confdb] crit: can't initialize service
Mar 12 14:23:12 n1 pmxcfs[5596]: [dcdb] notice: start cluster connection
Mar 12 14:23:12 n1 pmxcfs[5596]: [dcdb] crit: cpg_initialize failed: 2
Mar 12 14:23:12 n1 pmxcfs[5596]: [dcdb] crit: can't initialize service
Mar 12 14:23:12 n1 pmxcfs[5596]: [status] notice: start cluster connection
Mar 12 14:23:12 n1 pmxcfs[5596]: [status] crit: cpg_initialize failed: 2
Mar 12 14:23:12 n1 pmxcfs[5596]: [status] crit: can't initialize service
Mar 12 14:23:12 n1 corosync[46003]: Waiting for corosync services to unload:.[ OK ]
Mar 12 14:23:12 n1 systemd[1]: Stopped Corosync Cluster Engine.
Mar 12 14:23:12 n1 systemd[1]: Reloading.
Mar 12 14:23:13 n1 systemd[1]: Reloading PVE API Daemon.
Mar 12 14:23:13 n1 pvedaemon[47220]: send HUP to 6146
Mar 12 14:23:13 n1 pvedaemon[6146]: received signal HUP
Mar 12 14:23:13 n1 pvedaemon[6146]: server closing
Mar 12 14:23:13 n1 pvedaemon[6146]: server shutdown (restart)
Mar 12 14:23:13 n1 systemd[1]: Reloaded PVE API Daemon.
Mar 12 14:23:13 n1 systemd[1]: Reloading PVE Status Daemon.
Mar 12 14:23:14 n1 pve-ha-lrm[6161]: lost lock 'ha_agent_n1_lock - cfs lock update failed - Permission denied
Mar 12 14:23:14 n1 pvestatd[47227]: send HUP to 6107
Mar 12 14:23:14 n1 pvestatd[6107]: received signal HUP
Mar 12 14:23:14 n1 systemd[1]: Reloaded PVE Status Daemon.
Mar 12 14:23:14 n1 systemd[1]: Reloading PVE API Proxy Server.
Mar 12 14:23:14 n1 pvedaemon[6146]: restarting server
Mar 12 14:23:14 n1 pvedaemon[6146]: worker 34675 finished
Mar 12 14:23:14 n1 pvedaemon[6146]: worker 31603 finished
Mar 12 14:23:14 n1 pvedaemon[6146]: worker 47559 finished
...
Mar 12 14:23:15 n1 spiceproxy[42569]: restarting server
Mar 12 14:23:15 n1 spiceproxy[42569]: worker 42570 finished
Mar 12 14:23:15 n1 spiceproxy[42569]: starting 1 worker(s)
Mar 12 14:23:15 n1 spiceproxy[42569]: worker 47479 started
Mar 12 14:23:18 n1 systemd-udevd[42065]: timeout '/sbin/blkid -o udev -p /dev/dm-4'
Mar 12 14:23:18 n1 pmxcfs[5596]: [quorum] crit: quorum_initialize failed: 2
Mar 12 14:23:18 n1 pmxcfs[5596]: [confdb] crit: cmap_initialize failed: 2
Mar 12 14:23:18 n1 pmxcfs[5596]: [dcdb] crit: cpg_initialize failed: 2
Mar 12 14:23:18 n1 pmxcfs[5596]: [status] crit: cpg_initialize failed: 2
Mar 12 14:23:19 n1 pve-ha-lrm[6161]: status change active => lost_agent_lock
Mar 12 14:23:19 n1 systemd-udevd[42065]: timeout: killing '/sbin/blkid -o udev -p /dev/dm-4' [42115]
Mar 12 14:23:19 n1 systemd-udevd[42065]: '/sbin/blkid -o udev -p /dev/dm-4' [42115] terminated by signal 9 (Killed)
Mar 12 14:23:19 n1 systemd-udevd[42090]: timeout '/sbin/blkid -o udev -p /dev/dm-7'
Mar 12 14:23:20 n1 systemd-udevd[42090]: timeout: killing '/sbin/blkid -o udev -p /dev/dm-7' [42658]
Mar 12 14:23:20 n1 systemd-udevd[42090]: '/sbin/blkid -o udev -p /dev/dm-7' [42658] terminated by signal 9 (Killed)
Mar 12 14:23:24 n1 pmxcfs[5596]: [quorum] crit: quorum_initialize failed: 2
Mar 12 14:23:24 n1 pmxcfs[5596]: [confdb] crit: cmap_initialize failed: 2
Mar 12 14:23:24 n1 pmxcfs[5596]: [dcdb] crit: cpg_initialize failed: 2
Mar 12 14:23:24 n1 pmxcfs[5596]: [status] crit: cpg_initialize failed: 2
Mar 12 14:23:24 n1 kernel: [2747378.971657] audit: type=1400 audit(1489325004.931:7): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/bin/lxc-start" pid=3505 comm="apparmor_parser"
Mar 12 14:23:25 n1 kernel: [2747379.233019] audit: type=1400 audit(1489325005.191:8): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="lxc-container-default" pid=3507 comm="apparmor_parser"
Mar 12 14:23:25 n1 kernel: [2747379.233242] audit: type=1400 audit(1489325005.191:9): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="lxc-container-default-cgns" pid=3507 comm="apparmor_parser"
Mar 12 14:23:25 n1 kernel: [2747379.233450] audit: type=1400 audit(1489325005.191:10): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="lxc-container-default-with-mounting" pid=3507 comm="apparmor_parser"
Mar 12 14:23:25 n1 kernel: [2747379.233685] audit: type=1400 audit(1489325005.191:11): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="lxc-container-default-with-nesting" pid=3507 comm="apparmor_parser"
Mar 12 14:23:25 n1 systemd[1]: Reloading.
Mar 12 14:23:25 n1 systemd[1]: Reloading.
Mar 12 14:23:25 n1 systemd[1]: Started LXC network bridge setup.
Mar 12 14:23:25 n1 systemd[1]: Starting LXC Container Monitoring Daemon...
Mar 12 14:23:25 n1 systemd[1]: Started LXC Container Monitoring Daemon.
Mar 12 14:23:25 n1 systemd[1]: Started LXC Container Initialization and Autoboot Code.
Mar 12 14:23:30 n1 pmxcfs[5596]: [quorum] crit: quorum_initialize failed: 2
Mar 12 14:23:30 n1 pmxcfs[5596]: [confdb] crit: cmap_initialize failed: 2
Mar 12 14:23:30 n1 pmxcfs[5596]: [dcdb] crit: cpg_initialize failed: 2
Mar 12 14:23:30 n1 pmxcfs[5596]: [status] crit: cpg_initialize failed: 2
Mar 12 14:23:36 n1 pmxcfs[5596]: [quorum] crit: quorum_initialize failed: 2
Mar 12 14:23:36 n1 pmxcfs[5596]: [confdb] crit: cmap_initialize failed: 2
Mar 12 14:23:36 n1 pmxcfs[5596]: [dcdb] crit: cpg_initialize failed: 2
Mar 12 14:23:36 n1 pmxcfs[5596]: [status] crit: cpg_initialize failed: 2
Mar 12 14:23:42 n1 pmxcfs[5596]: [quorum] crit: quorum_initialize failed: 2
Mar 12 14:23:42 n1 pmxcfs[5596]: [confdb] crit: cmap_initialize failed: 2
Mar 12 14:23:42 n1 pmxcfs[5596]: [dcdb] crit: cpg_initialize failed: 2
Mar 12 14:23:42 n1 pmxcfs[5596]: [status] crit: cpg_initialize failed: 2
Mar 12 14:23:48 n1 pmxcfs[5596]: [quorum] crit: quorum_initialize failed: 2
Mar 12 14:23:48 n1 pmxcfs[5596]: [confdb] crit: cmap_initialize failed: 2
Mar 12 14:23:48 n1 pmxcfs[5596]: [dcdb] crit: cpg_initialize failed: 2
Mar 12 14:23:48 n1 pmxcfs[5596]: [status] crit: cpg_initialize failed: 2
Mar 12 14:23:54 n1 pmxcfs[5596]: [quorum] crit: quorum_initialize failed: 2
Mar 12 14:23:54 n1 pmxcfs[5596]: [confdb] crit: cmap_initialize failed: 2
Mar 12 14:23:54 n1 pmxcfs[5596]: [dcdb] crit: cpg_initialize failed: 2
Mar 12 14:23:54 n1 pmxcfs[5596]: [status] crit: cpg_initialize failed: 2
Mar 12 14:24:00 n1 pmxcfs[5596]: [quorum] crit: quorum_initialize failed: 2
Mar 12 14:24:00 n1 pmxcfs[5596]: [confdb] crit: cmap_initialize failed: 2
Mar 12 14:24:00 n1 pmxcfs[5596]: [dcdb] crit: cpg_initialize failed: 2
Mar 12 14:24:00 n1 pmxcfs[5596]: [status] crit: cpg_initialize failed: 2
Mar 12 14:24:04 n1 watchdog-mux[5329]: client watchdog expired - disable watchdog updates
Mar 12 14:24:06 n1 pmxcfs[5596]: [quorum] crit: quorum_initialize failed: 2
Mar 12 14:24:06 n1 pmxcfs[5596]: [confdb] crit: cmap_initialize failed: 2
Mar 12 14:24:06 n1 pmxcfs[5596]: [dcdb] crit: cpg_initialize failed: 2
Mar 12 14:24:06 n1 pmxcfs[5596]: [status] crit: cpg_initialize failed: 2
Mar 12 14:28:00 n1 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="5340" x-info="http://www.rsyslog.com"] start
Mar 12 14:28:00 n1 systemd-modules-load[614]: Module 'fuse' is builtin
Mar 12 14:28:00 n1 systemd-modules-load[614]: Inserted module '8021q'
Mar 12 14:28:00 n1 systemd-modules-load[614]: Inserted module 'bonding'
Mar 12 14:28:00 n1 systemd-modules-load[614]: Inserted module 'vhost_net'
Mar 12 14:28:00 n1 systemd[1]: Started Load Kernel Modules.
Mar 12 14:28:00 n1 systemd[1]: Mounting FUSE Control File System...
...

If so what to do about it, waiting for 2.4.2-2 corosync and hope issue disappears?

Howto possible improve speed of grub configuration during kernel patching with our iSCSI devices / do it properly instead of vgexport -a breaking live migration on other nodes until reboot?

fabian · Mar 15, 2017

stefws said:
Hm now we're on 2.4.2-1 corosync, so we still have this potential issue?

Would this count as you describe where corosync is stopped (@14:23:11) but not started again before after been fenced off by watchdog (@14:24:04) and reboots (@14:28:00)?

If so what to do about it, waiting for 2.4.2-2 corosync and hope issue disappears?

I suggest disabling fencing when upgrading from 2.4.2-1 to 2.4.2-2~pve4+1, and re-enabling after the upgrade is done.

Howto possible improve speed of grub configuration during kernel patching with our iSCSI devices / do it properly instead of vgexport -a breaking live migration on other nodes until reboot?

I haven't yet had time to investigate this further - but with the corosync issue out of the way, long grub updates should not be problematic anymore.

stefws · Mar 15, 2017

fabian said:
I suggest disabling fencing when upgrading from 2.4.2-1 to 2.4.2-2~pve4+1, and re-enabling after the upgrade is done.

Howto disable fencing and just during the next corosync patch and is it always a good idea?

I haven't yet had time to investigate this further - but with the corosync issue out of the way, long grub updates should not be problematic anymore.

Previous talk of this issue here in this forum suggested removing the debian package os-prober, only I can't find such a package...

fabian · Mar 15, 2017

stefws said:
Howto disable fencing and just during the next corosync patch and is it always a good idea?

Code:

systemctl stop pve-ha-lrm.service pve-ha-crm.service

then the upgrade, once it's done

Code:

systemctl start pve-ha-lrm.service pve-ha-crm.service

yes, just for the upgrade, and only for the one next corosync upgrade.

Previous talk of this issue here in this forum suggested removing the debian package os-prober, only I can't find such a package...

if you don't have it installed, it cannot be the cause of your problems

os-prober in Jessie is pretty problematic as it can mount filesystems already in use by a guest, causing data corruption. os-prober in Stretch will be a bit better - but as its only use is for dual-booting operating systems, it will stay disabled in a default PVE installation.

stefws · Mar 18, 2017

Right thanks, so far 3 out of 7 patched to corosync 2.4.2-2 with fencing turned off and no NMI yet.

Why does it seem like when purging an old kernel it goes to through grub configuration twice?

root@n5:~# dpkg --purge pve-kernel-4.4.35-2-pve
(Reading database ... 51999 files and directories currently installed.)
Removing pve-kernel-4.4.35-2-pve (4.4.35-79) ...
Examining /etc/kernel/postrm.d.
run-parts: executing /etc/kernel/postrm.d/initramfs-tools 4.4.35-2-pve /boot/vml
inuz-4.4.35-2-pve
update-initramfs: Deleting /boot/initrd.img-4.4.35-2-pve
run-parts: executing /etc/kernel/postrm.d/zz-update-grub 4.4.35-2-pve /boot/vmli
nuz-4.4.35-2-pve
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-4.4.44-1-pve
Found initrd image: /boot/initrd.img-4.4.44-1-pve
Found linux image: /boot/vmlinuz-4.4.40-1-pve
Found initrd image: /boot/initrd.img-4.4.40-1-pve
Found memtest86+ image: /boot/memtest86+.bin
Found memtest86+ multiboot image: /boot/memtest86+_multiboot.bin
Adding boot menu entry for EFI firmware configuration
done
Purging configuration files for pve-kernel-4.4.35-2-pve (4.4.35-79) ...
Examining /etc/kernel/postrm.d.
run-parts: executing /etc/kernel/postrm.d/initramfs-tools 4.4.35-2-pve /boot/vml
inuz-4.4.35-2-pve
update-initramfs: Deleting /boot/initrd.img-4.4.35-2-pve
run-parts: executing /etc/kernel/postrm.d/zz-update-grub 4.4.35-2-pve /boot/vmli
nuz-4.4.35-2-pve
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-4.4.44-1-pve
Found initrd image: /boot/initrd.img-4.4.44-1-pve
Found linux image: /boot/vmlinuz-4.4.40-1-pve
Found initrd image: /boot/initrd.img-4.4.40-1-pve
Found memtest86+ image: /boot/memtest86+.bin
Found memtest86+ multiboot image: /boot/memtest86+_multiboot.bin
Adding boot menu entry for EFI firmware configuration
done

Specially annoying when it grub configuration is slow

stefws · Mar 19, 2017

fabian said:
if you don't have it installed, it cannot be the cause of your problems os-prober in Jessie is pretty problematic as it can mount filesystems already in use by a guest, causing data corruption. os-prober in Stretch will be a bit better - but as its only use is for dual-booting operating systems, it will stay disabled in a default PVE installation.

Haven't installed it and makes sense we can't find it if default not installed by PVE and don't need to install os-prober either as we don't dual boot our PVE nodes of course

Still wonder what makes grub configuration slow when attached to our iSCSI devices compared to not...
https://forum.proxmox.com/threads/iscsi-luns-or-vm-image-lvms-slows-grub-during-updating-pve.33184

fabian · Mar 20, 2017

stefws said:
Right thanks, so far 3 out of 7 patched to corosync 2.4.2-2 with fencing turned off and no NMI yet.

Why does it seem like when purging an old kernel it goes to through grub configuration twice?

Specially annoying when it grub configuration is slow

because purging is actually a two-step process if the package is still installed:
first, remove the package
second, purge its configuration

the kernel hooks run on both.

Search

Search

SW watchdog sometimes fires NMIs while patching

stefws

Member

fabian

Proxmox Staff Member

stefws

Member

fabian

Proxmox Staff Member

stefws

Member

fabian

Proxmox Staff Member

stefws

Member

stefws

Member

fabian

Proxmox Staff Member