Proxmox Cluster Upgrade from 4.4-78 to 4.4-87 (current) => Hard Crash

Jospeh Huber

Renowned Member
Apr 18, 2016
99
7
73
46
Hi,

yesterday I have done a rolling upgrade of one of our cluster from UI 4.4-78 to 4.4-87 (current)
I did the upgrade via the GUI, a "apt-get dist-upgrade" is executed.
There was a full crash on all nodes, all VMS are crashed and the system reboots automatically.
The symptom was the same on all nodes, the last package was "pve-firmware" and then the total crash:
Unpacking pve-firmware (1.1.-11) over (1.1-10)…

After the reboot if have to execute...
Code:
dpkg --configure -a
...and the upgrade can be continued.

On the other nodes I have stopped all VMs manually and then executed the upgrade.
We are using Proxmox for many years and have done countless upgrades, but such a crash never happened...
Has anybody else observed such hard crashes during the upgrade?

/var/log/syslog:

May 9 19:11:53 vmhostX systemd-timesyncd[22845]: interval/delta/delay/jitter/drift 128s/-0.000s/0.009s/0.000s/-11ppm
May 9 19:11:53 vmhostX pmxcfs[1710]: [quorum] crit: quorum_initialize failed: 2
May 9 19:11:53 vmhostX pmxcfs[1710]: [confdb] crit: cmap_initialize failed: 2
May 9 19:11:53 vmhostX pmxcfs[1710]: [dcdb] crit: cpg_initialize failed: 2
May 9 19:11:53 vmhostX pmxcfs[1710]: [status] crit: cpg_initialize failed: 2
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@
....

/var/log/apt/history.log

Start-Date: 2017-05-09 19:10:14
Commandline: apt-get dist-upgrade
Install: libappconfig-perl:amd64 (1.66-1, automatic), libjs-extjs:amd64 (6.0.1-1, automatic), libpve-guest-common-perl:amd64 (1.0-2, automatic), pve-kernel-4.4.59-1-pve:amd64 (4.4.59-87, automatic), libtemplate-perl:amd64 (2.24-1.2+b1, automatic), libpve-http-server-perl:amd64 (1.0-4, automatic)
Upgrade: bind9-host:amd64 (9.9.5.dfsg-9+deb8u9, 9.9.5.dfsg-9+deb8u10), liblwres90:amd64 (9.9.5.dfsg-9+deb8u9, 9.9.5.dfsg-9+deb8u10), libevent-2.0-5:amd64 (2.0.21-stable-2, 2.0.21-stable-2+deb8u1), libgnutls-openssl27:amd64 (3.3.8-6+deb8u4, 3.3.8-6+deb8u5), libpve-common-perl:amd64 (4.0-85, 4.0-94), multiarch-support:amd64 (2.19-18+deb8u7, 2.19-18+deb8u9), libdns100:amd64 (9.9.5.dfsg-9+deb8u9, 9.9.5.dfsg-9+deb8u10), libisc-export95:amd64 (9.9.5.dfsg-9+deb8u9, 9.9.5.dfsg-9+deb8u10), postfix:amd64 (2.11.3-1, 2.11.3-1+deb8u2), zfs-initramfs:amd64 (0.6.5.8-pve13~bpo80, 0.6.5.9-pve15~bpo80), libisccfg90:amd64 (9.9.5.dfsg-9+deb8u9, 9.9.5.dfsg-9+deb8u10), libssl1.0.0:amd64 (1.0.1t-1+deb8u5, 1.0.1t-1+deb8u6), libpve-storage-perl:amd64 (4.0-71, 4.0-76), rpcbind:amd64 (0.2.1-6+deb8u1, 0.2.1-6+deb8u2), libbind9-90:amd64 (9.9.5.dfsg-9+deb8u9, 9.9.5.dfsg-9+deb8u10), libzfs2linux:amd64 (0.6.5.8-pve13~bpo80, 0.6.5.9-pve15~bpo80), tcpdump:amd64 (4.6.2-5+deb8u1, 4.9.0-1~deb8u1), pve-manager:amd64 (4.4-5, 4.4-13), uidmap:amd64 (4.2-3+deb8u1, 4.2-3+deb8u3), libgnutls-deb0-28:amd64 (3.3.8-6+deb8u4, 3.3.8-6+deb8u5), libicu52:amd64 (52.1-8+deb8u4, 52.1-8+deb8u5), qemu-server:amd64 (4.0-102, 4.0-110), python-cephfs:amd64 (10.2.5-1~bpo80+1, 10.2.7-1~bpo80+1), vim-common:amd64 (7.4.488-7+deb8u1, 7.4.488-7+deb8u3), librbd1:amd64 (10.2.5-1~bpo80+1, 10.2.7-1~bpo80+1), pve-kernel-4.4.35-2-pve:amd64 (4.4.35-78, 4.4.35-79), spiceterm:amd64 (2.0-1, 2.0-2), libtiff5:amd64 (4.0.3-12.3+deb8u2, 4.0.3-12.3+deb8u3), libradosstriper1:amd64 (10.2.5-1~bpo80+1, 10.2.7-1~bpo80+1), libfreetype6:amd64 (2.5.2-3+deb8u1, 2.5.2-3+deb8u2), zfs-zed:amd64 (0.6.5.8-pve13~bpo80, 0.6.5.9-pve15~bpo80), libuutil1linux:amd64 (0.6.5.8-pve13~bpo80, 0.6.5.9-pve15~bpo80), libtirpc1:amd64 (0.2.5-1, 0.2.5-1+deb8u1), librados2:amd64 (10.2.5-1~bpo80+1, 10.2.7-1~bpo80+1), libc-bin:amd64 (2.19-18+deb8u7, 2.19-18+deb8u9), libc6:amd64 (2.19-18+deb8u7, 2.19-18+deb8u9), zfsutils:amd64 (0.6.5.8-pve13~bpo80, 0.6.5.9-pve15~bpo80), zfsutils-linux:amd64 (0.6.5.8-pve13~bpo80, 0.6.5.9-pve15~bpo80), dnsutils:amd64 (9.9.5.dfsg-9+deb8u9, 9.9.5.dfsg-9+deb8u10), ceph-base:amd64 (10.2.5-1~bpo80+1, 10.2.7-1~bpo80+1), pve-ha-manager:amd64 (1.0-38, 1.0-40), udev:amd64 (215-17+deb8u6, 215-17+deb8u7), libqb0:amd64 (1.0-1, 1.0.1-1), base-files:amd64 (8+deb8u7, 8+deb8u8), python-ceph:amd64 (10.2.5-1~bpo80+1, 10.2.7-1~bpo80+1), libzpool2linux:amd64 (0.6.5.8-pve13~bpo80, 0.6.5.9-pve15~bpo80), ceph-osd:amd64 (10.2.5-1~bpo80+1, 10.2.7-1~bpo80+1), spl:amd64 (0.6.5.8-pve7~bpo80, 0.6.5.9-pve8~bpo80), initramfs-tools:amd64 (0.120+deb8u2, 0.120+deb8u3), lxcfs:amd64 (2.0.5-pve2, 2.0.6-pve1), vncterm:amd64 (1.2-1, 1.3-2), eject:amd64 (2.1.5+deb1+cvs20081104-13.1, 2.1.5+deb1+cvs20081104-13.1+deb8u1), pve-cluster:amd64 (4.0-48, 4.0-49), libudev1:amd64 (215-17+deb8u6, 215-17+deb8u7), binutils:amd64 (2.25-5, 2.25-5+deb8u1), pve-qemu-kvm:amd64 (2.7.1-1, 2.7.1-4), libnvpair1:amd64 (0.6.5.8-pve13~bpo80, 0.6.5.9-pve15~bpo80), ceph:amd64 (10.2.5-1~bpo80+1, 10.2.7-1~bpo80+1), libpve-access-control:amd64 (4.0-19, 4.0-23), vim-tiny:amd64 (7.4.488-7+deb8u1, 7.4.488-7+deb8u3), pve-container:amd64 (1.0-90, 1.0-99), libnvpair1linux:amd64 (0.6.5.8-pve13~bpo80, 0.6.5.9-pve15~bpo80), samba-libs:amd64 (4.2.14+dfsg-0+deb8u2, 4.2.14+dfsg-0+deb8u5), wget:amd64 (1.16-1+deb8u1, 1.16-1+deb8u2), libuutil1:amd64 (0.6.5.8-pve13~bpo80, 0.6.5.9-pve15~bpo80), ceph-common:amd64 (10.2.5-1~bpo80+1, 10.2.7-1~bpo80+1), mysql-common:amd64 (5.5.54-0+deb8u1, 5.5.55-0+deb8u1), librgw2:amd64 (10.2.5-1~bpo80+1, 10.2.7-1~bpo80+1), systemd-sysv:amd64 (215-17+deb8u6, 215-17+deb8u7), libjasper1:amd64 (1.900.1-debian1-2.4+deb8u1, 1.900.1-debian1-2.4+deb8u3), ca-certificates:amd64 (20141019+deb8u2, 20141019+deb8u3), pve-libspice-server1:amd64 (0.12.8-1, 0.12.8-2), libmysqlclient18:amd64 (5.5.54-0+deb8u1, 5.5.55-0+deb8u1), python-rbd:amd64 (10.2.5-1~bpo80+1, 10.2.7-1~bpo80+1), libzpool2:amd64 (0.6.5.8-pve13~bpo80, 0.6.5.9-pve15~bpo80), libcorosync4-pve:amd64 (2.4.0-1, 2.4.2-2~pve4+1), smbclient:amd64 (4.2.14+dfsg-0+deb8u2, 4.2.14+dfsg-0+deb8u5), systemd:amd64 (215-17+deb8u6, 215-17+deb8u7), vim:amd64 (7.4.488-7+deb8u1, 7.4.488-7+deb8u3), lxc-pve:amd64 (2.0.6-5, 2.0.7-4), passwd:amd64 (4.2-3+deb8u1, 4.2-3+deb8u3), ceph-mon:amd64 (10.2.5-1~bpo80+1, 10.2.7-1~bpo80+1), libcephfs1:amd64 (10.2.5-1~bpo80+1, 10.2.7-1~bpo80+1), proxmox-ve:amd64 (4.4-78, 4.4-87), login:amd64 (4.2-3+deb8u1, 4.2-3+deb8u3), pve-docs:amd64 (4.4-1, 4.4-4), libzfs2:amd64 (0.6.5.8-pve13~bpo80, 0.6.5.9-pve15~bpo80), tzdata:amd64 (2016j-0+deb8u1, 2017b-0+deb8u1), openssl:amd64 (1.0.1t-1+deb8u5, 1.0.1t-1+deb8u6), libwbclient0:amd64 (4.2.14+dfsg-0+deb8u2, 4.2.14+dfsg-0+deb8u5), libsystemd0:amd64 (215-17+deb8u6, 215-17+deb8u7), libgnutlsxx28:amd64 (3.3.8-6+deb8u4, 3.3.8-6+deb8u5), novnc-pve:amd64 (0.5-8, 0.5-9), corosync-pve:amd64 (2.4.0-1, 2.4.2-2~pve4+1), libisccfg-export90:amd64 (9.9.5.dfsg-9+deb8u9, 9.9.5.dfsg-9+deb8u10), samba-common:amd64 (4.2.14+dfsg-0+deb8u2, 4.2.14+dfsg-0+deb8u5), libdns-export100:amd64 (9.9.5.dfsg-9+deb8u9, 9.9.5.dfsg-9+deb8u10), pve-firmware:amd64 (1.1-10, 1.1-11), locales:amd64 (2.19-18+deb8u7, 2.19-18+deb8u9), libirs-export91:amd64 (9.9.5.dfsg-9+deb8u9, 9.9.5.dfsg-9+deb8u10), libisccc90:amd64 (9.9.5.dfsg-9+deb8u9, 9.9.5.dfsg-9+deb8u10), vim-runtime:amd64 (7.4.488-7+deb8u1, 7.4.488-7+deb8u3), libsmbclient:amd64 (4.2.14+dfsg-0+deb8u2, 4.2.14+dfsg-0+deb8u5), libisc95:amd64 (9.9.5.dfsg-9+deb8u9, 9.9.5.dfsg-9+deb8u10), python-rados:amd64 (10.2.5-1~bpo80+1, 10.2.7-1~bpo80+1)
 
corosync-pve:amd64 (2.4.0-1, 2.4.2-2~pve4+1)

do you have HA enabled on those nodes? there was a bug in the old corosync package which could trigger a fence on upgrades that took too long (based on the number of upgrades accumulated in this case, and the log you posted, I guess this was the case). the current corosync package does not suffer from this issue anymore, as corosync is now restarted after upgrading, and not stopped before and started again afterwards.
 
do you have HA enabled on those nodes?
Yes we have HA enabled in this cluster.

OK, this means it does not happen with future upgrades...

To avoid this problem, while upgrading our other clusters, I have to disable HA during the upgrade process, right?
 
Yes we have HA enabled in this cluster.

OK, this means it does not happen with future upgrades...

To avoid this problem, while upgrading our other clusters, I have to disable HA during the upgrade process, right?

yes, that should be safe.