pve7 upgrade on small cluster. one node crashes at boot

alexskysilk

Distinguished Member
Oct 16, 2015
1,516
261
153
Chatsworth, CA
www.skysilk.com
I had a small cluster I'm upgrading as a test. all appeared well, until one node crashed for no apparent reason.

Subsequent boots yielded looping crashes (video attached from kvm.) I booted to 5.4 kernel and it booted right up. working and non working nodes on the same firmware, and hardware. I should also mention these are Epyc based machines which may be instructive.

Any thoughts on further troubleshooting?

Code:
pveversion -v
proxmox-ve: 7.0-2 (running kernel: 5.4.124-1-pve)
pve-manager: 7.0-5 (running version: 7.0-5/cce9b25f)
pve-kernel-5.11: 7.0-3
pve-kernel-helper: 7.0-3
pve-kernel-5.4: 6.4-4
pve-kernel-5.11.22-1-pve: 5.11.22-1
pve-kernel-5.4.124-1-pve: 5.4.124-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph: 15.2.13-pve1
ceph-fuse: 15.2.13-pve1
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.0.0-1+pve5
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 7.0-3
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-4
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-6
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-1
lxcfs: 4.0.8-pve1
novnc-pve: 1.2.0-3
openvswitch-switch: 2.15.0+ds1-2
proxmox-backup-client: 1.1.10-1
proxmox-backup-file-restore: 1.1.10-1
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.1-4
pve-cluster: 7.0-2
pve-container: 4.0-3
pve-docs: 7.0-3
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-2
pve-firmware: 3.2-4
pve-ha-manager: 3.2-2
pve-i18n: 2.3-1
pve-qemu-kvm: 6.0.0-2
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-5
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.4-pve1
 

Attachments

  • video_2-7-2021_11-36-21.zip
    484 KB · Views: 11
Do all of the servers in your cluster have the same hardware and only one of them shows this behavior after the upgrade?
 
At the time it was only one; it was a cluster of 3 servers (identical, HPE CL3150g10) but that was because one was still pre-reboot (running 6.4) and the other- it didnt survive reboot.

The proximate cause was an incompatibility between kernel ver 5.11 and the bios version on the hosts. bios update corrected the issue.
 
  • Like
Reactions: Dominic

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!