Updated to 7.1 and having boot issues

Astraea · Nov 18, 2021

I updated my cluster of Proxmox nodes to 7.1 and everything went smoothly except for 1 machine. I am running on older hardware but all from the same generation more or less. My updates on my 5 HP DL380 G5 servers went without a hitch though 1 of them did need a second reboot after the upgrade to get back working again. The one that is having the issues is my Dell PE2950 G3 that will not boot with the latest kernel but will boot with the previous 5.11 kernel.

Package versions from the 2950:

Code:

proxmox-ve: 7.1-1 (running kernel: 5.11.22-7-pve)
pve-manager: 7.1-5 (running version: 7.1-5/6fe299a0)
pve-kernel-5.13: 7.1-4
pve-kernel-helper: 7.1-4
pve-kernel-5.11: 7.0-10
pve-kernel-5.13.19-1-pve: 5.13.19-2
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-4-pve: 5.11.22-9
ceph-fuse: 15.2.14-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-3
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
openvswitch-switch: 2.15.0+ds1-2
proxmox-backup-client: 2.0.14-1
proxmox-backup-file-restore: 2.0.14-1
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.4-2
pve-cluster: 7.1-2
pve-container: 4.1-2
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-1
pve-qemu-kvm: 6.1.0-2
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-3
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3

Package versions from one of the HP servers:

Code:

proxmox-ve: 7.1-1 (running kernel: 5.13.19-1-pve)
pve-manager: 7.1-5 (running version: 7.1-5/6fe299a0)
pve-kernel-5.13: 7.1-4
pve-kernel-helper: 7.1-4
pve-kernel-5.11: 7.0-10
pve-kernel-5.13.19-1-pve: 5.13.19-2
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-4-pve: 5.11.22-9
ceph-fuse: 15.2.14-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-3
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
openvswitch-switch: 2.15.0+ds1-2
proxmox-backup-client: 2.0.14-1
proxmox-backup-file-restore: 2.0.14-1
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.4-2
pve-cluster: 7.1-2
pve-container: 4.1-2
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-1
pve-qemu-kvm: 6.1.0-2
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-3
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3

Not sure what the issue is but when the Dell is trying to boot is gets a timeout error of:
Kernel panic hung_tack_timeout_sec blocked for more than 120 seconds.

t.lamprecht · Nov 18, 2021

Astraea said:
The one that is having the issues is my Dell PE2950 G3 that will not boot with the latest kernel but will boot with the previous 5.11 kernel.

Isn't that one with a CPU released in ~ 2007? If so I'd guess it does not get much testing anymore which could explain a uncaught regression..

Astraea said:
Not sure what the issue is but when the Dell is trying to boot is gets a timeout error of:
Kernel panic hung_tack_timeout_sec blocked for more than 120 seconds.

Anything else before/after that, as this could be just a consequence of some other error...

Also, did you install the latest available BIOS/firmware version?

Astraea · Nov 18, 2021

Yes, the CPU was released in Q4 of 2007 and the HPs from my quick check on one of them was Q1 of 2008. I will try and grab a screenshot of the output it seems to repeatedly try to keep going but states similar or the same errors again and again. I just double-checked and I have the latest BIOS for this system.

Would have all the packages being the same except the kernel have any negative effects with that machine being in a cluster with other nodes? I just upgraded the storage side of my home data centre so next time I have some funds it will be the compute side that gets the attention.

I just checked the iDRAC log and it does capture part of the boot but cuts off before the error messages start.

t.lamprecht · Nov 18, 2021

Astraea said:
Would have all the packages being the same except the kernel have any negative effects with that machine being in a cluster with other nodes? I just upgraded the storage side of my home data centre so next time I have some funds it will be the compute side that gets the attention.

In general, it can be ok to boot an older kernel for a while, especially if the other options is having no running system at all.
To big differences of artifacts like kernel version but also CPU model can make issues on VM live-migration in a cluster.
Best thing here can be to set a fixed CPU model for the VMs that isn't host - for example, the kvm64 is one which exposes the lowest combination of CPU flags that still makes sense for virtualization to the guest, that way live-migration should still work.

t.lamprecht · Nov 18, 2021

Oh, and we got some old dusty workstation with a CPU from the same area around, I can see if I get to dust it of and boot that again soon(ish), and then update it to that kernel to see if it makes trouble too.

Astraea · Nov 18, 2021

Here are the CPU models that I have in use if this helps, I am hoping to get a chance to restart the Dell server today and get more from the output just have to get some work-related tasks done before I can tinker with it again.

Node 1 = Supermicro X8DTi-LN4F

with dual X5670 which is running without issues
Node 2 = HP DL380 G5 with dual E5440 which needed an extra restart after the upgrade but is fine now and restarts fine as well now
Node 3 = HP DL380 G5 with dual E5345 which is running without issues
Node 4 = HP DL380 G5 with dual E5345 which is running without issues
Node 5 = HP DL380 G5 with dual E5345 which is running without issues
Node 6 = HP DL385 G5 with dual AMD Opteron 2356 which is running without issues
Node 7 = Dell 2950 G3 with dual X5460 which is unable to run the 5.13 kernel but is totally fine on the 5.11 kernel.

Astraea · Nov 19, 2021

I restarted the machine having the boot issues and took a video and extracted the following images from it with the various errors and other output.

BuFf0k · Nov 22, 2021

I am having a similar issue with the upgrade. Also running Intel CPUs from that generation, Current workaround is as per: https://forum.proxmox.com/threads/select-default-boot-kernel.79582/

Note, both servers are Intel S5000VSA with dual CPUs.

BuFf0k · Nov 25, 2021

This issue seems fixed with pve-kernel-5.13.19-1-pve (5.13.19-3)

Astraea · Nov 25, 2021

Just updated the server that was causing issues and it stuck at the same spot as shown in image 2 of my previous post. I'll try and look into it more later this week or next and see if it is the same error or if it is changed at all.

BuFf0k · Nov 26, 2021

Astraea said:
Just updated the server that was causing issues and it stuck at the same spot as shown in image 2 of my previous post. I'll try and look into it more later this week or next and see if it is the same error or if it is changed at all.

I also found a strange issue with one of my two affected servers, when you get to the Grub Screen, and then manually choose the new kernel from advanced, does it boot? I found this to be an issue, I tried diagnosing why, even compared the boot script from default and advanced and there was no difference. Then I set the GRUB_TIMEOUT=10 and my server boots. This is a RaidZ1 on 4 4TB Seagate Drives (Brand New).
On my other server, which RUNS an Intel Raid Card with 4 2TB SAS drives in Raid10 just doing the update worked.

Astraea · Dec 5, 2021

I wanted to update and close this thread off as I just updated today (Dec 4, 2021) to the lastest set of packages and I was able to load the new kernel. For the record and to help others later here is the package versions that worked for this old Dell.

Code:

proxmox-ve: 7.1-1 (running kernel: 5.13.19-2-pve)
pve-manager: 7.1-7 (running version: 7.1-7/df5740ad)
pve-kernel-helper: 7.1-6
pve-kernel-5.13: 7.1-5
pve-kernel-5.11: 7.0-10
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.13.19-1-pve: 5.13.19-3
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-4-pve: 5.11.22-9
ceph-fuse: 15.2.14-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-4
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
openvswitch-switch: 2.15.0+ds1-2
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-4
pve-cluster: 7.1-2
pve-container: 4.1-2
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3

decaen · Jan 11, 2022

BuFf0k said:
I also found a strange issue with one of my two affected servers, when you get to the Grub Screen, and then manually choose the new kernel from advanced, does it boot? I found this to be an issue, I tried diagnosing why, even compared the boot script from default and advanced and there was no difference. Then I set the GRUB_TIMEOUT=10 and my server boots. This is a RaidZ1 on 4 4TB Seagate Drives (Brand New).
On my other server, which RUNS an Intel Raid Card with 4 2TB SAS drives in Raid10 just doing the update worked.

Setting GRUB_TIMEOUT to 10 solve the boot problem on my PowerEdge 2950 III (G3) + Intel E5420

proxmox-ve: 7.1-1 (running kernel: 5.13.19-2-pve)
pve-kernel-5.13.19-2-pve: 5.13.19-4

Thanks a lot !

EDIT (Jan 12, 2022): after multiple shutdown/restart, the boot problem came back :-(

For information, there is no boot problem with my (older) PowerEdge 2950 II (G2) + Intel E5130

Astraea · Feb 5, 2022

I just updated to the latest kernel and the problem is back again, this time even trying to load an older kernel has not helped.

t.lamprecht · Feb 5, 2022

Astraea said:
I just updated to the latest kernel and the problem is back again, this time even trying to load an older kernel has not helped.

Then it sounds like a different problem though? As the kernel boot is stateless w.r.t. CPU/Mainboard/..., so if the new kernel would be at fault for your issues booting the older, previously working one, should fix it in any case.

Is it showing really the exact same symptoms as you described in the screen-grabs of the console in your original posts above?

Astraea · Feb 6, 2022

It is the same screen from what I remember, though I did not do an exhaustive test yesterday I will dive into it more this week and report back.

Astraea · Feb 7, 2022

I had some time today to have a second look and the screen is the same, I was able to boot using the previous kernel, so I am not sure where it is hanging on boot, could it be a timeout issue? I am using hardware RAID due to the controller not being able to go into IT mode, though I only have 2 SSDs in the server in a mirrored RAID.

Astraea · Feb 20, 2022

The machine with the issue is able to boot: Linux 5.13.19-2-pve #1 SMP PVE 5.13.19-4
The rest of the other machines (6 in total) are running: Linux 5.13.19-4-pve #1 SMP PVE 5.13.19-9

Now some of the other machines will sometimes not boot first try but a restart (sometimes 2) and they are fine and boot without issues. CPU in the one with issues are dual X5460.

decaen · Mar 18, 2022

By disabling ACPI at the kernel level, the Dell PowerEdge 2950 III can boot again.
$ grep acpi /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="acpi=off"

Don't forget to run update-grub after the modification.

decaen · Mar 18, 2022

This needs to be confirmed over time, but a priori, we can replace the "acpi=off" with "noapic".
For this to work, it seems necessary to switch the "Demand-Based Power Management" parameter to "Disabled" in the BIOS.

Updated to 7.1 and having boot issues

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Proxmox Staff Member

Renowned Member

Renowned Member

Attachments

New Member

New Member

Renowned Member

New Member

Renowned Member

Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Renowned Member

Member

Member