One host not loading correct Kernel after upgrade

iwannabfishn

New Member
Aug 13, 2024
8
0
1
I have a 4 node cluster that I upgraded from pve kernel version 6.8.4-2 to 6.8.12-11. 3 of the 4 upgraded normally. But, one still has the kernel version at 6.8.4-2. I have restarted the host and it still has the old version. I checked for updates and it finds no updates. When I look at package versions on the summary tab on the the 3 that worked and the 1 that didn't work there is only one difference.

The first line of package versions is different. Here is the line from the host that doesn't show the correct Kernel version.
proxmox-ve: 8.4.0 (running kernel: 6.8.4-2-pve)

Here is the same line from a host that is showing the correct kernel version.
proxmox-ve: 8.4.0 (running kernel: 6.8.12-11-pve)

I am not sure how to get the one host to boot from the new kernel. All the other lines in package version match between all my hosts. Please let me know if you know how to fix this. Or should I just reinstall the OS on this host with the latest version?
 
On the effected node host, what does this show:
Code:
proxmox-boot-tool kernel list
(please post output using the code-editor: </> on the formatting bar)
 
Code:
Manually selected kernels:
None.

Automatically selected kernels:
6.8.12-11-pve
6.8.12-8-pve
6.8.4-2-pve
 
This seems odd behavior. You appeared to have also previously installed the 6.8.12-8-pve kernel, but yet only the previous (original install?) 6.8.4-2-pve is being booted.

What does this output:
Code:
cat /boot/grub/grub.cfg | grep "Proxmox VE GNU/Linux"
 
root@pve-ucsdr-06:~# cat /boot/grub/grub.cfg | grep "Proxmox VE GNU/Linux"
menuentry 'Proxmox VE GNU/Linux' --class proxmox --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-d423eb6d-e7a4-4762-802d-20868cb26609' {
submenu 'Advanced options for Proxmox VE GNU/Linux' $menuentry_id_option 'gnulinux-advanced-d423eb6d-e7a4-4762-802d-20868cb26609' {
menuentry 'Proxmox VE GNU/Linux, with Linux 6.8.12-11-pve' --class proxmox --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.8.12-11-pve-advanced-d423eb6d-e7a4-4762-802d-20868cb26609' {
menuentry 'Proxmox VE GNU/Linux, with Linux 6.8.12-11-pve (recovery mode)' --class proxmox --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.8.12-11-pve-recovery-d423eb6d-e7a4-4762-802d-20868cb26609' {
menuentry 'Proxmox VE GNU/Linux, with Linux 6.8.12-8-pve' --class proxmox --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.8.12-8-pve-advanced-d423eb6d-e7a4-4762-802d-20868cb26609' {
menuentry 'Proxmox VE GNU/Linux, with Linux 6.8.12-8-pve (recovery mode)' --class proxmox --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.8.12-8-pve-recovery-d423eb6d-e7a4-4762-802d-20868cb26609' {
menuentry 'Proxmox VE GNU/Linux, with Linux 6.8.4-2-pve' --class proxmox --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.8.4-2-pve-advanced-d423eb6d-e7a4-4762-802d-20868cb26609' {
menuentry 'Proxmox VE GNU/Linux, with Linux 6.8.4-2-pve (recovery mode)' --class proxmox --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.8.4-2-pve-recovery-d423eb6d-e7a4-4762-802d-20868cb26609' {
 
# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
# info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet"
GRUB_CMDLINE_LINUX=""

# If your computer has multiple operating systems installed, then you
# probably want to run os-prober. However, if your computer is a host
# for guest OSes installed via LVM or raw disk devices, running
# os-prober can cause damage to those guest OSes as it mounts
# filesystems to look for things.
#GRUB_DISABLE_OS_PROBER=false

# Uncomment to enable BadRAM filtering, modify to suit your needs
# This works with Linux (no patch required) and with any kernel that obtains
# the memory map information from GRUB (GNU Mach, kernel of FreeBSD ...)
#GRUB_BADRAM="0x01234567,0xfefefefe,0x89abcdef,0xefefefef"

# Uncomment to disable graphical terminal
#GRUB_TERMINAL=console

# The resolution used on graphical terminal
# note that you can use only modes which your graphic card supports via VBE
# you can see them in real GRUB with the command `vbeinfo'
#GRUB_GFXMODE=640x480

# Uncomment if you don't want GRUB to pass "root=UUID=xxx" parameter to Linux
#GRUB_DISABLE_LINUX_UUID=true

# Uncomment to disable generation of recovery mode menu entries
#GRUB_DISABLE_RECOVERY="true"

# Uncomment to get a beep at grub start
#GRUB_INIT_TUNE="480 440 1"
 
Did you install Proxmox VE on top of Debian instead of using the Proxmox installer? Did you install Proxmox VE a long time ago and migrated to the latest version? What is the output of proxmox-boot-tool status? Maybe Proxmox updates one ESP or drive but your system boots from another (which is never updated)?
 
I installed Proxmox from a proxmox installer when it was version 7. I used the Proxmox updates to update it. I have upgraded in the past with no issues. Not sure what happened this time. Here is the output of command.

Code:
Re-executing '/usr/sbin/proxmox-boot-tool' in new private mount namespace..
E: /etc/kernel/proxmox-boot-uuids does not exist.
 
Maybe it's easier to just remove the node from the cluster, reinstall PVE 8.4 fresh and join it to the cluster? Lot of things have changed around boot partition setup between early PVE 7 and the current PVE. How much time do you want to spend on debugging the current node (which is not under proxmox-boot-tool control and it's not GRUB selecting another kernel)? Maybe use this as an exercise to practice what happens when a node breaks down and needs to be replaced instead?
 
While I totally agree with leesteken that this may be the stage of deciding to reinstall this node - maybe a little more probing (if you want)? I'm thinking that your situation has to do with efi issues.

What do these output:

Code:
lsblk

efibootmgr -v
PLEASE PROVIDE OUTPUT USING THE CODE-EDITOR AS I HAVE DONE. </> FROM THE FORMATTING BAR.
 
  • Like
Reactions: leesteken
Thanks for your help. I am just about ready to reinstall. That would probably take less time than I have spent troubleshooting :)

Code:
lsblk
NAME                              MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                                 8:0    0   1.1T  0 disk
├─sda1                              8:1    0  1007K  0 part
├─sda2                              8:2    0     1G  0 part
└─sda3                              8:3    0   1.1T  0 part
  ├─pve--OLD--EAFE0A64-swap       252:4    0     8G  0 lvm 
  ├─pve--OLD--EAFE0A64-root       252:5    0    96G  0 lvm 
  ├─pve--OLD--EAFE0A64-data_tmeta 252:6    0    10G  0 lvm 
  │ └─pve--OLD--EAFE0A64-data     252:9    0 975.7G  0 lvm 
  └─pve--OLD--EAFE0A64-data_tdata 252:7    0 975.7G  0 lvm 
    └─pve--OLD--EAFE0A64-data     252:9    0 975.7G  0 lvm 
sdb                                 8:16   0 223.6G  0 disk
├─sdb1                              8:17   0  1007K  0 part
├─sdb2                              8:18   0     1G  0 part
└─sdb3                              8:19   0 222.6G  0 part
  ├─pve-swap                      252:0    0     8G  0 lvm  [SWAP]
  ├─pve-root                      252:1    0  65.6G  0 lvm  /
  ├─pve-data_tmeta                252:2    0   1.3G  0 lvm 
  │ └─pve-data                    252:8    0 130.3G  0 lvm 
  └─pve-data_tdata                252:3    0 130.3G  0 lvm 
    └─pve-data                    252:8    0 130.3G  0 lvm 
sdc                                 8:32   0 223.6G  0 disk
root@pve-06:~# efibootmgr -v
EFI variables are not supported on this system.
 
My guess is that Proxmox/GRUB are updating /dev/sdb2 but your motherboard BIOS is set to boot from /dev/sda2, which is old and no longer updated.

EDIT: Why did you not remove the old drive that is (or should be) no long used?
 
Last edited:
  • Like
Reactions: gfngfn256
I think its time:
  • Remove node from cluster.
  • From node remove any unnecessary drives / wipe them.
  • Attach to node monitor & keyb / KVM & reinstall fresh.
  • Join node to cluster.
 
Or simply shut down the node and remove the old drive
I did not advise such because I'm worried/expecting that node not to reboot correctly after disk changes. If that happens - he'll mess the cluster. Better to remove now & reinstall then join.