Proxmox not using new kernel

SpinningRust

Active Member
Sep 25, 2019
43
3
28
35
Hello everyone,

for some nefarious reason, one of my nodes (3 node cluster with external VM storage) will not boot into a newer kernel after kernel version 5.4.78. While booting up, the only options in the Proxmox-boot tool screen are kernel verison 5.4.78 and 5.4.73. An ls in /boot however, shows newer versions (see second spoiler).
Just now i upgraded all nodes to PVE 7 and while the other two nodes adopted the new kernel without problems, this node didn't. All nodes share the exact same Hardware and should also have all the same software/configuration. During the upgrade from PVE 6 to 7 no errors occured on any nodes.
i'll attach the package versions and would be happy to any advice on how to solve this weird error.

Regards

John Tanner


SQL:
proxmox-ve: 7.0-2 (running kernel: 5.4.78-2-pve)
pve-manager: 7.0-9 (running version: 7.0-9/228c9caa)
pve-kernel-helper: 7.0-4
pve-kernel-5.11: 7.0-3
pve-kernel-5.4: 6.4-4
pve-kernel-5.11.22-1-pve: 5.11.22-2
pve-kernel-5.4.124-1-pve: 5.4.124-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 14.2.21-1
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: not correctly installed
ifupdown2: 3.0.0-1+pve6
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.1.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-4
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-9
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-2
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.4-1
proxmox-backup-file-restore: 2.0.4-1
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-4
pve-cluster: 7.0-3
pve-container: 4.0-8
pve-docs: 7.0-5
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-2
pve-firmware: 3.2-4
pve-ha-manager: 3.3-1
pve-i18n: 2.4-1
pve-qemu-kvm: 6.0.0-2
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-10
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.4-pve1


Code:
drwxr-xr-x  5 root root   23 Jul 16 10:22 ./
drwxr-xr-x 19 root root   25 Mar 18 13:27 ../
-rw-r--r--  1 root root 247K Jul  2 16:22 config-5.11.22-1-pve
-rw-r--r--  1 root root 232K Jun 23 13:47 config-5.4.124-1-pve
-rw-r--r--  1 root root 232K Nov 16  2020 config-5.4.73-1-pve
-rw-r--r--  1 root root 232K Dec  3  2020 config-5.4.78-2-pve
drwxr-xr-x  2 root root    2 Mar 12 14:32 efi/
drwxr-xr-x  5 root root    8 Jul 16 10:22 grub/
-rw-r--r--  1 root root  56M Jul 16 09:19 initrd.img-5.11.22-1-pve
-rw-r--r--  1 root root  47M Jul 15 08:48 initrd.img-5.4.124-1-pve
-rw-r--r--  1 root root  41M Mar 12 14:37 initrd.img-5.4.73-1-pve
-rw-r--r--  1 root root  41M Apr  9 10:56 initrd.img-5.4.78-2-pve
-rw-r--r--  1 root root 179K Aug 15  2019 memtest86+.bin
-rw-r--r--  1 root root 181K Aug 15  2019 memtest86+_multiboot.bin
drwxr-xr-x  2 root root    8 Jul 16 09:19 pve/
-rw-r--r--  1 root root 5.5M Jul  2 16:22 System.map-5.11.22-1-pve
-rw-r--r--  1 root root 4.6M Jun 23 13:47 System.map-5.4.124-1-pve
-rw-r--r--  1 root root 4.5M Nov 16  2020 System.map-5.4.73-1-pve
-rw-r--r--  1 root root 4.6M Dec  3  2020 System.map-5.4.78-2-pve
-rw-r--r--  1 root root  14M Jul  2 16:22 vmlinuz-5.11.22-1-pve
-rw-r--r--  1 root root  12M Jun 23 13:47 vmlinuz-5.4.124-1-pve
-rw-r--r--  1 root root  12M Nov 16  2020 vmlinuz-5.4.73-1-pve
-rw-r--r--  1 root root  12M Dec  3  2020 vmlinuz-5.4.78-2-pve
 
could you post the output of proxmox-boot-tool status from a working and a non-working node?
 
  • Like
Reactions: Stoiko Ivanov
could you post the output of proxmox-boot-tool status from a working and a non-working node?

"stuck" or non-working node:
Code:
Re-executing '/usr/sbin/proxmox-boot-tool' in new private mount namespace..
System currently booted with uefi
WARN: /dev/disk/by-uuid/D6CF-1F28 does not exist - clean '/etc/kernel/proxmox-boot-uuids'! - skipping
D6D0-5D81 is configured with: uefi (versions: 5.11.22-1-pve, 5.4.124-1-pve, 5.4.78-2-pve)



working node:
Code:
Re-executing '/usr/sbin/proxmox-boot-tool' in new private mount namespace..
System currently booted with uefi
0ED0-BC90 is configured with: uefi (versions: 5.11.22-1-pve, 5.4.124-1-pve)
0ED1-4864 is configured with: uefi (versions: 5.11.22-1-pve, 5.4.124-1-pve)

The non-working node does indeed look unhealthy but how do i fix this? the rpool in question is healthy
 
Please also provide the outputs of:
* `lsblk`
* ` blkid /dev/disk/by-id/*`
 
Please also provide the outputs of:
* `lsblk`
* ` blkid /dev/disk/by-id/*`
Code:
NAME                        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                           8:0    0   1.5T  0 disk 
└─35000cca050abd018         253:0    0   1.5T  0 mpath
  ├─35000cca050abd018-part1 253:2    0   1.5T  0 part 
  └─35000cca050abd018-part9 253:3    0     8M  0 part 
sdb                           8:16   0   1.5T  0 disk 
└─35000cca050abc524         253:1    0   1.5T  0 mpath
  ├─35000cca050abc524-part1 253:5    0   1.5T  0 part 
  └─35000cca050abc524-part9 253:6    0     8M  0 part 
sdc                           8:32   0 186.3G  0 disk 
└─sdc3                        8:35   0 185.8G  0 part 
sdd                           8:48   0 186.3G  0 disk 
├─sdd1                        8:49   0  1007K  0 part 
├─sdd2                        8:50   0   512M  0 part 
└─sdd3                        8:51   0 185.8G  0 part 
sde                           8:64   0  13.6T  0 disk 
└─hp-san                    253:4    0  13.6T  0 mpath
  ├─PLS-vm--103--disk--1    253:7    0     4M  0 lvm   
  ├─PLS-vm--100--disk--0    253:8    0   128G  0 lvm   
  ├─PLS-vm--103--disk--0    253:9    0   128G  0 lvm   
  ├─PLS-vm--101--disk--0    253:10   0     8G  0 lvm   
  ├─PLS-vm--104--disk--0    253:11   0    32G  0 lvm   
  ├─PLS-vm--105--disk--0    253:12   0    16G  0 lvm   
  ├─PLS-vm--102--disk--0    253:13   0    32G  0 lvm   
  ├─PLS-vm--106--disk--0    253:14   0  1000G  0 lvm   
  ├─PLS-vm--107--disk--0    253:15   0    32G  0 lvm   
  ├─PLS-vm--108--disk--0    253:16   0    64G  0 lvm   
  ├─PLS-vm--109--disk--0    253:17   0    32G  0 lvm   
  ├─PLS-vm--110--disk--0    253:18   0    32G  0 lvm   
  ├─PLS-vm--110--disk--1    253:19   0   128G  0 lvm   
  ├─PLS-vm--110--disk--2    253:20   0    32G  0 lvm   
  ├─PLS-vm--110--disk--3    253:21   0   128G  0 lvm   
  ├─PLS-vm--111--disk--0    253:22   0    32G  0 lvm   
  ├─PLS-vm--111--disk--1    253:23   0    32G  0 lvm   
  └─PLS-vm--110--disk--4    253:24   0    32G  0 lvm   
zd0                         230:0    0   128G  0 disk 
├─zd0p1                     230:1    0     1M  0 part 
├─zd0p2                     230:2    0     1G  0 part 
└─zd0p3                     230:3    0   127G  0 part 
zd16                        230:16   0   128G  0 disk 
├─zd16p1                    230:17   0     1M  0 part 
├─zd16p2                    230:18   0     1G  0 part 
└─zd16p3                    230:19   0   127G  0 part


Code:
root@middle:~# blkid /dev/disk/by-id/
root@middle:~#
 
Spoiler: blkid /dev/disk/by-id/
the '*' was missing .. hence no output

could you also provide the output of both the lsblk and blkid commands also from the working node?

How is the system setup?
sda and sdb look like they're multipathed on a iSCSI/FC san ?
could it be that you set the system was setup with a ZFS RAID1 and at some point you replaced one of the disks?
 
the '*' was missing .. hence no output

could you also provide the output of both the lsblk and blkid commands also from the working node?

How is the system setup?
sda and sdb look like they're multipathed on a iSCSI/FC san ?
could it be that you set the system was setup with a ZFS RAID1 and at some point you replaced one of the disks?


Code:
sda                                   8:0    0   1.5T  0 disk 
└─hp-san                            253:0    0   1.5T  0 mpath
  ├─hp-san-part1                    253:2    0   1.5T  0 part 
  └─hp-san-part9                    253:3    0     8M  0 part 
sdb                                   8:16   0   1.5T  0 disk 
└─35000cca050abcb98                 253:1    0   1.5T  0 mpath
  ├─35000cca050abcb98-part1         253:5    0   1.5T  0 part 
  └─35000cca050abcb98-part9         253:6    0     8M  0 part 
sdc                                   8:32   0 186.3G  0 disk 
├─sdc1                                8:33   0  1007K  0 part 
├─sdc2                                8:34   0   512M  0 part 
└─sdc3                                8:35   0 185.8G  0 part 
sdd                                   8:48   0 186.3G  0 disk 
├─sdd1                                8:49   0  1007K  0 part 
├─sdd2                                8:50   0   512M  0 part 
└─sdd3                                8:51   0 185.8G  0 part 
sde                                   8:64   0  13.6T  0 disk 
└─3600c0ff000267d927f555b6001000000 253:4    0  13.6T  0 mpath
  ├─PLS-vm--103--disk--1            253:7    0     4M  0 lvm   
  ├─PLS-vm--100--disk--0            253:8    0   128G  0 lvm   
  ├─PLS-vm--103--disk--0            253:9    0   128G  0 lvm   
  ├─PLS-vm--101--disk--0            253:10   0     8G  0 lvm   
  ├─PLS-vm--104--disk--0            253:11   0    32G  0 lvm   
  ├─PLS-vm--105--disk--0            253:12   0    16G  0 lvm   
  ├─PLS-vm--102--disk--0            253:13   0    32G  0 lvm   
  ├─PLS-vm--106--disk--0            253:14   0  1000G  0 lvm   
  ├─PLS-vm--107--disk--0            253:15   0    32G  0 lvm   
  ├─PLS-vm--108--disk--0            253:16   0    64G  0 lvm   
  ├─PLS-vm--109--disk--0            253:17   0    32G  0 lvm   
  ├─PLS-vm--110--disk--0            253:18   0    32G  0 lvm   
  ├─PLS-vm--110--disk--1            253:19   0   128G  0 lvm   
  ├─PLS-vm--110--disk--2            253:20   0    32G  0 lvm   
  ├─PLS-vm--110--disk--3            253:21   0   128G  0 lvm   
  ├─PLS-vm--111--disk--0            253:22   0    32G  0 lvm   
  ├─PLS-vm--111--disk--1            253:23   0    32G  0 lvm   
  └─PLS-vm--110--disk--4            253:24   0    32G  0 lvm   
zd0                                 230:0    0   128G  0 disk 
├─zd0p1                             230:1    0     1M  0 part 
├─zd0p2                             230:2    0     1G  0 part 
└─zd0p3                             230:3    0   127G  0 part 
zd16                                230:16   0   128G  0 disk 
├─zd16p1                            230:17   0     1M  0 part 
├─zd16p2                            230:18   0     1G  0 part 
└─zd16p3                            230:19   0   127G  0 part


About the system setup:
All 3 nodes are running on a DL380 Gen9, sda and sdb are unused SSDs.
The external storage is a multipathed iSCSI target on a HP MSA 2040 and contains all VM/CT data.
And lastly yes, the problematic node was setup with a ZFS RAID1 and one of the SSDs disappeared quite some time ago. After reinserting it and running a resilver, everything (seemingly) worked again.

The command outputs were too long for a message, therefore please refer to the attached txt file
 

Attachments

  • commands.txt
    28.1 KB · Views: 1
And lastly yes, the problematic node was setup with a ZFS RAID1 and one of the SSDs disappeared quite some time ago. After reinserting it and running a resilver, everything (seemingly) worked again.
did you follow the guide at:
https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#_zfs_administration
(changing a failed bootable device)?

I'd try to add the 512M vfat partition to the replaced disk and add it with proxmox-boot-tool format/init - as described there

I hope this helps!
 
Last edited:
No unfortunately I didn't follow that guide...
But how do I find out which drive was the failed one? Because I do not remember which one "failed" :(
* check the output of `zpool status`

I still think that sda and sdb are not local in the server (but could be wrong)

I'd guess that sdc is the one that was swapped out, and sdd is the one that was in `rpool` from the beginning

as said -based on the command outputs these are just guesses - check with zpool status and then check the partition-tables with `parted` or `fdisk`

I hope this helps!
 
* check the output of `zpool status`

I still think that sda and sdb are not local in the server (but could be wrong)

I'd guess that sdc is the one that was swapped out, and sdd is the one that was in `rpool` from the beginning

as said -based on the command outputs these are just guesses - check with zpool status and then check the partition-tables with `parted` or `fdisk`

I hope this helps!
Ok i think i have broken the pool now as the proxmox boot tool continiously gives me this error, no matter what device path i choose:

Code:
invalid vdev specification
the following errors must be manually repaired:
/dev/disk/by-id/dm-name-35000cca050abc524-part3 is part of active pool 'rpool'
root@middle:~# proxmox-boot-tool format /dev/disk/by-id/dm-name-35000cca050abc524-part2
UUID="12523253704678969671" SIZE="536870912" FSTYPE="zfs_member" PARTTYPE="c12a7328-f81f-11d2-ba4b-00a0c93ec93b" PKNAME="" MOUNTPOINT=""
E: cannot determine parent device of '/dev/disk/by-id/dm-name-35000cca050abc524-part2' - please provide a partition, not a full disk.

EDIT: device path as in: the same device has several entries in /dev/disk/by-id/ and no matter what I choose it can not find the parent device...
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!