Upgrade from 7.0.10 Kernel 5.11.22-7 to 7.2.4 Kernel 5.15.35-1 Cannot Process Volume Group PVE

Tmanok

Well-Known Member
Why Hello There!

This evening I performed a regular upgrade to a client's long-standing (~2 years) hyperconverged PVE cluster. This cluster has some history but for the most part has been trouble free. The cluster is made up of homogeneous hardware, all Gen8 HPE Proliant servers with one node being a different model from the rest. The boot devices are single SSDs

The software upgrades went without issue, no warnings or errors with apt/dpkg, migration off of PVE nodes to other nodes before rebooting was occuring as expected. However, two things changed their behaviour upon reboot:
  1. The HBA (MegaRAID RAID Controller in IT/HBA Mode) now asks me to press "X" on startup claiming the following error:
    Code:
    Caution: Memory conflict detected. You may face boot problem.L2/L3 Cache error was detected on the RAID controller.
    Please contact technical support to resolve this issue. Press "X" to
    continue or else power off your system, replace the controller and reboot.
    Of course, a cluster full of HBAs don't all simultaneously fail after four Linux machines upgrade their kernels, no, this is a smart and valid detection by the HBA. However, I will need to refresh its memory in the CTRL+R Configuration Menu or if someone has a method of validating the data on the boot partitions in Linux I would greatly appreciate that.

  2. Grub boots to 5.15.35-1 and runs into the following repeated errors:
    Code:
    Volume group "pve" not foundCannot Process Volume Group pve
    (Then it ends with the following)
    Code:
    Gave up waiting for root file system device. Common problems:
    (TL;DR it said root delay or missing modules, then I get spat out to busybox)
    The issue with the root delay increase is that there is already a noticeable delay, the errors repeat about 10-15 times before quitting to BusyBox so there's no further reasonable delay that could help.
The upgrade for all nodes included the same message (because they are always upgraded at the same time and have the same software installed):

Code:
The following packages were automatically installed and are no longer required:
  bsdmainutils golang-docker-credential-helpers libsecret-1-0 libsecret-common
  libzpool4linux pve-kernel-5.4.114-1-pve pve-kernel-5.4.119-1-pve
  python3-asn1crypto python3-dockerpycreds
Use 'apt autoremove' to remove them.
The following NEW packages will be installed:
  gnutls-bin libdrm-common libdrm2 libepoxy0 libgbm1 libgnutls-dane0 libjs-qrcodejs
  libjson-glib-1.0-0 libjson-glib-1.0-common libopts25 libposix-strptime-perl
  libproxmox-rs-perl libtpms0 libunbound8 libvirglrenderer1 libwayland-server0
  libzpool5linux proxmox-websocket-tunnel pve-kernel-5.11.22-7-pve pve-kernel-5.15
  pve-kernel-5.15.35-1-pve swtpm swtpm-libs swtpm-tools
The following packages will be upgraded:
  base-files bind9-dnsutils bind9-host bind9-libs bsdextrautils bsdutils btrfs-progs
  ceph ceph-base ceph-common ceph-fuse ceph-mds ceph-mgr ceph-mgr-modules-core
  ceph-mon ceph-osd corosync cryptsetup-bin curl dirmngr distro-info-data dnsutils
  eject fdisk gnupg gnupg-l10n gnupg-utils gpg gpg-agent gpg-wks-client
  gpg-wks-server gpgconf gpgsm gpgv gzip krb5-locales libarchive13 libblkid1
  libc-bin libc-dev-bin libc-devtools libc-l10n libc6 libc6-dev libc6-i386
  libcephfs2 libcfg7 libcmap4 libcorosync-common4 libcpg4 libcryptsetup12
  libcurl3-gnutls libcurl4 libexpat1 libfdisk1 libflac8 libgmp10 libgssapi-krb5-2
  libgssrpc4 libjaeger libjs-jquery-ui libk5crypto3 libknet1 libkrad0 libkrb5-3
  libkrb5support0 libldap-2.4-2 libldap-common libldb2 liblzma5 libmount1 libnozzle1
  libnss-systemd libnss3 libntfs-3g883 libnvpair3linux libpam-modules
  libpam-modules-bin libpam-runtime libpam-systemd libpam0g libperl5.32
  libproxmox-acme-perl libproxmox-acme-plugins libproxmox-backup-qemu0
  libpve-access-control libpve-cluster-api-perl libpve-cluster-perl
  libpve-common-perl libpve-guest-common-perl libpve-http-server-perl libpve-rs-perl
  libpve-storage-perl libpve-u2f-server-perl libquorum5 librados2 libradosstriper1
  librbd1 librgw2 libsasl2-2 libsasl2-modules-db libseccomp2 libsmartcols1
  libsmbclient libssl1.1 libsystemd0 libtiff5 libudev1 libuuid1 libuutil3linux
  libvotequorum8 libwbclient0 libxml2 libzfs4linux linux-libc-dev locales lxc-pve
  lxcfs lynx lynx-common mount novnc-pve ntfs-3g openssl perl perl-base
  perl-modules-5.32 proxmox-backup-client proxmox-backup-file-restore
  proxmox-mini-journalreader proxmox-ve proxmox-widget-toolkit pve-cluster
  pve-container pve-docs pve-edk2-firmware pve-firewall pve-firmware pve-ha-manager
  pve-i18n pve-kernel-5.11 pve-kernel-5.11.22-3-pve pve-kernel-helper
  pve-lxc-syscalld pve-manager pve-qemu-kvm pve-xtermjs python3-ceph-argparse
  python3-ceph-common python3-cephfs python3-ldb python3-pil python3-rados
  python3-rbd python3-reportbug python3-rgw python3-waitress qemu-server reportbug
  rsync samba-common samba-libs smartmontools smbclient spl systemd systemd-sysv
  systemd-timesyncd sysvinit-utils tasksel tasksel-data tzdata udev usb.ids
  util-linux uuid-runtime vim-common vim-tiny wget xxd xz-utils zfs-initramfs
  zfs-zed zfsutils-linux zlib1g
185 upgraded, 24 newly installed, 0 to remove and 0 not upgraded.
Need to get 505 MB of archives.
After this operation, 949 MB of additional disk space will be used.
Do you want to continue? [Y/n]

Please let me know if you have any guidance, I appreciate your support.


Tmanok
 
hi,

* are you able to boot an older kernel at the grub menu? if yes, does it make a difference?

* have you checked for any BIOS upgrades for the servers?

* was any hardware changed recently?
 
Maybe one of these known issues [1] is related:

intel_iommu now defaults to on. The kernel config of the new 5.15 series enables the intel_iommu parameter by default - this can cause problems with older hardware (issues were reported with e.g. HP DL380 g8 servers, and Dell R610 servers - so hardware older than 10 years)
The issue can be fixed by explicitly disabling intel_iommu on the kernel commandline (intel_iommu=off) following the reference documentation - https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysboot_edit_kernel_cmdline

Certain systems may need to explicitly enable iommu=pt (SR-IOV pass-through) on the kernel command line - https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysboot_edit_kernel_cmdline
There are some reports for this to solve issues with Avago/LSI RAID controllers, for example in a Dell R340 Server booted in legacy mode.

[1] https://pve.proxmox.com/wiki/Roadmap#Proxmox_VE_7.2
 
hi,

* are you able to boot an older kernel at the grub menu? if yes, does it make a difference?

* have you checked for any BIOS upgrades for the servers?

* was any hardware changed recently?
Hi Oguz,

  1. Yes. That is what we have been doing in the meantime, going back to 5.11 without issue but we have noticed poor disk performance.
  2. BIOS is not latest, but we don't have a copy of latest or latest HBA firmware admittedly.
  3. No HW changes at all.
Hi Neobin,

You may be on the right track about why we cannot boot to 5.15. However, it does not explain our disk performance degradation so maybe booting into 5.15 is worth a try with iommu disabled. Also, to clarify G8 HP = R#20 series, so R620 = HP DL360 Gen8, R720 = HP DL380 Gen8. ;)

Thank you both for your time and support!

Tmanok
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!