Boot Failure on upgrade to VE 7.2 with kernel 5.15.35-1-pve

I'm suffering from the same problem. I'm using a microserver gen8 (BIOS (legacy) only unfortunately. The SSD in question is connected to P222 raid card in HBA mode.

Setting `intel_iommu=on iommu=pt` seems to have fixed it for me. It's not spewing errors and although it reported it could not import the pool on startup twice it's status is online right now

I've started a scrub to see if that triggers something, but so far so good.
 
Scrub was completed without errors. Disk still shows up and no errors.

Setting `intel_iommu=on iommu=pt` seems to be the solution, at least for me.
 
  • Like
Reactions: Stoiko Ivanov
Setting `intel_iommu=on iommu=pt` seems to have fixed it for me. It's not spewing errors and although it reported it could not import the pool on startup twice it's status is online right now
Thanks for the confirmation - I'll update the known issues page of the 7.2 release :)
 
Is this issue a particular PVE issue or a general kernel issue?
Since the linked thread which suggests that adding intel_iommu=on iommu=pt is regarding the stock debian kernel (in version 5.10) I would assume that it's not affecting only PVE, but would also affect other Linux distributions with a somewhat current kernel (for new systems booted in legacy mode, which is also something that is not too common in my experience)
 
Since the linked thread which suggests that adding intel_iommu=on iommu=pt is regarding the stock debian kernel (in version 5.10) I would assume that it's not affecting only PVE, but would also affect other Linux distributions with a somewhat current kernel (for new systems booted in legacy mode, which is also something that is not too common in my experience)
Do you recommend to add "intel_iommu=on iommu=pt" to all systems, even those that start without any problems? I have multiple DL360G6 proxmox clusters, some start, others need to be started using the old kernel which is quite weird IMHO. Same hardware, same install, same version.

Another thing:
dpkg --get-selections | grep ^pve-kernel pve-kernel-5.13 hold pve-kernel-5.13.19-6-pve hold pve-kernel-5.15 install pve-kernel-5.15.35-1-pve install pve-kernel-helper install
When running "apt update" I see there is a new kernel with the same package version:
apt list --upgradable 2>/dev/null | grep pve-kernel-5.15.35-1-pve pve-kernel-5.15.35-1-pve/stable 5.15.35-3 amd64 [upgradable from: 5.15.35-2]
As I am at 1000km from these clusters, I hesitate to upgrade, or will this fix the issue?

R.
 
Hi,

I switched to UEFI mode on the server and reinstalled Proxmox. Seems it is working this way, but not everybody has the option to reinstall the server, if it`s in production:

1652949951700.png
 
Do you recommend to add "intel_iommu=on iommu=pt" to all systems,
no - blindly adding options to the kernel commandline, which are needed in a few (more or less well defined cases) - is never a good idea!

some other servers (also hp gen8, but seemingly with a different BIOS version/Raid controller) for example only seem to boot cleanly if you set `intel_iommu=off`


As I am at 1000km from these clusters, I hesitate to upgrade, or will this fix the issue?

the change between 5.15.35-3 and 5.15.35-2 are related to some aquantia 10gig NICs (which are very unlikely to be in a very old machine as the dl360g6 - so no this will not guarantee to fix the issue

If the servers are so far away - can you connect to them via IPMI/iLO? (those changes can also be made via iLO)
if not - I'd suggest to schedule a maintenance window where someone is close to the servers to get them started properly some time not too far in the future
(running an old kernel, which does not get any updates anymore is not a good choice from a security perspective)
 
no - blindly adding options to the kernel commandline, which are needed in a few (more or less well defined cases) - is never a good idea!

some other servers (also hp gen8, but seemingly with a different BIOS version/Raid controller) for example only seem to boot cleanly if you set `intel_iommu=off`




the change between 5.15.35-3 and 5.15.35-2 are related to some aquantia 10gig NICs (which are very unlikely to be in a very old machine as the dl360g6 - so no this will not guarantee to fix the issue

If the servers are so far away - can you connect to them via IPMI/iLO? (those changes can also be made via iLO)
if not - I'd suggest to schedule a maintenance window where someone is close to the servers to get them started properly some time not too far in the future
(running an old kernel, which does not get any updates anymore is not a good choice from a security perspective)
The issue is that these DL360G6 servers are from the same batch with equal hardware and equal firmware. I have access to ILO, but not to the console. I can ask a collegue to keep an eye on the boot process, but I'd rather do that myself.

Another thing is: why change the kernel in stable versions? 7.1 to 7.2 is not a major upgrade like 7.x to 8.x. It is just to avoid these inprevisible issues where only patches should be applied. "Proxmox stable" != "Proxmox testing". Due to other upgrades you force us to install this newer kernel with the risk of issues. Update and upgrade are different things IMHO ;-)
 
Another thing is: why change the kernel in stable versions? 7.1 to 7.2 is not a major upgrade like 7.x to 8.x.
That's something Proxmox VE has been doing since at least 3 major releases.
PVE uses the kernel's from Ubuntu (with a few patches on top) and thus we follow their release-cycle - and 5.13 is the kernel for ubuntu 21.10 - it's not an LTS kernel and does not get any security updates anymore - now (and almost certainly for the rest of the 7.X release of PVE) it follows the
kernel 5.15 from 22.04 (which is both a LTS kernel and LTS Ubuntu release).

The changes of the default kernel series are (again can only speak for the past 4 years) always done on PVE point releases (7.1 -> 7.2 in this case)

Due to other upgrades you force us to install this newer kernel
it gets pulled in by default - yes, but you can always pin to a other kernel.
The more important point here is IMHO - that we ship it as default kernel, as it's the one kernel that gets security updates.

While I see that new kernel series do come with some changes that require manual adapting of configs on certain systems - it's nothing that happens to a large range of systems (else I'd assume that our internal testlab (compromising different pieces of hardware of the past 10 years) would have shown this, or we would have more reports on our support channels.

If you want to only do the adaptations and checks once per major release - for now I'd suggest to upgrade when the X.2 comes out (these were the point releases of the last 3 PVE versions, which had the kernel series that was used until the next major version came out)

Else - if possible I'd suggest to use one of those servers as a first test-installation (and have that one with an accessible console) (especially if the iLO license does not allow accessing the console remotely). Changes to config default can always happen (especially between kernel series) - and regressions on older hardware due to those are not too uncommon.

I hope this helps!
 
Ok, if I look how it works for Debian (I didn't know the kernel was an Ubuntu version) I'd expect that PVE just would use the LTS 5.10 kernel which comes with the Debian Bullseye version. Patched for PVE of course, but just the 5.10 version. Things that happened to some of us the last week could have been avoided IMHO. I suppose PVE won't switch to Debian Bookworm either when it goes from 7.4 to 7.5. In my opinion it is not A Good Thing to change kernel versions during a minor upgrades. That said, I have been using PVE for many years now and until last week kernel upgrades have never been a problem. But anyway I'd suggest: Debian does not upgrade the kernel, PVE shouldn't upgrade either. Remember many people run PVE in production environments, there is no place for experiments. But that's IMHO of course ;-)
 
Hi, I had issues with the latest kernel on my HP Micro Server Gen 8 using the built in Raid Controller (a common tweak to use the ODD slot as Boot device).

Syslog:
Code:
May 06 02:31:27 homeserver.tatooine.lan kernel: WARNING: CPU: 2 PID: 254 at drivers/iommu/intel/iommu.c:2391 __domain_mapping.cold+0x175/0x1a3
May 06 02:31:27 homeserver.tatooine.lan kernel: Modules linked in: binfmt_misc veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter softdog bonding tls nfnetlink_log nfnetlink ipmi_ssif intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp kvm_intel mgag200 kvm drm_kms_helper irqbypass crct10dif_pclmul cec ghash_clmulni_intel rc_core cdc_acm i2c_algo_bit aesni_intel crypto_simd cryptd rapl fb_sys_fops syscopyarea sysfillrect intel_cstate pcspkr serio_raw acpi_ipmi sysimgblt ipmi_si hpilo ipmi_devintf ie31200_edac ipmi_msghandler mac_hid acpi_power_meter zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi coretemp drm sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic xor zstd_compress raid6_pq simplefb dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c gpio_ich crc32_pclmul
May 06 02:31:27 homeserver.tatooine.lan kernel:  ahci libahci xhci_pci xhci_pci_renesas lpc_ich psmouse uhci_hcd ixgbe xhci_hcd ehci_pci tg3 xfrm_algo dca ehci_hcd mdio
May 06 02:31:27 homeserver.tatooine.lan kernel: CPU: 2 PID: 254 Comm: kworker/2:1H Tainted: P        W IO      5.15.30-2-pve #1
May 06 02:31:27 homeserver.tatooine.lan kernel: Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 11/02/2015
May 06 02:31:27 homeserver.tatooine.lan kernel: Workqueue: kblockd blk_mq_requeue_work
May 06 02:31:27 homeserver.tatooine.lan kernel: RIP: 0010:__domain_mapping.cold+0x175/0x1a3
May 06 02:31:27 homeserver.tatooine.lan kernel: Code: 4c 89 ee 48 c7 c7 18 57 85 8d 4c 89 5d b8 e8 a2 c1 fa ff 8b 05 3f ce 22 01 4c 8b 5d b8 85 c0 74 09 83 e8 01 89 05 2e ce 22 01 <0f> 0b e9 68 41 b3 ff 48 63 d1 be 01 00 00 00 48 c7 c7 60 e8 10 8e
May 06 02:31:27 homeserver.tatooine.lan kernel: RSP: 0018:ffffb0b4c1897860 EFLAGS: 00010046
May 06 02:31:27 homeserver.tatooine.lan kernel: RAX: 0000000000000000 RBX: ffff8c2a8506cf48 RCX: ffff8c2db6ea0988
May 06 02:31:27 homeserver.tatooine.lan kernel: RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff8c2db6ea0980
May 06 02:31:27 homeserver.tatooine.lan kernel: RBP: ffffb0b4c18978e0 R08: 0000000000000003 R09: 0000000000000001
May 06 02:31:27 homeserver.tatooine.lan kernel: R10: ffff8c2a8c840100 R11: 00000000001906cf R12: ffff8c2a84df4200
May 06 02:31:27 homeserver.tatooine.lan kernel: R13: 00000000000b1fe9 R14: 00000001906cf801 R15: 0000000000000004
May 06 02:31:27 homeserver.tatooine.lan kernel: FS:  0000000000000000(0000) GS:ffff8c2db6e80000(0000) knlGS:0000000000000000
May 06 02:31:27 homeserver.tatooine.lan kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 06 02:31:27 homeserver.tatooine.lan kernel: CR2: 00007ffeeb3e1fc0 CR3: 00000001e2610002 CR4: 00000000000626e0
May 06 02:31:27 homeserver.tatooine.lan kernel: Call Trace:
May 06 02:31:27 homeserver.tatooine.lan kernel:  <TASK>
May 06 02:31:27 homeserver.tatooine.lan kernel:  ? exc_invalid_op+0x19/0x70
May 06 02:31:27 homeserver.tatooine.lan kernel:  intel_iommu_map_pages+0xdc/0x120
May 06 02:31:27 homeserver.tatooine.lan kernel:  __iommu_map+0xda/0x270
May 06 02:31:27 homeserver.tatooine.lan kernel:  __iommu_map_sg+0x8e/0x120
May 06 02:31:27 homeserver.tatooine.lan kernel:  iommu_map_sg_atomic+0x14/0x20
May 06 02:31:27 homeserver.tatooine.lan kernel:  iommu_dma_map_sg+0x348/0x4e0
May 06 02:31:27 homeserver.tatooine.lan kernel:  __dma_map_sg_attrs+0x66/0x70
May 06 02:31:27 homeserver.tatooine.lan kernel:  dma_map_sg_attrs+0xe/0x20
May 06 02:31:27 homeserver.tatooine.lan kernel:  ata_qc_issue+0x173/0x240
May 06 02:31:27 homeserver.tatooine.lan kernel:  ? ata_scsi_mode_select_xlat+0x5d0/0x5d0
May 06 02:31:27 homeserver.tatooine.lan kernel:  __ata_scsi_queuecmd+0x150/0x430
May 06 02:31:27 homeserver.tatooine.lan kernel:  ata_scsi_queuecmd+0x45/0x90
May 06 02:31:27 homeserver.tatooine.lan kernel:  scsi_queue_rq+0x3dd/0xbe0
May 06 02:31:27 homeserver.tatooine.lan kernel:  blk_mq_dispatch_rq_list+0x13c/0x800
May 06 02:31:27 homeserver.tatooine.lan kernel:  ? sbitmap_get+0xb4/0x1e0
May 06 02:31:27 homeserver.tatooine.lan kernel:  ? sbitmap_get+0x121/0x1e0
May 06 02:31:27 homeserver.tatooine.lan kernel:  __blk_mq_do_dispatch_sched+0xba/0x2d0
May 06 02:31:27 homeserver.tatooine.lan kernel:  __blk_mq_sched_dispatch_requests+0xd6/0x150
May 06 02:31:27 homeserver.tatooine.lan kernel:  blk_mq_sched_dispatch_requests+0x35/0x60
May 06 02:31:27 homeserver.tatooine.lan kernel:  __blk_mq_run_hw_queue+0x34/0xb0
May 06 02:31:27 homeserver.tatooine.lan kernel:  __blk_mq_delay_run_hw_queue+0x162/0x170
May 06 02:31:27 homeserver.tatooine.lan kernel:  blk_mq_run_hw_queue+0x83/0x120
May 06 02:31:27 homeserver.tatooine.lan kernel:  ? blk_mq_sched_insert_request+0xb2/0x110
May 06 02:31:27 homeserver.tatooine.lan kernel:  blk_mq_run_hw_queues+0x4a/0xd0
May 06 02:31:27 homeserver.tatooine.lan kernel:  blk_mq_requeue_work+0x16f/0x1a0
May 06 02:31:27 homeserver.tatooine.lan kernel:  process_one_work+0x22b/0x3d0
May 06 02:31:27 homeserver.tatooine.lan kernel:  worker_thread+0x53/0x410
May 06 02:31:27 homeserver.tatooine.lan kernel:  ? process_one_work+0x3d0/0x3d0
May 06 02:31:27 homeserver.tatooine.lan kernel:  kthread+0x12a/0x150
May 06 02:31:27 homeserver.tatooine.lan kernel:  ? set_kthread_struct+0x50/0x50
May 06 02:31:27 homeserver.tatooine.lan kernel:  ret_from_fork+0x22/0x30
May 06 02:31:27 homeserver.tatooine.lan kernel:  </TASK>
May 06 02:31:27 homeserver.tatooine.lan kernel: ---[ end trace 39e439ed2c78483b ]---
May 06 02:31:27 homeserver.tatooine.lan kernel: DMAR: ERROR: DMA PTE for vPFN 0xb1fea already set (to b1fea003 not 1906d0801)

May 15 21:16:58 homeserver.tatooine.lan kernel: EXT4-fs (dm-11): error count since last fsck: 25
May 15 21:16:58 homeserver.tatooine.lan kernel: EXT4-fs (dm-12): error count since last fsck: 1
May 15 21:16:58 homeserver.tatooine.lan kernel: EXT4-fs (dm-9): error count since last fsck: 7
May 15 21:16:58 homeserver.tatooine.lan kernel: EXT4-fs (dm-11): initial error at time 1651795415: ext4_validate_block_bitmap:390
May 15 21:16:58 homeserver.tatooine.lan kernel: EXT4-fs (dm-9): initial error at time 1651795421: ext4_lookup:1785
May 15 21:16:58 homeserver.tatooine.lan kernel: EXT4-fs (dm-12): initial error at time 1651795417: ext4_get_journal_inode:5130

The machine did not boot correctly. File systems where corrupted and it lost dev/pve-data...

After editing /etc/default/grub Kernel 5.15.x worked fine:

Code:
GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=off"
GRUB_CMDLINE_LINUX=""

the server starts correctly and no more errors.
 
Last edited:
  • Like
Reactions: Stoiko Ivanov
Confirmed modifying GRUB

GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"

on the Dell T310 corrected the failure to boot with the MPTSAS card.
 
  • Like
Reactions: Stoiko Ivanov
the r340 should be new enough - but I'd suggest to try the following:
* make sure the latest firmware updates are installed for all components (the lifecycle-manager on Dell servers is quite comfortable for this)
* boot into the 5.15 kernel - if the issue persists follow:
https://pve.proxmox.com/wiki/Roadmap#Proxmox_VE_7.2 (known issues part about turning intel_iommu=off)
* boot into the 5.15 kernel

let us know if any of this fixed your issues

yes this should work remotely as well
Finally had some time to make a permanent fix for this, since it was a production server.

Steps taken:
- update bios firmware to latest (2.9.1)
- update perc controller firmware to latest (25.5.9.0001)
- in bios change raid to HBA mode
- reboot into 5.13.19-6
- update PVE kernel to version 5.15.39-1 to avoid this from the roadmap: "As the setting has been reverted in newer pve-kernel-5.15 packages the issue is now mostly relevant during installation from the ISO."
- now boots normally, problem fixed.
 
  • Like
Reactions: Stoiko Ivanov

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!