Random 6.8.4-2-pve kernel crashes

It's been more than 2 weeks since I applied the changes suggested by jsterr. Happy to report the node is stable so far. Going to re-enable VM backups on this node and that will be the ultimate stability test. Fingers crossed.
Thanks! Any downsides or side-effects you have noticed after enabling?
 
Thanks! Any downsides or side-effects you have noticed after enabling?
Focus is on zero kernel panics so didn't look for any downsides.. No side-effects AFAICT (Other than the split-lock issues that only showed up after I applied your suggest boot config).

No apparently slowdown on NVMe throughput (didn't perform any benchmarks though).

Time will tell.. Just going to wait and see if Proxmox can backup the VM/LXCs without triggering kernel panics over the next 2 weeks without the kernel crashing. Fingers crossed.
 
Some updates on this:

One of our developers spent some time investigating this and found that it can be triggered by using mdraid (a software RAID technology with no real integrity checking, which we recommend against as it can easily cause broken raids when using direct IO) and fault injection.

Some more background, skip to the end if you're only interested in the kernel version fixing this.
The investigation led to a problematic patch that was introduced into the upstream 6.6 Linux kernel, namely 81ada09cc25e ("blk-flush: reuse rq queuelist in flush state machine"). We worked with upstream to fix this edge case, resulting in a patch that was backported into our 6.8 kernel about a week ago (for completeness' sake, a first revision of this fix a week earlier still had some other problems that we found).
While the original problem was in the common block layer, we could only reproduce it by using mdraid, so the problem seems to be at least amplified strongly when using mdraid. But it might also be the cause of some other, more rare or setup specific issues with similar symptoms.
Because of this, we could not notice this issue early on in our production loads, as we focus on testing recommended setups, nor did we get a report through Enterprise Support, where we could have looked into it much more quickly.
Since this is a data race issue, using different kernels, especially for relatively short periods of time, is not really an indicator that those kernels were unaffected. While we can't rule out the possibility that some other change in newer kernels has increased the likelihood of triggering this problem, it's more realistic that this issue went under our and others' radars because mdraid is not used as often due to better/safer options like ZFS or btrfs that provide real data integrity and security checks. That some more Proxmox VE users are affected may have to do with the fact that some hosting providers use mdraid in their default Proxmox VE server templates, even though it's against our recommendation for the best supported setup.
Disabling the command queue like through using the libata.force=noncq option seems to side-step this issue by avoiding that the race can occur, at least that's our current understanding.

Anyhow, the kernel that includes the patch is proxmox-kernel-6.8 in version 6.8.8-1 (the jump from 6.8.4 to 6.8.8 has nothing to do with it though, this got back ported separately by us).
This package was uploaded to no-subscription a few moments ago.
It definitively fixes our reproducer showing similar symptoms as mentioned in this thread, while that's naturally not a 100% guarantee, we hope that the issue you're facing is also addressed, ideally also those where mdraid is not used.
We'd appreciate feedback, but please try to avoid mixing other issues into this though, one can be affected by more than one problem at the same time after all.
 
Anyhow, the kernel that includes the patch is proxmox-kernel-6.8 in version 6.8.8-1 (the jump from 6.8.4 to 6.8.8 has nothing to do with it though, this got back ported separately by us).
This package was uploaded to no-subscription a few moments ago.
Please double check the dependencies because apt update && apt dist-upgrade on all my up-to-date no-subscription installations wants to remove proxmox-ve (besides installing proxmox-kernel-6.8.8.1-pve-signed and several updates).
Code:
The following packages were automatically installed and are no longer required:
  fonts-font-logos libjs-qrcodejs libjs-sencha-touch libpve-cluster-api-perl proxmox-archive-keyring proxmox-default-kernel proxmox-firewall
  proxmox-headers-6.8.4-3-pve proxmox-kernel-helper proxmox-mail-forward proxmox-offline-mirror-docs proxmox-offline-mirror-helper
Use 'sudo apt autoremove' to remove them.
The following packages will be REMOVED:
  proxmox-ve pve-manager
The following NEW packages will be installed:
  proxmox-headers-6.8.8-1-pve proxmox-kernel-6.8.8-1-pve-signed
The following packages have been kept back:
  pve-container
The following packages will be upgraded:
  libnvpair3linux libpve-cluster-api-perl libpve-cluster-perl libpve-guest-common-perl libpve-notify-perl libpve-rs-perl libpve-storage-perl libuutil3linux
  libzfs4linux libzpool5linux proxmox-headers-6.8 proxmox-kernel-6.8 pve-cluster pve-esxi-import-tools pve-firmware pve-ha-manager spl zfs-initramfs zfs-zed
  zfsutils-linux
20 upgraded, 2 newly installed, 2 to remove and 1 not upgraded.
PS: Any word on a (upstream) fix for amdgpu with the new kernel?

EDIT:
https://forum.proxmox.com/threads/apt-want-to-remove-proxmox-ve-during-update.149103/
https://forum.proxmox.com/threads/you-are-attempting-to-remove-the-meta-package-proxmox-ve.111616/
https://forum.proxmox.com/threads/proxmox-8-2-2-updates-to-kernel-6-8-8-1-are-failing.149109/
https://forum.proxmox.com/threads/u...e-meta-package-proxmox-ve.149102/#post-674884
https://forum.proxmox.com/threads/u...remove-proxmox-ve-package.149101/#post-674885
 
Last edited:
  • Like
Reactions: UdoB
Please double check the dependencies because apt update && apt dist-upgrade on all my up-to-date no-subscription installations wants to remove proxmox-ve (besides installing proxmox-kernel-6.8.8.1-pve-signed and several updates).
Code:
The following packages were automatically installed and are no longer required:
  fonts-font-logos libjs-qrcodejs libjs-sencha-touch libpve-cluster-api-perl proxmox-archive-keyring proxmox-default-kernel proxmox-firewall
  proxmox-headers-6.8.4-3-pve proxmox-kernel-helper proxmox-mail-forward proxmox-offline-mirror-docs proxmox-offline-mirror-helper
Use 'sudo apt autoremove' to remove them.
The following packages will be REMOVED:
  proxmox-ve pve-manager
The following NEW packages will be installed:
  proxmox-headers-6.8.8-1-pve proxmox-kernel-6.8.8-1-pve-signed
The following packages have been kept back:
  pve-container
The following packages will be upgraded:
  libnvpair3linux libpve-cluster-api-perl libpve-cluster-perl libpve-guest-common-perl libpve-notify-perl libpve-rs-perl libpve-storage-perl libuutil3linux
  libzfs4linux libzpool5linux proxmox-headers-6.8 proxmox-kernel-6.8 pve-cluster pve-esxi-import-tools pve-firmware pve-ha-manager spl zfs-initramfs zfs-zed
  zfsutils-linux
20 upgraded, 2 newly installed, 2 to remove and 1 not upgraded.
PS: Any word on a (upstream) fix for amdgpu with the new kernel?

EDIT:
https://forum.proxmox.com/threads/apt-want-to-remove-proxmox-ve-during-update.149103/
https://forum.proxmox.com/threads/you-are-attempting-to-remove-the-meta-package-proxmox-ve.111616/
https://forum.proxmox.com/threads/proxmox-8-2-2-updates-to-kernel-6-8-8-1-are-failing.149109/
https://forum.proxmox.com/threads/u...e-meta-package-proxmox-ve.149102/#post-674884
https://forum.proxmox.com/threads/u...remove-proxmox-ve-package.149101/#post-674885
Pretty sure part of this is the pve manager 8.2.4 update .. Something in the rep got hosed maybe.
 
  • Like
Reactions: ipipwong
Some updates on this:


Anyhow, the kernel that includes the patch is proxmox-kernel-6.8 in version 6.8.8-1 (the jump from 6.8.4 to 6.8.8 has nothing to do with it though, this got back ported separately by us).
This package was uploaded to no-subscription a few moments ago.
It definitively fixes our reproducer showing similar symptoms as mentioned in this thread, while that's naturally not a 100% guarantee, we hope that the
Thanks. Going to revert the custom kernel boot params and continue to test.

Getting various issues trying to apply updates. First, I get these:
Code:
W: (pve-apt-hook) !! WARNING !!
W: (pve-apt-hook) You are attempting to remove the meta-package 'proxmox-ve'!
W: (pve-apt-hook)
W: (pve-apt-hook) If you really want to permanently remove 'proxmox-ve' from your system, run the following command
W: (pve-apt-hook)       touch '/please-remove-proxmox-ve'
W: (pve-apt-hook) run apt purge proxmox-ve to remove the meta-package
W: (pve-apt-hook) and repeat your apt invocation.
W: (pve-apt-hook)
W: (pve-apt-hook) If you are unsure why 'proxmox-ve' would be removed, please verify
W: (pve-apt-hook)       - your APT repository settings
W: (pve-apt-hook)       - that you are using 'apt full-upgrade' to upgrade your system
E: Sub-process /usr/share/proxmox-ve/pve-apt-hook returned an error code (1)
E: Failure running script /usr/share/proxmox-ve/pve-apt-hook

Then trying to update manually by running an apt upgrade, I get this:
Code:
:~# apt upgrade
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
The following NEW packages will be installed:
  proxmox-kernel-6.8.8-1-pve-signed
The following packages have been kept back:
  libpve-cluster-api-perl libpve-cluster-perl libpve-notify-perl libpve-rs-perl pve-container pve-manager
The following packages will be upgraded:
  libnvpair3linux libpve-guest-common-perl libpve-storage-perl libuutil3linux libzfs4linux libzpool5linux proxmox-kernel-6.8 pve-cluster
  pve-esxi-import-tools pve-firmware pve-ha-manager spl zfs-initramfs zfs-zed zfsutils-linux
15 upgraded, 1 newly installed, 0 to remove and 6 not upgraded.
Need to get 0 B/241 MB of archives.
After this operation, 582 MB of additional disk space will be used.
Do you want to continue? [Y/n] y
Reading changelogs... Done
(Reading database ... 59255 files and directories currently installed.)
Preparing to unpack .../00-libnvpair3linux_2.2.4-pve1_amd64.deb ...
Unpacking libnvpair3linux (2.2.4-pve1) over (2.2.3-pve2) ...
Preparing to unpack .../01-pve-cluster_8.0.7_amd64.deb ...
Unpacking pve-cluster (8.0.7) over (8.0.6) ...
Preparing to unpack .../02-libpve-storage-perl_8.2.2_all.deb ...
Unpacking libpve-storage-perl (8.2.2) over (8.2.1) ...
Preparing to unpack .../03-libpve-guest-common-perl_5.1.3_all.deb ...
Unpacking libpve-guest-common-perl (5.1.3) over (5.1.2) ...
Preparing to unpack .../04-libuutil3linux_2.2.4-pve1_amd64.deb ...
Unpacking libuutil3linux (2.2.4-pve1) over (2.2.3-pve2) ...
Preparing to unpack .../05-libzfs4linux_2.2.4-pve1_amd64.deb ...
Unpacking libzfs4linux (2.2.4-pve1) over (2.2.3-pve2) ...
Preparing to unpack .../06-libzpool5linux_2.2.4-pve1_amd64.deb ...
Unpacking libzpool5linux (2.2.4-pve1) over (2.2.3-pve2) ...
Preparing to unpack .../07-pve-firmware_3.12-1_all.deb ...
Unpacking pve-firmware (3.12-1) over (3.11-1) ...
Selecting previously unselected package proxmox-kernel-6.8.8-1-pve-signed.
Preparing to unpack .../08-proxmox-kernel-6.8.8-1-pve-signed_6.8.8-1_amd64.deb ...
Unpacking proxmox-kernel-6.8.8-1-pve-signed (6.8.8-1) ...
Preparing to unpack .../09-proxmox-kernel-6.8_6.8.8-1_all.deb ...
Unpacking proxmox-kernel-6.8 (6.8.8-1) over (6.8.4-3) ...
Preparing to unpack .../10-pve-esxi-import-tools_0.7.1_amd64.deb ...
Unpacking pve-esxi-import-tools (0.7.1) over (0.7.0) ...
Preparing to unpack .../11-pve-ha-manager_4.0.5_amd64.deb ...
Unpacking pve-ha-manager (4.0.5) over (4.0.4) ...
Preparing to unpack .../12-spl_2.2.4-pve1_all.deb ...
Unpacking spl (2.2.4-pve1) over (2.2.3-pve2) ...
Preparing to unpack .../13-zfs-initramfs_2.2.4-pve1_all.deb ...
Unpacking zfs-initramfs (2.2.4-pve1) over (2.2.3-pve2) ...
Preparing to unpack .../14-zfsutils-linux_2.2.4-pve1_amd64.deb ...
Unpacking zfsutils-linux (2.2.4-pve1) over (2.2.3-pve2) ...
Preparing to unpack .../15-zfs-zed_2.2.4-pve1_amd64.deb ...
Unpacking zfs-zed (2.2.4-pve1) over (2.2.3-pve2) ...
Setting up libnvpair3linux (2.2.4-pve1) ...
Setting up pve-firmware (3.12-1) ...
Setting up proxmox-kernel-6.8.8-1-pve-signed (6.8.8-1) ...
Examining /etc/kernel/postinst.d.
run-parts: executing /etc/kernel/postinst.d/initramfs-tools 6.8.8-1-pve /boot/vmlinuz-6.8.8-1-pve
update-initramfs: Generating /boot/initrd.img-6.8.8-1-pve
Running hook script 'zz-proxmox-boot'..
Re-executing '/etc/kernel/postinst.d/zz-proxmox-boot' in new private mount namespace..
No /etc/kernel/proxmox-boot-uuids found, skipping ESP sync.
run-parts: executing /etc/kernel/postinst.d/proxmox-auto-removal 6.8.8-1-pve /boot/vmlinuz-6.8.8-1-pve
run-parts: executing /etc/kernel/postinst.d/zz-proxmox-boot 6.8.8-1-pve /boot/vmlinuz-6.8.8-1-pve
Re-executing '/etc/kernel/postinst.d/zz-proxmox-boot' in new private mount namespace..
No /etc/kernel/proxmox-boot-uuids found, skipping ESP sync.
run-parts: executing /etc/kernel/postinst.d/zz-update-grub 6.8.8-1-pve /boot/vmlinuz-6.8.8-1-pve
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-6.8.8-1-pve
Found initrd image: /boot/initrd.img-6.8.8-1-pve
Found linux image: /boot/vmlinuz-6.8.4-3-pve
Found initrd image: /boot/initrd.img-6.8.4-3-pve
Found memtest86+ 64bit EFI image: /boot/memtest86+x64.efi
Adding boot menu entry for UEFI Firmware Settings ...
done
Setting up pve-cluster (8.0.7) ...
Setting up spl (2.2.4-pve1) ...
Setting up libpve-storage-perl (8.2.2) ...
Setting up libuutil3linux (2.2.4-pve1) ...
Setting up pve-esxi-import-tools (0.7.1) ...
Setting up proxmox-kernel-6.8 (6.8.8-1) ...
Setting up pve-ha-manager (4.0.5) ...
watchdog-mux.service is a disabled or a static unit, not starting it.
Setting up libzpool5linux (2.2.4-pve1) ...
Setting up libzfs4linux (2.2.4-pve1) ...
Setting up zfsutils-linux (2.2.4-pve1) ...
Setting up zfs-initramfs (2.2.4-pve1) ...
Setting up libpve-guest-common-perl (5.1.3) ...
Setting up zfs-zed (2.2.4-pve1) ...
Processing triggers for libc-bin (2.36-9+deb12u7) ...
Processing triggers for pve-manager (8.2.2) ...
Processing triggers for man-db (2.11.2-2) ...
Processing triggers for initramfs-tools (0.142) ...
update-initramfs: Generating /boot/initrd.img-6.8.8-1-pve
Running hook script 'zz-proxmox-boot'..
Re-executing '/etc/kernel/postinst.d/zz-proxmox-boot' in new private mount namespace..
No /etc/kernel/proxmox-boot-uuids found, skipping ESP sync.
Processing triggers for pve-ha-manager (4.0.5) ...
root@pve-3:~# apt upgrade
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
The following packages have been kept back:
  libpve-cluster-api-perl libpve-cluster-perl libpve-notify-perl libpve-rs-perl pve-container pve-manager
0 upgraded, 0 newly installed, 0 to remove and 6 not upgraded.


And when I try to update the kept back packages, I get:
Code:
apt upgrade   libpve-cluster-api-perl libpve-cluster-perl libpve-notify-perl libpve-rs-perl pve-container pve-manager

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 pve-container : Depends: proxmox-backup-client (>= 3.2.5-1) but 3.2.3-1 is to be installed
E: Broken packages
 
Last edited:
Some updates on this:

One of our developers spent some time investigating this and found that it can be triggered by using mdraid (a software RAID technology with no real integrity checking, which we recommend against as it can easily cause broken raids when using direct IO) and fault injection.

Some more background, skip to the end if you're only interested in the kernel version fixing this.
The investigation led to a problematic patch that was introduced into the upstream 6.6 Linux kernel, namely 81ada09cc25e ("blk-flush: reuse rq queuelist in flush state machine"). We worked with upstream to fix this edge case, resulting in a patch that was backported into our 6.8 kernel about a week ago (for completeness' sake, a first revision of this fix a week earlier still had some other problems that we found).
While the original problem was in the common block layer, we could only reproduce it by using mdraid, so the problem seems to be at least amplified strongly when using mdraid. But it might also be the cause of some other, more rare or setup specific issues with similar symptoms.
Because of this, we could not notice this issue early on in our production loads, as we focus on testing recommended setups, nor did we get a report through Enterprise Support, where we could have looked into it much more quickly.
Since this is a data race issue, using different kernels, especially for relatively short periods of time, is not really an indicator that those kernels were unaffected. While we can't rule out the possibility that some other change in newer kernels has increased the likelihood of triggering this problem, it's more realistic that this issue went under our and others' radars because mdraid is not used as often due to better/safer options like ZFS or btrfs that provide real data integrity and security checks. That some more Proxmox VE users are affected may have to do with the fact that some hosting providers use mdraid in their default Proxmox VE server templates, even though it's against our recommendation for the best supported setup.
Disabling the command queue like through using the libata.force=noncq option seems to side-step this issue by avoiding that the race can occur, at least that's our current understanding.

Anyhow, the kernel that includes the patch is proxmox-kernel-6.8 in version 6.8.8-1 (the jump from 6.8.4 to 6.8.8 has nothing to do with it though, this got back ported separately by us).
This package was uploaded to no-subscription a few moments ago.
It definitively fixes our reproducer showing similar symptoms as mentioned in this thread, while that's naturally not a 100% guarantee, we hope that the issue you're facing is also addressed, ideally also those where mdraid is not used.
We'd appreciate feedback, but please try to avoid mixing other issues into this though, one can be affected by more than one problem at the same time after all.

It is fortunate that the general use of mdraid helped expose a problem that (the additional layer of integrity checks provided by) some filesystems may have kept masked until now.

Proceeding to thoroughly test now. Let's start with some read bursts :)

Code:
root@ryzen-cobaya:~# pveversion
pve-manager/8.2.2/9355359cd7afbae4 (running kernel: 6.8.8-1-pve)

root@ryzen-cobaya:~# cat /proc/mdstat
Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4]
md1 : active raid10 sdb1[0] sdd1[3] sdc1[1] sda1[2]
      1953258496 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      [>....................]  check =  2.7% (53767936/1953258496) finish=121.2min speed=261188K/sec
      bitmap: 0/1 pages [0KB], 1048576KB chunk

md0 : active raid1 nvme1n1p1[1] nvme0n1p1[0]
      976628736 blocks super 1.2 [2/2] [UU]
      [===>.................]  check = 18.1% (177504768/976628736) finish=12.9min speed=1027842K/sec
      bitmap: 1/1 pages [4KB], 1048576KB chunk

unused devices: <none>
 
Last edited:
  • Like
Reactions: krrpi
Please double check the dependencies because apt update && apt dist-upgrade on all my up-to-date no-subscription installations wants to remove proxmox-ve (besides installing proxmox-kernel-6.8.8.1-pve-signed and several updates).
Code:
The following packages were automatically installed and are no longer required:
  fonts-font-logos libjs-qrcodejs libjs-sencha-touch libpve-cluster-api-perl proxmox-archive-keyring proxmox-default-kernel proxmox-firewall
  proxmox-headers-6.8.4-3-pve proxmox-kernel-helper proxmox-mail-forward proxmox-offline-mirror-docs proxmox-offline-mirror-helper
Use 'sudo apt autoremove' to remove them.
The following packages will be REMOVED:
  proxmox-ve pve-manager
The following NEW packages will be installed:
  proxmox-headers-6.8.8-1-pve proxmox-kernel-6.8.8-1-pve-signed
The following packages have been kept back:
  pve-container
The following packages will be upgraded:
  libnvpair3linux libpve-cluster-api-perl libpve-cluster-perl libpve-guest-common-perl libpve-notify-perl libpve-rs-perl libpve-storage-perl libuutil3linux
  libzfs4linux libzpool5linux proxmox-headers-6.8 proxmox-kernel-6.8 pve-cluster pve-esxi-import-tools pve-firmware pve-ha-manager spl zfs-initramfs zfs-zed
  zfsutils-linux
20 upgraded, 2 newly installed, 2 to remove and 1 not upgraded.
PS: Any word on a (upstream) fix for amdgpu with the new kernel?

EDIT:
https://forum.proxmox.com/threads/apt-want-to-remove-proxmox-ve-during-update.149103/
https://forum.proxmox.com/threads/you-are-attempting-to-remove-the-meta-package-proxmox-ve.111616/
https://forum.proxmox.com/threads/proxmox-8-2-2-updates-to-kernel-6-8-8-1-are-failing.149109/
https://forum.proxmox.com/threads/u...e-meta-package-proxmox-ve.149102/#post-674884
https://forum.proxmox.com/threads/u...remove-proxmox-ve-package.149101/#post-674885

Thanks. Going to revert the custom kernel boot params and continue to test.

Getting various issues trying to apply updates. First, I get these:
Code:
W: (pve-apt-hook) !! WARNING !!
W: (pve-apt-hook) You are attempting to remove the meta-package 'proxmox-ve'!
W: (pve-apt-hook)
W: (pve-apt-hook) If you really want to permanently remove 'proxmox-ve' from your system, run the following command
W: (pve-apt-hook)       touch '/please-remove-proxmox-ve'
W: (pve-apt-hook) run apt purge proxmox-ve to remove the meta-package
W: (pve-apt-hook) and repeat your apt invocation.
W: (pve-apt-hook)
W: (pve-apt-hook) If you are unsure why 'proxmox-ve' would be removed, please verify
W: (pve-apt-hook)       - your APT repository settings
W: (pve-apt-hook)       - that you are using 'apt full-upgrade' to upgrade your system
E: Sub-process /usr/share/proxmox-ve/pve-apt-hook returned an error code (1)
E: Failure running script /usr/share/proxmox-ve/pve-apt-hook

Then trying to update manually by running an apt upgrade, I get this:
Code:
:~# apt upgrade
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
The following NEW packages will be installed:
  proxmox-kernel-6.8.8-1-pve-signed
The following packages have been kept back:
  libpve-cluster-api-perl libpve-cluster-perl libpve-notify-perl libpve-rs-perl pve-container pve-manager
The following packages will be upgraded:
  libnvpair3linux libpve-guest-common-perl libpve-storage-perl libuutil3linux libzfs4linux libzpool5linux proxmox-kernel-6.8 pve-cluster
  pve-esxi-import-tools pve-firmware pve-ha-manager spl zfs-initramfs zfs-zed zfsutils-linux
15 upgraded, 1 newly installed, 0 to remove and 6 not upgraded.
Need to get 0 B/241 MB of archives.
After this operation, 582 MB of additional disk space will be used.
Do you want to continue? [Y/n] y
Reading changelogs... Done
(Reading database ... 59255 files and directories currently installed.)
Preparing to unpack .../00-libnvpair3linux_2.2.4-pve1_amd64.deb ...
Unpacking libnvpair3linux (2.2.4-pve1) over (2.2.3-pve2) ...
Preparing to unpack .../01-pve-cluster_8.0.7_amd64.deb ...
Unpacking pve-cluster (8.0.7) over (8.0.6) ...
Preparing to unpack .../02-libpve-storage-perl_8.2.2_all.deb ...
Unpacking libpve-storage-perl (8.2.2) over (8.2.1) ...
Preparing to unpack .../03-libpve-guest-common-perl_5.1.3_all.deb ...
Unpacking libpve-guest-common-perl (5.1.3) over (5.1.2) ...
Preparing to unpack .../04-libuutil3linux_2.2.4-pve1_amd64.deb ...
Unpacking libuutil3linux (2.2.4-pve1) over (2.2.3-pve2) ...
Preparing to unpack .../05-libzfs4linux_2.2.4-pve1_amd64.deb ...
Unpacking libzfs4linux (2.2.4-pve1) over (2.2.3-pve2) ...
Preparing to unpack .../06-libzpool5linux_2.2.4-pve1_amd64.deb ...
Unpacking libzpool5linux (2.2.4-pve1) over (2.2.3-pve2) ...
Preparing to unpack .../07-pve-firmware_3.12-1_all.deb ...
Unpacking pve-firmware (3.12-1) over (3.11-1) ...
Selecting previously unselected package proxmox-kernel-6.8.8-1-pve-signed.
Preparing to unpack .../08-proxmox-kernel-6.8.8-1-pve-signed_6.8.8-1_amd64.deb ...
Unpacking proxmox-kernel-6.8.8-1-pve-signed (6.8.8-1) ...
Preparing to unpack .../09-proxmox-kernel-6.8_6.8.8-1_all.deb ...
Unpacking proxmox-kernel-6.8 (6.8.8-1) over (6.8.4-3) ...
Preparing to unpack .../10-pve-esxi-import-tools_0.7.1_amd64.deb ...
Unpacking pve-esxi-import-tools (0.7.1) over (0.7.0) ...
Preparing to unpack .../11-pve-ha-manager_4.0.5_amd64.deb ...
Unpacking pve-ha-manager (4.0.5) over (4.0.4) ...
Preparing to unpack .../12-spl_2.2.4-pve1_all.deb ...
Unpacking spl (2.2.4-pve1) over (2.2.3-pve2) ...
Preparing to unpack .../13-zfs-initramfs_2.2.4-pve1_all.deb ...
Unpacking zfs-initramfs (2.2.4-pve1) over (2.2.3-pve2) ...
Preparing to unpack .../14-zfsutils-linux_2.2.4-pve1_amd64.deb ...
Unpacking zfsutils-linux (2.2.4-pve1) over (2.2.3-pve2) ...
Preparing to unpack .../15-zfs-zed_2.2.4-pve1_amd64.deb ...
Unpacking zfs-zed (2.2.4-pve1) over (2.2.3-pve2) ...
Setting up libnvpair3linux (2.2.4-pve1) ...
Setting up pve-firmware (3.12-1) ...
Setting up proxmox-kernel-6.8.8-1-pve-signed (6.8.8-1) ...
Examining /etc/kernel/postinst.d.
run-parts: executing /etc/kernel/postinst.d/initramfs-tools 6.8.8-1-pve /boot/vmlinuz-6.8.8-1-pve
update-initramfs: Generating /boot/initrd.img-6.8.8-1-pve
Running hook script 'zz-proxmox-boot'..
Re-executing '/etc/kernel/postinst.d/zz-proxmox-boot' in new private mount namespace..
No /etc/kernel/proxmox-boot-uuids found, skipping ESP sync.
run-parts: executing /etc/kernel/postinst.d/proxmox-auto-removal 6.8.8-1-pve /boot/vmlinuz-6.8.8-1-pve
run-parts: executing /etc/kernel/postinst.d/zz-proxmox-boot 6.8.8-1-pve /boot/vmlinuz-6.8.8-1-pve
Re-executing '/etc/kernel/postinst.d/zz-proxmox-boot' in new private mount namespace..
No /etc/kernel/proxmox-boot-uuids found, skipping ESP sync.
run-parts: executing /etc/kernel/postinst.d/zz-update-grub 6.8.8-1-pve /boot/vmlinuz-6.8.8-1-pve
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-6.8.8-1-pve
Found initrd image: /boot/initrd.img-6.8.8-1-pve
Found linux image: /boot/vmlinuz-6.8.4-3-pve
Found initrd image: /boot/initrd.img-6.8.4-3-pve
Found memtest86+ 64bit EFI image: /boot/memtest86+x64.efi
Adding boot menu entry for UEFI Firmware Settings ...
done
Setting up pve-cluster (8.0.7) ...
Setting up spl (2.2.4-pve1) ...
Setting up libpve-storage-perl (8.2.2) ...
Setting up libuutil3linux (2.2.4-pve1) ...
Setting up pve-esxi-import-tools (0.7.1) ...
Setting up proxmox-kernel-6.8 (6.8.8-1) ...
Setting up pve-ha-manager (4.0.5) ...
watchdog-mux.service is a disabled or a static unit, not starting it.
Setting up libzpool5linux (2.2.4-pve1) ...
Setting up libzfs4linux (2.2.4-pve1) ...
Setting up zfsutils-linux (2.2.4-pve1) ...
Setting up zfs-initramfs (2.2.4-pve1) ...
Setting up libpve-guest-common-perl (5.1.3) ...
Setting up zfs-zed (2.2.4-pve1) ...
Processing triggers for libc-bin (2.36-9+deb12u7) ...
Processing triggers for pve-manager (8.2.2) ...
Processing triggers for man-db (2.11.2-2) ...
Processing triggers for initramfs-tools (0.142) ...
update-initramfs: Generating /boot/initrd.img-6.8.8-1-pve
Running hook script 'zz-proxmox-boot'..
Re-executing '/etc/kernel/postinst.d/zz-proxmox-boot' in new private mount namespace..
No /etc/kernel/proxmox-boot-uuids found, skipping ESP sync.
Processing triggers for pve-ha-manager (4.0.5) ...
root@pve-3:~# apt upgrade
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
The following packages have been kept back:
  libpve-cluster-api-perl libpve-cluster-perl libpve-notify-perl libpve-rs-perl pve-container pve-manager
0 upgraded, 0 newly installed, 0 to remove and 6 not upgraded.


And when I try to update the kept back packages, I get:
Code:
apt upgrade   libpve-cluster-api-perl libpve-cluster-perl libpve-notify-perl libpve-rs-perl pve-container pve-manager

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 pve-container : Depends: proxmox-backup-client (>= 3.2.5-1) but 3.2.3-1 is to be installed
E: Broken packages
This was due to an unrelated mishap in the no-subscription repo w.r.t. dependencies of pve-container and the proxmox-backup-client packages, which has been fixed since a few hours again (and guarded against in our tooling for the future). You can simply try again to refresh the package index through apt update and then upgrade via apt full-upgrade (FYI the apt commands dist-upgrade and full-upgrade are aliases and do the same thing).
 
Last edited:
This was due to an unrelated mishap in the no-subscription repo w.r.t. dependencies of pve-container and the proxmox-backup-client packages, which has been fixed since a few hours again (and guarded against in our tooling for the future). You can simply try again to refresh the package index through apt update and then upgrade via apt full-upgrade (FYI the apt commands dist-upgrade and full-upgrade are aliases and do the same thing).
Yup. Dependency confirmed resolved and the stray 6 packages are updated now. Will schedule for a reboot later today and see how it goes.

Update: No go with kernel 6.8.8 for me as hitting a driver issue with my QNA-T310G1S dongle. No access to Windows machines with Thunderbolt so I can't even update the firmware. Looks like I'm going to be stuck with the older kernel for a while yet. Hopefully others can confirm no more I/O issues with this new kernel.
 
Last edited:
I've pinned the 6.5 kernel because I need long uptime stability. 6.5.13-5-pve What problems will this cause ?
 
Update: No go with kernel 6.8.8 for me as hitting a driver issue with my QNA-T310G1S dongle. No access to Windows machines with Thunderbolt so I can't even update the firmware. Looks like I'm going to be stuck with the older kernel for a while yet. Hopefully others can confirm no more I/O issues with this new kernel.
See my post here - maybe it will help you.
 
  • Like
Reactions: snakeoilos
See my post here - maybe it will help you.
That worked! I'll start testing 6.8.8-1 now and see how it goes. Will report back.

Quick update:
So far so good.. Did a VM backup, copied a 40 GB file back and forth between the VM and a NFS share... Did see 2x split lock traps but VM didn't freeze. No kernel panics on host. So sustained I/O looks problematic free for now. But these tests means nothing as I didn't perform this test before I apply the nocq boot param. lol.

Anyhow will let this problem node continue operating as per usual and see how it goes.. With 6.8.4, the node will not run > 3 days without a kernel panic.
 
Last edited:
Hi Proxmox staff,

I am pleased to report that since installing kernel: 6.8.8-1-pve I have not experienced any problems in this regard. The current kernel has now been running for about 7 days without interruption and without any problems. Many thanks for your support.
 
Wish I'm this lucky.. Just had a hard crash on this problem node with 6.8.8-1 (Was trying to compile an application which is pretty CPU intensive).

Nothing on the console, nothing useful in the journal from last boot.
Code:
Jun 25 09:43:51 pve-3 sshd[530192]: Received disconnect from 172.16.b.c port 49152:11: disconnected by user
Jun 25 09:43:51 pve-3 sshd[530192]: Disconnected from user root 172.16.b.c port 49152
Jun 25 09:43:51 pve-3 sshd[530192]: pam_unix(sshd:session): session closed for user root

Had to do a hard power off. Guess this is as good as it gets, at least I think this kernel is still better than 6.8.4 and is fine when hitting heavy loads on the NVMe media (all backups are good now where it will previously kernel fault).
 
Guess this is as good as it gets
Good its crashing less - not good it still crashes occasionally.
Was trying to compile an application which is pretty CPU intensive
Have you tried previously (on older kernel/updates) similar job successfully?

Maybe you need better/more CPU & RAM. What do you have now? Have you properly tested the RAM? Thermals & cooling?
 
  • Like
Reactions: snakeoilos
Wish I'm this lucky.. Just had a hard crash on this problem node with 6.8.8-1 (Was trying to compile an application which is pretty CPU intensive).

Nothing on the console, nothing useful in the journal from last boot.
Code:
Jun 25 09:43:51 pve-3 sshd[530192]: Received disconnect from 172.16.b.c port 49152:11: disconnected by user
Jun 25 09:43:51 pve-3 sshd[530192]: Disconnected from user root 172.16.b.c port 49152
Jun 25 09:43:51 pve-3 sshd[530192]: pam_unix(sshd:session): session closed for user root

Had to do a hard power off. Guess this is as good as it gets, at least I think this kernel is still better than 6.8.4 and is fine when hitting heavy loads on the NVMe media (all backups are good now where it will previously kernel fault).
Thanks for reporting back. As @gfngfn256 mentioned, a freeze while compiling could also hint at a thermal problem -- or faulty RAM, if possible I'd suggest to run a memtest86+ to rule that out.

Do I understand correctly that the freeze during load on the NVMe also happened with kernel 6.5?
Does the freeze during compile also happen on kernel 6.5?

Can you post the output of the following commands?
Code:
uname -a
cat /proc/cmdline
lsblk --ascii -M -o +HOTPLUG,FSTYPE,MODEL,TRAN
 
Good its crashing less - not good it still crashes occasionally.

Have you tried previously (on older kernel/updates) similar job successfully?

Maybe you need better/more CPU & RAM. What do you have now? Have you properly tested the RAM? Thermals & cooling?
CPU is a i9-13900H, 64 GB of RAM. Ran memtest86+ when I first got this. Prob 6 months old now and never ran mem test again. Will try to find a time to retest. Microcode is on 20240531-1. Will have to setup something to monitor thermals & cooling I guess.. Great tips, thanks.


Do I understand correctly that the freeze during load on the NVMe also happened with kernel 6.5?
No... 6.5 is good..Technically it's not a freeze, there's a kernel stack dump first. The kernel faults only happened in 6.8.x...

Previously I've pinned to 6.5 and that's been very stable. Only switched to this this newer 6.8.8 recently.
Does the freeze during compile also happen on kernel 6.5?
No, but that freeze could just be a coincidence. I agree it's likely thermal related... Gonna play more with this...
Can you post the output of the following commands?
Code:
uname -a
cat /proc/cmdline
lsblk --ascii -M -o +HOTPLUG,FSTYPE,MODEL,TRAN
Code:
Linux pve-3 6.8.8-1-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.8-1 (2024-06-10T11:42Z) x86_64 GNU/Linux

Code:
BOOT_IMAGE=/boot/vmlinuz-6.8.8-1-pve root=/dev/mapper/pve-root ro quiet nomodeset thunderbolt.host_reset=false
Had to add that host_reset param otherwise my QNAP TB dongle will not work in 6.8.8.-1

Also, I've been meaning to replace this NVMe.. I'm pretty certain all my woes are due to this no-name drive... It's probably just a fluke this drive is problem free before.
Code:
       NAME                         MAJ:MIN RM    SIZE RO TYPE MOUNTPOINTS HOTPLUG FSTYPE      MODEL             TRAN
       nvme0n1                      259:0    0    1.9T  0 disk                   0             QUANXING N700 2TB nvme
       |-nvme0n1p1                  259:1    0   1007K  0 part                   0                               nvme
       |-nvme0n1p2                  259:2    0    512M  0 part /boot/efi         0 vfat                          nvme
       |-nvme0n1p3                  259:3    0  896.4G  0 part                   0 LVM2_member                   nvme
       | |-pve-swap                 252:0    0     16G  0 lvm  [SWAP]            0 swap                         
       | |-pve-root                 252:1    0  115.1G  0 lvm  /                 0 ext4                         
   ,-> | |-pve-data_tmeta           252:2    0     96M  0 lvm                    0                               
   '-> | `-pve-data_tdata           252:4    0  765.1G  0 lvm                    0                               
    |  `-nvme0n1p4                  259:4    0 1010.8G  0 part                   0 LVM2_member                   nvme
,-> |    |-vmdata-vmstore_tmeta     252:3    0    128M  0 lvm                    0                               
'-> |    `-vmdata-vmstore_tdata     252:5    0 1010.6G  0 lvm                    0                               
 |  `--pve-data-tpool               252:6    0  765.1G  0 lvm                    0                               
 |     |-pve-data                   252:7    0  765.1G  1 lvm                    0                               
 |     |-pve-vm--1191--disk--0      252:8    0     32G  0 lvm                    0                               
 |     |-pve-vm--9002--disk--0      252:9    0     32G  0 lvm                    0                               
 |     |-pve-vm--10802--disk--0     252:10   0     40G  0 lvm                    0 ext4                         
 |     |-pve-vm--5500--disk--0      252:11   0      4M  0 lvm                    0                               
 |     |-pve-vm--5500--disk--1      252:12   0      4M  0 lvm                    0                               
 |     |-pve-vm--5500--disk--2      252:13   0     60G  0 lvm                    0                               
 |     |-pve-vm--22252--disk--0     252:14   0      8G  0 lvm                    0 ext4                         
 |     |-pve-vm--5501--fleece--0    252:15   0     60G  0 lvm                    0                               
 |     |-pve-vm--5501--fleece--1    252:16   0    200G  0 lvm                    0                               
 |     |-pve-vm--10801--disk--1     252:17   0     60G  0 lvm                    0 ext4                         
 |     `-pve-vm--10801--disk--0     252:18   0     80G  0 lvm                    0 ext4                         
 `-----vmdata-vmstore-tpool         252:19   0 1010.6G  0 lvm                    0                               
       |-vmdata-vmstore             252:20   0 1010.6G  1 lvm                    0                               
       |-vmdata-vm--8001--disk--0   252:21   0      3G  0 lvm                    0 ext4                         
       |-vmdata-vm--19999--disk--0  252:22   0     40G  0 lvm                    0 ext4                         
       |-vmdata-vm--9001--disk--0   252:23   0      4M  0 lvm                    0                               
       |-vmdata-vm--9001--disk--1   252:24   0     80G  0 lvm                    0                               
       |-vmdata-vm--30000--disk--0  252:25   0      4M  0 lvm                    0                               
       |-vmdata-vm--30000--disk--1  252:26   0     32G  0 lvm                    0                               
       |-vmdata-vm--30000--disk--2  252:27   0      4M  0 lvm                    0                               
       |-vmdata-vm--9001--disk--2   252:28   0      4M  0 lvm                    0                               
       |-vmdata-vm--2105--disk--0   252:29   0     60G  0 lvm                    0 ext4                         
       |-vmdata-vm--8000--disk--0   252:30   0     20G  0 lvm                    0 ext4                         
       |-vmdata-vm--8003--disk--0   252:31   0      2G  0 lvm                    0 ext4                         
       |-vmdata-vm--8004--disk--0   252:32   0      8G  0 lvm                    0 ext4                         
       |-vmdata-vm--10800--disk--0  252:33   0     40G  0 lvm                    0 ext4                         
       |-vmdata-vm--8005--disk--0   252:34   0      4G  0 lvm                    0 ext4                         
       |-vmdata-vm--8006--disk--0   252:35   0      4G  0 lvm                    0 ext4                         
       |-vmdata-vm--8007--disk--0   252:36   0      4G  0 lvm                    0 ext4                         
       |-vmdata-vm--8008--disk--0   252:37   0      4G  0 lvm                    0 ext4                         
       |-vmdata-vm--9850--disk--0   252:38   0      8G  0 lvm                    0 ext4                         
       |-vmdata-vm--5501--disk--0   252:39   0      4M  0 lvm                    0                               
       |-vmdata-vm--5501--disk--1   252:40   0      4M  0 lvm                    0                               
       |-vmdata-vm--5501--disk--2   252:41   0     60G  0 lvm                    0                               
       |-vmdata-vm--10501--disk--0  252:42   0     16G  0 lvm                    0 ext4                         
       |-vmdata-vm--10600--disk--0  252:43   0     20G  0 lvm                    0 ext4                         
       |-vmdata-vm--5501--disk--3   252:44   0    200G  0 lvm                    0                               
       |-vmdata-vm--8002--disk--0   252:45   0      8G  0 lvm                    0 ext4                         
       |-vmdata-vm--5500--fleece--0 252:46   0     60G  0 lvm                    0                               
       |-vmdata-vm--8009--disk--0   252:47   0     40G  0 lvm                    0 ext4                         
       `-vmdata-vm--2105--disk--1   252:48   0     60G  0 lvm                    0 ext4
 
Last edited:
CPU is a i9-13900H
I'll assume from this you are using a Mini PC (cheap Chinese?) - coupled with no-brand 2TB NVMe , so I would definitely check thermals inside the enclosure.

Also, I'd check PSU; on some of these cheap machines you usually also get a cheap PSU, (add to that your QNA-T310G1S dongle which is bus-powered & has a healthy appetite for current), which can be under-powered, unstable or both.

Good luck.
 
  • Like
Reactions: snakeoilos