Random 6.8.4-2-pve kernel crashes

Stoiko Ivanov · May 3, 2024

Der Harry said:
For production it's not - for everybody - an option.

yes - that's clear - but my current theory is that some of the issues with 6.8 kernels are simply due to intel_iommu changing to default on, so users who did not need it beforehand (and did not know that it was broken on their system) now have it set to on, thus run into problems. The problems should go away if they change it back to off (which now has to happen explicitly).

If someone has an issue with intel_iommu breaking between 6.5 and 6.8 and having been enabled and working beforehand - that would need to get investigated separately …..

Der Harry said:
For testing I am more then happy to turn it off!

would be great if you report back if it changes/fixes anything on your systems! - Thanks!

kwull · May 3, 2024

On E3-1240 V2 QEMU was not able to start a VM - no meaningful error messages in logs. Reverted kernel to 6.5.13-5-pve solved the issue

Der Harry · May 3, 2024

Stoiko Ivanov said:
For the systems using Intel CPU's - please try adding `intel_iommu=off` to the kernel cmdline - see: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysboot_edit_kernel_cmdline (for both systemd-boot and grub hitting `e` enables the editing before the kernel is booted)

That tutorial is broken

... I did a update-grub (my setup is a Debian 12 with Pmox on Top)

Bash:

root@nuc:~# uname -a
Linux nuc.xxx 6.8.4-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-2 (2024-04-10T17:36Z) x86_64 GNU/Linux
root@nuc:~# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.8.4-2-pve root=UUID=fa27a3ec-e659-4b5d-8416-ac160913f16b ro quiet
root@nuc:~# cat /etc/kernel/cmdline
intel_iommu=off

That worked:

Code:

#/etc/default/grub
#GRUB_CMDLINE_LINUX_DEFAULT="quiet"
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=off"

Bash:

root@nuc:~# uname -a
Linux nuc.xxx 6.8.4-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-2 (2024-04-10T17:36Z) x86_64 GNU/Linux
root@nuc:~# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.8.4-2-pve root=UUID=fa27a3ec-e659-4b5d-8416-ac160913f16b ro quiet intel_iommu=off

Now I make it cook!

Stoiko Ivanov · May 3, 2024

Der Harry said:
That tutorial is broken ... I did a update-grub (my setup is a Debian 12 with Pmox on Top)

where? - the tutorial says that you need to edit /etc/default/grub if you use grub for booting, and /etc/kernel/cmdline if you use systemd-boot (which is used for ZFS on /, in UEFI booted systems, but not using secure boot)....

Der Harry said:
Now I make it cook!

did it fix the issue on your NUC?

Der Harry · May 3, 2024

Stoiko Ivanov said:
where? - the tutorial says that you need to edit /etc/default/grub if you use grub for booting, and /etc/kernel/cmdline if you use systemd-boot (which is used for ZFS on /, in UEFI booted systems, but not using secure boot)....

did it fix the issue on your NUC?

Ok nevermind

then it's my fault.

The intel_iommu=off did the trick (so far). The little nuc is exploding. I try to use amd_iommu=off tonight on my Ryzen. I think we have the smoking gun.

Stoiko Ivanov · May 3, 2024

Der Harry said:
I try to use amd_iommu=off tonight on my Ryzen. I think we have the smoking gun

I don't think that this is the issue on the ryzen:
* amd_iommu has always (or at least since PVE 5.x) defaulted to on - so there was no change there
* intel_iommu has defaulted to off, this changed to on with kernel 6.8 - which is where the issue with your NUC (probably) came from.

Do you have any trace/dmesg/journal from the ryzen showing where things don't work out/panic/crash?

Der Harry · May 3, 2024

Stoiko Ivanov said:
I don't think that this is the issue on the ryzen:
* amd_iommu has always (or at least since PVE 5.x) defaulted to on - so there was no change there
* intel_iommu has defaulted to off, this changed to on with kernel 6.8 - which is where the issue with your NUC (probably) came from.

Do you have any trace/dmesg/journal from the ryzen showing where things don't work out/panic/crash?

That is my ryzen cmd line.

Code:

BOOT_IMAGE=/boot/vmlinuz-6.5.13-5-pve root=UUID=xxx ro quiet amd_iommu=on iommu=pt video=vesafb:off video=efifb:off initcall_blacklist=sysfb_init clocksource=tsc tsc=reliable

Yes voila: https://forum.proxmox.com/threads/random-6-8-4-2-pve-kernel-crashes.145760/#post-657335

On the Ryzen there was also a LXC running (that why I though it was the nfsd).

I am willing to debug that via UART/serial console if needed. It's a big issue for the rootservers at Hetzner.

Here is my idea - there is a Repo for Debian / Ubuntu with more recent kernels:. https://github.com/zabbly/linux#installation

Let my try using one 6.x with x >= 8 - just to check if it's a pve or a Linux thing. I can do that on the nuc. I probably should run some VMs via kvm.

What do you think?

Der Harry · May 3, 2024

Stoiko Ivanov said:
I don't think that this is the issue on the ryzen:
...

Voilà

https://www.phoronix.com/news/Linux-6.8-Scheduler - that might be the reason...

Code:

clocksource=tsc tsc=reliable

And my params might be also messing around with that on top.

I can remember reading something about a Bug that made it or almost made it for >=16 Core CPUs that Linus himself reverted (i might be wrong on that).

kwull · May 3, 2024

kwull said:
On E3-1240 V2 QEMU was not able to start a VM - no meaningful error messages in logs. Reverted kernel to 6.5.13-5-pve solved the issue

BTW, intel_iommu=off did the same trick on 6.8.4-2-pve. So far everything works on 6.8 with intel_iommu=off. Thanks!

m-electronics · May 3, 2024

Ich melde mich hier auch mal, nachdem ich von diesem Problem auf Facebook gelesen habe. Ich kann es für meinen Hetzner Dedi mit einem i7-6700 nicht bestätigen. Darauf läuft der Kernel

Code:

6.8.4-2-pve

mit aktivierter IOMMU für Intel. Und es gibt keinerlei Probleme. Auf meinem Proxmox hier zu Hause mit einem Intel i7-4785T läuft ebenfalls dieser Kernel auch mit aktivierter IOMMU.

Muss ich jetzt trotzdem bei irgendwas aufpassen oder nicht?

Der Harry · May 3, 2024

Stoiko Ivanov said:
Do you have any trace/dmesg/journal from the ryzen showing where things don't work out/panic/crash?

I installed the linux-image-6.8.8-zabbly+ kernel on the nuc in Debian 12 (as referenced here).

DON'T do this on proxmox

Bash:

curl -fsSL https://pkgs.zabbly.com/key.asc | gpg --show-keys --fingerprint
mkdir -p /etc/apt/keyrings/
curl -fsSL https://pkgs.zabbly.com/key.asc -o /etc/apt/keyrings/zabbly.asc

sh -c 'cat <<EOF > /etc/apt/sources.list.d/zabbly-kernel-stable.sources
Enabled: yes
Types: deb
URIs: https://pkgs.zabbly.com/kernel/stable
Suites: $(. /etc/os-release && echo ${VERSION_CODENAME})
Components: main
Architectures: $(dpkg --print-architecture)
Signed-By: /etc/apt/keyrings/zabbly.asc

EOF'
apt-get update
apt-get install linux-zabbly
reboot

It works super fine. The new scheduler feels much better.

I still got a crash, the first time I booted it. On that machine I had virtualbox 7.0.12 installed. Probably it's a "them" problem - but maybe it's also connected to the kvm issues with pve 6.6 I am attaching it here.

After deinstalling virtualbox and rebooting - no more crashes.

Conclusion:

Code:

pve-6.8.4 crashed without the intel_iommu=off (!) even bevore starting a VM

6.8.8-zabbly+ crashes virtualbox 7.0.12 kmods (we don't care - but still interessting)

6.8.8-zabbly+ works (at last for desktop) without the intel_iommu=off (i will do some KVM VMs)

Ramalama · May 3, 2024

I have on all Servers PVE 8.2.2 with Kernel 6.8.4
- Genoa 9374F, Asus rs520a-e12-rs12u
- Genoa 9374F, Asus rs520a-e12-rs12u
- Ryzen 7 5800X, X570D4I-2T
- i3-1315U, NUC 13
- i3-1115G4, NUC 11
- 2x E5-2637 v3, HPE ML350 G9
- Xeon Silver 4210R, HPE DL360 G10
- Xeon Silver 4210R, HPE DL360 G10
- 2x E5-2620 v3, DL360 G9
- E3-1275 v5, FUJITSU D3417-B2

No issues at all, no crashes, all run as Perfect as they can.
Since its unknown what the issue is, that may be helpfull!
None has less uptime as 4 days, some have 14days+, and the longest uptime is 22 days.
Because 22 days ago i started to move to the 6.8.4 kernel and a kernel switch needs sadly a reboot xD

I dont remember when the kernel was released, but i bet it was 22 days ago available in the test repo already.
Cheers

Der Harry · May 3, 2024

Stoiko Ivanov said:
I don't think that this is the issue on the ryzen:
* amd_iommu has always (or at least since PVE 5.x) defaulted to on - so there was no change there

The 6.8.6-zabbly+ and 6.8.8-zabbly+ Kernels (https://github.com/zabbly/linux) and Debian's 12 ancient kvm/qemu is also working.

Code:

# read more here https://wiki.debian.org/KVM
apt install -y qemu-system libvirt-daemon-system
apt install -y virtinst

cook.sh

Code:

# os variant is 11 - Debian 12 is too old :)
virt-install --virt-type kvm --name bookworm-amd64-$1 \
        --cdrom ./debian-12.5.0-amd64-netinst.iso \
        --os-variant debian11 \
        --disk size=10 --memory 1024

cook2.sh

Code:

# os variant is 11 - Debian 12 is too old :)
virt-install --virt-type kvm --name bookworm-amd64-$1 \
        --location https://deb.debian.org/debian/dists/bookworm/main/installer-amd64/ \
        --os-variant debian11 \
        --disk size=10 --memory 1024 \
        --graphics none \
        --console pty,target_type=serial \
        --extra-args "console=ttyS0"

Little Nuc is on fire - but - it's running stable.

Conclusion The pve 6.8.4 is somehow broken.

MasterChat · May 4, 2024

For now this worked for me :
Execute command: proxmox-boot-tool kernel list
Check the older version (My previous was 6.5.13-1-pve)
Execute command: proxmox-boot-tool kernel pin 6.5.13-1-pve --next-boot
Or select the version from the list, If there is no other version:
Execute command: apt install pve-kernel-6.2.16-4-pve
Reboot the server, then:
Execute command: proxmox-boot-tool kernel pin 6.2.16-4-pve --next-boot
Reboot the server.
Don't know if this is correct but it works

Der Harry · May 4, 2024

MasterChat said:
For now this worked for me :
Execute command: proxmox-boot-tool kernel list
Check the older version (My previous was 6.5.13-1-pve)
Execute command: proxmox-boot-tool kernel pin 6.5.13-1-pve --next-boot
Or select the version from the list, If there is no other version:
Execute command: apt install pve-kernel-6.2.16-4-pve
Reboot the server, then:
Execute command: proxmox-boot-tool kernel pin 6.2.16-4-pve --next-boot
Reboot the server.
Don't know if this is correct but it works

why 6.2?

Do you have 6.5 issues?

This is about 6.8 issues... Please open a new thread for 6.5 issues!

JimmyB · May 4, 2024

Hello

Same problem here: I didn't experiment crashes when in kernel 6.5.x and before.

After upgrade to 6.8.4, system crashed, usually a few hours after reboot.

When crashed:
-no more web UI,
-ping on the PVE host IP,
-ssh is opened but no remote connexion,
-nmap is showing opened tcp ports but none of them are working,
-when trying to log in or shutdown (ctrl-alt-del) the host seems to react but I never get a session and finally have to make a cold reboot,

I added intel_iommu=off in grub but after less than 24h, it finally crashed once again.

Nothing in logs.
On screen, errors with journald, as if the disks were lost.

Only one LAMP (+postfix / roundcube) VM is running on the host.

I'll try to get back to 6.5.13...

Hardware is a Shuttle DS10 with Intel Core i5-8265U.

Regards

Ramalama · May 4, 2024

Ramalama said:
I have on all Servers PVE 8.2.2 with Kernel 6.8.4
- Genoa 9374F, Asus rs520a-e12-rs12u
- Genoa 9374F, Asus rs520a-e12-rs12u
- Ryzen 7 5800X, X570D4I-2T
- i3-1315U, NUC 13
- i3-1115G4, NUC 11
- 2x E5-2637 v3, HPE ML350 G9
- Xeon Silver 4210R, HPE DL360 G10
- Xeon Silver 4210R, HPE DL360 G10
- 2x E5-2620 v3, DL360 G9
- E3-1275 v5, FUJITSU D3417-B2

No issues at all, no crashes, all run as Perfect as they can.
Since its unknown what the issue is, that may be helpfull!
None has less uptime as 4 days, some have 14days+, and the longest uptime is 22 days.
Because 22 days ago i started to move to the 6.8.4 kernel and a kernel switch needs sadly a reboot xD

I dont remember when the kernel was released, but i bet it was 22 days ago available in the test repo already.
Cheers

I waked today morning up and an Centos 7 VM with Kernel 6.8.8 from elrepo frozen up.
The task that made the VM "Freeze", was a backup job inside the VM.

The VM mounts over Samba a Folder from my Backup-Server:
//172.88.8.84/Nachtsicherung /mnt1/backup cifs credentials=/etc/fstab-bsmount.txt,iocharset=utf8,vers=3.0 0 0
The 172.88.8.84 Backup-Server is a PVE 8.2.2 Server with 6.8.4 Kernel, i just run Samba Server natively on it.
The Storage is ZFS on the Backup-Server.

The Script that started to run inside the VM, started exactly at 0 Oclock, exactly when the VM froze up. It simply compresses some folders via gzip directly to the /mnt1/backup folder, so samba destination (Backup-Server)

Now this is either a Samba issue why the VM frooze, or it's an ZFS issue on the Backup-Server.
I think its actually both tbh.
I run an far too new kernel on Centos 7 (6.8.8) and the Samba Package is pretty old (Samba v. 4.10.16)

However, i have sadly no logs, because there arent any errors. Im still searching tho.
But this error/freeze is not really related to Proxmox in my opinion, however, didnt had any crashes before, i changed now some things, and if i find sth out, i will report.
Cheers

PS: Only the VM frooze.
No crashes on any Proxmox Server.

fhloston · May 5, 2024

fhloston said:
Also crashes here with 6.8.4 and not with 6.5.13:

Supermicro X10DRU-i+, Bios 3.4
E5-2620 v4
01:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
01:00.1 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
88:00.0 Ethernet controller: Mellanox Technologies MT27500 Family [ConnectX-3]

I have 11 machines running these X10-DRU-i+ boards, only the three ceph nodes with the ConnectX-3 and no vms at all crash.
The other VM hosts without ConnectX-3 run just fine with kernel 6.8.4.

Still crashes with intel_iommu=off and BIOS updated to 3.5.

Back to 6.5.13 for now.

JimmyB · May 5, 2024

I didn't check BIOS update, I'll also do that.
But for now, pinned to 6.5.13.

After last crash, on screen there was some kernel messages but only the <TASK> part related to spin_lock and md_*. After that, I only get some nft logs and a message about journald before the system totally died 2 minutes later.

Nothing else in journal and logs.

DooMMasteR · May 6, 2024

On the positive end of things I run a B550 based AMD Ryzen 7 PRO 5750G with iommu and 14 groups

Code:

[    0.439221] iommu: Default domain type: Translated
[    0.439221] iommu: DMA domain TLB invalidation policy: lazy mode

for over 10 days now. My kernel cmd like is

BOOT_IMAGE=/boot/vmlinuz-6.8.4-2-pve root=/dev/mapper/pve-root ro quiet amd_pstate=passive amd_pstate.shared_mem=1 mem_encrypt=on kvm_amd.sev=1

I had 0 crashes so far, but the platform is probably a little less common, it e.g. uses a 04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller (rev 05) and it's also Cezanne Zen4.

lshw: https://paste.stratum0.org/?23d721d7ef45e317#BUm4mKZx22WXGFHirtcMk9SHmhhAWbpvaiiUTbaATbW9

Random 6.8.4-2-pve kernel crashes

Proxmox Staff Member

New Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Active Member

New Member

Member

Active Member

Attachments

Renowned Member

Active Member

New Member

Active Member

New Member

Renowned Member

Member

New Member

Member

We value your privacy