Random 6.8.4-2-pve kernel crashes

Der Harry · May 7, 2024

leesteken said:
It's just me at home on a day off. 6.8.8-zabbly+ won't boot because it lacks ZFS and Ubuntu-6.8.0-31 does not come with amdgpu(?).
I tried the Ubuntu 24.04 installer (with kernel 6.8.0-31) and it also crashed the GPU when amdgpu is loaded automatically. So my issue is an upstream kernel 6.8 or Ubuntu issue.
I could try a nested Proxmox (without ZFS) and GPU passthrough to test further, but it looks like Ubuntu kernel 6.8 is just an unlucky choice (at the moment).

Please carefully read.

- 6.8.8-zabbly+ is a Debian 12 only thing (I always made clear statements about this!)
- zabbly on his Github account sends you to a page, where a ZFS 6.8.8 version is available

However

- zabbly & friends are no (!) pve kernel replacement
- I introdudced them just to show that 6.8.4 and 6.8.8 are > not < broken.
- By testing all sorts of combos, I can proove it's a pve kernel thing.

Don't waste time with zabbly / friends on pve. It's the pve delta / kernel flags that have issues (that is my hypothesis at this point).

leesteken · May 7, 2024

Der Harry said:
- By testing all sorts of combos, I can proove it's a pve kernel thing.

Don't waste time with zabbly / friends on pve. It's the pve delta / kernel flags that have issues (that is my hypothesis at this point).

I think I just proved that my amdgpu issue is not introduced by Proxmox but already broken by Ubuntu (or upstream 6.8). There are no other issues that I can reproduce.

Der Harry · May 7, 2024

leesteken said:
I think I just proved that my amdgpu issue is not introduced by Proxmox but already broken by Ubuntu (or upstream 6.8). There are no other issues that I can reproduce.

Can you test this with a

- Debian 12 installation (USB stick) and the 6.8.6-zabbly+ / 6.8.8-zabbly+ kernel.

vs.

- Ubuntu installation (USB stick) with their 6.8.x kernel?

I need a kernel tree (left to right) GIT comparison and a possibility to make a diff. Your test would bring me very close to this

leesteken · May 7, 2024

Der Harry said:
- Debian 12 installation (USB stick) and the 6.8.6-zabbly+ / 6.8.8-zabbly+ kernel.

Debian kernel 6.1.0-18 from the Debian 12.5 Live installer loads amdgpu fine (on a VM with RX570 passthrough). So does kernel 6.1.0-21 after updating the installation.
On 6.8.6-zabbly+, amdgpu crashes when loaded and the VM with the RX570 cannot shutdown properly (internal-error) but Proxmox host can stop the VM and reboot normally.
6.8.8-zabbly+ has identical behavior to 6.8.6-zabbly+. With 6.8.9-zabbly+, I cannot get a graphical desktop as Cinnamon crashes (before I load amdgpu).

Der Harry said:
- Ubuntu installation (USB stick) with their 6.8.x kernel?

I think I already did what with the installer kernel 6.8.0-31 (bare metal): amdgpu crashes RX570. I'm convinced that the amdgpu RX570 issue is Linux (and not Ubuntu or Proxmox).

Der Harry · May 7, 2024

leesteken said:
Debian kernel 6.1.0-18 from the Debian 12.5 Live installer loads amdgpu fine (on a VM with RX570 passthrough). So does kernel 6.1.0-21 after updating the installation.
On 6.8.6-zabbly+, amdgpu crashes when loaded and the VM with the RX570 cannot shutdown properly (internal-error) but Proxmox host can stop the VM and reboot normally.
6.8.8-zabbly+ has identical behavior to 6.8.6-zabbly+. With 6.8.9-zabbly+, I cannot get a graphical desktop as Cinnamon crashes (before I load amdgpu).

I think I already did what with the installer kernel 6.8.0-31 (bare metal): amdgpu crashes RX570. I'm convinced that the amdgpu RX570 issue is Linux (and not Ubuntu or Proxmox).

We are alway always talking bare metal (!) please clarify that.

(Please leave out irrelvant information 6.1x kernels.)

Debian 12.5:

- 6.8.6-zabbly+ up to 6.8.9-zabbly+ crashes with amdgpu?
- correct?

Ubuntu:

- 6.8.0-31distro stock kernel has a reproducable crash with amdgpu?
- correct?

Conclusion:

- 6.8.x might be (in general) broken with amdgpu (in your setup)?
- this it totally irrelvant to the pve kernel?
- corrrect?

leesteken · May 7, 2024

Der Harry said:
We are alway always talking bare metal (!) please clarify that.

(Please leave out irrelvant information 6.1x kernels.)

I don't think this will add much to my previous post, but I'll try to make it more clear.

Der Harry said:
Debian 12.5:

- 6.8.6-zabbly+ up to 6.8.9-zabbly+ crashes with amdgpu?
- correct?

On 6.8.6, 6.8.8 and 6.8.9 the amdgpu driver crashes and leaves the RX570 in an unusable state. I did not test 6.8.7-zabbly+ (but I don't expect any change).
This was all tested on a VM with GPU passthrough (which I expect to behave the same as on bare metal w.r.t. amdgpu crashing RX570).

Der Harry said:
Ubuntu:

- 6.8.0-31distro stock kernel has a reproducable crash with amdgpu?
- correct?

6.8.0-31 Ubuntu installer kernel, yes This happened both on the VM with passthrough and bare metal, identically w.r.t. amdgpu crashing RX570.

Der Harry said:
Conclusion:

- 6.8.x might be (in general) broken with amdgpu (in your setup)?
- this it totally irrelvant to the pve kernel?
- corrrect?

Yes. My issue with amdgpu crashing on RX570 appears to be a generic Linux 6.8 problem (but PVE 8.2 works fine with Radeon 6950XT).

I do appreciate your work over the last couple of days, and I'm sorry that I cannot be more helpful right now.

Der Harry · May 7, 2024

leesteken said:
I do appreciate your work over the last couple of days, and I'm sorry that I cannot be more helpful right now.

We know have a connection point.

For intel intel_iommu=off makes pve kernel work

For my Ryzen 5700G I try to block the amdgpu module.

Code:

# in /etc/modprobe.d/blacklist.conf
blacklist amdgpu

# do a
update-initramfs -u

# reboot and
lsmod | grep amdgpu

^^^ can you please also try this? I just want to make sure that this might be a (temp fix) - like the iommu disableing for Intel.

Stoiko Ivanov · May 7, 2024

denvercoder9 said:
Some information of my crashing Hetzner Server.

The dmesg.log you posted indicates an issue with mdraid (which we do not officially support) - please try the 6.8.4-3-pve kernel - there were quite a few fixes in that area of the linux codebase - hopefully your issue is gone with that.

leesteken · May 7, 2024

Der Harry said:
Code:

# in /etc/modprobe.d/blacklist.conf blacklist amdgpu # do a update-initramfs -u # reboot and lsmod | grep amdgpu

^^^ can you please also try this? I just want to make sure that this might be a (temp fix) - like the iommu disableing for Intel.

That is what I already use as a work-around (Ryzen 2700X on 470X with RX570). That way I can still use the GPU for the host console until I start the VM with passthrough.

Der Harry · May 7, 2024

leesteken said:
That is what I already use as a work-around (Ryzen 2700X on 470X with RX570). That way I can still use the GPU for the host console until I start the VM with passthrough.

Did you investigate in an ubuntu forum about that crash?

That must be something well known with the ubuntu kernels.

(or in the Kernel Maling List)

Would be nice if we can split here and you give that a try.

I try to setup the build system for the pve kernels we didn't get much help of the pve team here.

fhloston · May 7, 2024

Der Harry said:
We know have a connection point.

For intel intel_iommu=off makes pve kernel work

Actually not true. it still crashes here with that. Supermicro X10DRU-i+ and Intel E5-26xx v4.

Der Harry · May 7, 2024

fhloston said:
Actually not true. it still crashes here with that. Supermicro X10DRU-i+ and Intel E5-26xx v4.

That is also my feeling - but I can't reproduce it on my NUC.

I am on rebuildin the pmox kernels now on my systems... that feels like gentoo

fhloston · May 7, 2024

Der Harry said:
That is also my feeling - but I can't reproduce it on my NUC.

I am on rebuildin the pmox kernels now on my systems... that feels like gentoo

It only crashes on the 3 ceph nodes with additional nvme devices and Mellanox Connect-X3 - the compute nodes do not crash. Same boards, same CPU Gen, all X10DRU-i+ and E5-26xx v4.

Der Harry · May 7, 2024

fhloston said:
It only crashes on the 3 ceph nodes with additional nvme devices and Mellanox Connect-X3 - the compute nodes do not crash. Same boards, same CPU Gen, all X10DRU-i+ and E5-26xx v4.

Same test as before

- Install Ubuntu with the vanilla 6.8 ubuntu Kernel. Run Qemu/libvrrtd with 10+ test machines.
- Install Debian 12. Install the 6.8.x-zabbly+ Kernel. Run Qemu/libvrrtd with 10+ test machines.

Come back with data ^^^ .

- In (my) setup 6.8.x-zabbly+ Kernel + qemu on debian 12 works fine.
- PVE 6.8.6-3 only works with intel_iommu=off

I can't hold your hand with "every" bug we know that might be exist with 6.8.x.

- I don't use ZFS
- I don't have cpfs
- I don't have a amdgpu on my test nuc

My game is

>> Blame vanilla / ubuntu 6.8.x or remove a pve kernel patch (and understand what) breaks >> my << setup.
>> It maybe "one" thing.

I am willing to provide all skripts on how to build your own pve kernel and investigate the "world" what is broken.

We need to do this as focused as possible.

Stoiko Ivanov · May 7, 2024

Der Harry said:
I select the "generic-6.8.0-31.31" in the Debian Grub Boot menu

The kernel boots

basically "nothing" works - no wifi - no X11 (only in 640x480)

could you maybe share the complete journals since boot (and/or at least `dmesg` ) - might help to get a more complete picture.

Der Harry said:
BUT: The PVE kernels (without any VMs!) crashed at this point. No VM started - just after boot. Check my logs

Der Harry said:
Next steps are clear - build 5-10 test kernels based on the Ubuntu kernel and leave out the proxmox patches - test them.

Our git-repo contains instructions on how to build the proxmox kernels:
https://git.proxmox.com/?p=pve-kern...cc26b8070c5a6b2ed9b2a33616ab1a0033883;hb=HEAD

our delta to Ubuntu is in the patches/kernel subdir:
https://git.proxmox.com/?p=pve-kern...e72b828e42baa8d986589a9a384a2784e35ec;hb=HEAD

probably bisecting between our current submodule state without any patches, and with all patches applied should yield where the issue with your NUC is rooted.

I hope this helps!

Stoiko Ivanov · May 7, 2024

fhloston said:
It only crashes on the 3 ceph nodes with additional nvme devices and Mellanox Connect-X3 - the compute nodes do not crash. Same boards, same CPU Gen, all X10DRU-i+ and E5-26xx v4.

* do you maybe have some logs of the systems that crash?
* did you also try version: 6.8.4-3-pve?

Thanks!

Der Harry · May 7, 2024

Stoiko Ivanov said:
I hope this helps!

I it's building

I am making my own -pve-harry kernels now.

My first test is to remove all pve patches - then we are as vanilla / close to ubuntu as possible.

Let me check if I can get access to some of these 48 core beats to make them cook

Der Harry · May 7, 2024

Stoiko Ivanov said:
* do you maybe have some logs of the systems that crash?
* did you also try version: 6.8.4-3-pve?

Thanks!

I can (finally confirm) it's an upstream issue

Bash:

root@nuc:~# uname -a
Linux nuc.xxx 6.8.4-3-pve-harry #1 SMP PREEMPT_DYNAMIC Tue May  7 18:34:58 UTC 2024 x86_64 GNU/Linux
root@nuc:~# cat /proc/filesystems  | grep zfs
nodev   zfs
root@nuc:~# lsmod | grep zfs
zfs                  6217728  6
spl                   151552  1 zfs

My 6.8.4-3-pve-harry Kernel has zero of these patches: <https://github.com/proxmox/pve-kernel/tree/master/patches/kernel>

ZFS included but not used.

I guess Huston we don't have a problem. Ubuntu has.

Code:

[    1.350929] ------------[ cut here ]------------
[    1.350932] WARNING: CPU: 0 PID: 148 at drivers/iommu/intel/iommu.c:167 intel_iommu_probe_device+0x26d/0x8d0
[    1.350938] Modules linked in: xhci_pci_renesas intel_lpss_pci(+) i2c_i801 sdhci libahci intel_lpss idma64 xhci_hcd e1000e(+) crc32_pclmul i2c_smbus video pinctrl_sunrisepoint wmi
[    1.350954] CPU: 0 PID: 148 Comm: (udev-worker) Tainted: G          I        6.8.4-3-pve-harry #1
[    1.350957] Hardware name:  /NUC6i5SYB, BIOS SYSKLi35.86A.0073.2020.0909.1625 09/09/2020
[    1.350959] RIP: 0010:intel_iommu_probe_device+0x26d/0x8d0
[    1.350962] Code: b7 f6 0f b7 42 d4 48 8d 4a 10 66 c1 c0 08 0f b7 c0 39 c6 0f 8c 90 00 00 00 0f 8f 86 00 00 00 4c 89 fe 4c 89 f7 e8 f3 27 6a 00 <0f> 0b 48 c7 c0 ef ff ff ff 4c 89 ef 48 89 45 c0 e8 3e a9 95 ff 48
[    1.350964] RSP: 0018:ffffa99e0061b568 EFLAGS: 00010246
[    1.350967] RAX: 0000000000000000 RBX: ffff8bd681f80c10 RCX: 0000000000000000
[    1.350969] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[    1.350970] RBP: ffffa99e0061b5b0 R08: 0000000000000000 R09: 0000000000000000
[    1.350972] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8bd6802cdc00
[    1.350973] R13: ffff8bd681327600 R14: ffff8bd6802cdd48 R15: 0000000000000246
[    1.350975] FS:  0000724d7995a8c0(0000) GS:ffff8bd9f2200000(0000) knlGS:0000000000000000
[    1.350977] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.350979] CR2: 0000724d7a140218 CR3: 0000000101d86002 CR4: 00000000003706f0
[    1.350981] Call Trace:
[    1.350983]  <TASK>
[    1.350986]  ? show_regs+0x6d/0x80
[    1.350991]  ? __warn+0x89/0x160
[    1.350995]  ? intel_iommu_probe_device+0x26d/0x8d0
[    1.350999]  ? report_bug+0x17e/0x1b0

@Stoiko Ivanov at this point, there is a way to go:

Apply the "things" that what 6.8.6-zabbly+ does for fixing these things to the pve kernel Patches.

That's >> on Proxmox GmbH << not on me

PN me if you want to hire me doing this.

spirit · May 8, 2024

fhloston said:
Also crashes here with 6.8.4 and not with 6.5.13:

Supermicro X10DRU-i+, Bios 3.4
E5-2620 v4
01:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
01:00.1 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
88:00.0 Ethernet controller: Mellanox Technologies MT27500 Family [ConnectX-3]

I have 11 machines running these X10-DRU-i+ boards, only the three ceph nodes with the ConnectX-3 and no vms at all crash.
The other VM hosts without ConnectX-3 run just fine with kernel 6.8.4.

Hi, this is interesting. I have also 20 nodes working fine with 6.8.4-2 (without ceph, without osd) for 3 weeks, but 2 nodes with ceph osd are crashing in 24h.
They are lenovovo epyc v3 servers with nvme drivers.

Do you use encryption for your osd ? (I have a trace related to storage/dm-crypt)

I'm currently on holiday, I'll try to newer kernel version next week.

Der Harry · May 8, 2024

Der Harry said:
Did you investigate in an ubuntu forum about that crash?

That must be something well known with the ubuntu kernels.

(or in the Kernel Maling List)

Would be nice if we can split here and you give that a try.

I try to setup the build system for the pve kernels we didn't get much help of the pve team here.

@leesteken any updates on the amdgpu thing?

I found multiple sources confirming issues and some mitigations.

Random 6.8.4-2-pve kernel crashes

Active Member

Distinguished Member

Active Member

Distinguished Member

Active Member

Distinguished Member

Active Member

Proxmox Staff Member

Distinguished Member

Active Member

Member

Active Member

Member

Active Member

Proxmox Staff Member

Proxmox Staff Member

Active Member

Active Member

Distinguished Member

Active Member