PVE Host crashing at random. Need some help

jasper9041

New Member
Jun 10, 2023
2
1
3
Hi everyone,

I'm having some issues with one of my pve hosts. I'm hoping you guys could point me in the right direction since I'm a bit lost at this point...

I have been encountering random crashes on this node:
  • pve web UI unreachable
  • SSH to the node: ok
  • VMs: unreachable
Seeing crash logs like this:
Code:
Jun 10 05:53:04 pve3 kernel: [47887.900969] general protection fault, probably for non-canonical address 0xeb879ed8efccc2c0: 0000 [#1] PREEMPT SMP NOPTI
Jun 10 05:53:04 pve3 kernel: [47887.900989] CPU: 9 PID: 145696 Comm: vgs Tainted: P           O       6.2.11-2-pve #1
Jun 10 05:53:04 pve3 kernel: [47887.900997] Hardware name: ASUS System Product Name/TUF GAMING B550-PLUS, BIOS 3002 02/23/2023
Jun 10 05:53:04 pve3 kernel: [47887.901006] RIP: 0010:kmem_cache_alloc+0xf1/0x330
Jun 10 05:53:04 pve3 kernel: [47887.901015] Code: ef 22 65 48 8b 50 08 48 83 78 10 00 48 8b 38 0f 84 e6 01 00 00 48 85 ff 0f 84 dd 01 00 00 41 8b 44 24 28 4d 8b 04 24 48 01 f8 <48> 8b 18 48 89 c1 49 33 9c 24 b8 00
 00 00 48 89 f8 48 0f c9 48 31
Jun 10 05:53:04 pve3 kernel: [47887.901029] RSP: 0018:ffffb7ec5ddcbc10 EFLAGS: 00010286
Jun 10 05:53:04 pve3 kernel: [47887.901037] RAX: eb879ed8efccc2c0 RBX: 0000000000000dc0 RCX: 0000000000000001
Jun 10 05:53:04 pve3 kernel: [47887.901044] RDX: 0000000343878009 RSI: 0000000000000200 RDI: eb879ed8efccc250
Jun 10 05:53:04 pve3 kernel: [47887.901051] RBP: ffffb7ec5ddcbc50 R08: 0000000000037fb0 R09: 0000000000000000
Jun 10 05:53:04 pve3 kernel: [47887.901059] R10: fefefefefefefeff R11: 0000000000000000 R12: ffff917a00206f00
Jun 10 05:53:04 pve3 kernel: [47887.901066] R13: 0000000000000dc0 R14: ffff917a1672fa00 R15: ffffffff9ae38968
Jun 10 05:53:04 pve3 kernel: [47887.901073] FS:  00007fb51a63f180(0000) GS:ffff91988e640000(0000) knlGS:0000000000000000
Jun 10 05:53:04 pve3 kernel: [47887.901082] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 10 05:53:04 pve3 kernel: [47887.901089] CR2: 0000564ee8d40008 CR3: 00000004111f6000 CR4: 0000000000750ee0
Jun 10 05:53:04 pve3 kernel: [47887.901096] PKRU: 55555554
Jun 10 05:53:04 pve3 kernel: [47887.901100] Call Trace:
Jun 10 05:53:04 pve3 kernel: [47887.901105]  <TASK>
Jun 10 05:53:04 pve3 kernel: [47887.901110]  __alloc_file+0x28/0xf0
Jun 10 05:53:04 pve3 kernel: [47887.901123]  ? try_to_unlazy+0x60/0xd0
Jun 10 05:53:04 pve3 kernel: [47887.901135]  alloc_empty_file+0x46/0xe0
Jun 10 05:53:04 pve3 kernel: [47887.901141]  path_openat+0x4a/0x1130
Jun 10 05:53:04 pve3 kernel: [47887.901147]  ? do_filp_open+0xb6/0x160
Jun 10 05:53:04 pve3 kernel: [47887.901153]  ? _copy_to_user+0x25/0x40
Jun 10 05:53:04 pve3 kernel: [47887.901160]  do_filp_open+0xb6/0x160
Jun 10 05:53:04 pve3 kernel: [47887.901166]  ? alloc_fd+0xb1/0x190
Jun 10 05:53:04 pve3 kernel: [47887.901173]  do_sys_openat2+0x9f/0x160
Jun 10 05:53:04 pve3 kernel: [47887.901179]  __x64_sys_openat+0x56/0xa0
Jun 10 05:53:04 pve3 kernel: [47887.901185]  do_syscall_64+0x5c/0x90
Jun 10 05:53:04 pve3 kernel: [47887.901192]  ? do_syscall_64+0x69/0x90
Jun 10 05:53:04 pve3 kernel: [47887.901198]  ? do_syscall_64+0x69/0x90
Jun 10 05:53:04 pve3 kernel: [47887.901204]  ? do_syscall_64+0x69/0x90
Jun 10 05:53:04 pve3 kernel: [47887.901210]  ? do_syscall_64+0x69/0x90
Jun 10 05:53:04 pve3 kernel: [47887.901216]  ? sysvec_reschedule_ipi+0x7b/0x120
Jun 10 05:53:04 pve3 kernel: [47887.901223]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
Jun 10 05:53:04 pve3 kernel: [47887.901230] RIP: 0033:0x7fb51ab272a2
Jun 10 05:53:04 pve3 kernel: [47887.901236] Code: c0 f6 c2 40 75 52 89 d0 45 31 d2 25 00 00 41 00 3d 00 00 41 00 74 41 64 8b 04 25 18 00 00 00 85 c0 75 65 b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 a2 00 00 00 48 8b 4c 24 38 64 48 2b 0c 25
Jun 10 05:53:04 pve3 kernel: [47887.901250] RSP: 002b:00007ffead446420 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
Jun 10 05:53:04 pve3 kernel: [47887.901259] RAX: ffffffffffffffda RBX: 00007ffead4465b0 RCX: 00007fb51ab272a2
Jun 10 05:53:04 pve3 kernel: [47887.901266] RDX: 00000000002a0000 RSI: 0000564ee95685d1 RDI: 0000000000000004
Jun 10 05:53:04 pve3 kernel: [47887.901273] RBP: 0000564ee95685d0 R08: 00007fb51adcc5c0 R09: 0073656369766564
Jun 10 05:53:04 pve3 kernel: [47887.901281] R10: 0000000000000000 R11: 0000000000000246 R12: 0000564ee95685d1
Jun 10 05:53:04 pve3 kernel: [47887.901289] R13: 0000564ee95694e4 R14: 0000000000000004 R15: 0000000000000008
Jun 10 05:53:04 pve3 kernel: [47887.901297]  </TASK>
Jun 10 05:53:04 pve3 kernel: [47887.901301] Modules linked in: tcp_diag inet_diag ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel nf_tables 8021q garp mrp bonding tls softdog nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common amdgpu edac_mce_amd kvm_amd snd_hda_codec_realtek snd_hda_codec_generic iommu_v2 snd_hda_codec_hdmi drm_buddy kvm gpu_sched drm_ttm_helper crct10dif_pclmul snd_hda_intel ttm polyval_clmulni snd_intel_dspcfg polyval_generic snd_intel_sdw_acpi ghash_clmulni_intel drm_display_helper sha512_ssse3 cec snd_hda_codec aesni_intel zfs(PO) rc_core crypto_simd eeepc_wmi snd_hda_core cryptd asus_wmi snd_hwdep drm_kms_helper zunicode(PO) rapl i2c_algo_bit ledtrig_audio snd_pcm sparse_keymap snd_timer syscopyarea sysfillrect zzstd(O) platform_profile snd sysimgblt video efi_pstore wmi_bmof soundcore pcspkr k10temp zlua(O) ccp input_leds zavl(PO) icp(PO) zcommon(PO) znvpair(PO) mac_hid spl(O) vhost_net vhost
Jun 10 05:53:04 pve3 kernel: [47887.901335]  vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd drm sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq simplefb hid_generic usbkbd usbhid hid dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c mpt3sas r8169 xhci_pci raid_class xhci_pci_renesas crc32_pclmul i2c_piix4 realtek scsi_transport_sas ahci xhci_hcd libahci wmi gpio_amdpt
Jun 10 05:53:04 pve3 kernel: [47887.901441] ---[ end trace 0000000000000000 ]---

The pve node:
  • Motherboard: Asus TUF GAMING B550-PLUS
  • CPU: AMD Ryzen 7 5700G
  • Memory: 128GB Kingston DDR4 3200
  • Booting from 256gb samsung ssd
  • A few other ssds and hdds for vm storage & backups
  • lsi hba & marvell sata controller (pcie) passed trough to truenas scale vm
  • PVE: 7.4-13 with kernel 6.2 installed
I was under the impression this had something to do with the pcie passtrough, but at this point I'm not sure anymore.
Almost all pcie devices were in the same IOMMU group, so I configured my grub with:
Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on pcie_acs_override=downstream,multifunction vfio-pci.ids=1b21:1064,1000:0072"

Some of my troubleshooting steps have included:
  • Updating pve
  • switching to kernel 6.2
  • updated to latest MBO firmware
Any ideas?
All help is greatly appreciated.

Thanks!
- Jasper
 
I was under the impression this had something to do with the pcie passtrough, but at this point I'm not sure anymore.
Undo the passthrough (amd_iommu=off) and see if the problem disappears. You are useing pcie_acs_override so PCIe device might be interfering with each other, causing corruption.
Some of my troubleshooting steps have included:
  • Updating pve
  • switching to kernel 6.2
  • updated to latest MBO firmware
Maybe the memory is failing and/or the drive is corrupted? Do a long memtest.
 
Had to do a lot of testing with memtest...
If I populate all the ram slots and enable XMP, the memory starts showing lots of errors.

Eventually I gave up and disabled xmp. It has been stable for a few days now.

Fingers crossed that this has fixed all my problems.

Thanks for your help @leesteken !
 
  • Like
Reactions: leesteken

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!