Hi everyone,
I'm having some issues with one of my pve hosts. I'm hoping you guys could point me in the right direction since I'm a bit lost at this point...
I have been encountering random crashes on this node:
The pve node:
Almost all pcie devices were in the same IOMMU group, so I configured my grub with:
Some of my troubleshooting steps have included:
All help is greatly appreciated.
Thanks!
- Jasper
I'm having some issues with one of my pve hosts. I'm hoping you guys could point me in the right direction since I'm a bit lost at this point...
I have been encountering random crashes on this node:
- pve web UI unreachable
- SSH to the node: ok
- VMs: unreachable
Code:
Jun 10 05:53:04 pve3 kernel: [47887.900969] general protection fault, probably for non-canonical address 0xeb879ed8efccc2c0: 0000 [#1] PREEMPT SMP NOPTI
Jun 10 05:53:04 pve3 kernel: [47887.900989] CPU: 9 PID: 145696 Comm: vgs Tainted: P O 6.2.11-2-pve #1
Jun 10 05:53:04 pve3 kernel: [47887.900997] Hardware name: ASUS System Product Name/TUF GAMING B550-PLUS, BIOS 3002 02/23/2023
Jun 10 05:53:04 pve3 kernel: [47887.901006] RIP: 0010:kmem_cache_alloc+0xf1/0x330
Jun 10 05:53:04 pve3 kernel: [47887.901015] Code: ef 22 65 48 8b 50 08 48 83 78 10 00 48 8b 38 0f 84 e6 01 00 00 48 85 ff 0f 84 dd 01 00 00 41 8b 44 24 28 4d 8b 04 24 48 01 f8 <48> 8b 18 48 89 c1 49 33 9c 24 b8 00
00 00 48 89 f8 48 0f c9 48 31
Jun 10 05:53:04 pve3 kernel: [47887.901029] RSP: 0018:ffffb7ec5ddcbc10 EFLAGS: 00010286
Jun 10 05:53:04 pve3 kernel: [47887.901037] RAX: eb879ed8efccc2c0 RBX: 0000000000000dc0 RCX: 0000000000000001
Jun 10 05:53:04 pve3 kernel: [47887.901044] RDX: 0000000343878009 RSI: 0000000000000200 RDI: eb879ed8efccc250
Jun 10 05:53:04 pve3 kernel: [47887.901051] RBP: ffffb7ec5ddcbc50 R08: 0000000000037fb0 R09: 0000000000000000
Jun 10 05:53:04 pve3 kernel: [47887.901059] R10: fefefefefefefeff R11: 0000000000000000 R12: ffff917a00206f00
Jun 10 05:53:04 pve3 kernel: [47887.901066] R13: 0000000000000dc0 R14: ffff917a1672fa00 R15: ffffffff9ae38968
Jun 10 05:53:04 pve3 kernel: [47887.901073] FS: 00007fb51a63f180(0000) GS:ffff91988e640000(0000) knlGS:0000000000000000
Jun 10 05:53:04 pve3 kernel: [47887.901082] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 10 05:53:04 pve3 kernel: [47887.901089] CR2: 0000564ee8d40008 CR3: 00000004111f6000 CR4: 0000000000750ee0
Jun 10 05:53:04 pve3 kernel: [47887.901096] PKRU: 55555554
Jun 10 05:53:04 pve3 kernel: [47887.901100] Call Trace:
Jun 10 05:53:04 pve3 kernel: [47887.901105] <TASK>
Jun 10 05:53:04 pve3 kernel: [47887.901110] __alloc_file+0x28/0xf0
Jun 10 05:53:04 pve3 kernel: [47887.901123] ? try_to_unlazy+0x60/0xd0
Jun 10 05:53:04 pve3 kernel: [47887.901135] alloc_empty_file+0x46/0xe0
Jun 10 05:53:04 pve3 kernel: [47887.901141] path_openat+0x4a/0x1130
Jun 10 05:53:04 pve3 kernel: [47887.901147] ? do_filp_open+0xb6/0x160
Jun 10 05:53:04 pve3 kernel: [47887.901153] ? _copy_to_user+0x25/0x40
Jun 10 05:53:04 pve3 kernel: [47887.901160] do_filp_open+0xb6/0x160
Jun 10 05:53:04 pve3 kernel: [47887.901166] ? alloc_fd+0xb1/0x190
Jun 10 05:53:04 pve3 kernel: [47887.901173] do_sys_openat2+0x9f/0x160
Jun 10 05:53:04 pve3 kernel: [47887.901179] __x64_sys_openat+0x56/0xa0
Jun 10 05:53:04 pve3 kernel: [47887.901185] do_syscall_64+0x5c/0x90
Jun 10 05:53:04 pve3 kernel: [47887.901192] ? do_syscall_64+0x69/0x90
Jun 10 05:53:04 pve3 kernel: [47887.901198] ? do_syscall_64+0x69/0x90
Jun 10 05:53:04 pve3 kernel: [47887.901204] ? do_syscall_64+0x69/0x90
Jun 10 05:53:04 pve3 kernel: [47887.901210] ? do_syscall_64+0x69/0x90
Jun 10 05:53:04 pve3 kernel: [47887.901216] ? sysvec_reschedule_ipi+0x7b/0x120
Jun 10 05:53:04 pve3 kernel: [47887.901223] entry_SYSCALL_64_after_hwframe+0x72/0xdc
Jun 10 05:53:04 pve3 kernel: [47887.901230] RIP: 0033:0x7fb51ab272a2
Jun 10 05:53:04 pve3 kernel: [47887.901236] Code: c0 f6 c2 40 75 52 89 d0 45 31 d2 25 00 00 41 00 3d 00 00 41 00 74 41 64 8b 04 25 18 00 00 00 85 c0 75 65 b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 a2 00 00 00 48 8b 4c 24 38 64 48 2b 0c 25
Jun 10 05:53:04 pve3 kernel: [47887.901250] RSP: 002b:00007ffead446420 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
Jun 10 05:53:04 pve3 kernel: [47887.901259] RAX: ffffffffffffffda RBX: 00007ffead4465b0 RCX: 00007fb51ab272a2
Jun 10 05:53:04 pve3 kernel: [47887.901266] RDX: 00000000002a0000 RSI: 0000564ee95685d1 RDI: 0000000000000004
Jun 10 05:53:04 pve3 kernel: [47887.901273] RBP: 0000564ee95685d0 R08: 00007fb51adcc5c0 R09: 0073656369766564
Jun 10 05:53:04 pve3 kernel: [47887.901281] R10: 0000000000000000 R11: 0000000000000246 R12: 0000564ee95685d1
Jun 10 05:53:04 pve3 kernel: [47887.901289] R13: 0000564ee95694e4 R14: 0000000000000004 R15: 0000000000000008
Jun 10 05:53:04 pve3 kernel: [47887.901297] </TASK>
Jun 10 05:53:04 pve3 kernel: [47887.901301] Modules linked in: tcp_diag inet_diag ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel nf_tables 8021q garp mrp bonding tls softdog nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common amdgpu edac_mce_amd kvm_amd snd_hda_codec_realtek snd_hda_codec_generic iommu_v2 snd_hda_codec_hdmi drm_buddy kvm gpu_sched drm_ttm_helper crct10dif_pclmul snd_hda_intel ttm polyval_clmulni snd_intel_dspcfg polyval_generic snd_intel_sdw_acpi ghash_clmulni_intel drm_display_helper sha512_ssse3 cec snd_hda_codec aesni_intel zfs(PO) rc_core crypto_simd eeepc_wmi snd_hda_core cryptd asus_wmi snd_hwdep drm_kms_helper zunicode(PO) rapl i2c_algo_bit ledtrig_audio snd_pcm sparse_keymap snd_timer syscopyarea sysfillrect zzstd(O) platform_profile snd sysimgblt video efi_pstore wmi_bmof soundcore pcspkr k10temp zlua(O) ccp input_leds zavl(PO) icp(PO) zcommon(PO) znvpair(PO) mac_hid spl(O) vhost_net vhost
Jun 10 05:53:04 pve3 kernel: [47887.901335] vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd drm sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq simplefb hid_generic usbkbd usbhid hid dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c mpt3sas r8169 xhci_pci raid_class xhci_pci_renesas crc32_pclmul i2c_piix4 realtek scsi_transport_sas ahci xhci_hcd libahci wmi gpio_amdpt
Jun 10 05:53:04 pve3 kernel: [47887.901441] ---[ end trace 0000000000000000 ]---
The pve node:
- Motherboard: Asus TUF GAMING B550-PLUS
- CPU: AMD Ryzen 7 5700G
- Memory: 128GB Kingston DDR4 3200
- Booting from 256gb samsung ssd
- A few other ssds and hdds for vm storage & backups
- lsi hba & marvell sata controller (pcie) passed trough to truenas scale vm
- PVE: 7.4-13 with kernel 6.2 installed
Almost all pcie devices were in the same IOMMU group, so I configured my grub with:
Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on pcie_acs_override=downstream,multifunction vfio-pci.ids=1b21:1064,1000:0072"
Some of my troubleshooting steps have included:
- Updating pve
- switching to kernel 6.2
- updated to latest MBO firmware
All help is greatly appreciated.
Thanks!
- Jasper