kernel panic: BUG: unable to handle page fault for address: 0000000000008000

dyadyaMisha

Member
Mar 25, 2016
3
0
21
40
Siberia
recently installed a new server and ran into a kernel panic, can anyone tell me where to look for the reason?

Linux pve02 5.4.78-2-pve #1 SMP PVE 5.4.78-2 (Thu, 03 Dec 2020 14:26:17 +0100) x86_64 GNU/Linux

Code:
Dec 20 19:07:35 pve02 kernel: [35190.115810] BUG: unable to handle page fault for address: 0000000000008000
Dec 20 19:07:35 pve02 kernel: [35190.116350] #PF: supervisor read access in kernel mode
Dec 20 19:07:35 pve02 kernel: [35190.116817] #PF: error_code(0x0000) - not-present page
Dec 20 19:07:35 pve02 kernel: [35190.117266] PGD 0 P4D 0
Dec 20 19:07:35 pve02 kernel: [35190.117698] Oops: 0000 [#1] SMP PTI
Dec 20 19:07:35 pve02 kernel: [35190.118130] CPU: 2 PID: 31438 Comm: sshd Tainted: P           O      5.4.78-2-pve #1
Dec 20 19:07:35 pve02 kernel: [35190.118546] Hardware name: System manufacturer System Product Name/P8Z68-V LX, BIOS 0602 09/13/2011
Dec 20 19:07:35 pve02 kernel: [35190.118983] RIP: 0010:skb_release_data+0xa9/0x180
Dec 20 19:07:35 pve02 kernel: [35190.119419] Code: 48 0f 45 fa 66 66 66 66 90 f0 ff 4f 34 75 ce e8 4d 3f 94 ff 41 0f b6 45 02 48 83 c3 01 39 d8 7f c9 49 8b 7d 08 48 85 ff 74 10 <48> 8b 1f e8 2f f5 ff ff 48 89 df 48 85 db 75 f0 4d 85 e4 74 57 41
Dec 20 19:07:35 pve02 kernel: [35190.120412] RSP: 0018:ffffa8d5c0c1fc30 EFLAGS: 00010206
Dec 20 19:07:35 pve02 kernel: [35190.120922] RAX: 0000000000000020 RBX: 0000000000000000 RCX: ffffffff875f3a00
Dec 20 19:07:35 pve02 kernel: [35190.121458] RDX: 000000000001dae4 RSI: 00000008316f5861 RDI: 0000000000008000
Dec 20 19:07:35 pve02 kernel: [35190.121984] RBP: ffffa8d5c0c1fc48 R08: 00000000000005a8 R09: ffffffff866f05b0
Dec 20 19:07:35 pve02 kernel: [35190.122515] R10: ffff89e42c4923d0 R11: 0000000000000000 R12: ffff89e418731b00
Dec 20 19:07:35 pve02 kernel: [35190.123049] R13: ffff89e40570ca40 R14: ffff89e42c49287c R15: 0000000000000000
Dec 20 19:07:35 pve02 kernel: [35190.123607] FS:  00007fef8c92fe40(0000) GS:ffff89e4bb880000(0000) knlGS:0000000000000000
Dec 20 19:07:35 pve02 kernel: [35190.124177] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 20 19:07:35 pve02 kernel: [35190.124751] CR2: 0000000000008000 CR3: 000000009357e004 CR4: 00000000000626e0
Dec 20 19:07:35 pve02 kernel: [35190.125332] Call Trace:
Dec 20 19:07:35 pve02 kernel: [35190.125937]  skb_release_all+0x24/0x30
Dec 20 19:07:35 pve02 kernel: [35190.126527]  __kfree_skb+0x12/0x20
Dec 20 19:07:35 pve02 kernel: [35190.127115]  tcp_recvmsg+0x7b5/0xbd0
Dec 20 19:07:35 pve02 kernel: [35190.127707]  ? aa_sk_perm+0x43/0x180
Dec 20 19:07:35 pve02 kernel: [35190.128313]  inet_recvmsg+0x5e/0xf0
Dec 20 19:07:35 pve02 kernel: [35190.128910]  sock_recvmsg+0x66/0x70
Dec 20 19:07:35 pve02 kernel: [35190.129502]  sock_read_iter+0x8f/0xf0
Dec 20 19:07:35 pve02 kernel: [35190.130082]  new_sync_read+0x122/0x1b0
Dec 20 19:07:35 pve02 kernel: [35190.130661]  __vfs_read+0x29/0x40
Dec 20 19:07:35 pve02 kernel: [35190.131242]  vfs_read+0x99/0x160
Dec 20 19:07:35 pve02 kernel: [35190.131825]  ksys_read+0x61/0xe0
Dec 20 19:07:35 pve02 kernel: [35190.132408]  __x64_sys_read+0x1a/0x20
Dec 20 19:07:35 pve02 kernel: [35190.132990]  do_syscall_64+0x57/0x190
Dec 20 19:07:35 pve02 kernel: [35190.133581]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Dec 20 19:07:35 pve02 kernel: [35190.134173] RIP: 0033:0x7fef8ccd2461
Dec 20 19:07:35 pve02 kernel: [35190.134754] Code: fe ff ff 50 48 8d 3d fe d0 09 00 e8 e9 03 02 00 66 0f 1f 84 00 00 00 00 00 48 8d 05 99 62 0d 00 8b 00 85 c0 75 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 57 c3 66 0f 1f 44 00 00 41 54 49 89 d4 55 48
Dec 20 19:07:35 pve02 kernel: [35190.136028] RSP: 002b:00007ffdf1fa6db8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
Dec 20 19:07:35 pve02 kernel: [35190.136687] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fef8ccd2461
Dec 20 19:07:35 pve02 kernel: [35190.137352] RDX: 0000000000004000 RSI: 00007ffdf1fa6dc0 RDI: 0000000000000003
Dec 20 19:07:35 pve02 kernel: [35190.138050] RBP: 00005569812945f0 R08: 00007ffdf1faad58 R09: 00007ffdf1faad50
Dec 20 19:07:35 pve02 kernel: [35190.138726] R10: 0000000000008975 R11: 0000000000000246 R12: 00007ffdf1fa6dc0
Dec 20 19:07:35 pve02 kernel: [35190.139406] R13: 000055698082cb00 R14: 0000000000000003 R15: 00007ffdf1faae60
Dec 20 19:07:35 pve02 kernel: [35190.140087] Modules linked in: veth tcp_diag inet_diag ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables sctp iptable_filter bpfilter softdog nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec_hdmi kvm_intel kvm snd_hda_codec_realtek irqbypass zfs(PO) snd_hda_codec_generic crct10dif_pclmul crc32_pclmul ledtrig_audio ghash_clmulni_intel aesni_intel zunicode(PO) crypto_simd zlua(PO) cryptd zavl(PO) glue_helper icp(PO) rapl snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core i915 snd_hwdep snd_pcm drm_kms_helper snd_timer snd drm soundcore i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt intel_cstate mxm_wmi pcspkr input_leds joydev eeepc_wmi mei_me usbmouse asus_wmi mei sparse_keymap wmi_bmof mac_hid zcommon(PO) znvpair(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc ip_tables x_tables autofs4 btrfs
Dec 20 19:07:35 pve02 kernel: [35190.140110]  xor zstd_compress hid_generic usbkbd usbhid hid raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c ahci libahci i2c_i801 xhci_pci lpc_ich r8169 realtek xhci_hcd ehci_pci ehci_hcd video wmi
Dec 20 19:07:35 pve02 kernel: [35190.147359] CR2: 0000000000008000
Dec 20 19:07:35 pve02 kernel: [35190.148242] ---[ end trace f2f583820acb9bf8 ]---
Dec 20 19:07:35 pve02 kernel: [35190.149216] RIP: 0010:skb_release_data+0xa9/0x180
Dec 20 19:07:35 pve02 kernel: [35190.150169] Code: 48 0f 45 fa 66 66 66 66 90 f0 ff 4f 34 75 ce e8 4d 3f 94 ff 41 0f b6 45 02 48 83 c3 01 39 d8 7f c9 49 8b 7d 08 48 85 ff 74 10 <48> 8b 1f e8 2f f5 ff ff 48 89 df 48 85 db 75 f0 4d 85 e4 74 57 41
Dec 20 19:07:35 pve02 kernel: [35190.152184] RSP: 0018:ffffa8d5c0c1fc30 EFLAGS: 00010206
Dec 20 19:07:35 pve02 kernel: [35190.153179] RAX: 0000000000000020 RBX: 0000000000000000 RCX: ffffffff875f3a00
Dec 20 19:07:35 pve02 kernel: [35190.154207] RDX: 000000000001dae4 RSI: 00000008316f5861 RDI: 0000000000008000
Dec 20 19:07:35 pve02 kernel: [35190.155264] RBP: ffffa8d5c0c1fc48 R08: 00000000000005a8 R09: ffffffff866f05b0
Dec 20 19:07:35 pve02 kernel: [35190.156284] R10: ffff89e42c4923d0 R11: 0000000000000000 R12: ffff89e418731b00
Dec 20 19:07:35 pve02 kernel: [35190.157309] R13: ffff89e40570ca40 R14: ffff89e42c49287c R15: 0000000000000000
Dec 20 19:07:35 pve02 kernel: [35190.158321] FS:  00007fef8c92fe40(0000) GS:ffff89e4bb880000(0000) knlGS:0000000000000000
Dec 20 19:07:35 pve02 kernel: [35190.159314] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 20 19:07:35 pve02 kernel: [35190.160249] CR2: 0000000000008000 CR3: 000000009357e004 CR4: 00000000000626e0
Dec 20 19:07:35 pve02 QEMU[31347]: kvm: Disconnect client, due to: Failed to read CMD_WRITE data: Unexpected end-of-file before all bytes were read

Code:
root@pve02:~# pveversion --verbose
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1
 
that does indeed look like a kernel bug...
 
Hi,

just wanted to chime in here. I triggered the same problem with the following setup:
- Intel XL710 NIC in host (up2date i40e+fw) + vmbr + kvm with virtio NIC
- Ubuntu 20.04 in VM with zvol (just single disk, formatted in VM with zfs)
- AMD EPYC 7302P on Supermicro H11SSL-C

When I do a zfs scrub in the VM I see no issues, when I do iperf3 in the VM no issues either. But doing rsync in the VM with heavy disk+net io crashes the host after a while - somewhere between 30 min to 3hrs after starting rsync.
I downgraded the host to 5.4.65-1-pve and it is stable since. Haven't had the time yet for more troubleshooting due to holidays.

Cheers,
foobar42
 
I do have the similar problem with kernel: 5.4.78-2-pve

Jan 18 20:46:49 Server01 kernel: [1068239.992143] i40e 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0xa0 flags=0x0000] Jan 18 20:46:49 Server01 kernel: [1068239.992838] i40e 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x10a0 flags=0x0000] Jan 18 20:46:49 Server01 kernel: [1068239.993413] i40e 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x1aa0 flags=0x0000] Jan 18 20:46:49 Server01 kernel: [1068239.993961] i40e 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x20a0 flags=0x0000] Jan 18 20:46:49 Server01 kernel: [1068239.994520] i40e 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x2ca0 flags=0x0000] Jan 18 20:46:49 Server01 kernel: [1068239.995098] i40e 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x30a0 flags=0x0000] Jan 18 20:46:49 Server01 kernel: [1068239.995620] i40e 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x34a0 flags=0x0000] Jan 18 20:46:49 Server01 kernel: [1068239.996133] i40e 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x40a0 flags=0x0000] Jan 18 20:46:49 Server01 kernel: [1068239.996635] i40e 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x4ea0 flags=0x0000] Jan 18 20:46:49 Server01 kernel: [1068239.997130] i40e 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x50a0 flags=0x0000] Jan 18 20:46:49 Server01 kernel: [1068239.997643] AMD-Vi: Event logged [IO_PAGE_FAULT device=44:00.0 domain=0x000e address=0x58a0 flags=0x0000] Jan 18 20:46:49 Server01 kernel: [1068239.998106] AMD-Vi: Event logged [IO_PAGE_FAULT device=44:00.0 domain=0x000e address=0x60a0 flags=0x0000] Jan 18 20:46:49 Server01 kernel: [1068239.998581] AMD-Vi: Event logged [IO_PAGE_FAULT device=44:00.0 domain=0x000e address=0x68a0 flags=0x0000] Jan 18 20:46:49 Server01 kernel: [1068239.999030] AMD-Vi: Event logged [IO_PAGE_FAULT device=44:00.0 domain=0x000e address=0x70a0 flags=0x0000] Jan 18 20:46:49 Server01 kernel: [1068239.999508] AMD-Vi: Event logged [IO_PAGE_FAULT device=44:00.0 domain=0x000e address=0x7600 flags=0x0000] Jan 18 20:46:49 Server01 kernel: [1068239.999939] AMD-Vi: Event logged [IO_PAGE_FAULT device=44:00.0 domain=0x000e address=0x7ca0 flags=0x0000] Jan 18 20:46:49 Server01 kernel: [1068240.000363] AMD-Vi: Event logged [IO_PAGE_FAULT device=44:00.0 domain=0x000e address=0x80a0 flags=0x0000] Jan 18 20:46:49 Server01 kernel: [1068240.000776] AMD-Vi: Event logged [IO_PAGE_FAULT device=44:00.0 domain=0x000e address=0x8d00 flags=0x0000] Jan 18 20:46:49 Server01 kernel: [1068240.001183] AMD-Vi: Event logged [IO_PAGE_FAULT device=44:00.0 domain=0x000e address=0x90a0 flags=0x0000] Jan 18 20:46:49 Server01 kernel: [1068240.001582] AMD-Vi: Event logged [IO_PAGE_FAULT device=44:00.0 domain=0x000e address=0x98a0 flags=0x0000] Jan 18 20:46:55 Server01 kernel: [1068245.974118] ------------[ cut here ]------------ Jan 18 20:46:55 Server01 kernel: [1068245.974122] NETDEV WATCHDOG: enp67s0f0 (i40e): transmit queue 45 timed out Jan 18 20:46:55 Server01 kernel: [1068245.974148] WARNING: CPU: 73 PID: 0 at net/sched/sch_generic.c:448 dev_watchdog+0x264/0x270 Jan 18 20:46:55 Server01 kernel: [1068245.974148] Modules linked in: veth ceph libceph rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables sctp iptable_filter bpfilter softdog nfnetlink_log nfnetlink amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd ipmi_ssif glue_helper pcspkr ast drm_vram_helper ttm joydev input_leds drm_kms_helper drm fb_sys_fops syscopyarea sysfillrect sysimgblt ccp k10temp ipmi_si ipmi_devintf ipmi_msghandler mac_hid zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp sunrpc libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c hid_generic usbmouse usbkbd usbhid hid igb i2c_algo_bit dca ahci i40e libahci xhci_pci xhci_hcd i2c_piix4 Jan 18 20:46:55 Server01 kernel: [1068245.974205] CPU: 73 PID: 0 Comm: swapper/73 Tainted: P O 5.4.78-2-pve #1 Jan 18 20:46:55 Server01 kernel: [1068245.974205] Hardware name: H11DSi, BIOS 2.1 02/21/2020 Jan 18 20:46:55 Server01 kernel: [1068245.974208] RIP: 0010:dev_watchdog+0x264/0x270 Jan 18 20:46:55 Server01 kernel: [1068245.974210] Code: 48 85 c0 75 e6 eb a0 4c 89 ef c6 05 8f d6 ea 00 01 e8 60 aa fa ff 89 d9 4c 89 ee 48 c7 c7 28 4c 63 97 48 89 c2 e8 4d 31 74 ff <0f> 0b eb 82 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 Jan 18 20:46:55 Server01 kernel: [1068245.974211] RSP: 0018:ffffae411a120e58 EFLAGS: 00010282 Jan 18 20:46:55 Server01 kernel: [1068245.974212] RAX: 0000000000000000 RBX: 000000000000002d RCX: 0000000000000006 Jan 18 20:46:55 Server01 kernel: [1068245.974213] RDX: 0000000000000007 RSI: 0000000000000082 RDI: ffff95fcce8578c0 Jan 18 20:46:55 Server01 kernel: [1068245.974213] RBP: ffffae411a120e88 R08: 0000000000000a44 R09: 0000000000000004 Jan 18 20:46:55 Server01 kernel: [1068245.974214] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000080 Jan 18 20:46:55 Server01 kernel: [1068245.974214] R13: ffff95fbee894000 R14: ffff95fbee894480 R15: ffff95fbed519f40 Jan 18 20:46:55 Server01 kernel: [1068245.974215] FS: 0000000000000000(0000) GS:ffff95fcce840000(0000) knlGS:0000000000000000 Jan 18 20:46:55 Server01 kernel: [1068245.974216] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jan 18 20:46:55 Server01 kernel: [1068245.974216] CR2: 00000266054e5000 CR3: 0000007d45c14000 CR4: 0000000000340ee0 Jan 18 20:46:55 Server01 kernel: [1068245.974217] Call Trace: Jan 18 20:46:55 Server01 kernel: [1068245.974220] <IRQ> Jan 18 20:46:55 Server01 kernel: [1068245.974224] ? pfifo_fast_enqueue+0x160/0x160 Jan 18 20:46:55 Server01 kernel: [1068245.974229] call_timer_fn+0x32/0x130 Jan 18 20:46:55 Server01 kernel: [1068245.974231] run_timer_softirq+0x1a5/0x430 Jan 18 20:46:55 Server01 kernel: [1068245.974232] ? enqueue_hrtimer+0x3c/0x90 Jan 18 20:46:55 Server01 kernel: [1068245.974234] ? ktime_get+0x3c/0xa0 Jan 18 20:46:55 Server01 kernel: [1068245.974238] ? lapic_next_event+0x20/0x30 Jan 18 20:46:55 Server01 kernel: [1068245.974240] ? clockevents_program_event+0x93/0xf0 Jan 18 20:46:55 Server01 kernel: [1068245.974243] __do_softirq+0xdc/0x2d4 Jan 18 20:46:55 Server01 kernel: [1068245.974246] irq_exit+0xa9/0xb0 Jan 18 20:46:55 Server01 kernel: [1068245.974247] smp_apic_timer_interrupt+0x79/0x130 Jan 18 20:46:55 Server01 kernel: [1068245.974250] apic_timer_interrupt+0xf/0x20 Jan 18 20:46:55 Server01 kernel: [1068245.974250] </IRQ> Jan 18 20:46:55 Server01 kernel: [1068245.974254] RIP: 0010:cpuidle_enter_state+0xbd/0x450 Jan 18 20:46:55 Server01 kernel: [1068245.974255] Code: ff e8 57 77 84 ff 80 7d c7 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 63 03 00 00 31 ff e8 4a e8 8a ff fb 66 0f 1f 44 00 00 <45> 85 ed 0f 88 8d 02 00 00 49 63 cd 48 8b 75 d0 48 2b 75 c8 48 8d Jan 18 20:46:55 Server01 kernel: [1068245.974255] RSP: 0018:ffffae410079fe48 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13 Jan 18 20:46:55 Server01 kernel: [1068245.974256] RAX: ffff95fcce86ae00 RBX: ffffffff97966a00 RCX: 000000000000001f Jan 18 20:46:55 Server01 kernel: [1068245.974256] RDX: 0003cb9065d1cb63 RSI: 000000002c235171 RDI: 0000000000000000 Jan 18 20:46:55 Server01 kernel: [1068245.974257] RBP: ffffae410079fe88 R08: 0000000000000002 R09: 000000000002a680 Jan 18 20:46:55 Server01 kernel: [1068245.974257] R10: 000b01c5ffb8d40c R11: ffff95fcce869aa0 R12: ffff96fbb3267800 Jan 18 20:46:55 Server01 kernel: [1068245.974258] R13: 0000000000000001 R14: ffffffff97966a78 R15: ffffffff97966a60 Jan 18 20:46:55 Server01 kernel: [1068245.974259] ? cpuidle_enter_state+0x99/0x450 Jan 18 20:46:55 Server01 kernel: [1068245.974260] cpuidle_enter+0x2e/0x40 Jan 18 20:46:55 Server01 kernel: [1068245.974263] call_cpuidle+0x23/0x40 Jan 18 20:46:55 Server01 kernel: [1068245.974264] do_idle+0x22c/0x270 Jan 18 20:46:55 Server01 kernel: [1068245.974264] cpu_startup_entry+0x1d/0x20 Jan 18 20:46:55 Server01 kernel: [1068245.974265] start_secondary+0x166/0x1c0 Jan 18 20:46:55 Server01 kernel: [1068245.974269] secondary_startup_64+0xa4/0xb0 Jan 18 20:46:55 Server01 kernel: [1068245.974271] ---[ end trace c8ed8042797d0d15 ]--- Jan 18 20:46:55 Server01 kernel: [1068245.974277] i40e 0000:44:00.0 enp67s0f0: tx_timeout: VSI_seid: 390, Q 45, NTC: 0x1b5, HWB: 0x1b5, NTU: 0x1d9, TAIL: 0x1d9, INT: 0x1 Jan 18 20:46:55 Server01 kernel: [1068245.974278] i40e 0000:44:00.0 enp67s0f0: tx_timeout recovery level 1, hung_queue 45 Jan 18 20:46:55 Server01 kernel: [1068245.974976] i40e 0000:44:00.0: VSI seid 390 Tx ring 0 disable timeout Jan 18 20:46:55 Server01 kernel: [1068246.105049] i40e 0000:44:00.0: VSI seid 392 Tx ring 128 disable timeout Jan 18 20:46:56 Server01 kernel: [1068246.157356] vmbr925: port 1(enp67s0f0) entered disabled state Jan 18 20:46:56 Server01 kernel: [1068246.336372] i40e 0000:44:00.1: VSI seid 393 Tx ring 128 disable timeout Jan 18 20:46:58 Server01 ntpd[3342]: Deleting interface #4 vmbr925, 10.99.125.18#123, interface stats: received=0, sent=0, dropped=0, active_time=1068240 secs Jan 18 20:46:59 Server01 kernel: [1068249.456122] amd_iommu_report_page_fault: 15 callbacks suppressed Jan 18 20:46:59 Server01 kernel: [1068249.456129] i40e 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0xc4a0 flags=0x0000] Jan 18 20:46:59 Server01 kernel: [1068249.456656] i40e 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0xd0a0 flags=0x0000] Jan 18 20:46:59 Server01 kernel: [1068249.457070] i40e 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0xf0a0 flags=0x0000] Jan 18 20:46:59 Server01 kernel: [1068249.457473] i40e 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x100a0 flags=0x0000] Jan 18 20:46:59 Server01 kernel: [1068249.457861] i40e 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0xe0a0 flags=0x0000] Jan 18 20:46:59 Server01 kernel: [1068249.458263] i40e 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x110a0 flags=0x0000] Jan 18 20:46:59 Server01 kernel: [1068249.458632] i40e 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x11ea0 flags=0x0000] Jan 18 20:46:59 Server01 kernel: [1068249.458991] i40e 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x120a0 flags=0x0000] Jan 18 20:46:59 Server01 kernel: [1068249.459340] i40e 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x12aa0 flags=0x0000] Jan 18 20:46:59 Server01 kernel: [1068249.459678] i40e 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0x130a0 flags=0x0000] Jan 18 20:46:59 Server01 kernel: [1068249.460006] amd_iommu_report_page_fault: 5 callbacks suppressed Jan 18 20:46:59 Server01 kernel: [1068249.460007] AMD-Vi: Event logged [IO_PAGE_FAULT device=44:00.0 domain=0x000e address=0x138a0 flags=0x0000] Jan 18 20:46:59 Server01 kernel: [1068249.460343] AMD-Vi: Event logged [IO_PAGE_FAULT device=44:00.0 domain=0x000e address=0x140a0 flags=0x0000] Jan 18 20:46:59 Server01 kernel: [1068249.460668] AMD-Vi: Event logged [IO_PAGE_FAULT device=44:00.0 domain=0x000e address=0x14ca0 flags=0x0000] Jan 18 20:46:59 Server01 kernel: [1068249.460974] AMD-Vi: Event logged [IO_PAGE_FAULT device=44:00.0 domain=0x000e address=0x150a0 flags=0x0000] Jan 18 20:46:59 Server01 kernel: [1068249.461267] AMD-Vi: Event logged [IO_PAGE_FAULT device=44:00.0 domain=0x000e address=0x158a0 flags=0x0000] Jan 18 20:46:59 Server01 kernel: [1068249.461544] AMD-Vi: Event logged [IO_PAGE_FAULT device=44:00.0 domain=0x000e address=0x160a0 flags=0x0000] Jan 18 20:46:59 Server01 kernel: [1068249.461814] AMD-Vi: Event logged [IO_PAGE_FAULT device=44:00.0 domain=0x000e address=0x16ca0 flags=0x0000] Jan 18 20:46:59 Server01 kernel: [1068249.462084] AMD-Vi: Event logged [IO_PAGE_FAULT device=44:00.0 domain=0x000e address=0x170a0 flags=0x0000] Jan 18 20:46:59 Server01 kernel: [1068249.462341] AMD-Vi: Event logged [IO_PAGE_FAULT device=44:00.0 domain=0x000e address=0x178a0 flags=0x0000] Jan 18 20:46:59 Server01 kernel: [1068249.462590] AMD-Vi: Event logged [IO_PAGE_FAULT device=44:00.0 domain=0x000e address=0x180a0 flags=0x0000] Jan 18 20:46:59 Server01 kernel: [1068249.594922] irq 775: Affinity broken due to vector space exhaustion.
 
  • Like
Reactions: lps90
Similar issues on our cluster happening at random. Started a few weeks ago. Have had a node freeze twice in the last few weeks, bringing down lots of VM's/services in a production environment.

We have not made any changes to underlying hardware or BIOS config recently.

Code:
Jul 28 10:31:13 px3 kernel: [514092.902026] i40e 0000:01:00.2: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0x40 flags=0x0000]
Jul 28 10:31:13 px3 kernel: [514092.902575] i40e 0000:01:00.2: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0x2040 flags=0x0000]
Jul 28 10:31:13 px3 kernel: [514092.903016] i40e 0000:01:00.2: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0x3040 flags=0x0000]
Jul 28 10:31:13 px3 kernel: [514092.903496] i40e 0000:01:00.2: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0x3640 flags=0x0000]
Jul 28 10:31:13 px3 kernel: [514092.904085] i40e 0000:01:00.2: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0x3c40 flags=0x0000]
Jul 28 10:31:13 px3 kernel: [514092.904668] i40e 0000:01:00.2: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0x5040 flags=0x0050]
Jul 28 10:31:13 px3 kernel: [514092.905081] i40e 0000:01:00.2: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0x5240 flags=0x0050]
Jul 28 10:31:13 px3 kernel: [514092.905479] i40e 0000:01:00.2: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0x5440 flags=0x0050]
Jul 28 10:31:19 px3 kernel: [514099.521581] ------------[ cut here ]------------
Jul 28 10:31:19 px3 kernel: [514099.521585] NETDEV WATCHDOG: enp1s0f2 (i40e): transmit queue 13 timed out
Jul 28 10:31:19 px3 kernel: [514099.521604] WARNING: CPU: 29 PID: 0 at net/sched/sch_generic.c:473 dev_watchdog+0x264/0x270
Jul 28 10:31:19 px3 kernel: [514099.521605] Modules linked in: rbd ceph libceph fscache dm_crypt algif_skcipher af_alg ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables sctp iptable_filter bpfilter 8021q garp mrp bonding softdog nfnetlink_log nfnetlink ipmi_ssif amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper pcspkr ast drm_vram_helper joydev input_leds mac_hid ttm drm_kms_helper drm i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt ccp k10temp ipmi_si ipmi_devintf ipmi_msghandler zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq libcrc32c hid_generic usbkbd usbmouse usbhid hid bnxt_en ahci mpt3sas xhci_pci libahci raid_class xhci_hcd i2c_piix4 scsi_transport_sas i40e
Jul 28 10:31:19 px3 kernel: [514099.521662] CPU: 29 PID: 0 Comm: swapper/29 Tainted: P           O      5.4.124-1-pve #1
Jul 28 10:31:19 px3 kernel: [514099.521663] Hardware name: Supermicro AS -2113S-WTRT/H11SSW-NT, BIOS 2.3 11/25/2020
Jul 28 10:31:19 px3 kernel: [514099.521665] RIP: 0010:dev_watchdog+0x264/0x270
Jul 28 10:31:19 px3 kernel: [514099.521668] Code: 48 85 c0 75 e6 eb a0 4c 89 ef c6 05 41 b6 ef 00 01 e8 50 b7 fa ff 89 d9 4c 89 ee 48 c7 c7 c0 61 23 ae 48 89 c2 e8 f5 57 15 00 <0f> 0b eb 82 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41
Jul 28 10:31:19 px3 kernel: [514099.521669] RSP: 0018:ffffb91040a60e58 EFLAGS: 00010282
Jul 28 10:31:19 px3 kernel: [514099.521671] RAX: 0000000000000000 RBX: 000000000000000d RCX: 0000000000000006
Jul 28 10:31:19 px3 kernel: [514099.521672] RDX: 0000000000000007 RSI: 0000000000000082 RDI: ffff904c4e9578c0
Jul 28 10:31:19 px3 kernel: [514099.521672] RBP: ffffb91040a60e88 R08: 0000000000000bf8 R09: 0000000000000004
Jul 28 10:31:19 px3 kernel: [514099.521673] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000040
Jul 28 10:31:19 px3 kernel: [514099.521674] R13: ffff904c1a52f000 R14: ffff904c1a52f480 R15: ffff904c1a7e4f40
Jul 28 10:31:19 px3 kernel: [514099.521676] FS:  0000000000000000(0000) GS:ffff904c4e940000(0000) knlGS:0000000000000000
Jul 28 10:31:19 px3 kernel: [514099.521677] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 28 10:31:19 px3 kernel: [514099.521678] CR2: 00007f03260b0000 CR3: 0000003b7a8aa000 CR4: 0000000000340ee0
Jul 28 10:31:19 px3 kernel: [514099.521679] Call Trace:
Jul 28 10:31:19 px3 kernel: [514099.521681]  <IRQ>
Jul 28 10:31:19 px3 kernel: [514099.521685]  ? pfifo_fast_enqueue+0x160/0x160
Jul 28 10:31:19 px3 kernel: [514099.521689]  call_timer_fn+0x32/0x130
Jul 28 10:31:19 px3 kernel: [514099.521691]  run_timer_softirq+0x1a5/0x430
Jul 28 10:31:19 px3 kernel: [514099.521693]  ? enqueue_hrtimer+0x3c/0x90
Jul 28 10:31:19 px3 kernel: [514099.521695]  ? ktime_get+0x3c/0xa0
Jul 28 10:31:19 px3 kernel: [514099.521698]  ? lapic_next_event+0x20/0x30
Jul 28 10:31:19 px3 kernel: [514099.521701]  ? clockevents_program_event+0x93/0xf0
Jul 28 10:31:19 px3 kernel: [514099.521704]  __do_softirq+0xdc/0x2d4
Jul 28 10:31:19 px3 kernel: [514099.521708]  irq_exit+0xa9/0xb0
Jul 28 10:31:19 px3 kernel: [514099.521709]  smp_apic_timer_interrupt+0x79/0x130
Jul 28 10:31:19 px3 kernel: [514099.521711]  apic_timer_interrupt+0xf/0x20
Jul 28 10:31:19 px3 kernel: [514099.521712]  </IRQ>
Jul 28 10:31:19 px3 kernel: [514099.521716] RIP: 0010:cpuidle_enter_state+0xbd/0x450
Jul 28 10:31:19 px3 kernel: [514099.521717] Code: ff e8 f7 69 88 ff 80 7d c7 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 63 03 00 00 31 ff e8 2a 76 8e ff fb 66 0f 1f 44 00 00 <45> 85 ed 0f 88 8d 02 00 00 49 63 cd 48 8b 75 d0 48 2b 75 c8 48 8d
Jul 28 10:31:19 px3 kernel: [514099.521718] RSP: 0018:ffffb910402e7e48 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
Jul 28 10:31:19 px3 kernel: [514099.521719] RAX: ffff904c4e96ae00 RBX: ffffffffae5669c0 RCX: 000000000000001f
Jul 28 10:31:19 px3 kernel: [514099.521720] RDX: 0001d3921f5cc157 RSI: 000000002db6dc7f RDI: 0000000000000000
Jul 28 10:31:19 px3 kernel: [514099.521721] RBP: ffffb910402e7e88 R08: 0000000000000002 R09: 000000000002a680
Jul 28 10:31:19 px3 kernel: [514099.521722] R10: 00051d67be363cf0 R11: ffff904c4e969aa0 R12: ffff904c213f5800
Jul 28 10:31:19 px3 kernel: [514099.521723] R13: 0000000000000002 R14: ffffffffae566a98 R15: ffffffffae566a80
Jul 28 10:31:19 px3 kernel: [514099.521726]  ? cpuidle_enter_state+0x99/0x450
Jul 28 10:31:19 px3 kernel: [514099.521728]  cpuidle_enter+0x2e/0x40
Jul 28 10:31:19 px3 kernel: [514099.521731]  call_cpuidle+0x23/0x40
Jul 28 10:31:19 px3 kernel: [514099.521732]  do_idle+0x22c/0x270
Jul 28 10:31:19 px3 kernel: [514099.521734]  cpu_startup_entry+0x1d/0x20
Jul 28 10:31:19 px3 kernel: [514099.521736]  start_secondary+0x166/0x1c0
Jul 28 10:31:19 px3 kernel: [514099.521739]  secondary_startup_64+0xa4/0xb0
Jul 28 10:31:19 px3 kernel: [514099.521741] ---[ end trace 7e14c924b64ce1e6 ]---
Jul 28 10:31:19 px3 kernel: [514099.521749] i40e 0000:01:00.2 enp1s0f2: tx_timeout: VSI_seid: 398, Q 13, NTC: 0x2e, HWB: 0x2e, NTU: 0x8e, TAIL: 0x8e, INT: 0x1
Jul 28 10:31:19 px3 kernel: [514099.521752] i40e 0000:01:00.2 enp1s0f2: tx_timeout recovery level 1, hung_queue 13
Jul 28 10:31:19 px3 kernel: [514099.522300] i40e 0000:01:00.2: VSI seid 398 Tx ring 0 disable timeout
Jul 28 10:31:19 px3 kernel: [514099.592532] i40e 0000:01:00.2: VSI seid 402 Tx ring 64 disable timeout
Jul 28 10:31:20 px3 kernel: [514099.823908] i40e 0000:01:00.0: VSI seid 396 Tx ring 0 disable timeout
Jul 28 10:31:20 px3 kernel: [514099.879957] i40e 0000:01:00.0: VSI seid 400 Tx ring 64 disable timeout
Jul 28 10:31:20 px3 kernel: [514099.930017] i40e 0000:01:00.3: VSI seid 399 Tx ring 0 disable timeout
Jul 28 10:31:20 px3 kernel: [514100.000525] i40e 0000:01:00.3: VSI seid 403 Tx ring 64 disable timeout
Jul 28 10:31:20 px3 kernel: [514100.050602] i40e 0000:01:00.1: VSI seid 397 Tx ring 0 disable timeout
Jul 28 10:31:20 px3 kernel: [514100.112479] i40e 0000:01:00.1: VSI seid 401 Tx ring 64 disable timeout
Jul 28 10:31:23 px3 kernel: [514102.943505] i40e 0000:01:00.2: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0x5640 flags=0x0000]
Jul 28 10:31:23 px3 kernel: [514102.944007] i40e 0000:01:00.2: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0x7040 flags=0x0000]
Jul 28 10:31:23 px3 kernel: [514102.944389] i40e 0000:01:00.2: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0x7e40 flags=0x0000]
Jul 28 10:31:23 px3 kernel: [514102.944764] i40e 0000:01:00.2: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0x8040 flags=0x0000]
Jul 28 10:31:23 px3 kernel: [514102.945132] i40e 0000:01:00.2: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0x6040 flags=0x0000]
Jul 28 10:31:23 px3 kernel: [514102.945507] i40e 0000:01:00.2: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0x9040 flags=0x0000]
Jul 28 10:31:23 px3 kernel: [514102.945862] i40e 0000:01:00.2: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0xa040 flags=0x0000]
Jul 28 10:31:23 px3 kernel: [514102.946208] i40e 0000:01:00.2: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0xb040 flags=0x0000]
Jul 28 10:31:23 px3 kernel: [514102.946548] i40e 0000:01:00.2: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0xbd00 flags=0x0000]
Jul 28 10:31:23 px3 kernel: [514102.946881] i40e 0000:01:00.2: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0xc040 flags=0x0000]
Jul 28 10:31:23 px3 kernel: [514102.947206] amd_iommu_report_page_fault: 6 callbacks suppressed
Jul 28 10:31:23 px3 kernel: [514102.947207] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.2 domain=0x0032 address=0xcc00 flags=0x0000]
Jul 28 10:31:23 px3 kernel: [514102.947539] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.2 domain=0x0032 address=0xd040 flags=0x0000]
Jul 28 10:31:23 px3 kernel: [514102.947866] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.2 domain=0x0032 address=0xdc40 flags=0x0000]
Jul 28 10:31:23 px3 kernel: [514102.948184] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.2 domain=0x0032 address=0xe040 flags=0x0000]
Jul 28 10:31:23 px3 kernel: [514102.948497] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.2 domain=0x0032 address=0xec40 flags=0x0000]
Jul 28 10:31:23 px3 kernel: [514102.948802] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.2 domain=0x0032 address=0xf040 flags=0x0000]
Jul 28 10:31:23 px3 kernel: [514102.949099] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.2 domain=0x0032 address=0xfc40 flags=0x0000]
Jul 28 10:31:23 px3 kernel: [514102.949389] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.2 domain=0x0032 address=0x10040 flags=0x0000]
Jul 28 10:31:23 px3 kernel: [514102.949683] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.2 domain=0x0032 address=0x10a40 flags=0x0000]
Jul 28 10:31:23 px3 kernel: [514102.949965] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.2 domain=0x0032 address=0x11040 flags=0x0000]
Jul 28 10:31:28 px3 kernel: [514108.737303] i40e 0000:01:00.0 enp1s0f0: tx_timeout: VSI_seid: 396, Q 45, NTC: 0x0, HWB: 0x0, NTU: 0x1, TAIL: 0x1, INT: 0x1
Jul 28 10:31:28 px3 kernel: [514108.737312] i40e 0000:01:00.0 enp1s0f0: tx_timeout recovery level 1, hung_queue 45
Jul 28 10:31:28 px3 kernel: [514108.737808] i40e 0000:01:00.0: VSI seid 396 Tx ring 0 disable timeout
Jul 28 10:31:29 px3 kernel: [514108.795974] i40e 0000:01:00.0: VSI seid 402 Tx ring 64 disable timeout
Jul 28 10:31:29 px3 kernel: [514109.027744] i40e 0000:01:00.1: VSI seid 397 Tx ring 0 disable timeout
Jul 28 10:31:29 px3 kernel: [514109.088372] i40e 0000:01:00.1: VSI seid 401 Tx ring 64 disable timeout
Jul 28 10:31:29 px3 kernel: [514109.138452] i40e 0000:01:00.2: VSI seid 398 Tx ring 0 disable timeout
Jul 28 10:31:29 px3 kernel: [514109.200020] i40e 0000:01:00.2: VSI seid 400 Tx ring 64 disable timeout
Jul 28 10:31:29 px3 kernel: [514109.250089] i40e 0000:01:00.3: VSI seid 399 Tx ring 0 disable timeout
Jul 28 10:31:29 px3 kernel: [514109.306358] i40e 0000:01:00.3: VSI seid 403 Tx ring 64 disable timeout
Jul 28 10:31:34 px3 kernel: [514113.994564] libceph: osd8 down
Jul 28 10:31:34 px3 kernel: [514113.994566] libceph: osd11 down
Jul 28 10:31:34 px3 kernel: [514113.994566] libceph: osd30 down
Jul 28 10:31:34 px3 kernel: [514113.994567] libceph: osd31 down
Jul 28 10:31:34 px3 kernel: [514113.994567] libceph: osd32 down

Running 7402P in Supermicro Single Socket Servers...
Code:
NX (Execute Disable) protection: active
Jul 28 10:36:21 px3 kernel: [    0.000000] efi: EFI v2.70 by American Megatrends
Jul 28 10:36:21 px3 kernel: [    0.000000] efi:  ACPI=0xa7693000  ACPI 2.0=0xa7693014  SMBIOS=0xa850e000  SMBIOS 3.0=0xa850d000  MEMATTR=0x9f966018  ESRT=0x9e83ba18
Jul 28 10:36:21 px3 kernel: [    0.000000] secureboot: Secure boot could not be determined (mode 0)
Jul 28 10:36:21 px3 kernel: [    0.000000] SMBIOS 3.2.0 present.
Jul 28 10:36:21 px3 kernel: [    0.000000] DMI: Supermicro AS -2113S-WTRT/H11SSW-NT, BIOS 2.3 11/25/2020

It's going to be a few weeks before I have time to prepare for and update the cluster to Proxmox 7. Not sure if that would make any difference but I thought it worth bringing this up. I think there's a kernel bug causing issues.

Code:
proxmox-ve: 6.4-1 (running kernel: 5.4.124-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-4
pve-kernel-helper: 6.4-4
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.124-1-pve: 5.4.124-2
pve-kernel-5.4.119-1-pve: 5.4.119-1
pve-kernel-5.4.114-1-pve: 5.4.114-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.44-1-pve: 5.4.44-1
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.13-3-pve: 5.3.13-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph: 15.2.13-pve1~bpo10
ceph-fuse: 15.2.13-pve1~bpo10
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: not correctly installed
ifupdown2: 3.0.0-1+pve4~bpo10
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.12-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.2-4
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1
 
Last edited:
After a 6.4.1 to 7.2 upgrade, I'm getting this error as well. Proxmox 6 was stable so this has me spooked as I just upgraded all my nodes to 7.2-3. Also caused a huge spike in IO delay that's ongoing now.

Code:
May 22 03:58:07 benson kernel: BUG: unable to handle page fault for address: 0000000000001094
May 22 03:58:07 benson kernel: #PF: supervisor read access in kernel mode
May 22 03:58:07 benson kernel: #PF: error_code(0x0000) - not-present page
May 22 03:58:07 benson kernel: PGD 0 P4D 0
May 22 03:58:07 benson kernel: Oops: 0000 [#1] SMP PTI
May 22 03:58:07 benson kernel: CPU: 2 PID: 1556580 Comm: z_wr_iss Tainted: P           O      5.15.35-1-pve #1
May 22 03:58:07 benson kernel: Hardware name: HP HP EliteDesk 800 G2 DM 35W/8055, BIOS N21 Ver. 02.32 01/30/2018
May 22 03:58:07 benson kernel: RIP: 0010:kmem_cache_alloc+0xfd/0x2e0
May 22 03:58:07 benson kernel: Code: 8b 50 08 49 8b 00 49 83 78 10 00 48 89 45 c8 0f 84 92 01 00 00 48 85 c0 0f 84 89 01 00 00 41 8b 4c 24 28 49 8b 3c 24 48 01 c1 <48> 8b 19 48 89 ce 49 33 9c 24 b8 00 00 00 48 8d 4a 01 48 0f ce 48
May 22 03:58:07 benson kernel: RSP: 0018:ffffbee19de1bc70 EFLAGS: 00010202
May 22 03:58:07 benson kernel: RAX: 0000000000000094 RBX: 0000000000002000 RCX: 0000000000001094
May 22 03:58:07 benson kernel: RDX: 00000000008b3177 RSI: 0000000000042c20 RDI: 000042abb0417a40
May 22 03:58:07 benson kernel: RBP: ffffbee19de1bcb0 R08: ffffdee17fc97a40 R09: ffffbee19de1bd80
May 22 03:58:07 benson kernel: R10: 00000000c6f48b33 R11: 0000000000000000 R12: ffff9c2ed0e29500
May 22 03:58:07 benson kernel: R13: 0000000000000000 R14: 0000000000042c20 R15: 0000000000042c20
May 22 03:58:07 benson kernel: FS:  0000000000000000(0000) GS:ffff9c35cf880000(0000) knlGS:0000000000000000
May 22 03:58:07 benson kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 22 03:58:07 benson kernel: CR2: 0000000000001094 CR3: 0000000468e10004 CR4: 00000000003726e0
May 22 03:58:07 benson kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May 22 03:58:07 benson kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
May 22 03:58:07 benson kernel: Call Trace:
May 22 03:58:07 benson kernel:  <TASK>
May 22 03:58:07 benson kernel:  ? spl_kmem_cache_alloc+0x79/0x790 [spl]
May 22 03:58:07 benson kernel:  spl_kmem_cache_alloc+0x79/0x790 [spl]
May 22 03:58:07 benson kernel:  ? zio_execute+0x95/0x160 [zfs]
May 22 03:58:07 benson kernel:  ? __cond_resched+0x1a/0x50
May 22 03:58:07 benson kernel:  ? mutex_lock+0x13/0x40
May 22 03:58:07 benson kernel:  ? zio_wait_for_children+0xaf/0x140 [zfs]
May 22 03:58:07 benson kernel:  ? vdev_mirror_io_start+0x113/0x280 [zfs]
May 22 03:58:07 benson kernel:  zio_write_compress+0x528/0xa00 [zfs]
May 22 03:58:07 benson kernel:  zio_execute+0x95/0x160 [zfs]
May 22 03:58:07 benson kernel:  taskq_thread+0x29b/0x4c0 [spl]
May 22 03:58:07 benson kernel:  ? wake_up_q+0x90/0x90
May 22 03:58:07 benson kernel:  ? zio_gang_tree_free+0x70/0x70 [zfs]
May 22 03:58:07 benson kernel:  ? taskq_thread_spawn+0x60/0x60 [spl]
May 22 03:58:07 benson kernel:  kthread+0x12a/0x150
May 22 03:58:07 benson kernel:  ? set_kthread_struct+0x50/0x50
May 22 03:58:07 benson kernel:  ret_from_fork+0x22/0x30
May 22 03:58:07 benson kernel:  </TASK>
May 22 03:58:07 benson kernel: Modules linked in: veth joydev input_leds hid_generic usbkbd usbmouse usbhid hid rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache netfs ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables sctp ip6_udp_tunnel udp_tunnel iptable_filter bpfilter 8021q garp mrp bonding tls iTCO_wdt intel_pmc_bxt iTCO_vendor_support nfnetlink_log nfnetlink snd_hda_codec_hdmi snd_hda_codec_realtek cdc_ether usbnet snd_hda_codec_generic ledtrig_audio intel_rapl_msr intel_rapl_common intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel i915 kvm snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec irqbypass mei_hdcp ttm crct10dif_pclmul snd_hda_core ghash_clmulni_intel drm_kms_helper snd_hwdep aesni_intel cec crypto_simd rc_core cryptd i2c_algo_bit snd_pcm fb_sys_fops hp_wmi syscopyarea r8152 platform_profile rapl snd_timer sysfillrect snd mei_me intel_cstate pcspkr sparse_keymap efi_pstore wmi_bmof mii ee1004 soundcore
May 22 03:58:07 benson kernel:  sysimgblt mei intel_pch_thermal mac_hid tpm_infineon acpi_pad vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c simplefb e1000e nvme crc32_pclmul i2c_i801 ahci xhci_pci xhci_pci_renesas i2c_smbus nvme_core libahci xhci_hcd wmi video
May 22 03:58:07 benson kernel: CR2: 0000000000001094
May 22 03:58:07 benson kernel: ---[ end trace b22404b88cc8ea3f ]---
May 22 03:58:07 benson kernel: RIP: 0010:kmem_cache_alloc+0xfd/0x2e0
May 22 03:58:07 benson kernel: Code: 8b 50 08 49 8b 00 49 83 78 10 00 48 89 45 c8 0f 84 92 01 00 00 48 85 c0 0f 84 89 01 00 00 41 8b 4c 24 28 49 8b 3c 24 48 01 c1 <48> 8b 19 48 89 ce 49 33 9c 24 b8 00 00 00 48 8d 4a 01 48 0f ce 48
May 22 03:58:07 benson kernel: RSP: 0018:ffffbee19de1bc70 EFLAGS: 00010202
May 22 03:58:07 benson kernel: RAX: 0000000000000094 RBX: 0000000000002000 RCX: 0000000000001094
May 22 03:58:07 benson kernel: RDX: 00000000008b3177 RSI: 0000000000042c20 RDI: 000042abb0417a40
May 22 03:58:07 benson kernel: RBP: ffffbee19de1bcb0 R08: ffffdee17fc97a40 R09: ffffbee19de1bd80
May 22 03:58:07 benson kernel: R10: 00000000c6f48b33 R11: 0000000000000000 R12: ffff9c2ed0e29500
May 22 03:58:07 benson kernel: R13: 0000000000000000 R14: 0000000000042c20 R15: 0000000000042c20
May 22 03:58:07 benson kernel: FS:  0000000000000000(0000) GS:ffff9c35cf880000(0000) knlGS:0000000000000000
May 22 03:58:07 benson kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 22 03:58:07 benson kernel: CR2: 0000000000001094 CR3: 00000001655dc006 CR4: 00000000003726e0
May 22 03:58:07 benson kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May 22 03:58:07 benson kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
cpu load.jpg
Code:
root@benson:~# pveversion --verbose
proxmox-ve: 7.2-1 (running kernel: 5.15.35-1-pve)
pve-manager: 7.2-3 (running version: 7.2-3/c743d6c1)
pve-kernel-5.15: 7.2-3
pve-kernel-helper: 7.2-3
pve-kernel-5.4: 6.4-15
pve-kernel-5.15.35-1-pve: 5.15.35-3
pve-kernel-5.4.174-2-pve: 5.4.174-2
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph-fuse: 14.2.21-1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-8
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-6
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.2-2
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.1.8-1
proxmox-backup-file-restore: 2.1.8-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-10
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-1
pve-qemu-kvm: 6.2.0-6
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1
root@benson:~#
 
Last edited:
I think I have same or similar issue. I had to reboot node.

Code:
May 23 10:03:02 s7 kernel: [893783.712903] show_signal: 8 callbacks suppressed
May 23 10:03:02 s7 kernel: [893783.712906] traps: pvescheduler[2627784] general protection fault ip:55c0dffe3f94 sp:7ffd3cb21a60 error:0 in perl[55c0dff2c000+185000]
May 23 10:04:11 s7 pvescheduler[2630746]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
May 23 10:04:48 s7 kernel: [893889.958084] BUG: Bad page state in process kvm  pfn:ffffff220738ad96
May 23 10:04:48 s7 kernel: [893889.958949] page:00000000fa32aaf3 refcount:-14506 mapcount:0 mapping:0000000000000000 index:0xffff8fd80e2b65b0 pfn:0xffffff220738ad96
May 23 10:04:48 s7 kernel: [893889.960356] memcg:ffff8fd80e2b65d0
May 23 10:04:48 s7 kernel: [893889.960356] flags: 0xffffc756b8aaeec8(waiters|dirty|workingset|slab|owner_priv_1|arch_1|private|private_2|writeback|mappedtodisk|swapbacked|mlocked|hwpoison|node=1023|zone=7|lastcpupid=0x1f1d5a)
May 23 10:04:48 s7 kernel: [893889.963103] raw: ffffc756b8aaeec8 dead000000000100 dead000000000122 ffff8fd80e2b65b0
May 23 10:04:48 s7 kernel: [893889.963103] raw: ffff8fd80e2b65b0 ffffc756862cc008 ffffc756b3610108 ffff8fd80e2b65d0
May 23 10:04:48 s7 kernel: [893889.963103] page dumped because: page still charged to cgroup
May 23 10:04:48 s7 kernel: [893889.963103] Modules linked in: joydev input_leds hid_generic usbmouse usbkbd usbhid hid uas usb_storage veth tcp_diag inet_diag ebtable_filter ebtables ip6table_raw ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables iptable_raw xt_mac ipt_REJECT nf_reject_ipv4 xt_mark xt_set xt_physdev xt_addrtype xt_comment xt_tcpudp xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter ip_set_hash_net ip_set nf_tables softdog bonding tls nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common edac_mce_amd amdgpu snd_hda_codec_realtek kvm_amd snd_hda_codec_generic ledtrig_audio kvm snd_hda_codec_hdmi iommu_v2 gpu_sched drm_ttm_helper irqbypass ttm snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi crct10dif_pclmul snd_hda_codec drm_kms_helper ghash_clmulni_intel aesni_intel snd_hda_core cec rc_core snd_hwdep crypto_simd i2c_algo_bit snd_pcm cryptd fb_sys_fops eeepc_wmi syscopyarea snd_timer asus_wmi rapl sysfillrect sysimgblt snd platform_profile soundcore
May 23 10:04:48 s7 kernel: [893889.963103]  sparse_keymap ccp video pcspkr k10temp efi_pstore wmi_bmof mac_hid vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi msr drm sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c simplefb xhci_pci xhci_pci_renesas crc32_pclmul nvme ahci i2c_piix4 r8169 realtek xhci_hcd libahci nvme_core wmi gpio_amdpt gpio_generic
May 23 10:04:48 s7 kernel: [893889.971099] CPU: 12 PID: 228872 Comm: kvm Tainted: P           O      5.15.35-1-pve #1
May 23 10:04:48 s7 kernel: [893889.975104] Hardware name: ASUS System Product Name/PRIME B550M-K, BIOS 1401 12/03/2020
May 23 10:04:48 s7 kernel: [893889.975104] Call Trace:
May 23 10:04:48 s7 kernel: [893889.975104]  <TASK>
May 23 10:04:48 s7 kernel: [893889.975104]  dump_stack_lvl+0x4a/0x5f
May 23 10:04:48 s7 kernel: [893889.975104]  dump_stack+0x10/0x12
May 23 10:04:48 s7 kernel: [893889.975104]  bad_page.cold+0x63/0x94
May 23 10:04:48 s7 kernel: [893889.975104]  check_free_page_bad+0x66/0x70
May 23 10:04:48 s7 kernel: [893889.975104]  free_pcppages_bulk+0x1c3/0x390
May 23 10:04:48 s7 kernel: [893889.975104]  free_unref_page_commit.constprop.0+0x12b/0x170
May 23 10:04:48 s7 kernel: [893889.975104]  free_unref_page_list+0x1b3/0x320
May 23 10:04:48 s7 kernel: [893889.975104]  release_pages+0x165/0x530
May 23 10:04:48 s7 kernel: [893889.983110]  free_pages_and_swap_cache+0x48/0x60
May 23 10:04:48 s7 kernel: [893889.983110]  tlb_finish_mmu+0x89/0x1c0
May 23 10:04:48 s7 kernel: [893889.983110]  zap_page_range+0x120/0x170
May 23 10:04:48 s7 kernel: [893889.983110]  do_madvise.part.0+0x8ca/0xf20
May 23 10:04:48 s7 kernel: [893889.983110]  ? do_syscall_64+0x69/0xc0
May 23 10:04:48 s7 kernel: [893889.983110]  ? exit_to_user_mode_prepare+0x37/0x1b0
May 23 10:04:48 s7 kernel: [893889.983110]  __x64_sys_madvise+0x58/0x70
May 23 10:04:48 s7 kernel: [893889.987101]  do_syscall_64+0x5c/0xc0
May 23 10:04:48 s7 kernel: [893889.987101]  ? do_syscall_64+0x69/0xc0
May 23 10:04:48 s7 kernel: [893889.987101]  ? do_syscall_64+0x69/0xc0
May 23 10:04:48 s7 kernel: [893889.987101]  ? asm_sysvec_apic_timer_interrupt+0xa/0x20
May 23 10:04:48 s7 kernel: [893889.987101]  entry_SYSCALL_64_after_hwframe+0x44/0xae
May 23 10:04:48 s7 kernel: [893889.991106] RIP: 0033:0x7f3b047d5cf7
May 23 10:04:48 s7 kernel: [893889.991106] Code: ff ff ff ff c3 66 0f 1f 44 00 00 48 8b 15 91 51 0c 00 f7 d8 64 89 02 b8 ff ff ff ff eb bc 0f 1f 44 00 00 b8 1c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 69 51 0c 00 f7 d8 64 89 01 48
May 23 10:04:48 s7 kernel: [893889.991106] RSP: 002b:00007f3af8958e68 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
May 23 10:04:48 s7 kernel: [893889.991106] RAX: ffffffffffffffda RBX: 0000556ee42f6350 RCX: 00007f3b047d5cf7
May 23 10:04:48 s7 kernel: [893889.991106] RDX: 0000000000000004 RSI: 0000000000200000 RDI: 00007f3a3b400000
May 23 10:04:48 s7 kernel: [893889.991106] RBP: 00000000ffffffff R08: 0000000100000000 R09: 0000000000000000
May 23 10:04:48 s7 kernel: [893889.995104] R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000200000
May 23 10:04:48 s7 kernel: [893889.995104] R13: 00007f3a3b400000 R14: 00007f3af895c098 R15: 000000004f600000
May 23 10:04:48 s7 kernel: [893889.995104]  </TASK>
 
no, that is a completely different trace..
 
Has there been any updates to this old thread? I am seeing similar behavior:

Code:
May 19 18:40:35 proxmox kernel: BUG: unable to handle page fault for address: 00000000000f424b
May 19 18:40:35 proxmox kernel: #PF: supervisor write access in kernel mode
May 19 18:40:35 proxmox kernel: #PF: error_code(0x0002) - not-present page
May 19 18:40:35 proxmox kernel: PGD 0 P4D 0
May 19 18:40:35 proxmox kernel: Oops: 0002 [#1] PREEMPT SMP PTI
May 19 18:40:35 proxmox kernel: CPU: 0 PID: 518 Comm: watchdog-mux Tainted: P           O       6.2.11-2-pve #1
May 19 18:40:35 proxmox kernel: Hardware name: Intel(R) Client Systems NUC8i3BEK/NUC8BEB, BIOS BECFL357.86A.0092.2023.0214.1114 02/14/2023
May 19 18:40:35 proxmox kernel: RIP: 0010:osq_lock+0x3d/0x160
May 19 18:40:35 proxmox kernel: Code: 48 89 d3 48 83 ec 10 65 8b 05 ab e9 0c 69 83 c0 01 65 48 03 1d ec 73 0b 69 c7 43 10 00 00 00 00 48 c7 03 00 00 00 00 89 43 14 <87> 07 85 c0 0f 84 cf 00 00 00 83 e8 01 49 89 fc 48 98 48 3d ff 1f
May 19 18:40:35 proxmox kernel: RSP: 0018:ffffa8d3410a7d20 EFLAGS: 00010286
May 19 18:40:35 proxmox kernel: RAX: 0000000000000001 RBX: ffff944d9dc324c0 RCX: 0000000000000000
May 19 18:40:35 proxmox kernel: RDX: 00000000000324c0 RSI: 0000000000000000 RDI: 00000000000f424b
May 19 18:40:35 proxmox kernel: RBP: ffffa8d3410a7d40 R08: 0000000000000001 R09: 0000000000000000
May 19 18:40:35 proxmox kernel: R10: 0000000000000001 R11: 0000000000000000 R12: 00000000000f423f
May 19 18:40:35 proxmox kernel: R13: 00000000000f424b R14: ffff944646800000 R15: 0000000000000000
May 19 18:40:35 proxmox kernel: FS:  00007fe69b4ce540(0000) GS:ffff944d9dc00000(0000) knlGS:0000000000000000
May 19 18:40:35 proxmox kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 19 18:40:35 proxmox kernel: CR2: 00000000000f424b CR3: 000000010e1ae005 CR4: 00000000003706f0
May 19 18:40:35 proxmox kernel: Call Trace:
May 19 18:40:35 proxmox kernel:  <TASK>
May 19 18:40:35 proxmox kernel:  ? schedule+0x68/0x100
May 19 18:40:35 proxmox kernel:  __mutex_lock.constprop.0+0x193/0x750
May 19 18:40:35 proxmox kernel:  ? __pfx_hrtimer_wakeup+0x10/0x10
May 19 18:40:35 proxmox kernel:  schedule_hrtimeout_range+0x13/0x20
May 19 18:40:35 proxmox kernel:  do_epoll_wait+0x631/0x770
May 19 18:40:35 proxmox kernel:  ? __pfx_ep_autoremove_wake_function+0x10/0x10
May 19 18:40:35 proxmox kernel:  __x64_sys_epoll_wait+0x5e/0x100
May 19 18:40:35 proxmox kernel:  do_syscall_64+0x59/0x90
May 19 18:40:35 proxmox kernel:  ? syscall_exit_to_user_mode+0x26/0x50
May 19 18:40:35 proxmox kernel:  ? do_syscall_64+0x69/0x90
May 19 18:40:35 proxmox kernel:  ? do_syscall_64+0x69/0x90
May 19 18:40:35 proxmox kernel:  entry_SYSCALL_64_after_hwframe+0x72/0xdc
May 19 18:40:35 proxmox kernel: RIP: 0033:0x7fe69b3f4d16
May 19 18:40:35 proxmox kernel: Code: 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 e8 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 90 48 83 ec 28 89 54 24 18 48 89 74 24
May 19 18:40:35 proxmox kernel: RSP: 002b:00007ffdb33a7488 EFLAGS: 00000246 ORIG_RAX: 00000000000000e8
May 19 18:40:35 proxmox kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fe69b3f4d16
May 19 18:40:35 proxmox kernel: RDX: 000000000000000a RSI: 00007ffdb33a85e0 RDI: 0000000000000005
May 19 18:40:35 proxmox kernel: RBP: 00007ffdb33a87b0 R08: 00007ffdb33a84a0 R09: 00007ffdb33a5207
May 19 18:40:35 proxmox kernel: R10: 00000000000003e8 R11: 0000000000000246 R12: 0000555fc156c270
May 19 18:40:35 proxmox kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
May 19 18:40:35 proxmox kernel:  </TASK>
May 19 18:40:35 proxmox kernel: Modules linked in: ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter nf_tables bonding tls softdog nfnetlink_log nfnetlink snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_sof_pci_intel_cnl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_bus snd_soc_core snd_compress intel_rapl_msr ac97_bus intel_rapl_common snd_pcm_dmaengine intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp coretemp i915 kvm_intel snd_hda_intel drm_buddy iwlmvm ttm mei_pxp mei_hdcp snd_intel_dspcfg kvm mac80211 snd_intel_sdw_acpi irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel libarc4 sha512_ssse3 snd_hda_codec drm_display_helper cec rc_core snd_hda_core aesni_intel crypto_simd btusb btrtl
May 19 18:40:35 proxmox kernel:  cryptd btbcm btintel btmtk snd_hwdep rapl iwlwifi snd_pcm wmi_bmof intel_cstate drm_kms_helper bluetooth snd_timer intel_wmi_thunderbolt pcspkr i2c_algo_bit joydev syscopyarea mei_me ecdh_generic efi_pstore sysfillrect snd input_leds soundcore ee1004 ecc sysimgblt intel_pch_thermal mei cfg80211 acpi_pad mac_hid acpi_tad zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq simplefb hid_logitech_hidpp hid_logitech_dj hid_generic usbkbd usbmouse usbhid hid dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c rtsx_pci_sdmmc nvme xhci_pci xhci_pci_renesas crc32_pclmul e1000e xhci_hcd rtsx_pci nvme_core ahci i2c_i801 i2c_smbus nvme_common libahci video wmi pinctrl_cannonlake
May 19 18:40:35 proxmox kernel: CR2: 00000000000f424b
May 19 18:40:35 proxmox kernel: ---[ end trace 0000000000000000 ]---
May 19 18:40:35 proxmox kernel: RIP: 0010:osq_lock+0x3d/0x160
May 19 18:40:35 proxmox kernel: Code: 48 89 d3 48 83 ec 10 65 8b 05 ab e9 0c 69 83 c0 01 65 48 03 1d ec 73 0b 69 c7 43 10 00 00 00 00 48 c7 03 00 00 00 00 89 43 14 <87> 07 85 c0 0f 84 cf 00 00 00 83 e8 01 49 89 fc 48 98 48 3d ff 1f
May 19 18:40:35 proxmox kernel: RSP: 0018:ffffa8d3410a7d20 EFLAGS: 00010286
May 19 18:40:35 proxmox kernel: RAX: 0000000000000001 RBX: ffff944d9dc324c0 RCX: 0000000000000000
May 19 18:40:35 proxmox kernel: RDX: 00000000000324c0 RSI: 0000000000000000 RDI: 00000000000f424b
May 19 18:40:35 proxmox kernel: RBP: ffffa8d3410a7d40 R08: 0000000000000001 R09: 0000000000000000
May 19 18:40:35 proxmox kernel: R10: 0000000000000001 R11: 0000000000000000 R12: 00000000000f423f
May 19 18:40:35 proxmox kernel: R13: 00000000000f424b R14: ffff944646800000 R15: 0000000000000000
May 19 18:40:35 proxmox kernel: FS:  00007fe69b4ce540(0000) GS:ffff944d9dc00000(0000) knlGS:0000000000000000
May 19 18:40:35 proxmox kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 19 18:40:35 proxmox kernel: CR2: 00000000000f424b CR3: 000000010e1ae005 CR4: 00000000003706f0
May 19 18:40:35 proxmox kernel: note: watchdog-mux[518] exited with irqs disabled
May 19 18:40:35 proxmox kernel: watchdog: watchdog0: watchdog did not stop!
 
Try to disable SMP / HyperThreading and Powersaving (C-State) in the BIOS.
 
Last edited:
Try to disable SMP / HyperThreading and Powersaving (C-State) in the BIOS.
Strangely enough. I took the RAM out, put a 4GB stick in, and had no issues, so I put both 16GB sticks back in, and no more issues, for now. I don't understand because they were originally both inserted all the way. I specifically inspected that before taking them out.