Never ending plague of kernel panics

TheMaskedCrusader · Nov 9, 2020

I used to be getting kernel panics for the e1000e network card, so I replaced my network card with a Broadcom from amazon. I thought that fixed it, but I am continually getting kernel crashes. I ran an `apt-get update && apt-get upgrade` today to try to get rid of them, but they're still coming. Can someone give me insight into what is causing these errors?

Code:

[257285.534102] RIP: 0010:proc_pid_permission+0x3a/0xd0
[257285.534952] Code: 31 f6 41 55 49 89 fd 41 54 53 48 8b 47 28 48 8b 7f b8 4c 8b a0 b8 03 00 00 e8 12 3a d5 ff 48 85 c0 0f 84 83 00 00 00 48 89 c3 <41> 8b 84 24 ac 00 00 00 41 8b bc 24 a8 00 00 00 85 c0 7f 3b 41 bf
[257285.536743] RSP: 0018:ffffb94f81233c00 EFLAGS: 00010286
[257285.537635] RAX: ffff8e233bfb0000 RBX: ffff8e233bfb0000 RCX: 0000000000000000
[257285.538536] RDX: 0000000000200000 RSI: 0000000000000930 RDI: ffff8e233bf88380
[257285.539448] RBP: ffffb94f81233c28 R08: 0000000000000003 R09: ffff8e22f6f493c0
[257285.540354] R10: 6c646d632f373036 R11: 0000000000000003 R12: 0000000000000000
[257285.541254] R13: ffff8e22d6bd9588 R14: 0000000000000001 R15: ffffb94f81233db0
[257285.542155] FS:  00007f21ea7707c0(0000) GS:ffff8e233fd00000(0000) knlGS:0000000000000000
[257285.543062] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[257285.543975] CR2: 00000000000000ac CR3: 00000007ef4b8005 CR4: 00000000001626e0
[257285.544891] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[257285.545799] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[257345.553085] BUG: kernel NULL pointer dereference, address: 00000000000000ac
[257345.553969] #PF: supervisor read access in kernel mode
[257345.554817] #PF: error_code(0x0000) - not-present page
[257345.555665] PGD 0 P4D 0
[257345.556505] Oops: 0000 [#47] SMP PTI
[257345.557341] CPU: 6 PID: 27033 Comm: ps Tainted: P      D    O      5.4.65-1-pve #1
[257345.558181] Hardware name: ASUS All Series/Z87-PLUS, BIOS 0801 04/19/2013
[257345.559023] RIP: 0010:pid_getattr+0x48/0xa0
[257345.559868] Code: 24 28 4c 89 e7 4c 8b b0 b8 03 00 00 e8 a1 ca f6 ff 48 c7 43 30 00 00 00 00 49 8b 7c 24 b8 31 f6 e8 ad 21 d5 ff 48 85 c0 74 29 <41> 83 be ac 00 00 00 01 49 89 c5 41 8b be a8 00 00 00 7f 20 41 0f
[257345.561631] RSP: 0018:ffffb94f83ea7db0 EFLAGS: 00010286
[257345.562511] RAX: ffff8e233bfb0000 RBX: ffffb94f83ea7e80 RCX: 000000000000000a
[257345.563401] RDX: 000000002f3b547d RSI: 0000000000000930 RDI: ffff8e233bf88380
[257345.564299] RBP: ffffb94f83ea7dd0 R08: ffff8e22d6bd9588 R09: ffffb94f83ea7e30
[257345.565194] R10: 0000000000000800 R11: 0000000000000003 R12: ffff8e22d6bd9588
[257345.566094] R13: 00000000000007ff R14: 0000000000000000 R15: 00000000000007ff
[257345.566995] FS:  00007f976a7fc7c0(0000) GS:ffff8e233fd80000(0000) knlGS:0000000000000000
[257345.567906] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[257345.568824] CR2: 00000000000000ac CR3: 000000043fa2e001 CR4: 00000000001626e0
[257345.569747] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[257345.570674] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[257345.571598] Call Trace:
[257345.572518]  vfs_getattr_nosec+0x98/0xc0
[257345.573413]  vfs_getattr+0x36/0x40
[257345.574282]  vfs_statx+0x8d/0xe0
[257345.575142]  __do_sys_newstat+0x3d/0x70
[257345.575995]  __x64_sys_newstat+0x16/0x20
[257345.576843]  do_syscall_64+0x57/0x190
[257345.577685]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[257345.578527] RIP: 0033:0x7f976ab40aa5
[257345.579368] Code: 00 00 00 75 05 48 83 c4 18 c3 e8 26 0d 02 00 66 0f 1f 44 00 00 48 89 f0 83 ff 01 77 30 48 89 c7 48 89 d6 b8 04 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 c3 66 90 48 8b 15 b9 13 0d 00 f7 d8 64 89
[257345.581080] RSP: 002b:00007ffcbe160b98 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[257345.581915] RAX: ffffffffffffffda RBX: 0000000000000060 RCX: 00007f976ab40aa5
[257345.582730] RDX: 00007f976ae2f680 RSI: 00007f976ae2f680 RDI: 0000558094c55660
[257345.583524] RBP: 0000558093a2f580 R08: 0000558094c581eb R09: 00007f976abd1e80
[257345.584297] R10: 0000000000000000 R11: 0000000000000246 R12: 0000558094c55660
[257345.585047] R13: 0000558094c55600 R14: 0000000000000005 R15: 0000000000000000
[257345.585773] Modules linked in: input_leds hid_generic usbkbd usbhid hid veth nfsv4 nfs fscache xt_conntrack xt_MASQUERADE nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables aufs sctp rpcsec_gss_krb5 iptable_filter bpfilter overlay softdog nfnetlink_log nfnetlink snd_hda_codec_hdmi intel_rapl_msr intel_rapl_common i915 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio x86_pkg_temp_thermal snd_hda_intel intel_powerclamp drm_kms_helper snd_intel_dspcfg drm coretemp snd_hda_codec i2c_algo_bit snd_hda_core fb_sys_fops snd_hwdep syscopyarea snd_pcm sysfillrect kvm_intel sysimgblt kvm snd_timer mei_hdcp snd irqbypass soundcore mei_me crct10dif_pclmul zfs(PO) mei crc32_pclmul ghash_clmulni_intel eeepc_wmi aesni_intel crypto_simd cryptd asus_wmi glue_helper rapl sparse_keymap pcspkr intel_cstate zunicode(PO) mac_hid mxm_wmi wmi_bmof zlua(PO)
[257345.585794]  zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm nfsd iw_cm ib_cm auth_rpcgss ib_core nfs_acl lockd iscsi_tcp libiscsi_tcp grace sunrpc libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq libcrc32c xhci_pci i2c_i801 ahci r8169 ehci_pci libahci lpc_ich realtek ehci_hcd xhci_hcd wmi video
[257345.593048] CR2: 00000000000000ac
[257345.593905] ---[ end trace 5209a205b908d837 ]---
[257345.594805] RIP: 0010:proc_pid_permission+0x3a/0xd0
[257345.595654] Code: 31 f6 41 55 49 89 fd 41 54 53 48 8b 47 28 48 8b 7f b8 4c 8b a0 b8 03 00 00 e8 12 3a d5 ff 48 85 c0 0f 84 83 00 00 00 48 89 c3 <41> 8b 84 24 ac 00 00 00 41 8b bc 24 a8 00 00 00 85 c0 7f 3b 41 bf
[257345.597432] RSP: 0018:ffffb94f81233c00 EFLAGS: 00010286
[257345.598325] RAX: ffff8e233bfb0000 RBX: ffff8e233bfb0000 RCX: 0000000000000000
[257345.599215] RDX: 0000000000200000 RSI: 0000000000000930 RDI: ffff8e233bf88380
[257345.600118] RBP: ffffb94f81233c28 R08: 0000000000000003 R09: ffff8e22f6f493c0
[257345.601024] R10: 6c646d632f373036 R11: 0000000000000003 R12: 0000000000000000
[257345.601915] R13: ffff8e22d6bd9588 R14: 0000000000000001 R15: ffffb94f81233db0
[257345.602808] FS:  00007f976a7fc7c0(0000) GS:ffff8e233fd80000(0000) knlGS:0000000000000000
[257345.603708] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[257345.604618] CR2: 00000000000000ac CR3: 000000043fa2e001 CR4: 00000000001626e0
[257345.605540] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[257345.606440] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

TheMaskedCrusader · Nov 9, 2020

This results on a gray questionmark on the node in the ProxMox user interface.

fabian · Nov 9, 2020

call traces look similar to this issue: https://github.com/openzfs/zfs/issues/11076 , maybe provide additional input there?

TheMaskedCrusader · Nov 10, 2020

It's a problem with pvestatd, not zfs. I'm not running ZFS.

Here's what I get when I check the status of pvestatd:

Code:

root@node02:/var# systemctl status pvestatd
● pvestatd.service - PVE Status Daemon
   Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; vendor preset: enabled)
   Active: failed (Result: signal) since Mon 2020-11-09 20:01:03 MST; 8min ago
  Process: 32131 ExecStart=/usr/bin/pvestatd start (code=exited, status=0/SUCCESS)
 Main PID: 32136 (code=killed, signal=KILL)

Nov 09 20:00:52 node02 systemd[1]: Starting PVE Status Daemon...
Nov 09 20:00:52 node02 pvestatd[32136]: starting server
Nov 09 20:00:52 node02 systemd[1]: Started PVE Status Daemon.
Nov 09 20:01:03 node02 systemd[1]: pvestatd.service: Main process exited, code=killed, status=9/KILL
Nov 09 20:01:03 node02 systemd[1]: pvestatd.service: Failed with result 'signal'.

Systemd says it's being killed with SIGKILL during main. Does anyone know how to reinstall pvestatd without wiping and reloading the entire system? I really don't want to reinstall the whole system. Removing the node and re-adding it to a cluster is a pain in the neck.

TheMaskedCrusader · Nov 10, 2020

So, this is a strange thing, It starts fine, and the pvestatd service will continue to run for about 3 days. But around the 225000 second mark, something crashes that kills the pvestatd service. I'll check dmesg when it crashes again to try to determine what the failure is.

I really don't want to reboot proxmox every 200,000 seconds.

TheMaskedCrusader · Dec 5, 2020

Well, I have updated both my servers to the latest version using distupgrade and everything was running swimmingly for 1.6 million seconds. Then the second node crashed again! I'm currently looking for alternatives to proxmox, since I can't figure out this problem, and crashing every two weeks is getting difficult to deal with.

Code:

Nov 28 09:48:42 node02 kernel: [1598368.518745] BUG: kernel NULL pointer dereference, address: 0000000000000048
Nov 28 09:48:42 node02 kernel: [1598368.519847] #PF: supervisor read access in kernel mode
Nov 28 09:48:42 node02 kernel: [1598368.520920] #PF: error_code(0x0000) - not-present page
Nov 28 09:48:42 node02 kernel: [1598368.521989] PGD 0 P4D 0
Nov 28 09:48:42 node02 kernel: [1598368.523058] Oops: 0000 [#8685] SMP PTI
Nov 28 09:48:42 node02 kernel: [1598368.524128] CPU: 6 PID: 13432 Comm: ps Tainted: P      D    O      5.4.65-1-pve #1
Nov 28 09:48:42 node02 kernel: [1598368.525205] Hardware name: ASUS All Series/Z87-PLUS, BIOS 0801 04/19/2013
Nov 28 09:48:42 node02 kernel: [1598368.526286] RIP: 0010:do_dentry_open+0x103/0x3a0
Nov 28 09:48:42 node02 kernel: [1598368.527366] Code: df e8 01 f0 17 00 41 89 c7 85 c0 0f 85 94 01 00 00 8b 73 40 48 8b 7b 20 f0 83 44 24 fc 00 48 8b 87 70 01 00 00 48 85 c0 74 23 <48> 8b 50 28 48 8d 48 28 48 39 ca 0f 84 94 01 00 00 ba 20 00 00 00
Nov 28 09:48:42 node02 kernel: [1598368.528501] RSP: 0018:ffffbd076393fc88 EFLAGS: 00010202
Nov 28 09:48:42 node02 kernel: [1598368.529639] RAX: 0000000000000020 RBX: ffff9fba983ca700 RCX: 0000000000000001
Nov 28 09:48:42 node02 kernel: [1598368.530783] RDX: ffff9fb874fb7ab8 RSI: 0000000000008000 RDI: ffff9fb874fb7ab8
Nov 28 09:48:42 node02 kernel: [1598368.531924] RBP: ffffbd076393fcb0 R08: 0000000000000000 R09: 0000000000000000
Nov 28 09:48:42 node02 kernel: [1598368.533059] R10: ffff9fb5e972e540 R11: 0000000000000004 R12: ffff9fb874fb7ab8
Nov 28 09:48:42 node02 kernel: [1598368.534166] R13: ffff9fba983ca710 R14: 0000000000000000 R15: 0000000000000000
Nov 28 09:48:42 node02 kernel: [1598368.535246] FS:  00007f9f4b5187c0(0000) GS:ffff9fbb7fd80000(0000) knlGS:0000000000000000
Nov 28 09:48:42 node02 kernel: [1598368.536326] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 28 09:48:42 node02 kernel: [1598368.537401] CR2: 0000000000000048 CR3: 00000007e2240003 CR4: 00000000001626e0
Nov 28 09:48:42 node02 kernel: [1598368.538485] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 28 09:48:42 node02 kernel: [1598368.539569] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 28 09:48:42 node02 kernel: [1598368.540648] Call Trace:
Nov 28 09:48:42 node02 kernel: [1598368.541720]  vfs_open+0x2d/0x30
Nov 28 09:48:42 node02 kernel: [1598368.542784]  path_openat+0x2e9/0x16f0
Nov 28 09:48:42 node02 kernel: [1598368.543843]  ? filename_lookup.part.60+0xe0/0x170
Nov 28 09:48:42 node02 kernel: [1598368.544904]  ? seq_put_decimal_ull+0x10/0x20
Nov 28 09:48:42 node02 kernel: [1598368.545967]  do_filp_open+0x93/0x100
Nov 28 09:48:42 node02 kernel: [1598368.547034]  ? __alloc_fd+0x46/0x150
Nov 28 09:48:42 node02 kernel: [1598368.548089]  do_sys_open+0x177/0x280
Nov 28 09:48:42 node02 kernel: [1598368.549144]  __x64_sys_openat+0x20/0x30
Nov 28 09:48:42 node02 kernel: [1598368.550199]  do_syscall_64+0x57/0x190
Nov 28 09:48:42 node02 kernel: [1598368.551254]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Nov 28 09:48:42 node02 kernel: [1598368.552311] RIP: 0033:0x7f9f4b85d1ae
Nov 28 09:48:42 node02 kernel: [1598368.553365] Code: 25 00 00 41 00 3d 00 00 41 00 74 48 48 8d 05 59 65 0d 00 8b 00 85 c0 75 69 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 0f 87 a6 00 00 00 48 8b 4c 24 28 64 48 33 0c 25
Nov 28 09:48:42 node02 kernel: [1598368.554485] RSP: 002b:00007fff2f1a44a0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
Nov 28 09:48:42 node02 kernel: [1598368.555616] RAX: ffffffffffffffda RBX: 00007fff2f1a4520 RCX: 00007f9f4b85d1ae
Nov 28 09:48:42 node02 kernel: [1598368.556750] RDX: 0000000000000000 RSI: 00007fff2f1a4520 RDI: 00000000ffffff9c
Nov 28 09:48:42 node02 kernel: [1598368.557883] RBP: 000055f5be378580 R08: 00007f9f4b94756d R09: 00007f9f4b8ede80
Nov 28 09:48:42 node02 kernel: [1598368.559019] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f9f4bb4b670
Nov 28 09:48:42 node02 kernel: [1598368.560124] R13: 000055f5bf769600 R14: 0000000000000005 R15: 0000000000000000
Nov 28 09:48:42 node02 kernel: [1598368.561199] Modules linked in: veth rpcsec_gss_krb5 nfsv4 nfs fscache xt_conntrack xt_MASQUERADE nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables aufs sctp overlay iptable_filter bpfilter softdog nfnetlink_log nfnetlink snd_hda_codec_realtek snd_hda_codec_generic intel_rapl_msr intel_rapl_common ledtrig_audio snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp snd_hda_intel snd_intel_dspcfg snd_hda_codec coretemp i915 snd_hda_core drm_kms_helper kvm_intel kvm drm mei_hdcp snd_hwdep irqbypass i2c_algo_bit snd_pcm snd_timer fb_sys_fops snd syscopyarea crct10dif_pclmul mei_me sysfillrect crc32_pclmul sysimgblt ghash_clmulni_intel aesni_intel crypto_simd soundcore cryptd mei glue_helper input_leds zfs(PO) zunicode(PO) eeepc_wmi rapl asus_wmi zlua(PO) intel_cstate zavl(PO) pcspkr mxm_wmi mac_hid sparse_keymap wmi_bmof icp(PO) zcommon(PO)
Nov 28 09:48:42 node02 kernel: [1598368.561220]  znvpair(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm nfsd ib_cm ib_core auth_rpcgss iscsi_tcp libiscsi_tcp nfs_acl libiscsi lockd grace scsi_transport_iscsi sunrpc ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq libcrc32c hid_generic usbkbd usbhid hid xhci_pci i2c_i801 r8169 ahci lpc_ich ehci_pci realtek libahci xhci_hcd ehci_hcd wmi video
Nov 28 09:48:42 node02 kernel: [1598368.569684] CR2: 0000000000000048
Nov 28 09:48:42 node02 kernel: [1598368.570916] ---[ end trace ce5c8f62037cf7b0 ]---
Nov 28 09:48:42 node02 kernel: [1598368.572113] RIP: 0010:do_dentry_open+0x103/0x3a0
Nov 28 09:48:42 node02 kernel: [1598368.573271] Code: df e8 01 f0 17 00 41 89 c7 85 c0 0f 85 94 01 00 00 8b 73 40 48 8b 7b 20 f0 83 44 24 fc 00 48 8b 87 70 01 00 00 48 85 c0 74 23 <48> 8b 50 28 48 8d 48 28 48 39 ca 0f 84 94 01 00 00 ba 20 00 00 00
Nov 28 09:48:42 node02 kernel: [1598368.574454] RSP: 0018:ffffbd0760617c88 EFLAGS: 00010202
Nov 28 09:48:42 node02 kernel: [1598368.575610] RAX: 0000000000000020 RBX: ffff9fb9eeb70e00 RCX: 0000000000000001
Nov 28 09:48:42 node02 kernel: [1598368.576759] RDX: ffff9fb874fb7ab8 RSI: 0000000000008000 RDI: ffff9fb874fb7ab8
Nov 28 09:48:42 node02 kernel: [1598368.577904] RBP: ffffbd0760617cb0 R08: 0000000000000000 R09: 0000000000000000
Nov 28 09:48:42 node02 kernel: [1598368.579092] R10: ffff9fb5e972e540 R11: 0000000000000004 R12: ffff9fb874fb7ab8
Nov 28 09:48:42 node02 kernel: [1598368.580208] R13: ffff9fb9eeb70e10 R14: 0000000000000000 R15: 0000000000000000
Nov 28 09:48:42 node02 kernel: [1598368.581293] FS:  00007f9f4b5187c0(0000) GS:ffff9fbb7fd80000(0000) knlGS:0000000000000000
Nov 28 09:48:42 node02 kernel: [1598368.582380] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 28 09:48:42 node02 kernel: [1598368.583473] CR2: 0000000000000048 CR3: 00000007e2240003 CR4: 00000000001626e0
Nov 28 09:48:42 node02 kernel: [1598368.584565] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 28 09:48:42 node02 kernel: [1598368.585656] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

If anyone has a suggestion for an alternative to proxmox, please let me know. I run two servers with a combined 24 CPU cores and 56GB ram. I can't use ESXi (which I was using before) because I have a cluster, and cluster support is not free through VMWare. I'm not looking to spend hundreds of dollars every year on enterprise-level licensing to keep my home, tinker cluster up running, but I'm also looking for a reliable hypervisor.

datdenkikniet · Dec 5, 2020

You could try updating your bios for good measure. There seems to be a new version available if the motherboard model is correct. Perhaps it helps (though I'm guessing it's not particularly likely that it will)?

Have you verified that your RAM is working OK, too? Also rather unlikely since the problem seems to only affect pvestatd, but it can't hurt to check.

TheMaskedCrusader · Dec 14, 2020

datdenkikniet said:
You could try updating your bios for good measure. There seems to be a new version available if the motherboard model is correct. Perhaps it helps (though I'm guessing it's not particularly likely that it will)?

Have you verified that your RAM is working OK, too? Also rather unlikely since the problem seems to only affect pvestatd, but it can't hurt to check.

Ok, I've flashed my bios and updated proxmox on both my servers in the cluster. Now we wait to see if the problem goes away.

I recently rebuilt this server, and in so I replaced the ram with Kingston sticks (4x8GB). The "Server" is a repurposed desktop system with dated hardware (4th Gen I7 processor). My other server is an HP ProLiant server with HP server ram and dual Xeon processors; it's having no issues. Maybe it's just my old hardware.

I'll be interested to see if the problem comes back after the BIOS update. Thank you for that.

Cheers.

Search

Search

Never ending plague of kernel panics

TheMaskedCrusader

New Member

TheMaskedCrusader

New Member

fabian

Proxmox Staff Member

TheMaskedCrusader

New Member

TheMaskedCrusader

New Member

TheMaskedCrusader

New Member

datdenkikniet

Member

TheMaskedCrusader

New Member