Kernel Oops with kworker getting tainted.

TheGrandWazoo · Jan 10, 2020

Since upgrading from 6.0 to 6.1 I have had two different host machine generate the following similar error. It always had kworker as the issue PID.
First system became unresponsive and unrecoverable over time. Tried to reboot the host safely but had to power reset the physical host.
This one is responsive but the GUI is not responding either direct or in the cluster from another host.
Will reboot to refresh the machine.

[1406350.728555] BUG: kernel NULL pointer dereference, address: 0000000000000014
[1406350.728598] #PF: supervisor read access in kernel mode
[1406350.728624] #PF: error_code(0x0000) - not-present page
[1406350.728638] PGD 0 P4D 0
[1406350.728647] Oops: 0000 [#1] SMP PTI
[1406350.728659] CPU: 1 PID: 1426 Comm: kworker/1:2 Tainted: P W O 5.3.13-1-pve #1
[1406350.728679] Hardware name: Supermicro X8DTH-i/6/iF/6F/X8DTH, BIOS 2.1b 05/04/12
[1406350.728702] Workqueue: events key_garbage_collector
[1406350.728717] RIP: 0010:keyring_gc_check_iterator+0x30/0x40
[1406350.728731] Code: 48 83 e7 fc b8 01 00 00 00 48 89 e5 f6 87 80 00 00 00 21 75 19 48 8b 57 58 48 39 16 7c 05 48 85 d2 7f 0b 48 8b 87 a0 00 00 00 <0f> b6 40 14 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55
[1406350.728772] RSP: 0018:ffffa892966e7db8 EFLAGS: 00010246
[1406350.728786] RAX: 0000000000000000 RBX: ffff8e173b769980 RCX: ffffa892966e7e20
[1406350.728804] RDX: 0000000000000000 RSI: ffffa892966e7e20 RDI: ffff8e174ade9a00
[1406350.728821] RBP: ffffa892966e7db8 R08: 0000000000000000 R09: 000073746e657665
[1406350.728838] R10: 8080808080808080 R11: ffff8e0b4f8694c4 R12: ffff8e173b769a10
[1406350.728855] R13: ffffffff93227dd0 R14: ffff8e174fdfdb00 R15: ffff8e173b7699f8
[1406350.728872] FS: 0000000000000000(0000) GS:ffff8e0b4f840000(0000) knlGS:0000000000000000
[1406350.728891] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1406350.728906] CR2: 0000000000000014 CR3: 000000040a80a003 CR4: 00000000000226e0
[1406350.728923] Call Trace:
[1406350.728935] assoc_array_subtree_iterate+0x5c/0x100
[1406350.728948] assoc_array_iterate+0x19/0x20
[1406350.728961] keyring_gc+0x43/0x80
[1406350.728971] key_garbage_collector+0x35a/0x400
[1406350.728985] process_one_work+0x20f/0x3d0
[1406350.728997] worker_thread+0x34/0x400
[1406350.729009] kthread+0x120/0x140
[1406350.729019] ? process_one_work+0x3d0/0x3d0
[1406350.729031] ? __kthread_parkme+0x70/0x70
[1406350.729045] ret_from_fork+0x35/0x40
[1406350.729056] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables sctp iptable_filter b pfilter ipmi_watchdog bonding nfnetlink_log nfnetlink intel_powerclamp coretemp ipmi_ssif kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper mgag200 drm_vram_helper intel_cstate ttm serio_raw pcspkr drm_kms_helper drm joydev input_leds fb_sys_fops syscopyarea sysfillrect sysimgblt ioatdma i5500_temp i7core_edac ipmi_s i ipmi_devintf mac_hid ipmi_msghandler vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi sunrpc scsi_transport_iscsi ip_tables x_tables autofs4 zfs(PO) zunicode (PO) zlua(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs xor zstd_compress raid6_pq libcrc32c hid_generic usbkbd usbmouse usbhid hid uas usb_storage gpio_ich mpt3sas ahci raid_class igb p smouse i2c_algo_bit
[1406350.729089] scsi_transport_sas i2c_i801 libahci lpc_ich dca
[1406350.729285] CR2: 0000000000000014
[1406350.729295] ---[ end trace 170167807202c727 ]---
[1406350.729309] RIP: 0010:keyring_gc_check_iterator+0x30/0x40
[1406350.729323] Code: 48 83 e7 fc b8 01 00 00 00 48 89 e5 f6 87 80 00 00 00 21 75 19 48 8b 57 58 48 39 16 7c 05 48 85 d2 7f 0b 48 8b 87 a0 00 00 00 <0f> b6 40 14 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55
[1406350.729365] RSP: 0018:ffffa892966e7db8 EFLAGS: 00010246
[1406350.729378] RAX: 0000000000000000 RBX: ffff8e173b769980 RCX: ffffa892966e7e20
[1406350.729395] RDX: 0000000000000000 RSI: ffffa892966e7e20 RDI: ffff8e174ade9a00
[1406350.729412] RBP: ffffa892966e7db8 R08: 0000000000000000 R09: 000073746e657665
[1406350.730258] R10: 8080808080808080 R11: ffff8e0b4f8694c4 R12: ffff8e173b769a10
[1406350.731095] R13: ffffffff93227dd0 R14: ffff8e174fdfdb00 R15: ffff8e173b7699f8
[1406350.731919] FS: 0000000000000000(0000) GS:ffff8e0b4f840000(0000) knlGS:0000000000000000
[1406350.732748] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1406350.733569] CR2: 0000000000000014 CR3: 000000040a80a003 CR4: 00000000000226e0

Currently did a 'reboot' from an ssh session because the console was not having me login. So far it seems to be shutting down but currently flashing between Stop job for PVE quest and D-Bus is being reported after 10+ mins.

Hope this helps.

Thanks

TheGrandWazoo · Jan 10, 2020

I just realized that the CPU's in these machines do NOT have the new Microcode applied to them (For the CVE's) . I will attempt to do that update in the near future if this could cause an issue.

TheGrandWazoo · Jan 10, 2020

This is the dmesg output after the server rebooted with another kworker tainted issue.

[ 17.464801] ------------[ cut here ]------------
[ 17.464802] General protection fault in user access. Non-canonical address?
[ 17.464812] WARNING: CPU: 16 PID: 1582 at arch/x86/mm/extable.c:126 ex_handler_uaccess+0x52/0x60
[ 17.464812] Modules linked in: ipmi_ssif intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel mgag200 aes_x86_64 crypto_simd drm_vram_helper cryptd ttm glue_helper drm_kms_helper intel_cstate pcspkr serio_raw drm joydev input_leds fb_sys_fops syscopyarea sysfillrect sysimgblt ioatdma i5500_temp i7core_edac ipmi_si ipmi_devintf ipmi_msghandler mac_hid vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs xor zstd_compress raid6_pq libcrc32c hid_generic usbmouse usbkbd usbhid hid uas usb_storage gpio_ich mpt3sas raid_class ahci psmouse igb scsi_transport_sas i2c_algo_bit i2c_i801 libahci lpc_ich dca
[ 17.464837] CPU: 16 PID: 1582 Comm: kworker/u49:5 Tainted: P O 5.3.13-1-pve #1
[ 17.464838] Hardware name: Supermicro X8DTH-i/6/iF/6F/X8DTH, BIOS 2.1b 05/04/12
[ 17.464840] RIP: 0010:ex_handler_uaccess+0x52/0x60
[ 17.464841] Code: c4 08 b8 01 00 00 00 5b 5d c3 80 3d 85 d6 78 01 00 75 db 48 c7 c7 58 10 54 bd 48 89 75 f0 c6 05 71 d6 78 01 01 e8 ff a1 01 00 <0f> 0b 48 8b 75 f0 eb bc 66 0f 1f 44 00 00 66 66 66 66 90 55 80 3d
[ 17.464842] RSP: 0018:ffffacaaf211bcc0 EFLAGS: 00010282
[ 17.464843] RAX: 0000000000000000 RBX: ffffffffbd002448 RCX: 0000000000000000
[ 17.464844] RDX: 0000000000000007 RSI: ffffffffbdd83f7f RDI: 0000000000000246
[ 17.464844] RBP: ffffacaaf211bcd0 R08: ffffffffbdd83f40 R09: 0000000000029fc0
[ 17.464845] R10: 0000029b92a57ef8 R11: ffffffffbdd83f40 R12: 000000000000000d
[ 17.464846] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 17.464847] FS: 0000000000000000(0000) GS:ffff9d6d8fa80000(0000) knlGS:0000000000000000
[ 17.464847] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 17.464848] CR2: 00007f1c7fb14c4a CR3: 00000012f900a006 CR4: 00000000000206e0
[ 17.464849] Call Trace:
[ 17.464854] fixup_exception+0x4a/0x61
[ 17.464858] do_general_protection+0x4e/0x150
[ 17.464862] general_protection+0x28/0x30
[ 17.464866] RIP: 0010:strnlen_user+0x4c/0x110
[ 17.464867] Code: f8 0f 86 e1 00 00 00 48 29 f8 45 31 c9 66 66 90 0f ae e8 48 39 c6 49 89 fa 48 0f 46 c6 41 83 e2 07 48 83 e7 f8 31 c9 4c 01 d0 <4c> 8b 1f 85 c9 0f 85 96 00 00 00 42 8d 0c d5 00 00 00 00 41 b8 01
[ 17.464867] RSP: 0018:ffffacaaf211bde8 EFLAGS: 00010206
[ 17.464868] RAX: 0000000000020000 RBX: e241d0f914ca2a00 RCX: 0000000000000000
[ 17.464869] RDX: e241d0f914ca2a00 RSI: 0000000000020000 RDI: e241d0f914ca2a00
[ 17.464869] RBP: ffffacaaf211bdf8 R08: 8080808080808080 R09: 0000000000000000
[ 17.464870] R10: 0000000000000000 R11: 0000000000000000 R12: 00007fffffffefe6
[ 17.464871] R13: ffff9d6d84dacfe6 R14: 0000000000000000 R15: ffffcccbf0136b00
[ 17.464876] ? _copy_from_user+0x3e/0x60
[ 17.464880] copy_strings.isra.35+0x92/0x380
[ 17.464882] __do_execve_file.isra.42+0x5b5/0x9d0
[ 17.464886] ? kmem_cache_alloc+0x110/0x220
[ 17.464887] do_execve+0x25/0x30
[ 17.464890] call_usermodehelper_exec_async+0x188/0x1b0
[ 17.464891] ? call_usermodehelper+0xb0/0xb0
[ 17.464895] ret_from_fork+0x35/0x40
[ 17.464896] ---[ end trace 0dabfcc6524f90c6 ]---

This machine does NOT have the CVE microcode fixes.

TheGrandWazoo · Jan 10, 2020

This machine does have the CVE microcode fixes...

[ 22.919602] ------------[ cut here ]------------
[ 22.919604] General protection fault in user access. Non-canonical address?
[ 22.919611] WARNING: CPU: 13 PID: 2990 at arch/x86/mm/extable.c:126 ex_handler_uaccess+0x52/0x60
[ 22.919612] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache sch_ingress ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables sctp iptable_filter bpfilter bonding openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 softdog nfnetlink_log nfnetlink intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 crypto_simd mgag200 cryptd drm_vram_helper glue_helper ttm intel_cstate drm_kms_helper serio_raw pcspkr drm joydev input_leds fb_sys_fops syscopyarea sysfillrect sysimgblt ioatdma i5500_temp i7core_edac ipmi_si ipmi_devintf mac_hid ipmi_msghandler vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs xor zstd_compress raid6_pq libcrc32c ses enclosure hid_generic
[ 22.919643] usbmouse usbkbd usbhid hid gpio_ich mpt3sas ahci raid_class psmouse igb i2c_algo_bit scsi_transport_sas libahci i2c_i801 lpc_ich dca
[ 22.919649] CPU: 13 PID: 2990 Comm: kworker/u49:5 Tainted: P O 5.3.13-1-pve #1
[ 22.919650] Hardware name: Violin Memory Memory Gateway/X8DTH, BIOS 2.1b 05/04/12
[ 22.919651] RIP: 0010:ex_handler_uaccess+0x52/0x60
[ 22.919653] Code: c4 08 b8 01 00 00 00 5b 5d c3 80 3d 85 d6 78 01 00 75 db 48 c7 c7 58 10 94 b7 48 89 75 f0 c6 05 71 d6 78 01 01 e8 ff a1 01 00 <0f> 0b 48 8b 75 f0 eb bc 66 0f 1f 44 00 00 66 66 66 66 90 55 80 3d
[ 22.919653] RSP: 0018:ffffb2bf893fbcc0 EFLAGS: 00010282
[ 22.919654] RAX: 0000000000000000 RBX: ffffffffb7402448 RCX: 0000000000000000
[ 22.919655] RDX: 0000000000000007 RSI: ffffffffb8183f7f RDI: 0000000000000246
[ 22.919655] RBP: ffffb2bf893fbcd0 R08: ffffffffb8183f40 R09: 0000000000029fc0
[ 22.919656] R10: 000002934853fbb2 R11: ffffffffb8183f40 R12: 000000000000000d
[ 22.919657] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 22.919657] FS: 0000000000000000(0000) GS:ffff95bf4f9c0000(0000) knlGS:0000000000000000
[ 22.919658] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 22.919659] CR2: 000055665e4ff860 CR3: 0000000a6dc0a002 CR4: 00000000000226e0
[ 22.919659] Call Trace:
[ 22.919663] fixup_exception+0x4a/0x61
[ 22.919666] do_general_protection+0x4e/0x150
[ 22.919668] general_protection+0x28/0x30
[ 22.919671] RIP: 0010:strnlen_user+0x4c/0x110
[ 22.919672] Code: f8 0f 86 e1 00 00 00 48 29 f8 45 31 c9 66 66 90 0f ae e8 48 39 c6 49 89 fa 48 0f 46 c6 41 83 e2 07 48 83 e7 f8 31 c9 4c 01 d0 <4c> 8b 1f 85 c9 0f 85 96 00 00 00 42 8d 0c d5 00 00 00 00 41 b8 01
[ 22.919673] RSP: 0018:ffffb2bf893fbde8 EFLAGS: 00010206
[ 22.919674] RAX: 0000000000020000 RBX: 8d36aeb23d31be00 RCX: 0000000000000000
[ 22.919674] RDX: 8d36aeb23d31be00 RSI: 0000000000020000 RDI: 8d36aeb23d31be00
[ 22.919675] RBP: ffffb2bf893fbdf8 R08: 8080808080808080 R09: 0000000000000000
[ 22.919675] R10: 0000000000000000 R11: 0000000000000000 R12: 00007fffffffefe7
[ 22.919676] R13: ffff95bf305fdfe7 R14: 0000000000000000 R15: ffffe1496fc17f40
[ 22.919679] ? _copy_from_user+0x3e/0x60
[ 22.919682] copy_strings.isra.35+0x92/0x380
[ 22.919683] __do_execve_file.isra.42+0x5b5/0x9d0
[ 22.919685] ? kmem_cache_alloc+0x110/0x220
[ 22.919687] do_execve+0x25/0x30
[ 22.919689] call_usermodehelper_exec_async+0x188/0x1b0
[ 22.919690] ? call_usermodehelper+0xb0/0xb0
[ 22.919693] ret_from_fork+0x35/0x40
[ 22.919694] ---[ end trace 4d6f164f06cad077 ]---

James Crook · Jan 10, 2020

Strange, we had a simmar issue with our node, gonna reboot tonight. The console and web management is unresponsive, but SSH works very slow.

Ours are HP Gen9 with quite old hp firmware.

Containers seem to work tho...

harrijs · Jan 12, 2020

I have experienced a similar issue. I have a 3 node deployment with a specific LXC container that accesses a mounted NFS share with consistent I/O. The host that this LXC container is running on, if it is using kernel 5.3.13-1, will eventually encounter this same issue. In my case this appears to revolve around the NFS workload I am using with this specific LXC container.

I have verified that this behavior does not occur if the host is running kernel 4.15.18-24. Currently I am reserving this LXC container to a single host running the older kernel and things have been smooth for the past 7 days.

I will report back here once the next 5.x kernel is released with test results.

Code:

Jan  3 16:18:55 pve03 kernel: [75675.877088] PGD 0 P4D 0
Jan  3 16:18:55 pve03 kernel: [75675.877724] Oops: 0000 [#1] SMP PTI
Jan  3 16:18:55 pve03 kernel: [75675.878350] CPU: 15 PID: 546133 Comm: kworker/15:1 Tainted: P           O      5.3.13-1-pve #1
Jan  3 16:18:55 pve03 kernel: [75675.879010] Hardware name: Supermicro X8DTL/X8DTL, BIOS 2.0a       09/04/10
Jan  3 16:18:55 pve03 kernel: [75675.879688] Workqueue: events key_garbage_collector
Jan  3 16:18:55 pve03 kernel: [75675.880369] RIP: 0010:keyring_gc_check_iterator+0x30/0x40
Jan  3 16:18:55 pve03 kernel: [75675.881081] Code: 48 83 e7 fc b8 01 00 00 00 48 89 e5 f6 87 80 00 00 00 21 75 19 48 8b 57 58 48 39 16 7c 05 48 85 d2 7f 0b 48 8b 87 a0 00 00 00 <0f> b6 40 14 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55
Jan  3 16:18:55 pve03 kernel: [75675.882567] RSP: 0018:ffffaf18a5597db8 EFLAGS: 00010246
Jan  3 16:18:55 pve03 kernel: [75675.883318] RAX: 0000000000000000 RBX: ffff9f8823ca6a80 RCX: ffffaf18a5597e20
Jan  3 16:18:55 pve03 kernel: [75675.884082] RDX: 0000000000000000 RSI: ffffaf18a5597e20 RDI: ffff9f8d21015300
Jan  3 16:18:55 pve03 kernel: [75675.884847] RBP: ffffaf18a5597db8 R08: 0000000000000010 R09: 0000000000000000
Jan  3 16:18:55 pve03 kernel: [75675.885625] R10: 000000000000000f R11: 000000000000001a R12: ffff9f8823ca6b10
Jan  3 16:18:55 pve03 kernel: [75675.886397] R13: ffffffffb5827dd0 R14: ffff9f8d23c55f00 R15: ffff9f8823ca6b00
Jan  3 16:18:55 pve03 kernel: [75675.887161] FS:  0000000000000000(0000) GS:ffff9f8d2b9c0000(0000) knlGS:0000000000000000
Jan  3 16:18:55 pve03 kernel: [75675.887942] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan  3 16:18:55 pve03 kernel: [75675.888725] CR2: 0000000000000014 CR3: 0000000890980003 CR4: 00000000000206e0
Jan  3 16:18:55 pve03 kernel: [75675.889526] Call Trace:
Jan  3 16:18:55 pve03 kernel: [75675.890321]  assoc_array_subtree_iterate+0x5c/0x100
Jan  3 16:18:55 pve03 kernel: [75675.891111]  assoc_array_iterate+0x19/0x20
Jan  3 16:18:55 pve03 kernel: [75675.891895]  keyring_gc+0x43/0x80
Jan  3 16:18:55 pve03 kernel: [75675.892670]  key_garbage_collector+0x35a/0x400
Jan  3 16:18:55 pve03 kernel: [75675.893457]  process_one_work+0x20f/0x3d0
Jan  3 16:18:55 pve03 kernel: [75675.894232]  worker_thread+0x34/0x400
Jan  3 16:18:55 pve03 kernel: [75675.895004]  kthread+0x120/0x140
Jan  3 16:18:55 pve03 kernel: [75675.895779]  ? process_one_work+0x3d0/0x3d0
Jan  3 16:18:55 pve03 kernel: [75675.896545]  ? __kthread_parkme+0x70/0x70
Jan  3 16:18:55 pve03 kernel: [75675.897328]  ret_from_fork+0x35/0x40
Jan  3 16:18:55 pve03 kernel: [75675.898089] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 veth rbd libceph nfsv3 nfs_acl nfs lockd grace fscache ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables sctp iptable_filter bpfilter softdog nfnetlink_log nfnetlink intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul mgag200 crc32_pclmul ghash_clmulni_intel drm_vram_helper ttm aesni_intel drm_kms_helper snd_hda_intel aes_x86_64 snd_hda_codec snd_hda_core crypto_simd drm snd_hwdep cryptd i2c_algo_bit fb_sys_fops zfs(PO) snd_pcm syscopyarea snd_timer sysfillrect snd soundcore sysimgblt zunicode(PO) zlua(PO) zavl(PO) glue_helper icp(PO) ioatdma joydev i5500_temp i7core_edac dca input_leds ipmi_si ipmi_devintf ipmi_msghandler intel_cstate pcspkr serio_raw mac_hid zcommon(PO) znvpair(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq dm_thin_pool
Jan  3 16:18:55 pve03 kernel: [75675.898138]  dm_persistent_data dm_bio_prison dm_bufio libcrc32c hid_generic gpio_ich psmouse usbkbd usbmouse mptsas i2c_i801 usbhid mptscsih pata_acpi lpc_ich e1000e mptbase hid scsi_transport_sas
Jan  3 16:18:55 pve03 kernel: [75675.907141] CR2: 0000000000000014
Jan  3 16:18:55 pve03 kernel: [75675.908128] ---[ end trace 2a53930208ad4d55 ]---
Jan  3 16:18:55 pve03 kernel: [75675.909131] RIP: 0010:keyring_gc_check_iterator+0x30/0x40
Jan  3 16:18:55 pve03 kernel: [75675.910113] Code: 48 83 e7 fc b8 01 00 00 00 48 89 e5 f6 87 80 00 00 00 21 75 19 48 8b 57 58 48 39 16 7c 05 48 85 d2 7f 0b 48 8b 87 a0 00 00 00 <0f> b6 40 14 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55
Jan  3 16:18:55 pve03 kernel: [75675.912165] RSP: 0018:ffffaf18a5597db8 EFLAGS: 00010246
Jan  3 16:18:55 pve03 kernel: [75675.913211] RAX: 0000000000000000 RBX: ffff9f8823ca6a80 RCX: ffffaf18a5597e20
Jan  3 16:18:55 pve03 kernel: [75675.914247] RDX: 0000000000000000 RSI: ffffaf18a5597e20 RDI: ffff9f8d21015300
Jan  3 16:18:55 pve03 kernel: [75675.915273] RBP: ffffaf18a5597db8 R08: 0000000000000010 R09: 0000000000000000
Jan  3 16:18:55 pve03 kernel: [75675.916308] R10: 000000000000000f R11: 000000000000001a R12: ffff9f8823ca6b10
Jan  3 16:18:55 pve03 kernel: [75675.917355] R13: ffffffffb5827dd0 R14: ffff9f8d23c55f00 R15: ffff9f8823ca6b00
Jan  3 16:18:55 pve03 kernel: [75675.918410] FS:  0000000000000000(0000) GS:ffff9f8d2b9c0000(0000) knlGS:0000000000000000
Jan  3 16:18:55 pve03 kernel: [75675.919472] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan  3 16:18:55 pve03 kernel: [75675.920524] CR2: 0000000000000014 CR3: 0000000890980003 CR4: 00000000000206e0

James Crook · Jan 13, 2020

We mount our backup location via NFS, so we get quite a spike in traffic.

I didn't boot on an older kernel as he jump between the two kernel versions was huge.

Got ours running on a node with older HP firmware and so far it seems ok.

James Crook · Jan 13, 2020

So a little bit of hardware info.

DL380 Gen9
The crashed system was
System ROM
P89 v2.60 (05/21/2018)
System ROM Date
05/21/2018
E5-2620 v3
with microcode loaded 2019-03-01

Tho we have been running fine for 2 days on a older firmware node.
System ROM
P89 v2.52 (10/25/2017)
System ROM Date
10/25/2017
E5-2620 v3
without microcode

James Crook · Jan 20, 2020

So at the end of the Rabbit hole i think i found this.
https://about.gitlab.com/blog/2018/11/14/how-we-spent-two-weeks-hunting-an-nfs-bug/
And i checked the output of mount -v | grep and sure enough Node1 had V4 NFS mount, and Node2 had it mounted as V3.

I've changed both to use V3 but have not had the go ahead to move containers back.

TheGrandWazoo · Jan 21, 2020

James,
Are you saying to use version 3 instead of version 4? Both of my nodes are mounted as version 4 via the `mount -v` output.

James Crook · Jan 21, 2020

TheGrandWazoo said:
James,
Are you saying to use version 3 instead of version 4? Both of my nodes are mounted as version 4 via the `mount -v` output.

Yes , the article points to an issue in version 4.0 and maybe 4.1
My node that hasn't crashed is mounted NFS using version 3.

TheGrandWazoo · Jan 22, 2020

James Crook said:
Yes , the article points to an issue in version 4.0 and maybe 4.1
My node that hasn't crashed is mounted NFS using version 3.

Ok. Thanks for that. I will give Version 3 a try. Lot of hard work and info in that article but was not sure if it was fixed in any of the kernels Proxmox is using.
Thanks for posting it.

James Crook · Jan 22, 2020

TheGrandWazoo said:
Ok. Thanks for that. I will give Version 3 a try. Lot of hard work and info in that article but was not sure if it was fixed in any of the kernels Proxmox is using.
Thanks for posting it.

Your right, looking at it it's kernel 4.14 and 4.19.

I can't confirm this NFS bit is the fix as we haven't moved the containers back and the above mentioned microcode updates in he BIOS.

zeha · Jan 26, 2020

Just to add in to this, we also see this.

However, we also see hanging `corosync-quorum` processes, like this:

Code:

root     48408  0.0  0.0      0     0 ?        D    Jan25   0:00 [corosync-quorum]
root     48454  0.0  0.0      0     0 ?        D    04:25   0:00 [corosync-quorum]
root     48559  0.0  0.0      0     0 ?        D    12:22   0:00 [corosync-quorum]
root     48570  0.0  0.0      0     0 ?        D    Jan25   0:00 [corosync-quorum]
root     48578  0.0  0.0      0     0 ?        D    Jan25   0:00 [corosync-quorum]
root     48609  0.0  0.0      0     0 ?        D    14:23   0:00 [corosync-quorum]
root     48650  0.0  0.0      0     0 ?        D    Jan25   0:00 [corosync-quorum]
root     48673  0.0  0.0      0     0 ?        D    18:27   0:00 [corosync-quorum]
root     48682  0.0  0.0      0     0 ?        D    00:24   0:00 [corosync-quorum]
root     48709  0.0  0.0      0     0 ?        D    10:23   0:00 [corosync-quorum]

They appear to be corosync-quorumtool processes that didn't launch or terminate properly.

dmesg:

Code:

[1304379.837825] BUG: kernel NULL pointer dereference, address: 0000000000000014
[1304379.837853] #PF: supervisor read access in kernel mode
[1304379.837866] #PF: error_code(0x0000) - not-present page
[1304379.837878] PGD 0 P4D 0
[1304379.837888] Oops: 0000 [#1] SMP PTI
[1304379.837899] CPU: 34 PID: 26203 Comm: kworker/34:3 Tainted: P           O      5.3.13-1-pve #1
[1304379.837918] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 04/18/2019
[1304379.837943] Workqueue: events key_garbage_collector
[1304379.837957] RIP: 0010:keyring_gc_check_iterator+0x30/0x40
[1304379.837970] Code: 48 83 e7 fc b8 01 00 00 00 48 89 e5 f6 87 80 00 00 00 21 75 19 48 8b 57 58 48 39 16 7c 05 48 85 d2 7f 0b 48 8b 87 a0 00 00 00 <0f> b6 40 14 5d c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55
[1304379.838018] RSP: 0018:ffffb82996a03db8 EFLAGS: 00010246
[1304379.838030] RAX: 0000000000000000 RBX: ffff9a7a21a43380 RCX: ffffb82996a03e20
[1304379.838045] RDX: 0000000000000000 RSI: ffffb82996a03e20 RDI: ffff9a923b4e6b00
[1304379.838060] RBP: ffffb82996a03db8 R08: 0000000000000010 R09: 0000000000000000
[1304379.838074] R10: 000000000000000f R11: 00000000fffffff0 R12: ffff9a7a21a43410
[1304379.838850] R13: ffffffffbc027dd0 R14: ffff9afbbbffeb00 R15: ffff9a7a21a43408
[1304379.839593] FS:  0000000000000000(0000) GS:ffff9a7a5ff80000(0000) knlGS:0000000000000000
[1304379.840311] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1304379.841000] CR2: 0000000000000014 CR3: 0000001e3ac0a006 CR4: 00000000007626e0
[1304379.841706] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1304379.842399] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1304379.843126] PKRU: 55555554
[1304379.843829] Call Trace:
[1304379.844535]  assoc_array_subtree_iterate+0x5c/0x100
[1304379.845237]  assoc_array_iterate+0x19/0x20
[1304379.845932]  keyring_gc+0x43/0x80
[1304379.846610]  key_garbage_collector+0x35a/0x400
[1304379.847278]  process_one_work+0x20f/0x3d0
[1304379.847928]  worker_thread+0x34/0x400
[1304379.848563]  kthread+0x120/0x140
[1304379.849181]  ? process_one_work+0x3d0/0x3d0
[1304379.849788]  ? __kthread_parkme+0x70/0x70
[1304379.850381]  ret_from_fork+0x35/0x40
[1304379.850955] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache binfmt_misc ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables sctp dm_service_time iptable_filter bpfilter openvswitch nsh nf_conncount nf_nat softdog nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common isst_if_common nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel zfs(PO) aesni_intel aes_x86_64 zunicode(PO) zlua(PO) crypto_simd cryptd glue_helper zavl(PO) intel_cstate icp(PO) ipmi_ssif mgag200 drm_vram_helper ttm drm_kms_helper drm i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt ioatdma mei_me hpilo mei intel_rapl_perf dca ipmi_si acpi_power_meter ipmi_devintf ipmi_msghandler acpi_tad mac_hid zcommon(PO) znvpair(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc ip_vs nf_conntrack nf_defrag_ipv6
[1304379.850987]  nf_defrag_ipv4 ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ses enclosure bnx2x smartpqi mdio scsi_transport_sas libcrc32c uas bnxt_en usb_storage lpc_ich tg3 wmi
[1304379.857397] CR2: 0000000000000014
[1304379.857963] ---[ end trace a674b2ef35aa61b3 ]---
[1304379.991496] RIP: 0010:keyring_gc_check_iterator+0x30/0x40
[1304379.992138] Code: 48 83 e7 fc b8 01 00 00 00 48 89 e5 f6 87 80 00 00 00 21 75 19 48 8b 57 58 48 39 16 7c 05 48 85 d2 7f 0b 48 8b 87 a0 00 00 00 <0f> b6 40 14 5d c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55
[1304379.993272] RSP: 0018:ffffb82996a03db8 EFLAGS: 00010246
[1304379.993844] RAX: 0000000000000000 RBX: ffff9a7a21a43380 RCX: ffffb82996a03e20
[1304379.994424] RDX: 0000000000000000 RSI: ffffb82996a03e20 RDI: ffff9a923b4e6b00
[1304379.994999] RBP: ffffb82996a03db8 R08: 0000000000000010 R09: 0000000000000000
[1304379.995580] R10: 000000000000000f R11: 00000000fffffff0 R12: ffff9a7a21a43410
[1304379.996147] R13: ffffffffbc027dd0 R14: ffff9afbbbffeb00 R15: ffff9a7a21a43408
[1304379.996716] FS:  0000000000000000(0000) GS:ffff9a7a5ff80000(0000) knlGS:0000000000000000
[1304379.997280] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1304379.997833] CR2: 0000000000000014 CR3: 0000001e3ac0a006 CR4: 00000000007626e0
[1304379.998384] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1304379.998930] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1304379.999479] PKRU: 55555554

harrijs · Feb 4, 2020

I just want to update this thread to state that this behavior is still present on Kernel 5.3.13-2. On the particular host that the lxc is running on, we loaded Kernel 4.15.18-24 and everything is running as expected.

James Crook · Feb 5, 2020

harrijs said:
I just want to update this thread to state that this behavior is still present on Kernel 5.3.13-2. On the particular host that the lxc is running on, we loaded Kernel 4.15.18-24 and everything is running as expected.

Was it using NFS v4 ?

pizza · Feb 11, 2020

harrijs said:
I just want to update this thread to state that this behavior is still present on Kernel 5.3.13-2. On the particular host that the lxc is running on, we loaded Kernel 4.15.18-24 and everything is running as expected.

I have the same problem with a HP G7, Kernel 5.3.13-3 gives also the kernel oops error. IOwait on NFS goes up, only resetting the node helps. I have set NFS to v3 in Proxmox, if you set NFS4 tot NFS3 in Proxmox storage, a reboot was needed else mount -v says it still v4.

Here is a german topic: https://forum.proxmox.com/threads/hohe-load-kernel-oops-reboot-unmöglich.63071/#post-292282

Oliver Polterauer · Feb 12, 2020

pizza said:
I have the same problem with a HP G7, Kernel 5.3.13-3 gives also the kernel oops error. IOwait on NFS goes up, only resetting the node helps. I have set NFS to v3 in Proxmox, if you set NFS4 tot NFS3 in Proxmox storage, a reboot was needed else mount -v says it still v4.

Here is a german topic: https://forum.proxmox.com/threads/hohe-load-kernel-oops-reboot-unmöglich.63071/#post-292282

actually restart is not needed if you can free up the node and move the resources away (for example on other host, or mounted ISO Files).

We have also changed to V3 now, because we where also running into this bug several times.

pizza · Feb 12, 2020

Oliver Polterauer said:
actually restart is not needed if you can free up the node and move the resources away (for example on other host, or mounted ISO Files).

We have also changed to V3 now, because we where also running into this bug several times.

Unmounting the nfs4 shares is enough indeed, Proxmox remounts the nfs shares with v3 afterwards.

harrijs · Feb 16, 2020

I dig some additional digging on this since my PVE hosts were already mounting the NFS share as V3. I had a container on the offending box that was mounting a share using NFS v4. I modified this mount to use v3 and then updated to the latest 5.3.18-1 kernel and everything has been running smooth for 24 hours now. I will be interested to track this issue to see when v4 becomes workable again.

Kernel Oops with kworker getting tainted.

Member

Member

Member

Member

Well-Known Member

Member

Well-Known Member

Well-Known Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Renowned Member

Member

Well-Known Member

Renowned Member

Member

Renowned Member

Member