PVE 7.0 BUG: kernel NULL pointer dereference, address: 00000000000000c0-PF:error_code(0x0000) - No web access no ssh

BrandonN · Sep 21, 2021

Hello for Everyone, We updated to pve-manager/7.0-11/63d82f4e (running kernel: 5.11.22-4-pve) two week ago and after that I found this bug that disable the proxmox web and ssh access and stop somehow that I have not been able to understand. I need to force reboot and the services starts ok, I used zpool and all it's ok, before the upgrade I make all the necesary configurations to be ready to update :

Switch Legacy-Boot to Proxmox Boot Tool as here says https://pve.proxmox.com/wiki/ZFS:_Switch_Legacy-Boot_to_Proxmox_Boot_Tool
We have CT with older cgroup, so we had to change cgroup version as here says https://pve.proxmox.com/pve-docs/chapter-pct.html#pct_cgroup_compat

The server was updated and starts fine as I told but I don't know how to solve this kernel BUG (you can see scroll down in bold). Our server is Dell Power Edge

Thanks in advance about any help.

Sep 20 16:03:00 proxmox systemd[1]: Starting Proxmox VE replication runner...
Sep 20 16:03:02 proxmox systemd[1]: pvesr.service: Succeeded.
Sep 20 16:03:02 proxmox systemd[1]: Finished Proxmox VE replication runner.
Sep 20 16:04:00 proxmox systemd[1]: Starting Proxmox VE replication runner...
Sep 20 16:04:02 proxmox systemd[1]: pvesr.service: Succeeded.
Sep 20 16:04:02 proxmox systemd[1]: Finished Proxmox VE replication runner.
Sep 20 16:05:00 proxmox systemd[1]: Starting Proxmox VE replication runner...
Sep 20 16:05:02 proxmox systemd[1]: pvesr.service: Succeeded.
Sep 20 16:05:02 proxmox systemd[1]: Finished Proxmox VE replication runner.
Sep 20 16:06:00 proxmox systemd[1]: Starting Proxmox VE replication runner...
Sep 20 16:06:02 proxmox systemd[1]: pvesr.service: Succeeded.
Sep 20 16:06:02 proxmox systemd[1]: Finished Proxmox VE replication runner.
Sep 20 16:07:00 proxmox systemd[1]: Starting Proxmox VE replication runner...
Sep 20 16:07:02 proxmox systemd[1]: pvesr.service: Succeeded.
Sep 20 16:07:02 proxmox systemd[1]: Finished Proxmox VE replication runner.
Sep 20 16:08:00 proxmox systemd[1]: Starting Proxmox VE replication runner...
Sep 20 16:08:02 proxmox systemd[1]: pvesr.service: Succeeded.
Sep 20 16:08:02 proxmox systemd[1]: Finished Proxmox VE replication runner.
Sep 20 16:09:00 proxmox systemd[1]: Starting Proxmox VE replication runner...
Sep 20 16:09:02 proxmox systemd[1]: pvesr.service: Succeeded.
Sep 20 16:09:02 proxmox systemd[1]: Finished Proxmox VE replication runner.
Sep 20 16:10:00 proxmox systemd[1]: Starting Proxmox VE replication runner...
Sep 20 16:10:02 proxmox systemd[1]: pvesr.service: Succeeded.
Sep 20 16:10:02 proxmox systemd[1]: Finished Proxmox VE replication runner.
Sep 20 16:11:00 proxmox systemd[1]: Starting Proxmox VE replication runner...
Sep 20 16:11:02 proxmox systemd[1]: pvesr.service: Succeeded.
Sep 20 16:11:02 proxmox systemd[1]: Finished Proxmox VE replication runner.
Sep 20 16:11:25 proxmox kernel: BUG: kernel NULL pointer dereference, address: 00000000000000c0
Sep 20 16:11:25 proxmox kernel: #PF: supervisor read access in kernel mode
Sep 20 16:11:25 proxmox kernel: #PF: error_code(0x0000) - not-present page
Sep 20 16:11:25 proxmox kernel: PGD 0 P4D 0
Sep 20 16:11:25 proxmox kernel: Oops: 0000 [#1] SMP PTI
Sep 20 16:11:25 proxmox kernel: CPU: 7 PID: 398 Comm: kworker/7:1H Tainted: P O 5.11.22-4-pve #1
Sep 20 16:11:25 proxmox kernel: Hardware name: Dell Inc. PowerEdge R720/046V88, BIOS 2.9.0 12/06/2019
Sep 20 16:11:25 proxmox kernel: Workqueue: kblockd blk_mq_timeout_work
Sep 20 16:11:25 proxmox kernel: RIP: 0010:blk_mq_put_rq_ref+0xa/0x60
Sep 20 16:11:25 proxmox kernel: Code: 15 0f b6 d3 4c 89 e7 be 01 00 00 00 e8 cf fe ff ff 5b 41 5c 5d c3 0f 0b 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 8b 47 10 <48> 8b 80 c0 00 00 00 48 89 e5 48 3b 78 40 74 1f 4c 8d 87 e8 00 00
Sep 20 16:11:25 proxmox kernel: RSP: 0018:ffff9b890ecb7d68 EFLAGS: 00010287
Sep 20 16:11:25 proxmox kernel: RAX: 0000000000000000 RBX: ffff9b890ecb7de8 RCX: 0000000000000002
Sep 20 16:11:25 proxmox kernel: RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffff8f4bff31c000
Sep 20 16:11:25 proxmox kernel: RBP: ffff9b890ecb7da0 R08: 0000000000000000 R09: 000000000000003b
Sep 20 16:11:25 proxmox kernel: R10: 0000000000000008 R11: 0000000000000008 R12: ffff8f4bff31c000
Sep 20 16:11:25 proxmox kernel: R13: ffff8f4bff31b400 R14: 0000000000000000 R15: 0000000000000001
Sep 20 16:11:25 proxmox kernel: FS: 0000000000000000(0000) GS:ffff8f7b2f8c0000(0000) knlGS:0000000000000000
Sep 20 16:11:25 proxmox kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 20 16:11:25 proxmox kernel: CR2: 00000000000000c0 CR3: 000000305b246005 CR4: 00000000000626e0
Sep 20 16:11:25 proxmox kernel: Call Trace:
Sep 20 16:11:25 proxmox kernel: ? bt_iter+0x54/0x90
Sep 20 16:11:25 proxmox kernel: blk_mq_queue_tag_busy_iter+0x1a2/0x2d0
Sep 20 16:11:25 proxmox kernel: ? blk_mq_put_rq_ref+0x60/0x60
Sep 20 16:11:25 proxmox kernel: ? blk_mq_put_rq_ref+0x60/0x60
Sep 20 16:11:25 proxmox kernel: blk_mq_timeout_work+0x5f/0x120
Sep 20 16:11:25 proxmox kernel: process_one_work+0x220/0x3c0
Sep 20 16:11:25 proxmox kernel: worker_thread+0x53/0x420
Sep 20 16:11:25 proxmox kernel: ? process_one_work+0x3c0/0x3c0
Sep 20 16:11:25 proxmox kernel: kthread+0x12b/0x150
Sep 20 16:11:25 proxmox kernel: ? set_kthread_struct+0x50/0x50
Sep 20 16:11:25 proxmox kernel: ret_from_fork+0x22/0x30
Sep 20 16:11:25 proxmox kernel: Modules linked in: binfmt_misc ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute veth nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace nfs_ssc fscache ebtable_filter ebtables ip_set bonding tls softdog ip6table_nat ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_security iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_raw xt_tcpudp iptable_filter bpfilter nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul ghash_clmulni_intel aesni_intel ipmi_ssif mgag200 drm_kms_helper crypto_simd cryptd cec glue_helper rc_core i2c_algo_bit fb_sys_fops syscopyarea dcdbas rapl sysfillrect pcspkr joydev mei_me sysimgblt input_leds mei intel_cstate ipmi_si ipmi_devintf mac_hid ipmi_msghandler acpi_power_meter vhost_net vhost vhost_iotlb tap ib_iser
Sep 20 16:11:25 proxmox kernel: rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) hid_logitech_hidpp btrfs blake2b_generic xor raid6_pq libcrc32c hid_logitech_dj hid_generic usbkbd usbmouse usbhid hid ehci_pci crc32_pclmul lpc_ich ehci_hcd megaraid_sas tg3 wmi
Sep 20 16:11:25 proxmox kernel: CR2: 00000000000000c0
Sep 20 16:11:25 proxmox kernel: ---[ end trace 43b5fd3492cb5d6d ]---
Sep 20 16:11:25 proxmox kernel: RIP: 0010:blk_mq_put_rq_ref+0xa/0x60
Sep 20 16:11:25 proxmox kernel: Code: 15 0f b6 d3 4c 89 e7 be 01 00 00 00 e8 cf fe ff ff 5b 41 5c 5d c3 0f 0b 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 8b 47 10 <48> 8b 80 c0 00 00 00 48 89 e5 48 3b 78 40 74 1f 4c 8d 87 e8 00 00
Sep 20 16:11:25 proxmox kernel: RSP: 0018:ffff9b890ecb7d68 EFLAGS: 00010287
Sep 20 16:11:25 proxmox kernel: RAX: 0000000000000000 RBX: ffff9b890ecb7de8 RCX: 0000000000000002
Sep 20 16:11:25 proxmox kernel: RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffff8f4bff31c000
Sep 20 16:11:25 proxmox kernel: RBP: ffff9b890ecb7da0 R08: 0000000000000000 R09: 000000000000003b
Sep 20 16:11:25 proxmox kernel: R10: 0000000000000008 R11: 0000000000000008 R12: ffff8f4bff31c000
Sep 20 16:11:25 proxmox kernel: R13: ffff8f4bff31b400 R14: 0000000000000000 R15: 0000000000000001
Sep 20 16:11:25 proxmox kernel: FS: 0000000000000000(0000) GS:ffff8f7b2f8c0000(0000) knlGS:0000000000000000
Sep 20 16:11:25 proxmox kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 20 16:11:25 proxmox kernel: CR2: 00000000000000c0 CR3: 000000305b246005 CR4: 00000000000626e0
-- Reboot --
Sep 20 17:25:50 proxmox kernel: Linux version 5.11.22-4-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.11.22-8 (Fri, 27 Aug 2021 11:51:34 +0200) ()
Sep 20 17:25:50 proxmox kernel: Command line: BOOT_IMAGE=/vmlinuz-5.11.22-4-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs systemd.unified_cgroup_hierarchy=0 quiet

t.lamprecht · Sep 21, 2021

Hi,

when does the issue occurs? Randomly, or on a lot of IO, or just after the system ran for a while?

But anyhow, with the kernel oops you posted I got enough info to check a bit around and I have a feeling that this is a regression from:
https://git.kernel.org/pub/scm/linu...y&id=a3362ff0433b9cbd545c35baae59c5788b766e53

Which came in through an upstream kernel-stable update a few weeks ago, and was later fixed by another commit that is not yet included in our kernel:
https://git.kernel.org/pub/scm/linu...y&id=ceffaa61b5bb5296e07cb7f4f494377eb659058f

We'll pick that one up for the next kernel build, in the meantime you could try using an older 5.11 version, e.g., by installing pve-kernel-5.11.22-3-pve (if not already installed) and manually select that on next boot.

cyp · Sep 21, 2021

Hi,

Same problem here, reproduced on two different machine (same hardware) with 5.11.22-4-pve.
It occurs after a few days.

Code:

Sep  7 09:11:51 pve12 kernel: [65320.444899] BUG: kernel NULL pointer dereference, address: 0000000000000000
Sep  7 09:11:51 pve12 kernel: [65320.444941] #PF: supervisor instruction fetch in kernel mode
Sep  7 09:11:51 pve12 kernel: [65320.444965] #PF: error_code(0x0010) - not-present page
Sep  7 09:11:51 pve12 kernel: [65320.444986] PGD 0 P4D 0
Sep  7 09:11:51 pve12 kernel: [65320.445000] Oops: 0010 [#1] SMP NOPTI
Sep  7 09:11:51 pve12 kernel: [65320.445017] CPU: 4 PID: 542 Comm: kworker/4:1H Tainted: P           O      5.11.22-4-pve #1
Sep  7 09:11:51 pve12 kernel: [65320.445049] Hardware name: Quanta Cloud Technology Inc. QuantaPlex T22HF-1U/S5HF MB, BIOS 3A05.ON02 03/20/2019
Sep  7 09:11:51 pve12 kernel: [65320.445083] Workqueue: kblockd blk_mq_timeout_work

Error it's not always logged because the FS seems to be the first affected(Server and VM respond to bing, screen continue to display errors...).

BrandonN · Sep 21, 2021

t.lamprecht said:
Hi,

when does the issue occurs? Randomly, or on a lot of IO, or just after the system ran for a while?

But anyhow, with the kernel oops you posted I got enough info to check a bit around and I have a feeling that this is a regression from:
https://git.kernel.org/pub/scm/linu...y&id=a3362ff0433b9cbd545c35baae59c5788b766e53

Which came in through an upstream kernel-stable update a few weeks ago, and was later fixed by another commit that is not yet included in our kernel:
https://git.kernel.org/pub/scm/linu...y&id=ceffaa61b5bb5296e07cb7f4f494377eb659058f

We'll pick that one up for the next kernel build, in the meantime you could try using an older 5.11 version, e.g., by installing pve-kernel-5.11.22-3-pve (if not already installed) and manually select that on next boot.

Hi and thanks.
It occurs when the system ran for a while, right now I didn't found any specific thing to occurs because the server has the same use than before update.
I'll try as you guided me, thanks a lot!

cyp · Sep 21, 2021

To give more info on my case, it was two fresh reinstall on servers previously running fine on Proxmox 6.
The servers where removed from a PVE6/CEPH cluster, reinstalled from PVE7 iso and joined to another PVE7/CEPH cluster (formed by same hardware servers running fine with previous kernel version).
I downgraded the servers to 5.11.22-3-pve, waiting few days to see if it's fixed.

Joe How · Sep 21, 2021

This very same happens to me last week. Yesterday it destroyed all my OSD so I could manually activate all OSDs. Today on the one node of three cause havoc with this messagess

Code:

get_health_metrics reporting 1 slow ops, oldest is osd_op(client.104774156.0:5385119 2.3f3 2:cff4b302:::rbd_data.0d04dc79291870.0000000000000008:head [write 2781184~4096 in=4096b] snapc 9860=[] ondisk+write+known_if_redirected e77207)

My CPU is :
56 x Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz (2 Sockets)

The Kernel bug:

Code:

[Tue Sep 21 15:44:21 2021] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[Tue Sep 21 15:44:21 2021] fwbr146i0: port 2(veth146i0) entered blocking state
[Tue Sep 21 15:44:21 2021] fwbr146i0: port 2(veth146i0) entered forwarding state
[Tue Sep 21 15:58:23 2021] STREAM_RECEIVER[2056031]: segfault at 50 ip 000055b7278387a5 sp 00007fae31849e40 error 4 in netdata[55b7276f7000+1ff000]
[Tue Sep 21 15:58:23 2021] Code: 84 c0 0f 84 0c 01 00 00 4c 89 c6 41 be c5 9d 1c 81 0f 1f 40 00 45 69 f6 93 01 00 01 48 83 c6 01 41 31 c6 0f b6 06 84 c0 75 eb <39> 6b 10 74 1e 66 0f 1f 44 00 00 48 8b 43 30 49 89 dd 48 85 c0 0f
[Tue Sep 21 16:21:32 2021] BUG: kernel NULL pointer dereference, address: 00000000000000c0
[Tue Sep 21 16:21:32 2021] #PF: supervisor read access in kernel mode
[Tue Sep 21 16:21:32 2021] #PF: error_code(0x0000) - not-present page
[Tue Sep 21 16:21:32 2021] PGD 0 P4D 0
[Tue Sep 21 16:21:32 2021] Oops: 0000 [#1] SMP PTI
[Tue Sep 21 16:21:32 2021] CPU: 5 PID: 2067368 Comm: PLUGIN[proc] Tainted: P           O      5.11.22-4-pve #1
[Tue Sep 21 16:21:32 2021] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.11.0 11/02/2019
[Tue Sep 21 16:21:32 2021] RIP: 0010:blk_mq_put_rq_ref+0xa/0x60
[Tue Sep 21 16:21:32 2021] Code: 15 0f b6 d3 4c 89 e7 be 01 00 00 00 e8 cf fe ff ff 5b 41 5c 5d c3 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 47 10 <48> 8b 80 c0 00 00 00 48 89 e5 48 3b 78 40 74 1f 4c 8d 87 e8 00 00
[Tue Sep 21 16:21:32 2021] RSP: 0018:ffffc04fcc753b08 EFLAGS: 00010287
[Tue Sep 21 16:21:32 2021] RAX: 0000000000000000 RBX: ffffc04fcc753b88 RCX: 0000000000000002
[Tue Sep 21 16:21:32 2021] RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffff9902ca03d400
[Tue Sep 21 16:21:32 2021] RBP: ffffc04fcc753b40 R08: 0000000000000000 R09: 000000000000003a
[Tue Sep 21 16:21:32 2021] R10: 00000000000005e5 R11: 000000000000000d R12: ffff9902ca03d400
[Tue Sep 21 16:21:32 2021] R13: ffff9902ca00cc00 R14: 0000000000000000 R15: 0000000000000001
[Tue Sep 21 16:21:32 2021] FS:  00007f1abff2eb00(0000) GS:ffff9931ffa80000(0000) knlGS:0000000000000000
[Tue Sep 21 16:21:32 2021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Tue Sep 21 16:21:32 2021] CR2: 00000000000000c0 CR3: 0000002dfbd5a004 CR4: 00000000003726e0
[Tue Sep 21 16:21:32 2021] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[Tue Sep 21 16:21:32 2021] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[Tue Sep 21 16:21:32 2021] Call Trace:
[Tue Sep 21 16:21:32 2021]  ? bt_iter+0x54/0x90
[Tue Sep 21 16:21:32 2021]  blk_mq_queue_tag_busy_iter+0x1a2/0x2d0
[Tue Sep 21 16:21:32 2021]  ? blk_add_rq_to_plug+0x50/0x50
[Tue Sep 21 16:21:32 2021]  ? blk_add_rq_to_plug+0x50/0x50
[Tue Sep 21 16:21:32 2021]  blk_mq_in_flight+0x38/0x60
[Tue Sep 21 16:21:32 2021]  diskstats_show+0x159/0x300
[Tue Sep 21 16:21:32 2021]  seq_read_iter+0x2c6/0x4b0
[Tue Sep 21 16:21:32 2021]  proc_reg_read_iter+0x51/0x80
[Tue Sep 21 16:21:32 2021]  new_sync_read+0x10d/0x190
[Tue Sep 21 16:21:32 2021]  vfs_read+0x15a/0x1c0
[Tue Sep 21 16:21:32 2021]  ksys_read+0x67/0xe0
[Tue Sep 21 16:21:32 2021]  __x64_sys_read+0x1a/0x20
[Tue Sep 21 16:21:32 2021]  do_syscall_64+0x38/0x90
[Tue Sep 21 16:21:32 2021]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Tue Sep 21 16:21:32 2021] RIP: 0033:0x7f1ac16373c3
[Tue Sep 21 16:21:32 2021] Code: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> e9 53 de ff ff 48 63 ff 50 b8 e5 00 00 00 0f 05 48 89 c7 e8 19
[Tue Sep 21 16:21:32 2021] RSP: 002b:00007f1abff2c0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[Tue Sep 21 16:21:32 2021] RAX: ffffffffffffffda RBX: 00007f1abff2eb00 RCX: 00007f1ac16373c3
[Tue Sep 21 16:21:32 2021] RDX: 0000000000002800 RSI: 0000555558f8a430 RDI: 0000000000000028
[Tue Sep 21 16:21:32 2021] RBP: 0000555558f89000 R08: 0000000000000000 R09: 0000000000000000
[Tue Sep 21 16:21:32 2021] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[Tue Sep 21 16:21:32 2021] R13: 0000000000000028 R14: 0000555558f8a430 R15: 00007f1ac185e980
[Tue Sep 21 16:21:32 2021] Modules linked in: option usb_wwan usbserial nft_counter xt_state nft_compat nf_tables veth rbd libceph ebt_arp ebtable_filter ebtables ip6table_raw ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables iptable_raw ipt_REJECT nf_reject_ipv4 xt_mark xt_set xt_physdev xt_addrtype xt_comment xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set_hash_net ip_set sctp ip6_udp_tunnel udp_tunnel iptable_filter 8021q garp mrp bonding tls ipmi_watchdog xt_tcpudp xt_DSCP iptable_mangle bpfilter nfnetlink_log nfnetlink ipmi_ssif intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper dcdbas rapl mgag200 intel_cstate drm_kms_helper pcspkr cec rc_core fb_sys_fops syscopyarea sysfillrect sysimgblt mxm_wmi mei_me mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO)
[Tue Sep 21 16:21:32 2021]  zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq uas usb_storage dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c crc32_pclmul ixgbe igb ahci xfrm_algo ehci_pci i2c_algo_bit mdio lpc_ich libahci ehci_hcd dca megaraid_sas wmi
[Tue Sep 21 16:21:32 2021] CR2: 00000000000000c0
[Tue Sep 21 16:21:32 2021] ---[ end trace 02bc89a81abdd48a ]---
[Tue Sep 21 16:21:32 2021] RIP: 0010:blk_mq_put_rq_ref+0xa/0x60
[Tue Sep 21 16:21:32 2021] Code: 15 0f b6 d3 4c 89 e7 be 01 00 00 00 e8 cf fe ff ff 5b 41 5c 5d c3 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 47 10 <48> 8b 80 c0 00 00 00 48 89 e5 48 3b 78 40 74 1f 4c 8d 87 e8 00 00
[Tue Sep 21 16:21:32 2021] RSP: 0018:ffffc04fcc753b08 EFLAGS: 00010287
[Tue Sep 21 16:21:32 2021] RAX: 0000000000000000 RBX: ffffc04fcc753b88 RCX: 0000000000000002
[Tue Sep 21 16:21:32 2021] RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffff9902ca03d400
[Tue Sep 21 16:21:32 2021] RBP: ffffc04fcc753b40 R08: 0000000000000000 R09: 000000000000003a
[Tue Sep 21 16:21:32 2021] R10: 00000000000005e5 R11: 000000000000000d R12: ffff9902ca03d400
[Tue Sep 21 16:21:32 2021] R13: ffff9902ca00cc00 R14: 0000000000000000 R15: 0000000000000001
[Tue Sep 21 16:21:32 2021] FS:  00007f1abff2eb00(0000) GS:ffff9931ffa80000(0000) knlGS:0000000000000000
[Tue Sep 21 16:21:32 2021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Tue Sep 21 16:21:32 2021] CR2: 00000000000000c0 CR3: 0000002dfbd5a004 CR4: 00000000003726e0
[Tue Sep 21 16:21:32 2021] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[Tue Sep 21 16:21:32 2021] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[Tue Sep 21 16:26:36 2021] libceph: osd0 (1)10.0.100.1:6841 socket closed (con state OPEN)
[Tue Sep 21 16:26:36 2021] libceph: osd2 (1)10.0.100.1:6865 socket closed (con state OPEN)
[Tue Sep 21 16:26:36 2021] libceph: osd3 (1)10.0.100.1:6801 socket closed (con state OPEN)
[Tue Sep 21 16:26:36 2021] libceph: osd7 (1)10.0.100.1:6857 socket closed (con state OPEN)
[Tue Sep 21 16:26:36 2021] libceph: osd6 (1)10.0.100.1:6825 socket closed (con state OPEN)
[Tue Sep 21 16:26:36 2021] libceph: osd5 (1)10.0.100.1:6833 socket closed (con state OPEN)

t.lamprecht · Sep 22, 2021

FYI, there's a new kernel package available on the pvetest repository for Proxmox VE 7.x. It includes the proposed fix for the regression of the suspicious patch I linked in my previous answer, it's pve-kernel-5.11.22-4-pve in version 5.11.22-9.

Note that due to not being able to reproduce the exact issue here (neither our production nor test lab systems showed such a crash) I cannot guarantee that this is in fact the full fix, but gut feeling tells me that's not unlikely and at least won't hurt, so feedback would be welcomed.

BrandonN · Sep 22, 2021

t.lamprecht said:
FYI, there's a new kernel package available on the pvetest repository for Proxmox VE 7.x. It includes the proposed fix for the regression of the suspicious patch I linked in my previous answer, it's pve-kernel-5.11.22-4-pve in version 5.11.22-9.

Note that due to not being able to reproduce the exact issue here (neither our production nor test lab systems showed such a crash) I cannot guarantee that this is in fact the full fix, but gut feeling tells me that's not unlikely and at least won't hurt, so feedback would be welcomed.

TYSM Thomas, As soon as I can I'll try and write the feedback.

Joe How · Sep 22, 2021

Our production suffered over the last few days so much, that I'll stick to the pve-kernel-5.11.22-3-pve version as it turns out to be stable in the last 24hrs until it will be fully confirm. THX anyway

moxfan · Sep 22, 2021

Joe How said:
Our production suffered over the last few days so much, that I'll stick to the pve-kernel-5.11.22-3-pve version as it turns out to be stable in the last 24hrs until it will be fully confirm. THX anyway

Is it possible to downgrade to pve-kernel-5.11.22-3-pve without having to choose that version manually on reboot ? I only have ssh access to the server.

Joe How · Sep 22, 2021

moxfan said:
Is it possible to downgrade to pve-kernel-5.11.22-3-pve without having to choose that version manually on reboot ? I only have ssh access to the server.

Hi, you can follow this thread

https://unix.stackexchange.com/a/327686

TwiX · Sep 22, 2021

Hi,

My cluster was also affected by this issue.
kernel 5.11.22-9 installed.

I keep you in touch too.

May I say that packages are moving so fast from testing/no-subscription to enterprise repository ?

t.lamprecht · Sep 22, 2021

moxfan said:
Is it possible to downgrade to pve-kernel-5.11.22-3-pve without having to choose that version manually on reboot ? I only have ssh access to the server.

Why not try out the proposed kernel instead? It may well fix the problem for good and does not require pinning kernel versions.

t.lamprecht · Sep 22, 2021

TwiX said:
May I say that packages are moving so fast from testing/no-subscription to enterprise repository ?

I mean it certainly is a nuisance if one is hit by a regression in their setup, so I get the sentiment

The problematic kernel was out for about two weeks before it got to enterprise, IIRC, and until then the issue did not pop up anywhere (it required almost four weeks for that), I mean at least I did not notice any report, so it was deemed to be safe enough for going on enterprise - especially as our whole production setups (and there are a few) run on it since quite early in the release stack. I certainly do not want to make up excuses here, just trying to give some context and slightly ranting about HW/setup specific issues; those are always a bit of a PITA

In general, we always need to make a balance between long enough on each non-production repo so that hopefully any issue turns up but not too long as then important bug and security fixes are delayed, so it's a bit of a trade-off.

moxfan · Sep 22, 2021

Joe How said:
Hi, you can follow this thread

https://unix.stackexchange.com/a/327686

Thanks. That has helped. The thing is, on a fresh 7.x install, there is no pve-kernel-5.11.22-3-pve in /boot/grub/grub.cfg. Is there a way to grab and just install the kernel?

Joe How · Sep 23, 2021

Kernel is like any other package, so you can install it by typing this: apt install pve-kernel-5.11.22-3-pve

moxfan · Sep 23, 2021

Joe How said:
Kernel is like any other package, so you can install it by typing this: apt install pve-kernel-5.11.22-3-pve

Excellent, thanks.

comfreak · Sep 29, 2021

Was also affected today with kernel 5.11.22-8. Updating right now...otherwise I will go back to an older kernel as suggested.

xcdr · Oct 18, 2021

Hi.

I have the same problem on many clusters and kernel 5.11.22-4, I already downgrading to 5.11.22-3.

Is there chance to resolve this issue in newer kernels and nearly future?

TwiX · Oct 18, 2021

Already fixed since kernel 5.11.22-9

PVE 7.0 BUG: kernel NULL pointer dereference, address: 00000000000000c0-PF:error_code(0x0000) - No web access no ssh

Member

Attachments

Proxmox Staff Member

Renowned Member

Member

Renowned Member

Active Member

Proxmox Staff Member

Member

Active Member

Active Member

Active Member

Renowned Member

Proxmox Staff Member

Proxmox Staff Member

Active Member

Active Member

Active Member

Well-Known Member

Renowned Member

Renowned Member

We value your privacy