Kernel failure - PVE 7

TwiX

Renowned Member
Feb 3, 2015
310
22
83
Hi,

Yesterday an updated PVE7 node crashed - Dell R440 - Xeon(R) Silver 4210 CPU @ 2.20GHz (2 Sockets) with 256 GB RAM

proxmox-ve: 7.0-2 (running kernel: 5.11.22-4-pve)
pve-manager: 7.0-11 (running version: 7.0-11/63d82f4e)
pve-kernel-5.11: 7.0-7
pve-kernel-helper: 7.0-7
pve-kernel-5.4: 6.4-5
pve-kernel-5.0: 6.0-11
pve-kernel-5.11.22-4-pve: 5.11.22-8
pve-kernel-5.11.22-3-pve: 5.11.22-7
pve-kernel-5.4.128-1-pve: 5.4.128-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 16.2.5-pve1
ceph-fuse: 16.2.5-pve1
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.3.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-6
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-11
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.9-2
proxmox-backup-file-restore: 2.0.9-2
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.0-9
pve-docs: 7.0-5
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-2
pve-firmware: 3.3-1
pve-ha-manager: 3.3-1
pve-i18n: 2.5-1
pve-qemu-kvm: 6.0.0-4
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-13
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1

Sep 15 19:20:43 dc-prox-25 kernel: [103877.293790] PGD 0 P4D 0
Sep 15 19:20:43 dc-prox-25 kernel: [103877.293801] Oops: 0000 [#1] SMP NOPTI
Sep 15 19:20:43 dc-prox-25 kernel: [103877.293815] CPU: 38 PID: 426 Comm: kworker/38:1H Tainted: P IO 5.11.22-4-pve #1
Sep 15 19:20:43 dc-prox-25 kernel: [103877.293840] Hardware name: Dell Inc. PowerEdge R440, BIOS 2.12.2 07/09/2021
Sep 15 19:20:43 dc-prox-25 kernel: [103877.293862] Workqueue: kblockd blk_mq_timeout_work
Sep 15 19:20:43 dc-prox-25 kernel: [103877.293884] RIP: 0010:blk_mq_put_rq_ref+0xa/0x60
Sep 15 19:20:43 dc-prox-25 kernel: [103877.293902] Code: 15 0f b6 d3 4c 89 e7 be 01 00 00 00 e8 cf fe ff ff 5b 41 5c 5d c3 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 47 10 <48> 8b 80 c0 00 00 00 48 89 e5 48 3b 78 40 74 1f 4c 8d 87 e8 00 00
Sep 15 19:20:43 dc-prox-25 kernel: [103877.293953] RSP: 0018:ffffb37e8f34bd68 EFLAGS: 00010287
Sep 15 19:20:43 dc-prox-25 kernel: [103877.293970] RAX: 0000000000000000 RBX: ffffb37e8f34bde8 RCX: 0000000000000002
Sep 15 19:20:43 dc-prox-25 kernel: [103877.293991] RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffff94468eee0000
Sep 15 19:20:43 dc-prox-25 kernel: [103877.294011] RBP: ffffb37e8f34bda0 R08: 0000000000000000 R09: 0000000000000002
Sep 15 19:20:43 dc-prox-25 kernel: [103877.294032] R10: 0000000000000008 R11: 0000000000000008 R12: ffff94468eee0000
Sep 15 19:20:43 dc-prox-25 kernel: [103877.294053] R13: ffff944690e84800 R14: 0000000000000000 R15: 0000000000000001
Sep 15 19:20:43 dc-prox-25 kernel: [103877.294074] FS: 0000000000000000(0000) GS:ffff9446000c0000(0000) knlGS:0000000000000000
Sep 15 19:20:43 dc-prox-25 kernel: [103877.294098] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 15 19:20:43 dc-prox-25 kernel: [103877.294115] CR2: 00000000000000c0 CR3: 000000287ce2c005 CR4: 00000000007726e0
Sep 15 19:20:43 dc-prox-25 kernel: [103877.294136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep 15 19:20:43 dc-prox-25 kernel: [103877.294157] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Sep 15 19:20:43 dc-prox-25 kernel: [103877.294177] PKRU: 55555554
Sep 15 19:20:43 dc-prox-25 kernel: [103877.294187] Call Trace:
Sep 15 19:20:43 dc-prox-25 kernel: [103877.294198] ? bt_iter+0x54/0x90
Sep 15 19:20:43 dc-prox-25 kernel: [103877.294212] blk_mq_queue_tag_busy_iter+0x1a2/0x2d0
Sep 15 19:20:43 dc-prox-25 kernel: [103877.294228] ? blk_mq_put_rq_ref+0x60/0x60
Sep 15 19:20:43 dc-prox-25 kernel: [103877.294243] ? blk_mq_put_rq_ref+0x60/0x60
Sep 15 19:20:43 dc-prox-25 kernel: [103877.294258] blk_mq_timeout_work+0x5f/0x120
Sep 15 19:20:43 dc-prox-25 kernel: [103877.294274] process_one_work+0x220/0x3c0
Sep 15 19:20:43 dc-prox-25 kernel: [103877.294292] worker_thread+0x53/0x420
Sep 15 19:20:43 dc-prox-25 kernel: [103877.294306] ? process_one_work+0x3c0/0x3c0
Sep 15 19:20:43 dc-prox-25 kernel: [103877.294321] kthread+0x12b/0x150
Sep 15 19:20:43 dc-prox-25 kernel: [103877.294333] ? set_kthread_struct+0x50/0x50
Sep 15 19:20:43 dc-prox-25 kernel: [103877.294347] ret_from_fork+0x1f/0x30
Sep 15 19:20:43 dc-prox-25 kernel: [103877.294363] Modules linked in: veth 8021q garp mrp nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace nfs_ssc fscache ebtable_filter ebtables ip6table_raw ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables iptable_raw ipt_REJECT nf_reject_ipv4 xt_physdev xt_addrtype xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_tcpudp xt_multiport xt_comment xt_set xt_mark ip_set_hash_net ip_set sctp ip6_udp_tunnel udp_tunnel iptable_filter bpfilter bonding tls softdog nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common ipmi_ssif isst_if_common skx_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm mgag200 irqbypass drm_kms_helper cec crct10dif_pclmul rc_core ghash_clmulni_intel i2c_algo_bit aesni_intel crypto_simd fb_sys_fops syscopyarea cryptd glue_helper dell_smbios rapl zfs(PO) intel_cstate dcdbas sysfillrect mei_me joydev dell_wmi_descriptor wmi_bmof intel_pch_thermal mei sysimgblt pcspkr input_leds efi_pstore acpi_ipmi zunicode(PO) ipmi_si
Sep 15 19:20:43 dc-prox-25 kernel: [103877.294427] zzstd(O) ipmi_devintf zlua(O) zavl(PO) acpi_power_meter ipmi_msghandler icp(PO) mac_hid zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc drm ip_tables x_tables autofs4 btrfs blake2b_generic xor hid_generic usbmouse usbkbd usbhid hid raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c crc32_pclmul ixgbe xfrm_algo xhci_pci ahci xhci_pci_renesas i2c_i801 dca mdio megaraid_sas tg3 lpc_ich i2c_smbus xhci_hcd libahci wmi
Sep 15 19:20:43 dc-prox-25 kernel: [103877.304501] CR2: 00000000000000c0
Sep 15 19:20:43 dc-prox-25 kernel: [103877.305254] ---[ end trace 913e8515690bbea4 ]---
Sep 15 19:20:43 dc-prox-25 kernel: [103877.338816] RIP: 0010:blk_mq_put_rq_ref+0xa/0x60
Sep 15 19:20:43 dc-prox-25 kernel: [103877.339725] Code: 15 0f b6 d3 4c 89 e7 be 01 00 00 00 e8 cf fe ff ff 5b 41 5c 5d c3 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 47 10 <48> 8b 80 c0 00 00 00 48 89 e5 48 3b 78 40 74 1f 4c 8d 87 e8 00 00
Sep 15 19:20:43 dc-prox-25 kernel: [103877.341403] RSP: 0018:ffffb37e8f34bd68 EFLAGS: 00010287
Sep 15 19:20:43 dc-prox-25 kernel: [103877.342209] RAX: 0000000000000000 RBX: ffffb37e8f34bde8 RCX: 0000000000000002
Sep 15 19:20:43 dc-prox-25 kernel: [103877.342970] RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffff94468eee0000
Sep 15 19:20:43 dc-prox-25 kernel: [103877.343738] RBP: ffffb37e8f34bda0 R08: 0000000000000000 R09: 0000000000000002
Sep 15 19:20:43 dc-prox-25 kernel: [103877.344516] R10: 0000000000000008 R11: 0000000000000008 R12: ffff94468eee0000
Sep 15 19:20:43 dc-prox-25 kernel: [103877.345298] R13: ffff944690e84800 R14: 0000000000000000 R15: 0000000000000001
Sep 15 19:20:43 dc-prox-25 kernel: [103877.346025] FS: 0000000000000000(0000) GS:ffff9446000c0000(0000) knlGS:0000000000000000
Sep 15 19:20:43 dc-prox-25 kernel: [103877.346750] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 15 19:20:43 dc-prox-25 kernel: [103877.347474] CR2: 00000000000000c0 CR3: 000000287ce2c005 CR4: 00000000007726e0
Sep 15 19:20:43 dc-prox-25 kernel: [103877.348205] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep 15 19:20:43 dc-prox-25 kernel: [103877.348935] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Sep 15 19:20:43 dc-prox-25 kernel: [103877.349660] PKRU: 55555554

The weird thing is that network was still OK - and corosync said that all 6 nodes are reachable.
the GUI showed involved node grayed - its Vms continue to respond to ping but didn't work. HA didn't work too - No fencing for the node in trouble. HA policy is default (conditional).

Hope you have an idea, the node is completely uptodate (packages, BIOS, firmwares, ...)
I wasn't able to collect dmesg before the manual reboot. :/
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!