Kernel Panic with our new Cluster

Nov 28, 2016
100
23
83
Hamburg
Hey guys,

we've bought some new Hardware and receive this kernel panic on several machines (although they all have the same HW, some panic):

Code:
[   20.699939] ------------[ cut here ]------------
[   20.700608] kernel BUG at mm/slub.c:306!
[   20.701277] invalid opcode: 0000 [#1] SMP NOPTI
[   20.701900] CPU: 1 PID: 304 Comm: kworker/1:2 Tainted: P           OE     5.4.41-1-pve #1
[   20.702531] Hardware name: Supermicro SYS-2029BT-HNC0R/X11DPT-B, BIOS 3.2 10/19/2019
[   20.703165] Workqueue: infiniband ib_cache_event_task [ib_core]
[   20.703794] RIP: 0010:__slab_free+0x18d/0x340
[   20.704409] Code: fa 66 0f 1f 44 00 00 f0 49 0f ba 2c 24 00 0f 82 95 00 00 00 4d 3b 6c 24 20 74 11 49 0f ba 34 24 00 57 9d 0f 1f 44 00 00 eb 9b <0f> 0b 49 3b 5c 24 28 75 e8 48 8b 44 24 28 49 89 4c 24 28 49 89 44
[   20.705642] RSP: 0018:ffffb334cd577d10 EFLAGS: 00010246
[   20.706243] RAX: ffff959684a633e0 RBX: 000000008200000a RCX: ffff959684a633e0
[   20.706853] RDX: ffff959684a633e0 RSI: ffffde299f1298c0 RDI: ffff9596a0407b80
[   20.707462] RBP: ffffb334cd577db0 R08: 0000000000000001 R09: ffffffffc020a4bf
[   20.708106] R10: ffff959684a633e0 R11: 0000000000000001 R12: ffffde299f1298c0
[   20.708743] R13: ffff959684a633e0 R14: ffff9596a0407b80 R15: 0000000000000001
[   20.709357] FS:  0000000000000000(0000) GS:ffff9596a0a40000(0000) knlGS:0000000000000000
[   20.710007] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   20.710650] CR2: 00007ffa6875cca0 CR3: 0000000442c0a006 CR4: 00000000007606e0
[   20.711277] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   20.711925] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   20.712571] PKRU: 55555554
[   20.713211] Call Trace:
[   20.713847]  ? mlx5_query_hca_vport_pkey+0xc2/0x240 [mlx5_core]
[   20.714458]  ? __switch_to_asm+0x34/0x70
[   20.715056]  ? ib_cache_update.part.20+0x15f/0x270 [ib_core]
[   20.715637]  kfree+0x22e/0x250
[   20.716238]  ib_cache_update.part.20+0x15f/0x270 [ib_core]
[   20.716813]  ? __switch_to_asm+0x40/0x70
[   20.717369]  ? __switch_to_asm+0x40/0x70
[   20.717906]  ? __switch_to_asm+0x34/0x70
[   20.718428]  ib_cache_event_task+0x3c/0x70 [ib_core]
[   20.718941]  process_one_work+0x20f/0x3d0
[   20.719427]  worker_thread+0x34/0x400
[   20.719899]  kthread+0x120/0x140
[   20.720365]  ? process_one_work+0x3d0/0x3d0
[   20.720835]  ? kthread_park+0x90/0x90
[   20.721300]  ret_from_fork+0x1f/0x40
[   20.721745] Modules linked in: sctp(E) iptable_filter(E) bpfilter(E) 8021q(E) garp(E) mrp(E) softdog(E) bonding(E) nfnetlink_log(E) nfnetlink(E) ipmi_ssif(E) intel_rapl_msr(E) intel_rapl_common(E) isst_if_common(E) skx_edac(E) nfit(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) ast(E) drm_vram_helper(E) crypto_simd(E) ttm(E) cryptd(E) glue_helper(E) drm_kms_helper(E) drm(E) i2c_algo_bit(E) mei_me(E) fb_sys_fops(E) intel_cstate(E) syscopyarea(E) sysfillrect(E) intel_rapl_perf(E) pcspkr(E) joydev(E) input_leds(E) sysimgblt(E) mei(E) ioatdma(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) acpi_pad(E) acpi_power_meter(E) mac_hid(E) vhost_net(E) vhost(E) tap(E) ib_iser(E) rdma_cm(E) iw_cm(E) sunrpc(E) ib_cm(E) iscsi_tcp(E) libiscsi_tcp(E) libiscsi(E) scsi_transport_iscsi(E) ip_tables(E) x_tables(E) autofs4(E) zfs(POE) zunicode(POE) zlua(POE) zavl(POE) icp(POE) zcommon(POE)
[   20.721772]  znvpair(POE) spl(OE) btrfs(E) xor(E) zstd_compress(E) raid6_pq(E) libcrc32c(E) mlx5_ib(E) uas(E) usb_storage(E) hid_generic(E) usbmouse(E) usbkbd(E) usbhid(E) hid(E) ib_uverbs(E) ib_core(E) mlx5_core(E) mpt3sas(E) ixgbe(E) raid_class(E) pci_hyperv_intf(E) xfrm_algo(E) scsi_transport_sas(E) xhci_pci(E) tls(E) dca(E) mlxfw(E) mdio(E) i2c_i801(E) lpc_ich(E) ahci(E) xhci_hcd(E) libahci(E) wmi(E)
[   20.728373] ---[ end trace d9dbd1fd21bff09b ]---

Code:
proxmox-ve: 6.2-1 (running kernel: 5.4.41-1-pve)
pve-manager: 6.2-4 (running version: 6.2-4/9824574a)
pve-kernel-5.4: 6.2-2
pve-kernel-helper: 6.2-2
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph: 14.2.9-pve1
ceph-fuse: 14.2.9-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-2
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-1
pve-cluster: 6.1-8
pve-container: 3.1-6
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-2
pve-qemu-kvm: 5.0.0-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1

This results in an increasing load over time (after ~ 10 minutes we got a 15.00 load, after 20 Minutes the machine is unusable).
Any ideas? Some machines are working as expected and have successfully joined our existing cluster. Some show this exact phenomena and are not usable. We've debugged for a week but cannot find a clue how to solve the issue.
 
Hi,
it looks like this problem comes from the Mellanox Connect-x 5.
Does this card have current firmware?
Why do you use Infiniband instead of ethernet?