kernel BUG at mm/slub.c:296!

mailinglists

Renowned Member
Mar 14, 2012
641
67
93
Hi,

one of my ZFS nodes crashed:

Code:
[DATE59:15 2019] ------------[ cut here ]------------
[DATE59:15 2019] kernel BUG at mm/slub.c:296!
[DATE59:15 2019] invalid opcode: 0000 [#1] SMP PTI
[DATE59:15 2019] Modules linked in: nf_conntrack_proto_gre tcp_diag inet_diag 8021q garp mrp act_police cls_basic sch_ingress sch_htb veth ebtable_filter ebtables ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_mac ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_set xt_mark xt_addrtype xt_multiport xt_conntrack nf_conntrack ip_set_hash_net ip_set iptable_filter softdog nfnetlink_log nfnetlink intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm irqbypass mgag200 ttm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel snd_pcm drm_kms_helper snd_timer snd aes_x86_64 drm crypto_simd glue_helper cryptd soundcore i2c_algo_bit fb_sys_fops intel_cstate syscopyarea sysfillrect
[DATE59:15 2019]  intel_rapl_perf pcspkr input_leds joydev sysimgblt shpchp mei_me ipmi_si mei ipmi_devintf lpc_ich ipmi_msghandler ioatdma wmi mac_hid acpi_pad vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear mlx4_en raid1 hid_generic usbmouse usbkbd usbhid hid ses enclosure mlx4_core devlink i2c_i801 ahci libahci igb(O) dca mpt3sas ptp pps_core raid_class scsi_transport_sas
[DATE59:15 2019] CPU: 28 PID: 1354 Comm: txg_sync Tainted: P           O     4.15.18-11-pve #1
[DATE59:15 2019] Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.3 08/23/2018
[DATE59:15 2019] RIP: 0010:__slab_free+0x1a2/0x330
[DATE59:15 2019] RSP: 0018:ffffaccb0c2b3840 EFLAGS: 00010246
[DATE59:15 2019] RAX: ffff8988e52d6540 RBX: ffff8988e52d6540 RCX: 00000001002a0014
[DATE59:15 2019] RDX: ffff8988e52d6540 RSI: fffffbe83a94b580 RDI: ffff898a7f407600
[DATE59:15 2019] RBP: ffffaccb0c2b38e0 R08: 0000000000000001 R09: ffffffffc063da5c
[DATE59:15 2019] R10: ffffaccb0c2b38f8 R11: 0000000000000000 R12: ffff8988e52d6540
[DATE59:15 2019] R13: ffff898a7f407600 R14: fffffbe83a94b580 R15: ffff8988e52d6540
[DATE59:15 2019] FS:  0000000000000000(0000) GS:ffff898a7fc80000(0000) knlGS:0000000000000000
[DATE59:15 2019] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[DATE59:15 2019] CR2: 0000000077775000 CR3: 0000000ebe80a006 CR4: 00000000001626e0
[DATE59:15 2019] Call Trace:
[DATE59:15 2019]  ? _cond_resched+0x1a/0x50
[DATE59:15 2019]  ? _cond_resched+0x1a/0x50
[DATE59:15 2019]  kmem_cache_free+0x1af/0x1e0
[DATE59:15 2019]  ? kmem_cache_free+0x1af/0x1e0
[DATE59:15 2019]  spl_kmem_cache_free+0x13c/0x1c0 [spl]
[DATE59:15 2019]  arc_hdr_destroy+0xa7/0x1b0 [zfs]
[DATE59:15 2019]  arc_freed+0x69/0xc0 [zfs]
[DATE59:15 2019]  zio_free_sync+0x41/0x100 [zfs]
[DATE59:15 2019]  dsl_scan_free_block_cb+0x154/0x270 [zfs]
[DATE59:15 2019]  bpobj_iterate_impl+0x194/0x780 [zfs]
[DATE59:15 2019]  ? dbuf_read+0x718/0x930 [zfs]
[DATE59:15 2019]  ? dsl_scan_zil_block+0x100/0x100 [zfs]
[DATE59:15 2019]  ? dnode_rele_and_unlock+0x53/0x80 [zfs]
[DATE59:15 2019]  ? dmu_bonus_hold+0xc6/0x1b0 [zfs]
[DATE59:15 2019]  ? bpobj_open+0xa0/0x100 [zfs]
[DATE59:15 2019]  ? _cond_resched+0x1a/0x50
[DATE59:15 2019]  ? mutex_lock+0x12/0x40
[DATE59:15 2019]  bpobj_iterate_impl+0x38e/0x780 [zfs]
[DATE59:15 2019]  ? dsl_scan_zil_block+0x100/0x100 [zfs]
[DATE59:15 2019]  bpobj_iterate+0x14/0x20 [zfs]
[DATE59:15 2019]  dsl_scan_sync+0x562/0xbb0 [zfs]
[DATE59:15 2019]  ? zio_destroy+0xbc/0xc0 [zfs]
[DATE59:15 2019]  spa_sync+0x48d/0xd50 [zfs]
[DATE59:15 2019]  txg_sync_thread+0x2d4/0x4a0 [zfs]
[DATE59:15 2019]  ? txg_quiesce_thread+0x3f0/0x3f0 [zfs]
[DATE59:15 2019]  thread_generic_wrapper+0x74/0x90 [spl]
[DATE59:15 2019]  kthread+0x105/0x140
[DATE59:15 2019]  ? __thread_exit+0x20/0x20 [spl]
[DATE59:15 2019]  ? kthread_create_worker_on_cpu+0x70/0x70
[DATE59:15 2019]  ? kthread_create_worker_on_cpu+0x70/0x70
[DATE59:15 2019]  ret_from_fork+0x35/0x40
[DATE59:15 2019] Code: ff ff ff 75 5d 48 83 bd 68 ff ff ff 00 0f 84 b9 fe ff ff 48 8b b5 60 ff ff ff 48 8b bd 68 ff ff ff e8 13 54 78 00 e9 a1 fe ff ff <0f> 0b 80 4d ab 80 4c 8b 45 88 31 d2 4c 8b 4d a8 4c 89 f6 e8 66
[DATE59:15 2019] RIP: __slab_free+0x1a2/0x330 RSP: ffffaccb0c2b3840
[DATE59:15 2019] ---[ end trace 76e5b73db47c32ae ]---

Had to be hard rebooted.
After it came back, symlinks to some of VMs in /dev/zvol/rpool/data/ were missing.
So VMs could not be started, even thou zfs list has shown all zvols.
Code:
file=/dev/zvol/rpool/data/vm-101-disk-0,if=none,id=drive-scsi0,format=raw,discard=on Could not open '/dev/zvol/rpool/data/vm-101-disk-0': No such file or directory

After some time, while I was busy moving VMs off the host (and some back for test), symlinks came back and I could once again start every VM.

I also noticed this during bootup:
Code:
[DATE3:35 2019] ZFS: Loaded module v0.7.12-1, ZFS pool version 5000, ZFS filesystem version 5
[DATE3:35 2019] mlx4_en: enp131s0: Link Up
[DATE7:31 2019] INFO: task l2arc_feed:696 blocked for more than 120 seconds.
[DATE7:31 2019]       Tainted: P           O     4.15.18-11-pve #1
[DATE7:31 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[DATE7:31 2019] l2arc_feed      D    0   696      2 0x80000000
[DATE7:31 2019] Call Trace:
[DATE7:31 2019]  __schedule+0x3e0/0x870
[DATE7:31 2019]  schedule+0x36/0x80
[DATE7:31 2019]  schedule_preempt_disabled+0xe/0x10
[DATE7:31 2019]  __mutex_lock.isra.2+0x2b1/0x4e0
[DATE7:31 2019]  ? __cv_timedwait_common+0xf3/0x170 [spl]
[DATE7:31 2019]  __mutex_lock_slowpath+0x13/0x20
[DATE7:31 2019]  ? __mutex_lock_slowpath+0x13/0x20
[DATE7:31 2019]  mutex_lock+0x2f/0x40
[DATE7:31 2019]  l2arc_feed_thread+0x1a5/0xc00 [zfs]
[DATE7:31 2019]  ? __switch_to_asm+0x40/0x70
[DATE7:31 2019]  ? __switch_to+0xb2/0x4f0
[DATE7:31 2019]  ? __switch_to_asm+0x40/0x70
[DATE7:31 2019]  ? spl_kmem_free+0x33/0x40 [spl]
[DATE7:31 2019]  ? kfree+0x165/0x180
[DATE7:31 2019]  ? kfree+0x165/0x180
[DATE7:31 2019]  ? l2arc_evict+0x340/0x340 [zfs]
[DATE7:31 2019]  thread_generic_wrapper+0x74/0x90 [spl]
[DATE7:31 2019]  kthread+0x105/0x140
[DATE7:31 2019]  ? __thread_exit+0x20/0x20 [spl]
[DATE7:31 2019]  ? kthread_create_worker_on_cpu+0x70/0x70
[DATE7:31 2019]  ret_from_fork+0x35/0x40
[DATE8:41 2019]  zd0: p1 p2 p3
[DATE8:41 2019]  zd16: p1
[DATE8:41 2019]  zd48: p1
                            p1: <bsd: p5 p6 >
[DATE8:41 2019]  zd64: p1
[DATE8:42 2019]  zd96: p1 p2
[DATE8:42 2019]  zd112: p1
[DATE8:42 2019]  zd128: p1 p2


I think there might be a but with ZFS and memory management.
How do you guys read this kernel panic?

Obviously I will update the host to latest PM shortly.
I will also disable l2arc on SSDs, for this ZFS RAID 10 HDD pool.
 
I have since updated to latest kernel from no subs repo and removed l2arc.
Still waiting for someone to decode that call trace.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!