Hi,
one of my ZFS nodes crashed:
Had to be hard rebooted.
After it came back, symlinks to some of VMs in /dev/zvol/rpool/data/ were missing.
So VMs could not be started, even thou zfs list has shown all zvols.
After some time, while I was busy moving VMs off the host (and some back for test), symlinks came back and I could once again start every VM.
I also noticed this during bootup:
I think there might be a but with ZFS and memory management.
How do you guys read this kernel panic?
Obviously I will update the host to latest PM shortly.
I will also disable l2arc on SSDs, for this ZFS RAID 10 HDD pool.
one of my ZFS nodes crashed:
Code:
[DATE59:15 2019] ------------[ cut here ]------------
[DATE59:15 2019] kernel BUG at mm/slub.c:296!
[DATE59:15 2019] invalid opcode: 0000 [#1] SMP PTI
[DATE59:15 2019] Modules linked in: nf_conntrack_proto_gre tcp_diag inet_diag 8021q garp mrp act_police cls_basic sch_ingress sch_htb veth ebtable_filter ebtables ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_mac ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_set xt_mark xt_addrtype xt_multiport xt_conntrack nf_conntrack ip_set_hash_net ip_set iptable_filter softdog nfnetlink_log nfnetlink intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm irqbypass mgag200 ttm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel snd_pcm drm_kms_helper snd_timer snd aes_x86_64 drm crypto_simd glue_helper cryptd soundcore i2c_algo_bit fb_sys_fops intel_cstate syscopyarea sysfillrect
[DATE59:15 2019] intel_rapl_perf pcspkr input_leds joydev sysimgblt shpchp mei_me ipmi_si mei ipmi_devintf lpc_ich ipmi_msghandler ioatdma wmi mac_hid acpi_pad vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear mlx4_en raid1 hid_generic usbmouse usbkbd usbhid hid ses enclosure mlx4_core devlink i2c_i801 ahci libahci igb(O) dca mpt3sas ptp pps_core raid_class scsi_transport_sas
[DATE59:15 2019] CPU: 28 PID: 1354 Comm: txg_sync Tainted: P O 4.15.18-11-pve #1
[DATE59:15 2019] Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.3 08/23/2018
[DATE59:15 2019] RIP: 0010:__slab_free+0x1a2/0x330
[DATE59:15 2019] RSP: 0018:ffffaccb0c2b3840 EFLAGS: 00010246
[DATE59:15 2019] RAX: ffff8988e52d6540 RBX: ffff8988e52d6540 RCX: 00000001002a0014
[DATE59:15 2019] RDX: ffff8988e52d6540 RSI: fffffbe83a94b580 RDI: ffff898a7f407600
[DATE59:15 2019] RBP: ffffaccb0c2b38e0 R08: 0000000000000001 R09: ffffffffc063da5c
[DATE59:15 2019] R10: ffffaccb0c2b38f8 R11: 0000000000000000 R12: ffff8988e52d6540
[DATE59:15 2019] R13: ffff898a7f407600 R14: fffffbe83a94b580 R15: ffff8988e52d6540
[DATE59:15 2019] FS: 0000000000000000(0000) GS:ffff898a7fc80000(0000) knlGS:0000000000000000
[DATE59:15 2019] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[DATE59:15 2019] CR2: 0000000077775000 CR3: 0000000ebe80a006 CR4: 00000000001626e0
[DATE59:15 2019] Call Trace:
[DATE59:15 2019] ? _cond_resched+0x1a/0x50
[DATE59:15 2019] ? _cond_resched+0x1a/0x50
[DATE59:15 2019] kmem_cache_free+0x1af/0x1e0
[DATE59:15 2019] ? kmem_cache_free+0x1af/0x1e0
[DATE59:15 2019] spl_kmem_cache_free+0x13c/0x1c0 [spl]
[DATE59:15 2019] arc_hdr_destroy+0xa7/0x1b0 [zfs]
[DATE59:15 2019] arc_freed+0x69/0xc0 [zfs]
[DATE59:15 2019] zio_free_sync+0x41/0x100 [zfs]
[DATE59:15 2019] dsl_scan_free_block_cb+0x154/0x270 [zfs]
[DATE59:15 2019] bpobj_iterate_impl+0x194/0x780 [zfs]
[DATE59:15 2019] ? dbuf_read+0x718/0x930 [zfs]
[DATE59:15 2019] ? dsl_scan_zil_block+0x100/0x100 [zfs]
[DATE59:15 2019] ? dnode_rele_and_unlock+0x53/0x80 [zfs]
[DATE59:15 2019] ? dmu_bonus_hold+0xc6/0x1b0 [zfs]
[DATE59:15 2019] ? bpobj_open+0xa0/0x100 [zfs]
[DATE59:15 2019] ? _cond_resched+0x1a/0x50
[DATE59:15 2019] ? mutex_lock+0x12/0x40
[DATE59:15 2019] bpobj_iterate_impl+0x38e/0x780 [zfs]
[DATE59:15 2019] ? dsl_scan_zil_block+0x100/0x100 [zfs]
[DATE59:15 2019] bpobj_iterate+0x14/0x20 [zfs]
[DATE59:15 2019] dsl_scan_sync+0x562/0xbb0 [zfs]
[DATE59:15 2019] ? zio_destroy+0xbc/0xc0 [zfs]
[DATE59:15 2019] spa_sync+0x48d/0xd50 [zfs]
[DATE59:15 2019] txg_sync_thread+0x2d4/0x4a0 [zfs]
[DATE59:15 2019] ? txg_quiesce_thread+0x3f0/0x3f0 [zfs]
[DATE59:15 2019] thread_generic_wrapper+0x74/0x90 [spl]
[DATE59:15 2019] kthread+0x105/0x140
[DATE59:15 2019] ? __thread_exit+0x20/0x20 [spl]
[DATE59:15 2019] ? kthread_create_worker_on_cpu+0x70/0x70
[DATE59:15 2019] ? kthread_create_worker_on_cpu+0x70/0x70
[DATE59:15 2019] ret_from_fork+0x35/0x40
[DATE59:15 2019] Code: ff ff ff 75 5d 48 83 bd 68 ff ff ff 00 0f 84 b9 fe ff ff 48 8b b5 60 ff ff ff 48 8b bd 68 ff ff ff e8 13 54 78 00 e9 a1 fe ff ff <0f> 0b 80 4d ab 80 4c 8b 45 88 31 d2 4c 8b 4d a8 4c 89 f6 e8 66
[DATE59:15 2019] RIP: __slab_free+0x1a2/0x330 RSP: ffffaccb0c2b3840
[DATE59:15 2019] ---[ end trace 76e5b73db47c32ae ]---
Had to be hard rebooted.
After it came back, symlinks to some of VMs in /dev/zvol/rpool/data/ were missing.
So VMs could not be started, even thou zfs list has shown all zvols.
Code:
file=/dev/zvol/rpool/data/vm-101-disk-0,if=none,id=drive-scsi0,format=raw,discard=on Could not open '/dev/zvol/rpool/data/vm-101-disk-0': No such file or directory
After some time, while I was busy moving VMs off the host (and some back for test), symlinks came back and I could once again start every VM.
I also noticed this during bootup:
Code:
[DATE3:35 2019] ZFS: Loaded module v0.7.12-1, ZFS pool version 5000, ZFS filesystem version 5
[DATE3:35 2019] mlx4_en: enp131s0: Link Up
[DATE7:31 2019] INFO: task l2arc_feed:696 blocked for more than 120 seconds.
[DATE7:31 2019] Tainted: P O 4.15.18-11-pve #1
[DATE7:31 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[DATE7:31 2019] l2arc_feed D 0 696 2 0x80000000
[DATE7:31 2019] Call Trace:
[DATE7:31 2019] __schedule+0x3e0/0x870
[DATE7:31 2019] schedule+0x36/0x80
[DATE7:31 2019] schedule_preempt_disabled+0xe/0x10
[DATE7:31 2019] __mutex_lock.isra.2+0x2b1/0x4e0
[DATE7:31 2019] ? __cv_timedwait_common+0xf3/0x170 [spl]
[DATE7:31 2019] __mutex_lock_slowpath+0x13/0x20
[DATE7:31 2019] ? __mutex_lock_slowpath+0x13/0x20
[DATE7:31 2019] mutex_lock+0x2f/0x40
[DATE7:31 2019] l2arc_feed_thread+0x1a5/0xc00 [zfs]
[DATE7:31 2019] ? __switch_to_asm+0x40/0x70
[DATE7:31 2019] ? __switch_to+0xb2/0x4f0
[DATE7:31 2019] ? __switch_to_asm+0x40/0x70
[DATE7:31 2019] ? spl_kmem_free+0x33/0x40 [spl]
[DATE7:31 2019] ? kfree+0x165/0x180
[DATE7:31 2019] ? kfree+0x165/0x180
[DATE7:31 2019] ? l2arc_evict+0x340/0x340 [zfs]
[DATE7:31 2019] thread_generic_wrapper+0x74/0x90 [spl]
[DATE7:31 2019] kthread+0x105/0x140
[DATE7:31 2019] ? __thread_exit+0x20/0x20 [spl]
[DATE7:31 2019] ? kthread_create_worker_on_cpu+0x70/0x70
[DATE7:31 2019] ret_from_fork+0x35/0x40
[DATE8:41 2019] zd0: p1 p2 p3
[DATE8:41 2019] zd16: p1
[DATE8:41 2019] zd48: p1
p1: <bsd: p5 p6 >
[DATE8:41 2019] zd64: p1
[DATE8:42 2019] zd96: p1 p2
[DATE8:42 2019] zd112: p1
[DATE8:42 2019] zd128: p1 p2
I think there might be a but with ZFS and memory management.
How do you guys read this kernel panic?
Obviously I will update the host to latest PM shortly.
I will also disable l2arc on SSDs, for this ZFS RAID 10 HDD pool.