ZFS zfs_send_corrupt_data parameter not working

davidindra · Jan 3, 2018

Tried it too - it fails, sometimes even with kernel panic (saying something with VERIFY3()).

fabian · Jan 4, 2018

if you get a kernel panic, then please post the full stack it dumps..

davidindra · Jan 4, 2018

I ran this:

Code:

dd if=/dev/zvol/rpool/data/vm-101-disk-1 bs=4096 | pv | dd bs=4096 of=/dev/null

And got this:

Code:

 109GiB 1:24:06 [21.7MiB/s] [                                                                                                                                                                             <=>                                  ]
Message from syslogd@prox2 at Jan  4 10:05:09 ...
 kernel:[255179.799188] VERIFY3(0 == remove_reference(hdr, ((void *)0), tag)) failed (0 == 1023)

Message from syslogd@prox2 at Jan  4 10:05:09 ...
 kernel:[255179.799344] PANIC at arc.c:3076:arc_buf_destroy()
 
 Message from syslogd@prox2 at Jan  4 10:05:09 ...
 kernel:[255179.799344] PANIC at arc.c:3076:arc_buf_destroy()

^C28727296+0 records in
28727296+0 records out
117667004416 bytes (118 GB, 110 GiB) copied, 8413.37 s, 14.0 MB/s

In dmesg I found this (first line repeated a lot):

Code:

[252193.764462] Buffer I/O error on dev zd32, logical block 28727886, async page read
[252194.765870] Buffer I/O error on dev zd32, logical block 28727886, async page read
[252195.779429] Buffer I/O error on dev zd32, logical block 28727886, async page read
[252196.857450] Buffer I/O error on dev zd32, logical block 28727886, async page read
[252197.919854] Buffer I/O error on dev zd32, logical block 28727886, async page read
[252199.019102] Buffer I/O error on dev zd32, logical block 28727886, async page read
[252200.136994] Buffer I/O error on dev zd32, logical block 28727886, async page read
[255179.799188] VERIFY3(0 == remove_reference(hdr, ((void *)0), tag)) failed (0 == 1023)
[255179.799344] PANIC at arc.c:3076:arc_buf_destroy()
[255179.799516] Showing stack for process 392
[255179.799517] CPU: 7 PID: 392 Comm: z_rd_int_1 Tainted: P           O    4.13.13-2-pve #1
[255179.799518] Hardware name: LENOVO ThinkServer RS140/ThinkServer RS140, BIOS FBKTA2CUS 09/25/2017
[255179.799518] Call Trace:
[255179.799524]  dump_stack+0x63/0x8b
[255179.799530]  spl_dumpstack+0x42/0x50 [spl]
[255179.799532]  spl_panic+0xc8/0x110 [spl]
[255179.799534]  ? spl_kmem_cache_alloc+0x72/0x8d0 [spl]
[255179.799536]  ? __slab_alloc+0x20/0x40
[255179.799537]  ? kmem_cache_alloc+0xfc/0x1a0
[255179.799539]  ? spl_kmem_cache_alloc+0x72/0x8d0 [spl]
[255179.799564]  ? buf_cons+0x6a/0x70 [zfs]
[255179.799566]  ? spl_kmem_cache_alloc+0x116/0x8d0 [spl]
[255179.799568]  ? wait_woken+0x80/0x80
[255179.799580]  arc_buf_destroy+0x123/0x140 [zfs]
[255179.799593]  dbuf_read_done+0x91/0xf0 [zfs]
[255179.799604]  arc_read_done+0x18f/0x310 [zfs]
[255179.799624]  zio_done+0x32a/0xe30 [zfs]
[255179.799626]  ? spl_kmem_free+0x33/0x40 [spl]
[255179.799646]  ? vdev_mirror_map_free+0x25/0x30 [zfs]
[255179.799666]  zio_execute+0x8a/0xe0 [zfs]
[255179.799668]  taskq_thread+0x25e/0x460 [spl]
[255179.799670]  ? wake_up_q+0x80/0x80
[255179.799671]  kthread+0x109/0x140
[255179.799673]  ? taskq_thread_should_stop+0x70/0x70 [spl]
[255179.799674]  ? kthread_create_on_node+0x70/0x70
[255179.799675]  ret_from_fork+0x25/0x30

And also these (multiple times):

Code:

[255317.380664] INFO: task kswapd0:82 blocked for more than 120 seconds.
[255317.380850]       Tainted: P           O    4.13.13-2-pve #1
[255317.381033] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[255317.381249] kswapd0         D    0    82      2 0x00000000
[255317.381251] Call Trace:
[255317.381258]  __schedule+0x3cc/0x860
[255317.381259]  schedule+0x36/0x80
[255317.381261]  io_schedule+0x16/0x40
[255317.381267]  cv_wait_common+0xb2/0x140 [spl]
[255317.381269]  ? wait_woken+0x80/0x80
[255317.381272]  __cv_wait_io+0x18/0x20 [spl]
[255317.381302]  zio_wait+0xfd/0x1b0 [zfs]
[255317.381325]  zil_commit.part.14+0x4d6/0x8a0 [zfs]
[255317.381347]  zil_commit+0x17/0x20 [zfs]
[255317.381368]  zvol_write+0x599/0x620 [zfs]
[255317.381370]  ? avl_add+0x6f/0x90 [zavl]
[255317.381392]  zvol_request+0x24a/0x300 [zfs]
[255317.381393]  ? SyS_madvise+0x970/0x970
[255317.381395]  generic_make_request+0x125/0x300
[255317.381396]  submit_bio+0x73/0x150
[255317.381397]  ? submit_bio+0x73/0x150
[255317.381398]  ? map_swap_page+0x12/0x20
[255317.381400]  __swap_writepage+0x2ed/0x340
[255317.381401]  ? __frontswap_store+0x6d/0xf0
[255317.381402]  swap_writepage+0x34/0x90
[255317.381403]  pageout.isra.51+0x189/0x2b0
[255317.381405]  shrink_page_list+0x9ca/0xb20
[255317.381406]  shrink_inactive_list+0x240/0x5c0
[255317.381407]  ? find_first_bit+0x40/0x50
[255317.381408]  shrink_node_memcg+0x365/0x780
[255317.381410]  shrink_node+0xe1/0x310
[255317.381410]  ? shrink_node+0xe1/0x310
[255317.381412]  kswapd+0x386/0x760
[255317.381413]  kthread+0x109/0x140
[255317.381414]  ? mem_cgroup_shrink_node+0x170/0x170
[255317.381415]  ? kthread_create_on_node+0x70/0x70
[255317.381416]  ret_from_fork+0x25/0x30

Any further info I can supply?

David

edit:
Seems to me that it corresponds to this line, but it's not exactly the same as logged in that panic.

fabian · Jan 4, 2018

how's the memory situation on your host? what are your ARC limits? is swap on ZFS? do you see swapping occuring?

davidindra · Jan 4, 2018

I have some weird ARC cache problems longer time. It swaps a bit and arcstat says it usually uses max. 1GB of RAM (shortly after boot a lot is used, but then it drops in one second to around 1GB) out of 32GB.
My /sys/module/zfs/parameters/zfs_arc_min and zfs_arc_max equals zero. Swap is on ZFS.

fabian · Jan 4, 2018

might be https://github.com/zfsonlinux/zfs/pull/6989

davidindra · Jan 4, 2018

That looks possible. So, how can I workaround that problem please?

fabian · Jan 4, 2018

see if disabling swap and setting explicit arc min and max limits of a couple GB helps, and/or wait for the patches in question to be backported to 0.7.x (they are slated for 0.7.6, which is not yet released).

davidindra · Jan 4, 2018

Disabling ARC cache compression isn't going to help?

davidindra · Jan 4, 2018

I have:

disabled swap
set arc_min to 4GB
set arc_max to 8GB
set zfs_compressed_arc_enabled to 0

And the system freezed completely (IO delay 90%). This was logged into dmesg:

Code:

[ 7629.092954] Buffer I/O error on dev zd64, logical block 28727886, async page read
[ 7629.095030] Buffer I/O error on dev zd64, logical block 28727887, async page read
[ 7630.104689] Buffer I/O error on dev zd64, logical block 28727887, async page read
[ 7631.280027] Buffer I/O error on dev zd64, logical block 28727887, async page read
[ 7632.299289] Buffer I/O error on dev zd64, logical block 28727887, async page read
[ 7633.317392] Buffer I/O error on dev zd64, logical block 28727887, async page read
[ 7634.322105] Buffer I/O error on dev zd64, logical block 28727887, async page read
[ 7635.381550] Buffer I/O error on dev zd64, logical block 28727887, async page read
[ 7636.970181] Buffer I/O error on dev zd64, logical block 28727887, async page read
[ 7637.998322] Buffer I/O error on dev zd64, logical block 28727887, async page read
[ 7638.003647] Buffer I/O error on dev zd64, logical block 28727886, async page read
[ 7639.011515] Buffer I/O error on dev zd64, logical block 28727886, async page read
[ 7640.024676] Buffer I/O error on dev zd64, logical block 28727886, async page read
[ 7641.029876] Buffer I/O error on dev zd64, logical block 28727886, async page read
[ 7642.378082] Buffer I/O error on dev zd64, logical block 28727886, async page read
[ 7643.504731] Buffer I/O error on dev zd64, logical block 28727886, async page read
[ 7644.710597] Buffer I/O error on dev zd64, logical block 28727886, async page read
[ 7645.712085] Buffer I/O error on dev zd64, logical block 28727886, async page read
[ 7646.765862] Buffer I/O error on dev zd64, logical block 28727886, async page read
[10257.745877] general protection fault: 0000 [#1] SMP
[10257.746040] Modules linked in: ip_set ip6table_filter ip6_tables iptable_filter softdog nfnetlink_log nfnetlink intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp snd_pcm kvm_intel snd_timer kvm snd irqbypass soundcore crct10dif_p
clmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 i915 crypto_simd glue_helper cryptd input_leds drm_kms_helper ppdev wmi_bmof intel_cstate intel_rapl_perf drm pcspkr ie31200_edac i2c_algo_bit mei_me mei fb_sys_fops syscopy
area sysfillrect sysimgblt vhost_net lpc_ich vhost shpchp parport_pc tap mac_hid video parport wmi ib_iser rdma_cm iw_cm ib_cm ib_core sunrpc iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 hid_generic usbkbd
 usbhid hid zfs(PO) zunicode(PO) zavl(PO) icp(PO) uas usb_storage zcommon(PO) znvpair(PO)
[10257.747813]  spl(O) btrfs xor raid6_pq ahci i2c_i801 libahci e1000e(O) igb(O) dca ptp pps_core
[10257.748116] CPU: 1 PID: 388 Comm: z_rd_int_5 Tainted: P           O    4.13.13-2-pve #1
[10257.748449] Hardware name: LENOVO ThinkServer RS140/ThinkServer RS140, BIOS FBKTA2CUS 09/25/2017
[10257.748799] task: ffff99973a15dd00 task.stack: ffffb3cb4c338000
[10257.749133] RIP: 0010:sg_next+0x0/0x30
[10257.749498] RSP: 0018:ffffb3cb4c33bcf8 EFLAGS: 00010202
[10257.749840] RAX: 3d00020f00083c13 RBX: 0000000000000001 RCX: 0000000000000000
[10257.750195] RDX: 00000000087e0000 RSI: 0000000000000000 RDI: 3d00020f00083c13
[10257.750587] RBP: ffffb3cb4c33bd40 R08: 0000000000008000 R09: 000000009c254000
[10257.750952] R10: ffff99972c9dc000 R11: 0000000000000181 R12: 0000000000080a13
[10257.751349] R13: ffffb3cb52120000 R14: 0000000000080a13 R15: 0000000000000000
[10257.751729] FS:  0000000000000000(0000) GS:ffff99975ea40000(0000) knlGS:0000000000000000
[10257.752117] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10257.752606] CR2: 0000000000808ff8 CR3: 0000000424c09000 CR4: 00000000001426e0
[10257.753031] Call Trace:
[10257.753495]  ? abd_free+0xbd/0x1e0 [zfs]
[10257.753919]  ? spl_kmem_alloc+0x9b/0x170 [spl]
[10257.754390]  zio_pop_transforms+0x83/0x90 [zfs]
[10257.754837]  l2arc_read_done+0x354/0x4e0 [zfs]
[10257.755315]  zio_done+0x32a/0xe30 [zfs]
[10257.755776]  ? zio_wait_for_children+0x89/0xa0 [zfs]
[10257.756484]  zio_execute+0x8a/0xe0 [zfs]
[10257.757109]  taskq_thread+0x25e/0x460 [spl]
[10257.757629]  ? wake_up_q+0x80/0x80
[10257.758170]  kthread+0x109/0x140
[10257.758664]  ? taskq_thread_should_stop+0x70/0x70 [spl]
[10257.759158]  ? kthread_create_on_node+0x70/0x70
[10257.759655]  ret_from_fork+0x25/0x30
[10257.760154] Code: 41 5c 41 5d 41 5e 5d c3 90 90 90 55 c7 47 10 00 00 00 00 89 57 0c 48 89 37 48 89 e5 89 4f 08 5d c3 66 2e 0f 1f 84 00 00 00 00 00 <f6> 07 02 55 48 89 e5 75 18 48 8b 57 20 48 8d 47 20 5d 48 89 d1
[10257.761270] RIP: sg_next+0x0/0x30 RSP: ffffb3cb4c33bcf8
[10257.761837] ---[ end trace bd1bd66653c3e80a ]---

fabian · Jan 4, 2018

are you sure your RAM / disks are not broken somehow? is zd64 the zvol you are attempting to dump? have you attempted a full scrub yet?

davidindra · Jan 4, 2018

I have ECC RAMs (non-ECC was reason of the initial corruption) now. Yes, it is. I've done scrub multiple times.

davidindra · Jan 29, 2018

Hello,
please, do you have any other ideas? That Proxmox instance we were talking about behaves very unstable (random freezes etc.), I would be thankful for any ideas that might solve that problem by successfully moving broken data away.
Thanks a lot
David

Nemesiz · Jan 29, 2018

It may be relate to hardware. Check the cables and so on.

davidindra · Jan 29, 2018

What exactly can be HW problem?

Nemesiz · Jan 29, 2018

Not so long time ago I and my cousin had some problem with his ZFS pool. Pool is 8 disk raidz2 configuration. At the beginning one HDD started to show checksum, r/w error on ZFS pool status. Scrub show no error but disk error count still occurs. Suddenly another HDD showed smart errors and we replaced it. Later we continued investigating the first HDD. Badblock scan with non-destructing test show no error. Smart - no problem. He had no replacement 8086 cable with him so he just mix up connections. For now ZFS pool works with no error and we think it may be related to cables. Who know will the problem occurs again.

Search

Search

ZFS zfs_send_corrupt_data parameter not working

davidindra

New Member

fabian

Proxmox Staff Member

davidindra

New Member

fabian

Proxmox Staff Member

davidindra

New Member

fabian

Proxmox Staff Member

davidindra

New Member

fabian

Proxmox Staff Member

davidindra

New Member

davidindra

New Member

fabian

Proxmox Staff Member

davidindra

New Member

davidindra

New Member

Nemesiz

Renowned Member

davidindra

New Member

Nemesiz

Renowned Member

We value your privacy