[SOLVED] [z_null_int] with 99.99 % IO load after 5.1 upgrade

@fabian I have applied the patch to one server in the pool and modinfo seems to confirm:

Code:
filename:       /lib/modules/4.13.13-5-pve/zfs/zfs.ko
version:        0.7.4-1
license:        CDDL
author:         OpenZFS on Linux
description:    ZFS
srcversion:     E8EDB9B5FFA260178BA7DC9
depends:        spl,znvpair,zcommon,zunicode,zavl,icp
name:           zfs
vermagic:       4.13.13-5-pve SMP mod_unload modversions

I/O according to iotop is still at 99.99% -- any hints?

did you reboot after installing the kernel? (alternatively, you could unload and reload the zfs modules, but that requires exporting all ZFS pools ;))
 
Yes I did reboot.

perf top shows:
Code:
Samples: 15K of event 'cycles:ppp', Event count (approx.): 3405002590
Overhead  Shared Object             Symbol
   3.32%  perl                      [.] Perl_yyparse
   2.24%  [kernel]                  [k] copy_page
   2.16%  [kernel]                  [k] native_queued_spin_lock_slowpath
   1.73%  perl                      [.] Perl_yylex
   1.42%  [kernel]                  [k] multilist_sublist_next
   1.40%  [kernel]                  [k] lz4_compress_zfs
   1.39%  [kernel]                  [k] vmx_vcpu_run
   1.03%  perl                      [.] Perl_hv_common
   0.98%  [kernel]                  [k] l2arc_feed_thread
   0.97%  [kernel]                  [k] _raw_spin_lock_irqsave
   0.94%  [kernel]                  [k] clear_page_erms
   0.92%  [kernel]                  [k] _raw_spin_lock
   0.85%  [kernel]                  [k] __schedule
   0.77%  perl                      [.] Perl_Slab_Alloc
   0.74%  perl                      [.] Perl_rpeep
   0.71%  [kernel]                  [k] menu_select
   0.64%  [kernel]                  [k] update_load_avg
   0.62%  [kernel]                  [k] osq_lock
   0.59%  libqb.so.0.18.1           [.] qb_array_index
   0.54%  perl                      [.] Perl_leave_scope
   0.52%  perl                      [.] Perl_gv_fetchpvn_flags
   0.51%  [kernel]                  [k] syscall_return_via_sysret
   0.51%  [kernel]                  [k] kvm_arch_vcpu_ioctl_run
   0.49%  [kernel]                  [k] unmap_page_range
   0.48%  perl                      [.] Perl_sv_clear
   0.47%  [kernel]                  [k] cpuidle_enter_state
   0.45%  [kernel]                  [k] switch_mm_irqs_off
   0.45%  perl                      [.] Perl_sv_gets
   0.45%  libc-2.24.so              [.] malloc
   0.42%  [kernel]                  [k] native_irq_return_iret
   0.41%  [kernel]                  [k] mutex_lock
   0.41%  [kernel]                  [k] kallsyms_expand_symbol.constprop.0
   0.41%  libc-2.24.so              [.] __libc_calloc
   0.39%  perl                      [.] Perl_runops_standard
   0.38%  perl                      [.] Perl_sv_upgrade
   0.37%  [kernel]                  [k] ktime_get
   0.37%  [kernel]                  [k] memcpy_erms
   0.37%  [kernel]                  [k] __fget
   0.37%  [kernel]                  [k] native_write_msr
   0.35%  [unknown]                 [k] 0xfffffe000055a01b
   0.35%  [kernel]                  [k] format_decode
   0.35%  [kernel]                  [k] _find_next_bit
   0.34%  libqb.so.0.18.1           [.] qb_poll_fds_usage_check_
   0.33%  [kernel]                  [k] number
   0.32%  [kernel]                  [k] module_get_kallsym
 
perf top shows:

Thanks, but you have to run it when your IO load is high. The output shows where exactly the cpu cycles are used and then one can argue what happens in the kernel.

here is an example showing that most of the time is spent in deflate_slow and longest_match to compress the datastream:

perf.png
 
you could also try adding a kprobe and collecting perf data including call stacks:

setup kprobe:
Code:
perf probe -m zfs -a zio_null

record 1 minute of perf data when you are experiencing the issue
Code:
perf record -ag -e probe:zio_null -F 99 -- sleep 60

convert the perf traces to readable text ;)
Code:
perf script > out.perf

and attach that file here.
 
The file is to large for the forum; uploaded to my server: http://apolloner.eu/~apollo13/out.perf

so that does make it look like it is actually just waiting for completion of (zfs recv) I/O, and it's probably just counted/attributed to zio_null by iotop. you might want to attach it to upstream's github issue nevertheless.
 
I've attached the perf output to the issue. Any idea why it is waiting that much? The server is literally doing nothing (that is the vms running aren't doing much I/O -- certainly not 99%)
 
@fabian Ok, I've shut down all VMs and gathered a new perf: http://apolloner.eu/~apollo13/out2.perf -- I still see z_null_int at 99% every 3-5 seconds, but it reduced by much. Do so see anything obvious there? I'll try to start the vms one by one to see if one causes that.

the situation is still the same - iotop reports the z_null_int because there is stuff waiting for I/O to complete. in this case, we have:
  • pmxcfs (periodically writes because of LRM watchdog)
  • ip6tables (firewall checks periodically whether changes need to be applied)
  • ls
  • perf
  • pickup (postfix)
  • pvedaemon and pveproxy workers (open GUI?)
  • sshd
  • txg_sync (ZFS syncing regular writes)
  • zfs / zpool (possible pvestatd scanning configured pools or other requests over the API triggering such scans?)
this seems par for the course for an otherwise idle system, and is nothing to worry about in general.
 
@fabian: Got a new perf output while downloading a file in a VM (7GB) over the internet with maybe 2MB/s -- so not that much. The VM completely hangs http://apolloner.eu/~apollo13/proxmox_zfs/out_wtf.perf

that mainly shows waiting for sync writes (directly by the VM, or indirectly because it needs to swap stuff out) to finish. how is your pool setup? I assume you have swap on a ZVOL? how is that configured? could you re-run the test but collect the global perf data (e.g., "perf record -ag -F 99 -- sleep 60" for a 1 minute sample)?
 
Yes, the system was setup with the default proxmox installer, zfs raid 1 with three disks. Swap according to fstab (no special config, whatever the installer did):
Code:
/dev/zvol/rpool/swap none swap sw 0 0
the VM is also in the same pool.

I'll try to capture the data this evening
 
Yes, the system was setup with the default proxmox installer, zfs raid 1 with three disks. Swap according to fstab (no special config, whatever the installer did):
Code:
/dev/zvol/rpool/swap none swap sw 0 0
the VM is also in the same pool.

I'll try to capture the data this evening

could you also get "zfs get all rpool/swap"? it might also make sense to repeat the test a third time and get some I/O data with "zpool iostat" (it has switches for request sizes and latency histograms, I'd use those to get a more in-depth view) and "iostat -dmx /dev/sd?" for data on the disks themselves.
 
I'm experiencing this issue as well. It appears that zfs 0.7.6 corrects this for at least one person. Using the pvetest repo and associated backported zfs patches didn't seem to do the trick.

@fabian do you know when/if we can expect to see zfs 0.7.6 included in pvetest?
 
I'm experiencing this issue as well. It appears that zfs 0.7.6 corrects this for at least one person. Using the pvetest repo and associated backported zfs patches didn't seem to do the trick.

@fabian do you know when/if we can expect to see zfs 0.7.6 included in pvetest?

likely today or tomorrow (already in internal testing).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!