ZFS, extremely slow writes, soft lockups

justme

New Member
Oct 29, 2016
11
0
1
41
Hi,

I have a server with 4.4-22 installed (updated since I read somewhere that it should help, but it doesn't) and sometimes have extremely slow writes in the guests. For example sometimes it works, and most of the time when a larger file (50MB+ size) is written, the guest goes into soft lockup.

When I do in the guest:
[root@ada ~]# dd if=/dev/zero of=/root/testfile bs=500M count=1 oflag=direct
1+0 records in
1+0 records out
524288000 bytes (524 MB) copied, 1.60407 s, 327 MB/s
[root@ada ~]# rm /root/testfile
rm: remove regular file â/root/testfileâ? y
[root@ada ~]# dd if=/dev/zero of=/root/testfile bs=500M count=1 oflag=direct

Then the ssh connection is broken and in the log to the guest I see:
Feb 7 14:28:41 ada kernel: CPU: 11 PID: 0 Comm: swapper/11 ve: 0 Tainted: G OEL ------------ 3.10.0-427.18.2.lve1.4.27.el7.x86_64 #1 15.2
Feb 7 14:28:41 ada kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
Feb 7 14:28:41 ada kernel: task: ffff88042924b2c0 ti: ffff880429260000 task.ti: ffff880429260000
Feb 7 14:28:41 ada kernel: RIP: 0010:[<ffffffff81058ea6>] [<ffffffff81058ea6>] native_safe_halt+0x6/0x10
Feb 7 14:28:41 ada kernel: RSP: 0018:ffff880429263e98 EFLAGS: 00000286
Feb 7 14:28:41 ada kernel: RAX: 00000000ffffffed RBX: ffff880429260000 RCX: 0100000000000000
Feb 7 14:28:41 ada kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000046
Feb 7 14:28:41 ada kernel: RBP: ffff880429263e98 R08: 0000000000000000 R09: 0000000000000000
Feb 7 14:28:41 ada kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000000b
Feb 7 14:28:41 ada kernel: R13: ffff880429260000 R14: ffff880429260000 R15: 0000000000000000
Feb 7 14:28:41 ada kernel: FS: 0000000000000000(0000) GS:ffff88042f4c0000(0000) knlGS:0000000000000000
Feb 7 14:28:41 ada kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 7 14:28:41 ada kernel: CR2: 00007fe33f2ba000 CR3: 00000004275ed000 CR4: 00000000000006e0
Feb 7 14:28:41 ada kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 7 14:28:41 ada kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Feb 7 14:28:41 ada kernel: Stack:
Feb 7 14:28:41 ada kernel: ffff880429263eb8 ffffffff8101e12f ffff880429260000 ffffffff81a6a600
...

I have ZFS as the underlying storage, one SSD disk as cache:
zpool status
pool: zfs-pool
state: ONLINE
scan: scrub in progress since Wed Feb 7 10:22:17 2018
197G scanned out of 896G at 13.3M/s, 14h53m to go
0 repaired, 22.04% done
config:

NAME STATE READ WRITE CKSUM
zfs-pool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0
logs
sda4 ONLINE 0 0 0
cache
sda5 ONLINE 0 0 0

errors: No known data errors

I have tried scrub but as until now it hasn't found any issues with physical hard drives. I have also checked them with smartctl and nothing strange is reported.

Sometimes there is nothing in the host's kern.log when this happens on the guest, but for some cases I have found out entries in host's log files like at the very same time when the guest is stuck:
Feb 6 20:10:07 prog7 kernel: [17761.712491] INFO: task kvm:3374 blocked for more than 120 seconds.
Feb 6 20:10:07 prog7 kernel: [17761.714306] Tainted: P IO 4.4.98-5-pve #1
Feb 6 20:10:07 prog7 kernel: [17761.716228] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 6 20:10:07 prog7 kernel: [17761.718219] kvm D ffff8813afa1f680 0 3374 1 0x00000000
Feb 6 20:10:07 prog7 kernel: [17761.718223] ffff8813afa1f680 0000000000000000 ffff8813f16f6600 ffff8813ecad6600
Feb 6 20:10:07 prog7 kernel: [17761.718226] ffff8813afa20000 ffff8813f1326a10 ffff8813f1326a38 ffff8813f1326b58
Feb 6 20:10:07 prog7 kernel: [17761.718229] 0000000000000000 ffff8813afa1f698 ffffffff818651b5 ffff8813f1326b50
Feb 6 20:10:07 prog7 kernel: [17761.718232] Call Trace:
Feb 6 20:10:07 prog7 kernel: [17761.718236] [<ffffffff818651b5>] schedule+0x35/0x80
Feb 6 20:10:07 prog7 kernel: [17761.718245] [<ffffffffc0414cf4>] cv_wait_common+0xf4/0x130 [spl]
Feb 6 20:10:07 prog7 kernel: [17761.718248] [<ffffffff810c67b0>] ? wait_woken+0x90/0x90
Feb 6 20:10:07 prog7 kernel: [17761.718257] [<ffffffffc0414d45>] __cv_wait+0x15/0x20 [spl]
Feb 6 20:10:07 prog7 kernel: [17761.718303] [<ffffffffc0acadfa>] txg_wait_open+0xaa/0xf0 [zfs]
Feb 6 20:10:07 prog7 kernel: [17761.718337] [<ffffffffc0a88990>] dmu_tx_wait+0x330/0x340 [zfs]
Feb 6 20:10:07 prog7 kernel: [17761.718376] [<ffffffffc0a88a27>] dmu_tx_assign+0x87/0x490 [zfs]
Feb 6 20:10:07 prog7 kernel: [17761.718424] [<ffffffffc0b1eac6>] zvol_request+0x2e6/0x670 [zfs]
Feb 6 20:10:07 prog7 kernel: [17761.718428] [<ffffffff813cd913>] ? generic_make_request_checks+0x243/0x4f0

I have tried setting
vm.min_free_kbytes = 524288
vm.dirty_background_ratio = 5
vm.swappiness = 2

on the host, but still the same behaviour.

I have also increased min arc size to 1 GB (it was set to 32 MB before)
with this "echo 1073741824 >> /sys/module/zfs/parameters/zfs_arc_min" but same situation is still there.

I have been pulling my hair for a few days now and any help would be greatly appreciated.

Thanks, Matej
 
Hi,

Please send the output of the following commands to get an overview.

Code:
cat /proc/spl/kstat/zfs/arcstats
lsblk
cat /etc/default/zfs
 
Hello, the outputs are:

cat /proc/spl/kstat/zfs/arcstats

name type data
hits 4 24667005
misses 4 4694563
demand_data_hits 4 20857923
demand_data_misses 4 755489
demand_metadata_hits 4 2717348
demand_metadata_misses 4 205571
prefetch_data_hits 4 608425
prefetch_data_misses 4 3617187
prefetch_metadata_hits 4 483309
prefetch_metadata_misses 4 116316
mru_hits 4 12026430
mru_ghost_hits 4 320856
mfu_hits 4 11789334
mfu_ghost_hits 4 39655
deleted 4 1122603
mutex_miss 4 141
evict_skip 4 228
evict_not_enough 4 14
evict_l2_cached 4 26603543040
evict_l2_eligible 4 7071998976
evict_l2_ineligible 4 23003555840
evict_l2_skip 4 0
hash_elements 4 5023447
hash_elements_max 4 5024006
hash_collisions 4 3260807
hash_chains 4 594350
hash_chain_max 4 6
p 4 17302904436
c 4 41331045541
c_min 4 1073741824
c_max 4 42229026816
size 4 41295471360
hdr_size 4 1615837088
data_size 4 37461082112
metadata_size 4 1880202752
other_size 4 272273424
anon_size 4 14297088
anon_evictable_data 4 0
anon_evictable_metadata 4 0
mru_size 4 12995137536
mru_evictable_data 4 12770410496
mru_evictable_metadata 4 34287616
mru_ghost_size 4 25769231872
mru_ghost_evictable_data 4 25328959488
mru_ghost_evictable_metadata 4 440272384
mfu_size 4 26331850240
mfu_evictable_data 4 24680726528
mfu_evictable_metadata 4 1540871680
mfu_ghost_size 4 6693398016
mfu_ghost_evictable_data 4 6460178432
mfu_ghost_evictable_metadata 4 233219584
l2_hits 4 85855
l2_misses 4 4608645
l2_feeds 4 95788
l2_rw_clash 4 0
l2_read_bytes 4 1237059584
l2_write_bytes 4 52666335232
l2_writes_sent 4 79069
l2_writes_done 4 79069
l2_writes_error 4 0
l2_writes_lock_retry 4 0
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_evict_l1cached 4 0
l2_free_on_write 4 455
l2_cdata_free_on_write 4 6
l2_abort_lowmem 4 3
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 25886919168
l2_asize 4 25485618688
l2_hdr_size 4 66075984
l2_compress_successes 4 58541
l2_compress_zeros 4 0
l2_compress_failures 4 0
memory_throttle_count 4 0
duplicate_buffers 4 0
duplicate_buffers_size 4 0
duplicate_reads 4 0
memory_direct_count 4 299
memory_indirect_count 4 0
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 0
arc_meta_used 4 3834389248
arc_meta_limit 4 31671770112
arc_meta_max 4 3834435528
arc_meta_min 4 16777216
arc_need_free 4 0
arc_sys_free 4 1319653376


lsblk


NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 186.3G 0 disk
ââsda1 8:1 0 1007K 0 part
ââsda2 8:2 0 127M 0 part
ââsda3 8:3 0 72.2G 0 part
â ââpve-root 251:0 0 24G 0 lvm /
â ââpve-swap 251:1 0 32G 0 lvm [SWAP]
â ââpve-data_tmeta 251:2 0 16M 0 lvm
â â ââpve-data-tpool 251:4 0 16G 0 lvm
â â ââpve-data 251:5 0 16G 0 lvm
â ââpve-data_tdata 251:3 0 16G 0 lvm
â ââpve-data-tpool 251:4 0 16G 0 lvm
â ââpve-data 251:5 0 16G 0 lvm
ââsda4 8:4 0 20G 0 part
ââsda5 8:5 0 94G 0 part
sdb 8:16 0 279.4G 0 disk
ââsdb1 8:17 0 279.4G 0 part
ââsdb9 8:25 0 8M 0 part
sdc 8:32 0 279.4G 0 disk
ââsdc1 8:33 0 279.4G 0 part
ââsdc9 8:41 0 8M 0 part
sdd 8:48 0 136.7G 0 disk
ââsdd1 8:49 0 136.7G 0 part
ââsdd9 8:57 0 8M 0 part
sde 8:64 0 136.7G 0 disk
ââsde1 8:65 0 136.7G 0 part
ââsde9 8:73 0 8M 0 part
sdf 8:80 0 136.7G 0 disk
ââsdf1 8:81 0 136.7G 0 part
ââsdf9 8:89 0 8M 0 part
sdg 8:96 0 136.7G 0 disk
ââsdg1 8:97 0 136.7G 0 part
ââsdg9 8:105 0 8M 0 part
sdh 8:112 0 136.7G 0 disk
ââsdh1 8:113 0 136.7G 0 part
ââsdh9 8:121 0 8M 0 part
zd0 230:0 0 10G 0 disk
ââzd0p1 230:1 0 243M 0 part
ââzd0p2 230:2 0 1K 0 part
ââzd0p5 230:5 0 9.8G 0 part
zd16 230:16 0 100G 0 disk
ââzd16p1 230:17 0 487M 0 part
ââzd16p2 230:18 0 1K 0 part
ââzd16p5 230:21 0 99.5G 0 part
zd32 230:32 0 10G 0 disk
ââzd32p1 230:33 0 500M 0 part
ââzd32p2 230:34 0 9.5G 0 part
zd48 230:48 0 50G 0 disk
ââzd48p1 230:49 0 487M 0 part
ââzd48p2 230:50 0 1K 0 part
ââzd48p5 230:53 0 49.5G 0 part
zd64 230:64 0 40G 0 disk
ââzd64p1 230:65 0 40G 0 part
zd80 230:80 0 96G 0 disk
ââzd80p1 230:81 0 83.8G 0 part
ââzd80p2 230:82 0 7.6G 0 part
zd96 230:96 0 32G 0 disk
ââzd96p1 230:97 0 32G 0 part
zd112 230:112 0 64G 0 disk
zd128 230:128 0 500G 0 disk
ââzd128p1 230:129 0 500M 0 part
ââzd128p2 230:130 0 499.5G 0 part
zd144 230:144 0 32G 0 disk
ââzd144p1 230:145 0 487M 0 part
ââzd144p2 230:146 0 1K 0 part
ââzd144p5 230:149 0 31.5G 0 part
zd160 230:160 0 10G 0 disk
ââzd160p1 230:161 0 10G 0 part



cat /etc/default/zfs


# ZoL userland configuration.

# To enable a boolean setting, set it to yes, on, true, or 1.
# Anything else will be interpreted as unset.

# Run `zfs mount -a` during system start?
ZFS_MOUNT='yes'

# Run `zfs unmount -a` during system stop?
ZFS_UNMOUNT='yes'

# Run `zfs share -a` during system start?
# nb: The shareiscsi, sharenfs, and sharesmb dataset properties.
ZFS_SHARE='yes'

# Run `zfs unshare -a` during system stop?
ZFS_UNSHARE='yes'

# By default, a verbatim import of all pools is performed at boot based on the
# contents of the default zpool cache file. The contents of the cache are
# managed automatically by the 'zpool import' and 'zpool export' commands.
#
# By setting this to 'yes', the system will instead search all devices for
# pools and attempt to import them all at boot, even those that have been
# exported. Under this mode, the search path can be controlled by the
# ZPOOL_IMPORT_PATH variable and a list of pools that should not be imported
# can be listed in the ZFS_POOL_EXCEPTIONS variable.
#
# Note that importing all visible pools may include pools that you don't
# expect, such as those on removable devices and SANs, and those pools may
# proceed to mount themselves in places you do not want them to. The results
# can be unpredictable and possibly dangerous. Only enable this option if you
# understand this risk and have complete physical control over your system and
# SAN to prevent the insertion of malicious pools.
ZPOOL_IMPORT_ALL_VISIBLE='no'

# Specify specific path(s) to look for device nodes and/or links for the
# pool import(s). See zpool(8) for more information about this variable.
# It supersedes the old USE_DISK_BY_ID which indicated that it would only
# try '/dev/disk/by-id'.
# The old variable will still work in the code, but is deprecated.
#ZPOOL_IMPORT_PATH="/dev/disk/by-vdev:/dev/disk/by-id"

# List of pools that should NOT be imported at boot
# when ZPOOL_IMPORT_ALL_VISIBLE is 'yes'.
# This is a space separated list.
#ZFS_POOL_EXCEPTIONS="test2"

# List of pools that SHOULD be imported at boot by the initramfs
# instead of trying to import all available pools. If this is set
# then ZFS_POOL_EXCEPTIONS is ignored.
# Only applicable for Debian GNU/Linux {dkms,initramfs}.
# This is a semi-colon separated list.
#ZFS_POOL_IMPORT="pool1;pool2"

# Should the datasets be mounted verbosely?
# A mount counter will be used when mounting if set to 'yes'.
VERBOSE_MOUNT='no'

# Should we allow overlay mounts?
# This is standard in Linux, but not ZFS which comes from Solaris where this
# is not allowed).
DO_OVERLAY_MOUNTS='no'

# Any additional option to the 'zfs import' commandline?
# Include '-o' for each option wanted.
# You don't need to put '-f' in here, unless you want it ALL the time.
# Using the option 'zfsforce=1' on the grub/kernel command line will
# do the same, but on a case-to-case basis.
ZPOOL_IMPORT_OPTS=""

# Full path to the ZFS cache file?
# See "cachefile" in zpool(8).
# The default is "/etc/zfs/zpool.cache".
#ZPOOL_CACHE="/etc/zfs/zpool.cache"
#
# Setting ZPOOL_CACHE to an empty string ('') AND setting ZPOOL_IMPORT_OPTS to
# "-c /etc/zfs/zpool.cache" will _enforce_ the use of a cache file.
# This is needed in some cases (extreme amounts of VDEVs, multipath etc).
# Generally, the use of a cache file is usually not recommended on Linux
# because it sometimes is more trouble than it's worth (laptops with external
# devices or when/if device nodes changes names).
#ZPOOL_IMPORT_OPTS="-c /etc/zfs/zpool.cache"
#ZPOOL_CACHE=""

# Any additional option to the 'zfs mount' command line?
# Include '-o' for each option wanted.
MOUNT_EXTRA_OPTIONS=""

# Build kernel modules with the --enable-debug switch?
# Only applicable for Debian GNU/Linux {dkms,initramfs}.
ZFS_DKMS_ENABLE_DEBUG='no'

# Build kernel modules with the --enable-debug-dmu-tx switch?
# Only applicable for Debian GNU/Linux {dkms,initramfs}.
ZFS_DKMS_ENABLE_DEBUG_DMU_TX='no'

# Keep debugging symbols in kernel modules?
# Only applicable for Debian GNU/Linux {dkms,initramfs}.
ZFS_DKMS_DISABLE_STRIP='no'

# Wait for this many seconds in the initrd pre_mountroot?
# This delays startup and should be '0' on most systems.
# Only applicable for Debian GNU/Linux {dkms,initramfs}.
ZFS_INITRD_PRE_MOUNTROOT_SLEEP='0'

# Wait for this many seconds in the initrd mountroot?
# This delays startup and should be '0' on most systems. This might help on
# systems which have their ZFS root on a USB disk that takes just a little
# longer to be available
# Only applicable for Debian GNU/Linux {dkms,initramfs}.
ZFS_INITRD_POST_MODPROBE_SLEEP='0'

# List of additional datasets to mount after the root dataset is mounted?
#
# The init script will use the mountpoint specified in the 'mountpoint'
# property value in the dataset to determine where it should be mounted.
#
# This is a space separated list, and will be mounted in the order specified,
# so if one filesystem depends on a previous mountpoint, make sure to put
# them in the right order.
#
# It is not necessary to add filesystems below the root fs here. It is
# taken care of by the initrd script automatically. These are only for
# additional filesystems needed. Such as /opt, /usr/local which is not
# located under the root fs.
# Example: If root FS is 'rpool/ROOT/rootfs', this would make sense.
#ZFS_INITRD_ADDITIONAL_DATASETS="rpool/ROOT/usr rpool/ROOT/var"

# Optional arguments for the ZFS Event Daemon (ZED).
# See zed(8) for more information on available options.
#ZED_ARGS="-M"




Thanks a lot!
Matej
 
It is not recommended to use a ZIL drive on a shared Disk.
A Zill must always be dedicate only a L2ACR can also resist on this disk.
ZFS assumes it has absolute control over the disk controller, which is shared with LVM/ext4 in your case.

Also your L2ARC makes no sense because you have very less hits and plenty of misses.
L2ARC is not for free it cost memory.
 
Hello, thank you!

Would you recommend adding an additional drive and dedicate it as a ZIL drive only? I would use SSD, what size would be the best in my setup?
Can I add an additional ZIL while the system is running?

It is interesting that this setup has been running for about a year and a half now and problems started occurring some month ago and there were similar IO loads on the guests all the time.

Thanks,
Matej
 
Would you recommend adding an additional drive and dedicate it as a ZIL drive only? I would use SSD, what size would be the best in my setup?
16GB are enough.
So a small enterprise SSD like the Samsung SM863 is perfect.
Here is a list of SSD how they perform.
The list is for Ceph but the write pattern are similar to ZIL.
https://www.sebastien-han.fr/blog/2...-if-your-ssd-is-suitable-as-a-journal-device/

Can I add an additional ZIL while the system is running?
Yes

It is interesting that this setup has been running for about a year and a half now and problems started occurring some month ago and there were similar IO loads on the guests all the time.
I think you were at the limit and meltdown and Specter bug what slowdown the io stack on a Intel cpu give the system the rest.
 
Thank you very much!

So basically this SSD I have right now for ZIL and swap, will remain just for the swap purposes? Will that make sense since it is Intel's good quality SSD?
 
there are two guests with quite a number of writes (some webservers and data constantly coming in).
 
Could the reason for issues I am having is a "semi-faulty" SSD currently used for ZIL?
 
Could the reason for issues I am having is a "semi-faulty" SSD currently used for ZIL?
A quick, easy way to check is to set zfs sync=disabled on the pool, and test again. This turns off all syncwrites, and just confirms, so pulls out the zil bottleneck.

HOWEVER

It's not safe to do this for prod without a huge amount of redundancy. If you suffered a power loss with this and the UPS was unable to correctly power down the equipment, you run the risk of losing data.
 
Now I have two SSDs, one only for L2ARC and one only for ZIL and it works just perfectly!

Thanks.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!