Insane load avg, disk timeouts w/ZFS

athompso

Renowned Member
Sep 13, 2013
129
8
83
I've setup a new PVE system (community license only at this time).
It's reasonably beefy (given that it's slightly trailing edge technology) using local ZFS storage.
We are observing that during periods of heavy *write* activity, the system load average goes into orbit (I've got a screen capture of a 17k load avg!) and VMs start dropping like flies - specifically /dev/sda journal writes time out, so root gets remounted ro and then they require rebooting.

The setup:
* Dell C2100
* 2 x X5670 (2.98GHz Xeon, hex-core, with HT)
* 72GB RAM (ZFS ARC limited to 32GB)
* 10 x 2TB HDD
* 2 x "enterprise" 480GB SDD split into two equal-sized partitions, one for mirrored ZIL, one for striped L2ARC
* The disks are run through the standard C2100 SAS-expander backplane to a PERC H200 cross-flashed to be a Dell 6GBps SAS adapter.
* 2 x internal SSDs for boot+root

Code:
root@pve5:~# pveversion  -v
proxmox-ve: 4.3-70 (running kernel: 4.4.21-1-pve)
pve-manager: 4.3-7 (running version: 4.3-7/db02a4de)
pve-kernel-4.4.21-1-pve: 4.4.21-70
pve-kernel-4.4.19-1-pve: 4.4.19-66
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-46
qemu-server: 4.0-92
pve-firmware: 1.1-10
libpve-common-perl: 4.0-76
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-67
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.3-12
pve-qemu-kvm: 2.7.0-4
pve-container: 1.0-78
pve-firewall: 2.0-31
pve-ha-manager: 1.0-35
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.5-1
lxcfs: 2.0.4-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve12~bpo80

Code:
root@pve5:~# zpool list
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
rpool   149G  3.33G   146G         -    50%     2%  1.00x  ONLINE  -
tank   18.1T  3.33T  14.8T         -    32%    18%  1.21x  ONLINE  -
root@pve5:~# zpool list -v
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
rpool   149G  3.33G   146G         -    50%     2%  1.00x  ONLINE  -
  mirror   149G  3.33G   146G         -    50%     2%
    sda2      -      -      -         -      -      -
    sdb2      -      -      -         -      -      -
tank  18.1T  3.33T  14.8T         -    32%    18%  1.21x  ONLINE  -
  raidz3  18.1T  3.33T  14.8T         -    32%    18%
    ata-HUA723020ALA640_YFJLYA4C      -      -      -         -      -      -
    ata-HUA723020ALA640_YFJLYLUC      -      -      -         -      -      -
    ata-HUA723020ALA640_YFJLYSYC      -      -      -         -      -      -
    ata-HUA723020ALA640_YFJLYTJC      -      -      -         -      -      -
    ata-HUA723020ALA640_YFJLYVAC      -      -      -         -      -      -
    ata-HUA723020ALA640_YFJLYX3C      -      -      -         -      -      -
    ata-HUA723020ALA640_YFJM13BC      -      -      -         -      -      -
    ata-HUA723020ALA640_YFJM5L0C      -      -      -         -      -      -
    ata-HUA723020ALA640_YFJM5NXC      -      -      -         -      -      -
    ata-HUA723020ALA640_YFJM5RVC      -      -      -         -      -      -
log      -      -      -         -      -      -
  mirror   222G   420K   222G         -    88%     0%
    ata-TOSHIBA_THNSNJ480PCS3_15IS105OTFLW-part1      -      -      -         -      -      -
    ata-TOSHIBA_THNSNJ480PCS3_15IS107CTFLW-part1      -      -      -         -      -      -
cache      -      -      -         -      -      -
  ata-TOSHIBA_THNSNJ480PCS3_15IS105OTFLW-part2   224G   130G  93.9G         -     0%    57%
  ata-TOSHIBA_THNSNJ480PCS3_15IS107CTFLW-part2   224G   130G  93.7G         -     0%    58%

As long as the load average stays below ~10, everything works well. As soon as I start doing bulk writes into the main ZFS array, everything goes to hell: load average starts climbing, VMs start timing out, etc.

I've experimented with the sync and logbias settings, but it appears that I'm hitting the ZFS write throttle artificially early. I've tweaked the zfs_dirty_pct and related parameters, with very little effect.

Under some unknown conditions, dstat(1) reports that this array has peaked at >1GBps of write activity... so, WTF? Even restoring a single VM basically kills my entire system.

(Containers aren't as badly affected because of course they don't detect a dead disk and go into safe mode.)

Ideas welcome.
 
That's the confusing part - the load average with ZFS running seems completely arbitrary and artificial. (Does it count kernel threads? Even then it's ridiculous.)

Right now there are 4 containers and 9 VMs on this system. Not a very heavy load for the hardware.

Also, I just realized the 17k number was from a different cluster running sheepdog (same VMs though - we just migrated everything). I *am* seeing load averages easily reach into the 20s and 30s on the new system.

Once the load avg. goes above 10, all the VMs are, usually, dead.

The IOWAIT percentage of CPU is usually in the 30-60% range at this point.
 
Last edited:
You managed to confuse me completely ;-) What do we talk about here - sheepdog or ZFS??

The 17k load average was from the old cluster, running sheepdog. That number is irrelevant, my brain is just fixated on it because it was so astonishing.

New system is what I'm having problems with, which is not clustered.

Uh oh... now I can't even reboot it, zpool import takes too long (>5min) and the system reboots itself?!? This is going from bad to worse.
 
Rebooting again: import ZFS pools took 9 minutes, mount ZFS filesystems didn't finish until 12 minutes (according to systemd timer). I fear I have done something very wrong with ZFS, but I don't know what.

arcstat.py shows that, immediately after boot, the ARC is only 2.1GB. I'm not even sure what numbers I should be looking for :-(
 
I cannot find the memory related settings. You say ZFS ARC is limited to 32GB, but ZFS takes more than that (e.g. write buffers).
You say 9 VMs, but no information about they memory allocation. No word for containers either.

The reason I'm interested in this is because I've hit the same issue. ZFS write threads (z_wr_iss) eating 100% each. I have the same CPUs, so 24 threads in total (2x6x2).

I've solved my issue by limiting ARC to 16GB for a 60 GB host. Of course, the exact value don't matter because we have different usage patterns in VMs and CTs.
 
Ah, even worse, when the system finally boots, most of the entries are missing from /dev/zvol, so the VMs cannot start. Currently looking for workarounds... seems like renaming each zvol does the trick, but - ugh, not a nice workaround. Exporting the data pool and reimporting it works, too. (This sounds exactly like https://github.com/zfsonlinux/zfs/issues/599 & 441, from four years ago.)

Just doing a "zpool export" takes the loadavg up to 25, BTW. iotop shows the kernel thread txg_sync is doing 99% of all I/O.
 
I cannot find the memory related settings. You say ZFS ARC is limited to 32GB, but ZFS takes more than that (e.g. write buffers).
You say 9 VMs, but no information about they memory allocation. No word for containers either.

The reason I'm interested in this is because I've hit the same issue. ZFS write threads (z_wr_iss) eating 100% each. I have the same CPUs, so 24 threads in total (2x6x2).

I've solved my issue by limiting ARC to 16GB for a 60 GB host. Of course, the exact value don't matter because we have different usage patterns in VMs and CTs.


ARC is hard-limited via module options to 32GB.
ZIL on mirrored SLOG vdevs exist, 240GB in size (yes, vastly overprovisioned).
L2ARC on striped vdevs exist, 240GB in size.

Each of the VMs has between 1 and 4GB allocated to it, and nothing's actually running in the containers yet besides a bare OS, so VM+CT memory usage is fairly minimal; I always show many GB of free memory.

In journalctl output, I now also see stuff like:
Code:
Oct 30 06:05:42 pve5 systemd-udevd[2684]: worker [11771] terminated by signal 9 (Killed)
Oct 30 06:05:42 pve5 systemd-udevd[2684]: worker [11770] terminated by signal 9 (Killed)
Oct 30 06:05:42 pve5 systemd-udevd[2684]: worker [11772] terminated by signal 9 (Killed)
Oct 30 06:05:45 pve5 systemd[1]: Starting LVM2 PV scan on device 230:144...
Oct 30 06:05:45 pve5 systemd[1]: Started LVM2 PV scan on device 230:144.
Oct 30 06:05:46 pve5 kernel:  zd160: p1 p2 < p5 >
 
Hmm... I'm also trying the 'zpool export tank' / 'zpool import tank' that many others in the ZIL community suggest; exporting this zpool is taking 15+ minutes. A zpool export should be nearly instantaneous, no?
 
OK, seriously WTF... as I'm exporting the pool, the zd* devices are starting to show up in /sys/block and udev is trying to deal with them... and getting nowhere as they're vanishing out from under udev's feet.

Wishing I'd picked the H700 RAID controller and just run qcow2 on ext4 or xfs now...
 
Try an iostat -kxz 1 during these operations. Maybe a drive is gone bad?
Doesn't *seem* to be an issue.
There's consistent reads from between 4 and 6 out of 12 devices, which correlates with "zpool iostat -v" output. SMART detects nothing wrong, either.
 
Based on a whole lot of reading (including from Matt Ahrens, pretty much the horse's mouth himself), my fundamental problem appears to be that I chose a topology of 10 x HDD + 2 x HSS in RAIDZ3 (with SSDs split between SLOG and L2ARC). Plus a whole lot of sub-optimal code in ZFS-on-Linux that make large pools really really really slow to manage.
It appears - although this is damned hard to confirm - that if I had created my pool as 5 x mirror devices, striped, I would have significantly better performance. And since my workload is write-mostly, the L2ARCs are utterly useless; there's less data stored on the L2ARC than there is in the ARC itself!
Still unsure about the ZIL, though...
 
Oh, I didn't notice the Z3. With ZFS-type RAID you get the speed of the single slowest disk in the pool. It is all about capacity & redundance, not performance.
 
Based on a whole lot of reading (including from Matt Ahrens, pretty much the horse's mouth himself), my fundamental problem appears to be that I chose a topology of 10 x HDD + 2 x HSS in RAIDZ3 (with SSDs split between SLOG and L2ARC). Plus a whole lot of sub-optimal code in ZFS-on-Linux that make large pools really really really slow to manage.
It appears - although this is damned hard to confirm - that if I had created my pool as 5 x mirror devices, striped, I would have significantly better performance. And since my workload is write-mostly, the L2ARCs are utterly useless; there's less data stored on the L2ARC than there is in the ARC itself!
Still unsure about the ZIL, though...

A RAID-Z3 is much different than a 5 mirrored devices. In a RAID-Z3 you can loose 3 disk to loose redundancy and with 4 you loose data, in a RAID10 you can loose only 1 disk per mirror (with 2 disks you loose data)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!