Insane load avg, disk timeouts w/ZFS

athompso · Oct 30, 2016

I've setup a new PVE system (community license only at this time).
It's reasonably beefy (given that it's slightly trailing edge technology) using local ZFS storage.
We are observing that during periods of heavy *write* activity, the system load average goes into orbit (I've got a screen capture of a 17k load avg!) and VMs start dropping like flies - specifically /dev/sda journal writes time out, so root gets remounted ro and then they require rebooting.

The setup:
* Dell C2100
* 2 x X5670 (2.98GHz Xeon, hex-core, with HT)
* 72GB RAM (ZFS ARC limited to 32GB)
* 10 x 2TB HDD
* 2 x "enterprise" 480GB SDD split into two equal-sized partitions, one for mirrored ZIL, one for striped L2ARC
* The disks are run through the standard C2100 SAS-expander backplane to a PERC H200 cross-flashed to be a Dell 6GBps SAS adapter.
* 2 x internal SSDs for boot+root

Code:

root@pve5:~# pveversion  -v
proxmox-ve: 4.3-70 (running kernel: 4.4.21-1-pve)
pve-manager: 4.3-7 (running version: 4.3-7/db02a4de)
pve-kernel-4.4.21-1-pve: 4.4.21-70
pve-kernel-4.4.19-1-pve: 4.4.19-66
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-46
qemu-server: 4.0-92
pve-firmware: 1.1-10
libpve-common-perl: 4.0-76
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-67
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.3-12
pve-qemu-kvm: 2.7.0-4
pve-container: 1.0-78
pve-firewall: 2.0-31
pve-ha-manager: 1.0-35
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.5-1
lxcfs: 2.0.4-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve12~bpo80

Code:

root@pve5:~# zpool list
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
rpool   149G  3.33G   146G         -    50%     2%  1.00x  ONLINE  -
tank   18.1T  3.33T  14.8T         -    32%    18%  1.21x  ONLINE  -
root@pve5:~# zpool list -v
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
rpool   149G  3.33G   146G         -    50%     2%  1.00x  ONLINE  -
  mirror   149G  3.33G   146G         -    50%     2%
    sda2      -      -      -         -      -      -
    sdb2      -      -      -         -      -      -
tank  18.1T  3.33T  14.8T         -    32%    18%  1.21x  ONLINE  -
  raidz3  18.1T  3.33T  14.8T         -    32%    18%
    ata-HUA723020ALA640_YFJLYA4C      -      -      -         -      -      -
    ata-HUA723020ALA640_YFJLYLUC      -      -      -         -      -      -
    ata-HUA723020ALA640_YFJLYSYC      -      -      -         -      -      -
    ata-HUA723020ALA640_YFJLYTJC      -      -      -         -      -      -
    ata-HUA723020ALA640_YFJLYVAC      -      -      -         -      -      -
    ata-HUA723020ALA640_YFJLYX3C      -      -      -         -      -      -
    ata-HUA723020ALA640_YFJM13BC      -      -      -         -      -      -
    ata-HUA723020ALA640_YFJM5L0C      -      -      -         -      -      -
    ata-HUA723020ALA640_YFJM5NXC      -      -      -         -      -      -
    ata-HUA723020ALA640_YFJM5RVC      -      -      -         -      -      -
log      -      -      -         -      -      -
  mirror   222G   420K   222G         -    88%     0%
    ata-TOSHIBA_THNSNJ480PCS3_15IS105OTFLW-part1      -      -      -         -      -      -
    ata-TOSHIBA_THNSNJ480PCS3_15IS107CTFLW-part1      -      -      -         -      -      -
cache      -      -      -         -      -      -
  ata-TOSHIBA_THNSNJ480PCS3_15IS105OTFLW-part2   224G   130G  93.9G         -     0%    57%
  ata-TOSHIBA_THNSNJ480PCS3_15IS107CTFLW-part2   224G   130G  93.7G         -     0%    58%

As long as the load average stays below ~10, everything works well. As soon as I start doing bulk writes into the main ZFS array, everything goes to hell: load average starts climbing, VMs start timing out, etc.

I've experimented with the sync and logbias settings, but it appears that I'm hitting the ZFS write throttle artificially early. I've tweaked the zfs_dirty_pct and related parameters, with very little effect.

Under some unknown conditions, dstat(1) reports that this array has peaked at >1GBps of write activity... so, WTF? Even restoring a single VM basically kills my entire system.

(Containers aren't as badly affected because of course they don't detect a dead disk and go into safe mode.)

Ideas welcome.

dietmar · Oct 30, 2016

athompso said:
the system load average goes into orbit (I've got a screen capture of a 17k load avg!

So you run more that 17k processes on a single host! This looks a bit high to me. How many containers/VMs do you run?

athompso · Oct 30, 2016

That's the confusing part - the load average with ZFS running seems completely arbitrary and artificial. (Does it count kernel threads? Even then it's ridiculous.)

Right now there are 4 containers and 9 VMs on this system. Not a very heavy load for the hardware.

Also, I just realized the 17k number was from a different cluster running sheepdog (same VMs though - we just migrated everything). I *am* seeing load averages easily reach into the 20s and 30s on the new system.

Once the load avg. goes above 10, all the VMs are, usually, dead.

The IOWAIT percentage of CPU is usually in the 30-60% range at this point.

athompso · Oct 30, 2016

FYI, although not directly related to this thread (oops - see previous post): https://goo.gl/photos/EszhPMQQkrbWpw4x9

dietmar · Oct 30, 2016

athompso said:
Also, I just realized the 17k number was from a different cluster running sheepdog

You managed to confuse me completely ;-) What do we talk about here - sheepdog or ZFS??

athompso · Oct 30, 2016

dietmar said:
You managed to confuse me completely ;-) What do we talk about here - sheepdog or ZFS??

The 17k load average was from the old cluster, running sheepdog. That number is irrelevant, my brain is just fixated on it because it was so astonishing.

New system is what I'm having problems with, which is not clustered.

Uh oh... now I can't even reboot it, zpool import takes too long (>5min) and the system reboots itself?!? This is going from bad to worse.

athompso · Oct 30, 2016

Rebooting again: import ZFS pools took 9 minutes, mount ZFS filesystems didn't finish until 12 minutes (according to systemd timer). I fear I have done something very wrong with ZFS, but I don't know what.

arcstat.py shows that, immediately after boot, the ARC is only 2.1GB. I'm not even sure what numbers I should be looking for :-(

sigxcpu · Oct 30, 2016

I cannot find the memory related settings. You say ZFS ARC is limited to 32GB, but ZFS takes more than that (e.g. write buffers).
You say 9 VMs, but no information about they memory allocation. No word for containers either.

The reason I'm interested in this is because I've hit the same issue. ZFS write threads (z_wr_iss) eating 100% each. I have the same CPUs, so 24 threads in total (2x6x2).

I've solved my issue by limiting ARC to 16GB for a 60 GB host. Of course, the exact value don't matter because we have different usage patterns in VMs and CTs.

athompso · Oct 30, 2016

Ah, even worse, when the system finally boots, most of the entries are missing from /dev/zvol, so the VMs cannot start. Currently looking for workarounds... seems like renaming each zvol does the trick, but - ugh, not a nice workaround. Exporting the data pool and reimporting it works, too. (This sounds exactly like https://github.com/zfsonlinux/zfs/issues/599 & 441, from four years ago.)

Just doing a "zpool export" takes the loadavg up to 25, BTW. iotop shows the kernel thread txg_sync is doing 99% of all I/O.

athompso · Oct 30, 2016

sigxcpu said:
I cannot find the memory related settings. You say ZFS ARC is limited to 32GB, but ZFS takes more than that (e.g. write buffers).
You say 9 VMs, but no information about they memory allocation. No word for containers either.

The reason I'm interested in this is because I've hit the same issue. ZFS write threads (z_wr_iss) eating 100% each. I have the same CPUs, so 24 threads in total (2x6x2).

I've solved my issue by limiting ARC to 16GB for a 60 GB host. Of course, the exact value don't matter because we have different usage patterns in VMs and CTs.

ARC is hard-limited via module options to 32GB.
ZIL on mirrored SLOG vdevs exist, 240GB in size (yes, vastly overprovisioned).
L2ARC on striped vdevs exist, 240GB in size.

Each of the VMs has between 1 and 4GB allocated to it, and nothing's actually running in the containers yet besides a bare OS, so VM+CT memory usage is fairly minimal; I always show many GB of free memory.

In journalctl output, I now also see stuff like:

Code:

Oct 30 06:05:42 pve5 systemd-udevd[2684]: worker [11771] terminated by signal 9 (Killed)
Oct 30 06:05:42 pve5 systemd-udevd[2684]: worker [11770] terminated by signal 9 (Killed)
Oct 30 06:05:42 pve5 systemd-udevd[2684]: worker [11772] terminated by signal 9 (Killed)
Oct 30 06:05:45 pve5 systemd[1]: Starting LVM2 PV scan on device 230:144...
Oct 30 06:05:45 pve5 systemd[1]: Started LVM2 PV scan on device 230:144.
Oct 30 06:05:46 pve5 kernel:  zd160: p1 p2 < p5 >

athompso · Oct 30, 2016

Hmm... I'm also trying the 'zpool export tank' / 'zpool import tank' that many others in the ZIL community suggest; exporting this zpool is taking 15+ minutes. A zpool export should be nearly instantaneous, no?

sigxcpu · Oct 30, 2016

Try an iostat -kxz 1 during these operations. Maybe a drive is gone bad?

athompso · Oct 30, 2016

OK, seriously WTF... as I'm exporting the pool, the zd* devices are starting to show up in /sys/block and udev is trying to deal with them... and getting nowhere as they're vanishing out from under udev's feet.

Wishing I'd picked the H700 RAID controller and just run qcow2 on ext4 or xfs now...

athompso · Oct 30, 2016

sigxcpu said:
Try an iostat -kxz 1 during these operations. Maybe a drive is gone bad?

Doesn't *seem* to be an issue.
There's consistent reads from between 4 and 6 out of 12 devices, which correlates with "zpool iostat -v" output. SMART detects nothing wrong, either.

athompso · Oct 30, 2016

Based on a whole lot of reading (including from Matt Ahrens, pretty much the horse's mouth himself), my fundamental problem appears to be that I chose a topology of 10 x HDD + 2 x HSS in RAIDZ3 (with SSDs split between SLOG and L2ARC). Plus a whole lot of sub-optimal code in ZFS-on-Linux that make large pools really really really slow to manage.
It appears - although this is damned hard to confirm - that if I had created my pool as 5 x mirror devices, striped, I would have significantly better performance. And since my workload is write-mostly, the L2ARCs are utterly useless; there's less data stored on the L2ARC than there is in the ARC itself!
Still unsure about the ZIL, though...

sigxcpu · Oct 30, 2016

Oh, I didn't notice the Z3. With ZFS-type RAID you get the speed of the single slowest disk in the pool. It is all about capacity & redundance, not performance.

Alessandro 123 · Oct 30, 2016

athompso said:
Based on a whole lot of reading (including from Matt Ahrens, pretty much the horse's mouth himself), my fundamental problem appears to be that I chose a topology of 10 x HDD + 2 x HSS in RAIDZ3 (with SSDs split between SLOG and L2ARC). Plus a whole lot of sub-optimal code in ZFS-on-Linux that make large pools really really really slow to manage.
It appears - although this is damned hard to confirm - that if I had created my pool as 5 x mirror devices, striped, I would have significantly better performance. And since my workload is write-mostly, the L2ARCs are utterly useless; there's less data stored on the L2ARC than there is in the ARC itself!
Still unsure about the ZIL, though...

A RAID-Z3 is much different than a 5 mirrored devices. In a RAID-Z3 you can loose 3 disk to loose redundancy and with 4 you loose data, in a RAID10 you can loose only 1 disk per mirror (with 2 disks you loose data)

mir · Oct 30, 2016

athompso said:
2 x HSS

What is HSS?

mir · Oct 30, 2016

athompso said:
10 x HDD + 2 x HSS in RAIDZ3

Why RAIDZ3 and not RAIDZ2?
For each increment of raid level you more or less half the write speed.

athompso · Oct 30, 2016

mir said:
What is HSS?

Typo. SSD. Used as both L2ARC and SLOG.

Insane load avg, disk timeouts w/ZFS

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Well-Known Member

Renowned Member

Renowned Member

Renowned Member

Well-Known Member

Renowned Member

Renowned Member

Renowned Member

Well-Known Member

Well-Known Member

Famous Member

Famous Member

Renowned Member