I've setup a new PVE system (community license only at this time).
It's reasonably beefy (given that it's slightly trailing edge technology) using local ZFS storage.
We are observing that during periods of heavy *write* activity, the system load average goes into orbit (I've got a screen capture of a 17k load avg!) and VMs start dropping like flies - specifically /dev/sda journal writes time out, so root gets remounted ro and then they require rebooting.
The setup:
* Dell C2100
* 2 x X5670 (2.98GHz Xeon, hex-core, with HT)
* 72GB RAM (ZFS ARC limited to 32GB)
* 10 x 2TB HDD
* 2 x "enterprise" 480GB SDD split into two equal-sized partitions, one for mirrored ZIL, one for striped L2ARC
* The disks are run through the standard C2100 SAS-expander backplane to a PERC H200 cross-flashed to be a Dell 6GBps SAS adapter.
* 2 x internal SSDs for boot+root
As long as the load average stays below ~10, everything works well. As soon as I start doing bulk writes into the main ZFS array, everything goes to hell: load average starts climbing, VMs start timing out, etc.
I've experimented with the sync and logbias settings, but it appears that I'm hitting the ZFS write throttle artificially early. I've tweaked the zfs_dirty_pct and related parameters, with very little effect.
Under some unknown conditions, dstat(1) reports that this array has peaked at >1GBps of write activity... so, WTF? Even restoring a single VM basically kills my entire system.
(Containers aren't as badly affected because of course they don't detect a dead disk and go into safe mode.)
Ideas welcome.
It's reasonably beefy (given that it's slightly trailing edge technology) using local ZFS storage.
We are observing that during periods of heavy *write* activity, the system load average goes into orbit (I've got a screen capture of a 17k load avg!) and VMs start dropping like flies - specifically /dev/sda journal writes time out, so root gets remounted ro and then they require rebooting.
The setup:
* Dell C2100
* 2 x X5670 (2.98GHz Xeon, hex-core, with HT)
* 72GB RAM (ZFS ARC limited to 32GB)
* 10 x 2TB HDD
* 2 x "enterprise" 480GB SDD split into two equal-sized partitions, one for mirrored ZIL, one for striped L2ARC
* The disks are run through the standard C2100 SAS-expander backplane to a PERC H200 cross-flashed to be a Dell 6GBps SAS adapter.
* 2 x internal SSDs for boot+root
Code:
root@pve5:~# pveversion -v
proxmox-ve: 4.3-70 (running kernel: 4.4.21-1-pve)
pve-manager: 4.3-7 (running version: 4.3-7/db02a4de)
pve-kernel-4.4.21-1-pve: 4.4.21-70
pve-kernel-4.4.19-1-pve: 4.4.19-66
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-46
qemu-server: 4.0-92
pve-firmware: 1.1-10
libpve-common-perl: 4.0-76
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-67
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.3-12
pve-qemu-kvm: 2.7.0-4
pve-container: 1.0-78
pve-firewall: 2.0-31
pve-ha-manager: 1.0-35
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.5-1
lxcfs: 2.0.4-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve12~bpo80
Code:
root@pve5:~# zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
rpool 149G 3.33G 146G - 50% 2% 1.00x ONLINE -
tank 18.1T 3.33T 14.8T - 32% 18% 1.21x ONLINE -
root@pve5:~# zpool list -v
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
rpool 149G 3.33G 146G - 50% 2% 1.00x ONLINE -
mirror 149G 3.33G 146G - 50% 2%
sda2 - - - - - -
sdb2 - - - - - -
tank 18.1T 3.33T 14.8T - 32% 18% 1.21x ONLINE -
raidz3 18.1T 3.33T 14.8T - 32% 18%
ata-HUA723020ALA640_YFJLYA4C - - - - - -
ata-HUA723020ALA640_YFJLYLUC - - - - - -
ata-HUA723020ALA640_YFJLYSYC - - - - - -
ata-HUA723020ALA640_YFJLYTJC - - - - - -
ata-HUA723020ALA640_YFJLYVAC - - - - - -
ata-HUA723020ALA640_YFJLYX3C - - - - - -
ata-HUA723020ALA640_YFJM13BC - - - - - -
ata-HUA723020ALA640_YFJM5L0C - - - - - -
ata-HUA723020ALA640_YFJM5NXC - - - - - -
ata-HUA723020ALA640_YFJM5RVC - - - - - -
log - - - - - -
mirror 222G 420K 222G - 88% 0%
ata-TOSHIBA_THNSNJ480PCS3_15IS105OTFLW-part1 - - - - - -
ata-TOSHIBA_THNSNJ480PCS3_15IS107CTFLW-part1 - - - - - -
cache - - - - - -
ata-TOSHIBA_THNSNJ480PCS3_15IS105OTFLW-part2 224G 130G 93.9G - 0% 57%
ata-TOSHIBA_THNSNJ480PCS3_15IS107CTFLW-part2 224G 130G 93.7G - 0% 58%
As long as the load average stays below ~10, everything works well. As soon as I start doing bulk writes into the main ZFS array, everything goes to hell: load average starts climbing, VMs start timing out, etc.
I've experimented with the sync and logbias settings, but it appears that I'm hitting the ZFS write throttle artificially early. I've tweaked the zfs_dirty_pct and related parameters, with very little effect.
Under some unknown conditions, dstat(1) reports that this array has peaked at >1GBps of write activity... so, WTF? Even restoring a single VM basically kills my entire system.
(Containers aren't as badly affected because of course they don't detect a dead disk and go into safe mode.)
Ideas welcome.