ZFS over ISCSI (High Load on Move Disk)

ran · Jan 12, 2020

Hi

We have a setup of ZFS over ISCSI using LIO on Ubuntu 18 , and we have an issue with high IO load once we move disks that are bigger than 100GB,

once the move starts the Load is low until about a half of the transfer is done, and then it's getting crazy high,

our setup is very high end , and the load is very unreasonable, our setup is raidz1-0 on 8TB X 8 nVME disks

with 512GB of RAM and dual xeon gold 3.3GHZ and that is just for the storage, our atime for zfs is disabled .

do you have any clues for the reason for high load on high capacity disk moves?

btw setting a limit on proxmox cluster options doesn't help at all.

Thanks.

wolfgang · Jan 14, 2020

Hi,

what PVE version do you use?

Code:

pveversion -v

ran · Jan 14, 2020

Latest on all servers , minimal version on our cluster is 6.1-3

Proxmox-ve: 6.1-2 (running kernel: 5.3.13-1-pve)
pve-manager: 6.1-5 (running version: 6.1-5/9bf06119)
pve-kernel-5.3: 6.1-1
pve-kernel-helper: 6.1-1
pve-kernel-5.0: 6.0-11
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-5
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-9
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.1-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
openvswitch-switch: 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12+deb10u1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-1
pve-cluster: 6.1-2
pve-container: 3.0-15
pve-docs: 6.1-3
pve-edk2-firmware: 2.20191127-1
pve-firewall: 4.0-9
pve-firmware: 3.0-4
pve-ha-manager: 3.0-8
pve-i18n: 2.0-3
pve-qemu-kvm: 4.1.1-2
pve-xtermjs: 3.13.2-1
pve-zsync: 2.0-1
qemu-server: 6.1-4
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve2

wolfgang · Jan 14, 2020

I guess it has something to do with the AVX 512 instruction set.
Please try to switch the RaidZ and fletcher algo to ssse3

https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#zfs_fletcher_4_impl
https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#zfs_vdev_raidz_impl

ran · Jan 14, 2020

Hi,

Thanks a lot for the help , though how can i do it ? do i change it using the command "zfs set checksum..." ?

can the change you are suggesting be done on a live production ZFS volume?

Thanks so much.

wolfgang · Jan 14, 2020

You can change this at runtime.

Code:

echo ssse3 >> /sys/module/zfs/parameters/zfs_vdev_raidz_impl
zfs set checksum=sha256 <pool>

ran · Jan 27, 2020

Unfortunately it didn't help at all , for some reason during any major operation like disk move , clone or restore , its the same story , very high load that appears to be too many theards of process that are opening and causing major IO delay on the system .. anything else that you can suggest?

We are truely lost with this situation , we have multiple VM's relaying on that storage , they all can crash on any disk move. Even though the hardware is really top of the line as mentioned above.

Thanks.

wolfgang · Jan 28, 2020

Please send me the output of these commands.

Code:

arc_summary
lsblk
swapon
zpool get all
zfs get all

ran · Jan 28, 2020

Thanks, this is the data for each attached in files

ran · Feb 10, 2020

Hi Wolfgang ,

do you have any idea why it can happen?
do you have enough info from my side?

Thanks.

wolfgang · Feb 11, 2020

This is an NVMe problem at all.
It looks like a problem that the disks to fast ;-)
and many HW vendors use switches to extend the PCIe lanes for more devices that makes series problems.

Please send me the output of this command to verify how your NVMe are connected.

Code:

lspci -tv
lspci

Meanwhile, you can try the following to increase performance.

Set for cores the real core count without HT.
echo <Cores> > /sys/module/nvme/parameters/poll_queues

Check if the governor is set to performance
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Enable delay (0 = hybrid) 1< is ms
echo 0 > /sys/block/nvme0n1/queue/io_poll_delay

ran · Feb 11, 2020

Hi,

Thank you , i have attached both outputs of commands

currently this file does not exist " /sys/module/nvme/parameters/poll_queues "

should i just create it with the echo command? and use the output of nproc command?

output of scaling_governor:

cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

ondemand
ondemand
ondemand
ondemand
ondemand
ondemand
ondemand
ondemand
ondemand
ondemand
ondemand
ondemand
ondemand
ondemand
ondemand
ondemand
ondemand
ondemand
ondemand
ondemand
ondemand
ondemand
ondemand
ondemand
ondemand
ondemand
ondemand
ondemand
ondemand
ondemand
ondemand
ondemand

output of io_poll_delay:

cat /sys/block/nvme0n1/queue/io_poll_delay

-1

should i do all the changes you suggested live on active zfs server?

Thanks for the help

wolfgang · Feb 14, 2020

Please check your bios is there is a configuration where you can set the CPU to performance Mode and disable all power-saving settings.

ran said:
currently this file does not exist " /sys/module/nvme/parameters/poll_queues "

This is not a file, it is the sysfs.
It must exist in the actual pve kernel.

ran said:
should i do all the changes you suggested live on active zfs server?

This is normally no problem.

But as your lspci report shows the NVMe is not balanced.
You have 4 bridges and
Bridge 3B:00.0 got 7 NVMe
Bridge 18:00.0 got 1 NVMe
Bridge 86:00.0 got 2 NVMe
Bridge af:00.0 got 1 NVMe

I would take care that the NVMe is balanced on the bridges

ran · Feb 14, 2020

Hi Wolfgang , thank you

i should mention that NFS sharing on the same ZFS server and the same Nvme disks is much much faster

it's only when we use ZVOLS on that server and moving .. cloning disks then it gets a very high load..

so i'm not sure about the bridges solution.

what do you think?

JamesT · Apr 27, 2021

ran said:
Hi Wolfgang , thank you

i should mention that NFS sharing on the same ZFS server and the same Nvme disks is much much faster

it's only when we use ZVOLS on that server and moving .. cloning disks then it gets a very high load..

so i'm not sure about the bridges solution.

what do you think?

Hello,
I'm observing very similar if not the same issue on our setup. Things appear to work fine, but if trying to migrate a VM between two hosts, or even migrate a disk from one storage to another, everything comes to a complete halt, incredibly slow, VMs become unresponsive, CPU usage goes up.
This happens even if migrating from NVME storage on one host, to NVME storage on the other.
Did you ever find a solution?

P.S.
I was unable to find how to update the polling setting mentioned.
echo 1 > /sys/block/nvme0n1/queue/io_poll didn't work, it gave an error write error: Invalid argument . This also happened if using a text editor like nano.

Search

Search

ZFS over ISCSI (High Load on Move Disk)

ran

Active Member

wolfgang

Proxmox Retired Staff

ran

Active Member

wolfgang

Proxmox Retired Staff

ran

Active Member

wolfgang

Proxmox Retired Staff

ran

Active Member

wolfgang

Proxmox Retired Staff

ran

Active Member

Attachments

ran

Active Member

wolfgang

Proxmox Retired Staff

ran

Active Member

Attachments

wolfgang

Proxmox Retired Staff

ran

Active Member

JamesT

New Member