Write speeds fall off/VM lag during file copies ZFS

homelabber · Sep 14, 2019

Proxmox 6.0-7 used for 1 Windows 2016 Server VM and a couple of Ubuntu containers
Intel Core i7-6700K
32 GB non ecc ram.
2x ZFS 128K block size mirrored 960GB Seagate Nytro SSDs - OS install and VMs/containers
2x ZFS 128K block size mirrored 8TB HGST 7200RPM NAS Drives - 4TB File storage drive in RAW format used for Windows 2016 Server formatted NTFS
1x ZFS 128K block size 10TB Seagate Ironwolf 7200 RPM Drive - 5TB backup storage drive in RAW format used for Windows 2016 Server formatted NTFS

When I copy files from the SSD drive to the 7200 RPM drives it usually starts off strong (caching?) but speeds drops hard and gives me very high IO delay like 60-70% and causing lag on the Proxmox server and causes remote desktop to the Windows Server 2016 VM to time out briefly. Storage drives become inaccessible for a minute or two when I try to open with Windows Explorer. I am using Virtio scsci single and IO thread enabled with no improvement. I tried using writeback cache mode on these drives but this did not help so I am back to No Cache. Arc is configured to be able to use 16gigs of ram max.

Tests of file transfers with iostat running:
Copy 5gig ISO from HDD zfs mirror to SSD zfs mirror: https://imgur.com/8rC180F - The copy from the HDD to SSD goes very fast like its coming from a cache and then it drops to around 250 MB/s until it finishes. Once the transfer status window closes the iostat window shows the drive utilization drops pretty soon after and there is no lag in Windows.
Copy 5gig ISO back from SSD zfs mirror to HDD zfs mirror: https://imgur.com/6w6z4N2 - The copy from the SSD to HDD goes very fast as well (impossibly fast speeds like 1 gigabyte per second) and the file copy progress window closes but then the HDD drives becoming inaccessible for a minute or two and you can see the iostat still showing the storage drives are transfering at 120ish megabytes per second.

It seems like Windows thinks the transfer is complete but in reality the transfer is still going on behind the scenes and the storage drive becomes inaccessible for a few minutes. Also I should point out that when I was originally passing through the hard drives directly to Windows 2016 Server (NTFS mirror) there were no performance problems. Any idea what could be causing this issue? Thank you.

sg90 · Sep 14, 2019

You say ARC is set to use 16GB, so that leaves you with only 16GB.

What do you have assigned to the VM's?

May be worth pasting the full .conf output for the Windows VM

homelabber · Sep 14, 2019

Windows Server 2016 has 4-8GB assigned with balloon turned on. The other two containers have 3GB and 512MB assigned respectively. My PVE system is currently using 25GB out of 31.32GB.

Windows VM .conf:
agent: 1
balloon: 4096
bios: ovmf
boot: cdn
bootdisk: scsi0
cores: 3
cpu: host
efidisk0: local-zfs:vm-102-disk-0,size=4M
hotplug: disk,network,usb
machine: q35
memory: 8192
name: *****.**********.local
net0: virtio=FA:1A

F

A:E9

B,bridge=vmbr0
numa: 0
onboot: 1
ostype: win10
sata0: none,media=cdrom
scsi0: local-zfs:vm-102-disk-1,discard=on,size=100G,ssd=1
scsi1: storage:vm-102-disk-1,backup=0,discard=on,iothread=1,size=4T
scsi2: backups-zfs:vm-102-disk-0,backup=0,discard=on,iothread=1,size=5T
scsihw: virtio-scsi-single
smbios1: uuid=a581c5f9-45e0-4a70-bcab-fe2fd4721324
sockets: 1
usb0: host=01f0:1347
usb1: host=1048:15da,usb3=1
vga: virtio

sg90 · Sep 14, 2019

Have you monitored free -m output on the host node whist your running one of the transfers that causes the issue?

homelabber · Sep 14, 2019

sg90 said:
Have you monitored free -m output on the host node whist your running one of the transfers that causes the issue?

I just did some of the offending transfers and kept refreshing free -m and it ranges between 3600-6600 MB free memory.

homelabber · Sep 17, 2019

Does anyone have any thoughts about this? Thanks all.

guletz · Sep 17, 2019

Hi @homelabber ,

Can you start/boot your system , and then start your VM. Wait 5-10 min, and run again your tests as you wrote in this thread, but before/after each test run this as root:

arc_summary

Good luck!

mailinglists · Sep 17, 2019

When copying to ZFS HDD Mirror and at the same time using it for Windows will always result with the symptoms you describe, because two HDD disks are simply to slow (IOPS wise) for what you want them to do.

The only thing that might help is to disable syncing (ZIL) on windows HDD (storage:vm-102-disk-1), but ..
.. listen carefully ..
disabling ZIL means loosing data (from RAM not yet written), in case of a power cut or system crash, while your data on tje disk will still be consistent.
Please read more about it here:
https://forum.proxmox.com/threads/p...-ssd-drives-sync-parameter.31130/#post-155543
and elsewhere in the net.

Also you could use those SSD disks as a SLOG device (external ZIL) for your HDDs and O.S. request for synced writes will be faster.

Good luck!

guletz · Sep 17, 2019

... I write a reply but some security? was delete my post, and I do not have the time to write again!

Anyway this new forum skin is very bad for my smart phone/and for my old eayes,, and is very time consuming for any post at least for me! Most of the time I can not see what I write(the interface go up and down= hopa mitica in my own language)

Good luck anyway!

homelabber · Sep 18, 2019

@guletz - I have attached the results:
1) I shutdown the server, started it up waited about 8 minutes.
2) Ran arc_summary as root file 1-afterreboot.txt
3) Copied 5 gig ISO from Raid 1 spinning rust zfs drives to SSD, then renamed iso and copied from ssd zfs to raid 1 spinning rust - 2-afterfullcopy.txt
4) Finally I did one more rename and copy from ssd to spinning rust zfs but I took the arc_summary during the end of the copy when things slow down 3-secondcopyduringtransfer.txt

homelabber · Sep 18, 2019

mailinglists said:
When copying to ZFS HDD Mirror and at the same time using it for Windows will always result with the symptoms you describe, because two HDD disks are simply to slow (IOPS wise) for what you want them to do.

The only thing that might help is to disable syncing (ZIL) on windows HDD (storage:vm-102-disk-1), but ..
.. listen carefully ..
disabling ZIL means loosing data (from RAM not yet written), in case of a power cut or system crash, while your data on tje disk will still be consistent.
Please read more about it here:
https://forum.proxmox.com/threads/p...-ssd-drives-sync-parameter.31130/#post-155543
and elsewhere in the net.

Also you could use those SSD disks as a SLOG device (external ZIL) for your HDDs and O.S. request for synced writes will be faster.

Good luck!

Is the reason that it gets laggy because ZFS has more overhead/works differently? Because when I passed through the drives to Windows Server and formatted NTFS there was no problem.

I never setup SLOG device before. I already use the two SSDs as Raid 1 ZFS for the Proxmox OS and VMs so all the space is allocated. Can I still use these SSDS as a SLOG device to improve write performace for the RAID 1 spinning rust mirror or do I need empty drives?

Thanks.

guletz · Sep 18, 2019

Hi,

Can you setup on the ssd pool

zfs_vdev_schedule = none insted of noop? and repeat your tests again(without post arc_summary)?

Also some others info will be useful :

zpool status -v

zpool get all ${pool-name}
- for ssd and 7.2 k pool

homelabber · Sep 19, 2019

@guletz So I was researching how to set the zfs_vdev_schedule = none and it looks like it is already set to none? Please see command below which returned none:

cat /sys/block/sda/queue/scheduler
[mq-deadline] none

I read about it here: https://pve.proxmox.com/wiki/IO_Scheduler

guletz · Sep 19, 2019

Hi,

arc_summary rpool | grep zfs_vdev_schedule

homelabber said:
cat /sys/block/sda/queue/scheduler
[mq-deadline] none

... so it is [mq-deadline] !!!

You can change it like this:

echo "none" > /sys/block/sdX/queue/scheduler

homelabber · Sep 21, 2019

@guletz

I changed the scheduler only on my mirrored spinning drives to none and it had no effect on my problem. Do I need to do anything to force this change to immediately take effect like restart a service?

root@pve1:~# echo "none" > /sys/block/sdc/queue/scheduler
root@pve1:~# echo "none" > /sys/block/sdd/queue/scheduler
root@pve1:~# cat /sys/block/sdc/queue/scheduler
[none] mq-deadline
root@pve1:~# cat /sys/block/sdd/queue/scheduler
[none] mq-deadline

ertanerbek · Sep 21, 2019

Disable Discard on your Guest option and activate compress feature on ZFS . For secheduler do not change anythink, that for SSD disk not for spining disk also if you use SATA, your limit is 32Q at same time that is all...

All people tell ZFS need one gigabyte ram for one TB data, it is big lie. ZFS need that RAM for manage data data not for CACHE system so limit your cache system other way ZFS will try use your all ram for Level1 Cache. ZFS LOG disk only for if your kernel can not give more RAM to ZFS, ZF CACHE disk only for Random IO on heavy workload so do not use that if many many people can not access to yor system at same time....

so;

1. Disable discard on GUEST config.
2. You should open compression on ZFS pool other way ZFS can not fight with "0" it is mean you can not use your level 1 cache effectifly.. Also this is better solition instead of DISCARD and use global discard for your ssd on ZFS option.
3. Limit your ARC usage ( you can found on pve db ) ( for 16GB ram 2 GB enough )
4. use KSM system effectifly

KSM_NPAGES_MAX=10000
KSM_THRES_COEF=80

With this option KSM was start when your %80 memory is free and scan 10000 memory page each time it is mean CPU power but CPU for this type jobs...

5.use ZRAM ( also Vmware use that why you do not use that )
6. Do not use LOG or CACHE disk if many many people can not access to your system at same time..

ertanerbek · Sep 21, 2019

after all I forget write this one on last message. Proxmox use same SWAP area for all guest so use your SSD for SWAP area for unexpected RAM request from Guest because KSM and Memory balloning not a GOD they can not create more RAM for your system from nothing...

LnxBil · Sep 21, 2019

@ertanerbek thew in good pointers, but be aware that they do apply to everything that PVE has to offer:

1) only applicable for KVM VM and it'll waste storage space, so I would and also suggest to use it e.g. daily and not on every delete.
2) ZFS compression is always a good solution and the default if PVE creates your pool.
3) Depends on your problem. Better way is to reduce what is cached and what not (e.g over primarycache zfs attribute). Optimizing ZFS for best cache option is hard work and extremely problem oriented. A database does need different settings than the message logfiles and other options than the general OS.
4) also only applicable to KVM VM due to a special syscall that only KVM/QEMU uses.
5) Always a good suggestion, zram is extremely useful. I replaced all physical swap disks/partitions by zram in my systems.

Just FYI: There is a compressed ARC for OpenZFS, but currently not merged in ZoL, so we have to wait a bit more, but it'll be huge!

ertanerbek said:
6. Do not use LOG or CACHE disk if many many people can not access to your system at same time..

SLOG will improve performance on any sync write if you have them. It depends heavily on your problems.

ertanerbek said:
Proxmox use same SWAP area for all guest

Applies only to containers, not KVM VMs

ertanerbek · Sep 21, 2019

SLOG for FSYNC, another problem if you do not have SLC or MLC based SSD, your SSD ca not give to you enought IO. You can test that, please change sync option on your POOL from standart to always.. Also you can change first and second cache from all to metadata then you can see withouth ZIL, ZFS can not work and SLOG can not handle comming transection and TLC,3DTLC or QLC based SSD can not create enought IO..

Vmware company create one SWAP file for each GUEST and you can select where you can store that ( if you do not select special area they will use same storage area wiht GUEST ). SWAP file on KVM side, Linux use only one SWAP file for all.. If your guest start use all promised memory ( means memory balloning useless ) also KSM can not found same memory page on real memory ( measn all guest start make unic job ) then you will see where KVM store Guest memory.

LnxBil · Sep 21, 2019

ertanerbek said:
SWAP file on KVM side, Linux use only one SWAP file for all..

You cannot compare VMware internals with Linux with respect to swap. On Linux, swap does not discriminate, it swaps out whatever is necessary and has no idea it this is a VM, a container or any other process for that matter. In general, the best advice build your system in a way that it does not require swap, if it swaps a lot, you did something wrong with your memory planning and you machine will get slow, really slow depending on your storage backend. This is also true for guests.

Write speeds fall off/VM lag during file copies ZFS

New Member

Renowned Member

New Member

Renowned Member

New Member

New Member

Famous Member

Renowned Member

Famous Member

New Member

Attachments

New Member

Famous Member

New Member

Famous Member

New Member

Well-Known Member

Well-Known Member

Distinguished Member

Well-Known Member

Distinguished Member