ZFS IO issues/limitations inside VMs

Jun 8, 2018
24
2
8
35
Hello guys,

i hope you can help me with this, since i need to go into production soon:

tldr.:
Write speed on 2 ZFS pools is fine on the PM Host but inside VMs there are major issues with "larger writes".
What is the correct disk configuration for VMs on a ZFS pool for win10 and linux?


I've recently build the first of our two future servers with ZFS:
https://forum.proxmox.com/threads/server-config-zfs-encryption-cpu-load-questions.71620/#post-323204

Some VMs are installed and working fine, the plan was to test IO of the HDD mirror (for storage applications) and then order a second one.

When copying a lot of files onto a VM (tested with linux and windows10) copy speed drops to 0 after some seconds.
Same goes for copying files inside a VM.
I then found this: https://dannyda.com/2020/05/24/how-to-fix-proxmox-ve-zfs-pool-extremely-slow-write/
and added a SLOG via usb3 (400mbps/ writes tested) which only delayed the issue by some seconds.

I went on to test write performance on the zfs mirror with SSDs:

On the host (/hddmirror/encrypted/backup or /vmdata/encrypted) there is no issue writing with 80-100mb/s+ =gigabit is the limit
On the VM disk located on the ZFS SSD mirror the writes drop to 50mb/s and i notice short therm cpu load spikes on the host.

When copying the 17gb file inside the linux vm via cp the vm becomes unresponsive and takes forever.
A quick look with iotop reveals that the copy process starts at 300mb/s and then drops to 0 while IO % stays at 99.99.
Copy resumes at some twi digit mb/s and drops back to 0 repeatedly. CPU load on the Host sometimes rises to 70% with 16 cores.


Did i miss some major ZFS rule like:
Only use disk type X on zfs pool ?
I hoped i could blame ZIL and SLOG but as mentioned i can write fine onto the host, it never acts out.
Will dump configuration infos in the next post.

Thank you for your assistance!!
 
Last edited:
root@pve-d:/etc/pve/local/qemu-server# zpool status
pool: hddmirror
state: ONLINE
scan: scrub repaired 0B in 0 days 04:31:55 with 0 errors on Sun Oct 11 04:55:56 2020
config:

NAME STATE READ WRITE CKSUM
hddmirror ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-HGST_HUS726T6TALE6L4_V8KAK51R ONLINE 0 0 0
ata-HGST_HUS726T6TALE6L4_V8KA8PBR ONLINE 0 0 0
logs
sdi ONLINE 0 0 0

errors: No known data errors

pool: rpool
state: ONLINE
scan: scrub repaired 0B in 0 days 00:04:57 with 0 errors on Sun Oct 11 00:28:59 2020
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-Micron_5300_MTFDDAK480TDS_20102717FC7F-part3 ONLINE 0 0 0
ata-Micron_5300_MTFDDAK480TDS_201027180B3F-part3 ONLINE 0 0 0

errors: No known data errors

pool: vmdata
state: ONLINE
scan: scrub repaired 0B in 0 days 00:22:14 with 0 errors on Sun Oct 11 00:46:17 2020
config:

NAME STATE READ WRITE CKSUM
vmdata ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-INTEL_SSDSC2KB960G8_PHYF848601QD960CGN ONLINE 0 0 0
ata-INTEL_SSDSC2KB960G8_PHYF907200UN960CGN ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
ata-INTEL_SSDSC2KB960G8_PHYF9072018V960CGN ONLINE 0 0 0
ata-INTEL_SSDSC2KB960G8_PHYF907300R7960CGN ONLINE 0 0 0




zfs list
NAME USED AVAIL REFER MOUNTPOINT
hddmirror 2.85T 2.43T 96K /hddmirror
hddmirror/data 40.1G 2.43T 40.1G /mnt/hdd
hddmirror/encrypted 2.81T 2.43T 200K /hddmirror/encrypted
hddmirror/encrypted/backup 69.0G 2.43T 69.0G /hddmirror/encrypted/backup
hddmirror/encrypted/vm-100-disk-0 3.13G 2.43T 3.13G -
hddmirror/encrypted/vm-100-disk-1 2.74T 2.43T 2.74T -
hddmirror/encrypted/vm-223-disk-1 33.4M 2.43T 33.4M -
rpool 111G 319G 104K /rpool
rpool/ROOT 111G 319G 96K /rpool/ROOT
rpool/ROOT/pve-1 111G 319G 111G /
rpool/data 96K 319G 96K /rpool/data
vmdata 512G 1.18T 96K /vmdata
vmdata/encrypted 512G 1.18T 16.1G /vmdata/encrypted
vmdata/encrypted/vm-222-disk-0 43.8G 1.18T 43.8G -
vmdata/encrypted/vm-223-disk-0 17.6G 1.18T 17.6G -
vmdata/encrypted/vm-223-disk-1 88K 1.18T 88K -
vmdata/encrypted/vm-223-disk-2 20.1G 1.18T 20.1G -
vmdata/encrypted/vm-308-disk-0 13.6G 1.18T 13.6G -
vmdata/encrypted/vm-309-disk-0 239G 1.18T 229G -
vmdata/encrypted/vm-309-state-b4_password_env 9.41G 1.18T 9.41G -
vmdata/encrypted/vm-310-disk-0 6.10G 1.18T 6.10G -
vmdata/encrypted/vm-311-disk-0 7.29G 1.18T 5.06G -
vmdata/encrypted/vm-311-state-snappy1_0 1.30G 1.18T 1.30G -
vmdata/encrypted/vm-701-disk-0 21.3G 1.18T 21.3G -
vmdata/encrypted/vm-703-disk-0 20.8G 1.18T 20.8G -
vmdata/encrypted/vm-801-disk-0 41.6G 1.18T 41.6G -
vmdata/encrypted/vm-802-disk-0 5.90G 1.18T 5.90G -
vmdata/encrypted/vm-802-disk-1 34.0G 1.18T 34.0G -
vmdata/encrypted/vm-903-disk-0 9.42G 1.18T 9.42G -
vmdata/encrypted/vm-991-disk-0 4.26G 1.18T 4.26G -


dummy linux vm:

root@pve-d:/etc/pve/local/qemu-server# cat 222.conf
balloon: 5120
bootdisk: scsi0
cores: 8
cpu: host,flags=+aes
memory: 8032
name: linux-dummy
net0: virtio=7A:A0:FC:DA:9E:10,bridge=vmbr0
numa: 0
ostype: l26
scsi0: vmdata:vm-222-disk-0,discard=on,size=100G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=e644db7d-8d8f-4cb8-84f3-be7083c4c45b
sockets: 1
vmgenid: 00e82f9c-4cd1-4a8c-93e6-d9ea445a3f26

zfs get:

vmdata/encrypted encryption aes-256-ccm -
vmdata/encrypted compression on local
 
Last edited:
Hi,

Can you show the output from(run on PMX server):

arc_status

I guess that you use the default 8K volblocksize, also it would be intersting to see:


zfs get all ${your_zfs_pool_path-to-your-VM}

... and your hardware(especiall controller details, it is hardware RAID/Jbod/???)

Good luck / Bafta!
 
Its the onboard sata connectors of this board:
Supermicro mainboard H11SSL-i
AMD EPYC 7282 (2.80 GHz, 16-core, 64 MB)
128 GB (4x 32 GB) ECC Reg DDR4 RAM 2 Rank
4x960 GB SATA III Intel SSD 3D-NAND TLC 2.5" (D3 S4510) => RaidZ1 vmdata
2x6 TB SATA III Western Digital Ultrastar 3.5" 7.2k (512e) => zfs mirror vzdumps and data storage


edit: added arcstat
 

Attachments

  • zfs get all.txt
    5.1 KB · Views: 7
  • arc.txt
    4.5 KB · Views: 7
Last edited:
root@pve-d:/etc/pve/local/qemu-server# zfs get all vmdata/encrypted/vm-223-disk-0
NAME PROPERTY VALUE SOURCE
vmdata/encrypted/vm-223-disk-0 type volume -
vmdata/encrypted/vm-223-disk-0 creation Thu Sep 3 11:21 2020 -
vmdata/encrypted/vm-223-disk-0 used 20.2G -
vmdata/encrypted/vm-223-disk-0 available 1.15T -
vmdata/encrypted/vm-223-disk-0 referenced 20.2G -
vmdata/encrypted/vm-223-disk-0 compressratio 1.16x -
vmdata/encrypted/vm-223-disk-0 reservation none default
vmdata/encrypted/vm-223-disk-0 volsize 50G local
vmdata/encrypted/vm-223-disk-0 volblocksize 8K default
vmdata/encrypted/vm-223-disk-0 checksum on default
vmdata/encrypted/vm-223-disk-0 compression on inherited from vmdata/encrypted
vmdata/encrypted/vm-223-disk-0 readonly off default
vmdata/encrypted/vm-223-disk-0 createtxg 1076176 -
vmdata/encrypted/vm-223-disk-0 copies 1 default
vmdata/encrypted/vm-223-disk-0 refreservation none default
vmdata/encrypted/vm-223-disk-0 guid 1271442347337907345 -
vmdata/encrypted/vm-223-disk-0 primarycache all default
vmdata/encrypted/vm-223-disk-0 secondarycache all default
vmdata/encrypted/vm-223-disk-0 usedbysnapshots 0B -
vmdata/encrypted/vm-223-disk-0 usedbydataset 20.2G -
vmdata/encrypted/vm-223-disk-0 usedbychildren 0B -
vmdata/encrypted/vm-223-disk-0 usedbyrefreservation 0B -
vmdata/encrypted/vm-223-disk-0 logbias latency default
vmdata/encrypted/vm-223-disk-0 objsetid 3205 -
vmdata/encrypted/vm-223-disk-0 dedup off default
vmdata/encrypted/vm-223-disk-0 mlslabel none default
vmdata/encrypted/vm-223-disk-0 sync standard default
vmdata/encrypted/vm-223-disk-0 refcompressratio 1.16x -
vmdata/encrypted/vm-223-disk-0 written 20.2G -
vmdata/encrypted/vm-223-disk-0 logicalused 23.3G -
vmdata/encrypted/vm-223-disk-0 logicalreferenced 23.3G -
vmdata/encrypted/vm-223-disk-0 volmode default default
vmdata/encrypted/vm-223-disk-0 snapshot_limit none default
vmdata/encrypted/vm-223-disk-0 snapshot_count none default
vmdata/encrypted/vm-223-disk-0 snapdev hidden default
vmdata/encrypted/vm-223-disk-0 context none default
vmdata/encrypted/vm-223-disk-0 fscontext none default
vmdata/encrypted/vm-223-disk-0 defcontext none default
vmdata/encrypted/vm-223-disk-0 rootcontext none default
vmdata/encrypted/vm-223-disk-0 redundant_metadata all default
vmdata/encrypted/vm-223-disk-0 encryption aes-256-ccm -
vmdata/encrypted/vm-223-disk-0 keylocation none default
vmdata/encrypted/vm-223-disk-0 keyformat passphrase -
vmdata/encrypted/vm-223-disk-0 pbkdf2iters 342K -
vmdata/encrypted/vm-223-disk-0 encryptionroot vmdata/encrypted -
vmdata/encrypted/vm-223-disk-0 keystatus available -
root@pve-d:/etc/pve/local/qemu-server#
 
this is the behaviour on this disk as example in win10.
thanks for your help btw!
edit it writes ~5gb with 100mb/s then drops to 0 then writes on

writing to ssd.png
 

Attachments

  • win10 disk223-1.txt
    6.6 KB · Views: 4
Last edited:
Hi,

OK, let focus now only on the linux guest ONLY, if you are agree!

- because you use 8K ("vmdata/encrypted/vm-223-disk-0 volblocksize 8K default") you need to have inside the VM a FS who will use the same block
- in case of ext4, you should create the ext4 with "-b 4096 -E stripe-width=2"
- because compression on a encrypted zfs is very low("compressratio 1.16x"), I would disable compression
- you also need to run from crontab(daily):

fstrim -a

- at the zpool level, you also need to have atime=off, and also inside your linux VM(in /etc/fstab)

Good luck / Bafta !
 
  • Like
Reactions: Josef Grassler
Thanks, will try to add a second volume and observe.

btw.: i could redo the pools still, if it saves me the trouble of altering many VMs.
Especially because it is planned to transfer existing VMs from other proxmox hosts.
adding to this: im running cisco VMs there where i cannot alter the blocksize.
 
Last edited:
After i found some time for testing i followed your tip and used .raw disks on the directory storage with discard=on

Inside the windows VM i can now reach proper speeds (200mb/s+ file copy ect.)

@guletz apologies, i don't have the time right now to read more into ZFS so may i ask you:

Is it generally wrong to have all VMs use .raw disk inside a directory on a ZFS pool?
What would be the implication? I will lose ZFS features like snapshot and deduplication (i guess).
What about compression?
I might just do this for IO intensive VMs?

I might skip ZFS extra features and just use it as a "software raid" for now ?

Thank you very much for your input!

edit: also browsing the forum for this, seems qcow2 on zfs dir isn't an option.
edit2:even if its not optimal and my issues could be fixed, i would use it as a start, since i need to know soon, if i need to order a raid controller with the second server.
 
Last edited:
If you want performance with current setup and do not need VM (live) migration or quick migration or replication, just drop ZFS and go with thinLVM. It is much much faster.
Otherwise, you can look into ZFS a bit more. Maybe add a mirrored special and striped log device via PCI U.2 or M.2 pair of disks.
 
  • Like
Reactions: Josef Grassler
If you want performance with current setup and do not need VM (live) migration or quick migration or replication, just drop ZFS and go with thinLVM. It is much much faster.
Otherwise, you can look into ZFS a bit more. Maybe add a mirrored special and striped log device via PCI U.2 or M.2 pair of disks.
Thanks for your reply! The thing is, i don't have a raid controller and need encrpytion, that is why i use ZFS.
I could drop it, get a raid controller and use thinlvm true.
The idea was to skip the expensive raid controller by using some RAM^^
 
Raid controllers are not expensive now-a-days. Well at least not what I buy, on second hand market.
 
What about compression?
Compression for any encrypted data do not get any significant benefit!

I might just do this for IO intensive VMs?
As you like.
edit: also browsing the forum for this, seems qcow2 on zfs dir isn't an option.

Indeed!

even if its not optimal and my issues could be fixed, i would use it as a start

At any time you could change from raw to any other format!


Good luck!
 
  • Like
Reactions: Josef Grassler
In a linux vm, when using a .raw disk on a dir with discard on and mkfs.ext4 -b 4096 -E stripe-width=2 i did just copy a 17gb files from the HDD Mirror to itself with 100mb/s.
From this to a .raw on the SSDmirror i did copy with 270mb/s. (timed copy time)


Mabe im fine with this!
If i understood correctly without SLOG im writing double to my SSDmirror, so ~200mb/s should be good?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!