Diagnose ZFS Slow Random 4k Read/Write Speed, nvme SSD, RAID-Z2

jena

Member
Jul 9, 2020
47
8
13
34
Hi all,

PVE: Latest 6.4-5 (PVE 5.4.106-1)
Total system RAM is 256GB
SSD is WD SN750 2T x 6 in RAID-Z2, compression=lz4

VM : cache=writeback discard=on, size=501G, ssd=1 (see attached code)

VM Crystal Disk's RND 4K and IOPs are much slower than individual drive in bare-metal system.
Some other real world usage confirms that 4K random read/write is really bad. (program completion time 5 times slower than individual drive)

How can I diagnose?
Thank you very much!

in Windows VM, RAID-Z2
pve6.4-5_vm_ssd_speed.PNG

Individual drive
WD750_2TB_nvmeSSD_CrystalDiskMark_p02.png

Code:
zfs get compression nvmepool
NAME      PROPERTY     VALUE           SOURCE
nvmepool  compression  lz4             local

zpool iostat
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
hddpool      892G  10.0T      1     13  32.3K   236K
nvmepool     473G  10.4T    698  2.91K  2.27M  54.8M
rpool       3.23G   925G      1     40  28.6K   690K
----------  -----  -----  -----  -----  -----  -----

VM
Code:
agent: 1
args: -machine type=q35,kernel_irqchip=on
balloon: 0
bios: ovmf
bootdisk: scsi0
cores: 16
cpu: host,hidden=1,flags=-pcid;+ibpb,hv-vendor-id=whatever
efidisk0: local-nvme:vm-500-disk-1,size=1M
hostpci0: 21:00,pcie=1,romfile=rtx2080ti.rom,x-vga=1
ide2: none,media=cdrom
machine: q35
memory: 32768
name: XXXXX
net0: virtio=XXXX,bridge=vmbr0,firewall=1
numa: 1
ostype: win10
parent: Basic
scsi0: local-nvme:vm-500-disk-0,cache=writeback,discard=on,size=501G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=XXXXXX
sockets: 1
usb0: host=11-2,usb3=1
usb1: host=11-1.2,usb3=1
vga: none
vmgenid: XXXXXXX
 
Last edited:
First you are using consumer SSDs. To quote the staffs ZFS NVMe benchmark paper:
Can I use consumer or pro-sumer SSDs, as these are much cheaper than enterprise-class SSD?
No. Never. These SSDs wont provide the required performance, reliability or endurance. See the fio results from before and/or run your own fio tests.
Consumer SSDs are mostly really crappy at sync or continous random 4K writes.

The second point is that you use raidz2. Raidz got alot of overhead because of all the additional parity calculations and write speed doesn't scale with the number of drives as read speed does. If you want to use that pool as a VM storage or to store DBs (what I think is your plan if you pay more for faster NVMe SSDs and not just use SAS/SATA) a stripped mirror (raid10) is always recommended. If you want more performance you could stripe 3 mirrors of 2 SSDs. If you want more reliability you could stripe 2 mirrors of 3 SSDs (that way is even more secure than raidz2 but with half of ther capacity).
Moreover ZFS and virtualization are creating additional write amplification. If you got a high write amplification and for example your writes are amplified by factor 10 you only get 1/10 of the performance.

Did you increased the volblocksize before creating your first zvol? If you are using ashift of 12 and the default block size of 8K and not 32K or more you are wasting alot of capacity due to bad padding that is also causing write amplification.

And I think cache=none would be more usefull. Your ARC is already caching in RAM. As far as I know With writeback you force linux to cache again using page files so everything is cached twice.
Disableing atime for the pool could also help a bit.
And a pool should never be filled up more than 80% or ZFS will get slow.
Same with the SSDs. The fast write speeds are achieved by running TLC cells in SLC more for caching. The more you store on the SSDs the less space is unused and the lower is your SLC cache. So a full SSD will be way slower than a empty one.
And are you using heatsinks for your SSDs? If not M.2 SSDs easily get over 70-80 degree C on heavy usage and start to thermal throttle.

And lastly Crystal Disk Mark isn't really useful to compare. You are just measuring you RAM and not your SSDs. You should try fio with caching disabled (you find an example in the linked paper).
 
Last edited:
  • Like
Reactions: Kingneutron
If you want more reliability you could stripe 2 mirrors of 3 SSDs
Like stripe Disk 1 2 3 <-mirror-> stripe Disk 4 5 6?

Did you increased the volblocksize before creating your first zvol? If you are using ashift of 12 and the default block size of 8K and not 32K or more you are wasting alot of capacity due to bad padding that is also causing write amplification.
I probably didn't (I used all defaults), is there a way to check it now?

And I think cache=none would be more usefull. Your ARC is already caching in RAM. As far as I know With writeback you force linux to cache again using page files so everything is cached twice.
Disableing atime for the pool could also help a bit.

And a pool should never be filled up more than 80% or ZFS will get slow.
Same with the SSDs. The fast write speeds are achieved by running TLC cells in SLC more for caching. The more you store on the SSDs the less space is unused and the lower is your SLC cache. So a full SSD will be way slower than a empty one.
And are you using heatsinks for your SSDs? If not M.2 SSDs easily get over 70-80 degree C on heavy usage and start to thermal throttle.
Yes. These I know. my total capacity utilization is 11%.
Four SSDs are in Asus Quad Card with fan, at 35C. The other two are under motherboard heatsink, which is a bit warmer around 55C.
And lastly Crystal Disk Mark isn't really useful to compare. You are just measuring you RAM and not your SSDs. You should try fio with caching disabled (you find an example in the linked paper).
I will try that. Thanks!

Thank you very much for this detailed suggestion.
 
Like stripe Disk 1 2 3 <-mirror-> stripe Disk 4 5 6?
No that would be just like a normal raid10 just with 6 drives like: ( (A mirror B) stripe (C mirror D) stripe (E mirror F) )
In that case you could use 6TB and you would get 3x write speed but in the worst case you loose all data as soon as the second drive dies. Best case it could survive 3 failing drives but as soon as the 2 drives of the same mirror are dying all is lost.

What I mean is: ( (A mirror B mirror C) stripe (D mirror E mirror F) )
With that you only get 4TB of capacity, 2x write speed but 2 to 4 drives may fail before loosing data. So you get the same worst case of 2 failing disks like with raidz2 but way better performance because simple mirroring doesn't need complex parity calculations where your CPU is slowing down everything. And you get more write performance because of the striping and less write amplification because you don't need to increase the volblocksize that much.

A stripped mirror also got the benefit that you may add more drives later what isn't possible with raidz2.

The proxmox installer doesn't support tripple mirroring if I remember right if you also want to boot from that pool.
If that is just a secondary pool you don't need to boot from it is easy to create using cli: zpool create YourPoolName mirror /dev/disk/by-id/A /dev/disk/by-id/B /dev/disk/by-id/C mirror /dev/disk/by-id/D /dev/disk/by-id/E /dev/disk/by-id/F

I probably didn't (I used all defaults), is there a way to check it now?
Datacenter -> Storage -> Your ZFS storage -> Block size
You can change the block size but that won't help you already created virtual disks because it can only be set at creation of these. If you want to change your existsing virtual disks volblocksize you need to backup, destroy and recreate them.

Edit:
Here is a table for the volblocksize. On paper 16K volblocksize would be the optimum (33% capacity loss) for a ashift=12 raidz2 pool consisting of 6 disks. With 8K you loose 67% of capacity but you can'T directly see it. Poolsize is the same but with a volblocksize of 8K everthing you store on a virtual disk should consume 200% of space. But keep in mind that the table doesn't take compression into account so real life values may differ.
 
Last edited: