ZFS disk benchmark causes crash

badincite · Aug 7, 2022

New to Proxmox doing some testing on and old pc before I move my esxi server over. When running a benchmark on a 4 drive raidz seeing really high speed assuming its using the ram as a cache. However every time I do a benchmark on the zfs disk it's crashing . What cache setting should I use for primarily read operations my server volume will be used for plex. Assuming my crashing issues are just due to the test hardware.

Current Test hardware
CPU: i5-4570
MEM: 16GB

Current Server Hardware
MB: EP2C602-4L/D16
CPU: Dual E5-2680's
GPU: GTX 1060 Passthrough to VM
MEM: 128GB ECC
HBA: Dell Perc H200 ~ I will be flashing this into IT mode

LnxBil · Aug 7, 2022

What do you mean by crash? Any text displayed or does the system just resets itself?

How did you benchmark?

badincite · Aug 7, 2022

The whole system locks up I guess it's resetting no text in the web console. The VM is shut down when it comes back. I was using crystal discmark and atto disk benchmark.

Dunuin · Aug 7, 2022

badincite said:
The whole system locks up I guess it's resetting no text in the web console. The VM is shut down when it comes back. I was using crystal discmark and atto disk benchmark.

Look for OOM in syslog: cat /var/log/syslog | grep OOM to see if you are running out of RAM. ZFS by default will use up to 64GB for caching in your case. And if you use a cache mode other than "none" then there will be additilnal RAM be used by KVM for caching. So in case you gave your VMs/LXCs something like 64GB of RAM you could simply run out of RAM so OOM kills the VM to free some RAM.

badincite · Aug 7, 2022

I seemed to have fixed it, I recreated the windows 10 vm and used the recommended settings I think that was causing the issue.

Does no-cache mean no ram cache? Wondering how I'm getting these speeds with raidz and 4 spinning drives. Seems like no-cache gives better speeds.

Neobin · Aug 7, 2022

The different disk cache types are explained here:
https://pve.proxmox.com/wiki/Performance_Tweaks#Disk_Cache

http://www.ilsistemista.net/index.p...s-on-red-hat-enterprise-linux-62.html?start=2

Dunuin · Aug 7, 2022

ZFS will already cache in RAM when cache mode "none" is selected (read about "ARC"). If you select anything else like "writethrough" or "writeback" KVM will additionally cache in the hosts RAM. So everything will be double cached in RAM. And then your guest OS will cache in virtual RAM too. So you might cache the same data 2 or 3 times wasting memory bandwidth and RAM capacity. Thats why cache mode "none" is recommended when using ZFS.

@Neobin:
So according to your link consumer SSDs and HDDs with cache mode "none" would loose async writes on an power outage while enterprise SSDs should be save as physical cache is kind of persistent because of PLP. so in theory it would be saver to use cache mode "writethrough" with consumer SSDs and HDDs as the volatile RAM cache will be skipped?
Such a diagram including ZFS and its caches would be also interesting.

badincite · Aug 8, 2022

Dunuin said:
ZFS will already cache in RAM when cache mode "none" is selected (read about "ARC"). If you select anything else like "writethrough" or "writeback" KVM will additionally cache in the hosts RAM. So everything will be double cached in RAM. And then your guest OS will cache in virtual RAM too. So you might cache the same data 2 or 3 times wasting memory bandwidth and RAM capacity. Thats why cache mode "none" is recommended when using ZFS.

@Neobin:
So according to your link consumer SSDs and HDDs with cache mode "none" would loose async writes on an power outage while enterprise SSDs should be save as physical cache is kind of persistent because of PLP. so in theory it would be saver to use cache mode "writethrough" with consumer SSDs and HDDs as the volatile RAM cache will be skipped?
Such a diagram including ZFS and its caches would be also interesting.

Thanks I I get it now, didn't see zfs recommended no-cache. Since zfs caches in ram is a zram disk even beneficial for something like a transcoding drive for Plex.

I have 3 8TB Western digital gold's on the way to add 2 the 8TB Red Pros I was planning on putting into a raidz array. I was just looking in on how to limit ARC on my test hardware since I only have 16GB of memory. Also noticing the speed drop and pausing transfering files on my test raidz 4x1TB array.

Dunuin · Aug 8, 2022

A rule of thumb would be 2-4GB RAM + 0.25-1GB RAM per 1 TB of raw capacity of your disks. And if you want to use deduplication its another 5GB RAM per 1TB of deduplicated capacity. So with 5x 8TB you got 40TB of raw storage that would mean something between 12 to 44GB RAM for your ARC would be good to have. That 16GB isn`t really much as PVE already needs 2GB.
By default ZFS will use up to 50% of your total RAM for its ARC, so should be up to 8GB right now which is already very low for all that disks.
You can use arc_summary to see some ARC statistics. If you shrink your ARC too much stuff like hit rates will get lower and the pool becomes very slow.

badincite · Aug 8, 2022

Dunuin said:
A rule of thumb would be 2-4GB RAM + 0.25-1GB RAM per 1 TB of raw capacity of your disks. And if you want to use deduplication its another 5GB RAM per 1TB of deduplicated capacity. So with 5x 8TB you got 40TB of raw storage that would mean something between 12 to 44GB RAM for your ARC would be good to have. That 16GB isn`t really much as PVE already needs 2GB.
By default ZFS will use up to 50% of your total RAM for its ARC, so should be up to 8GB right now which is already very low for all that disks.
You can use arc_summary to see some ARC statistics. If you shrink your ARC too much stuff like hit rates will get lower and the pool becomes very slow.

Thanks yeah I just set this up to play with it. I wanted to get familiar with it before transferring my ESXi server which has 128GB of ram.

badincite · Aug 9, 2022

Doesn't look like a ram issue any idea what could be causing the transfer speed to drop off like this?

Code:

ZFS Subsystem Report                            Tue Aug 09 15:45:04 2022
Linux 5.15.30-2-pve                                           2.1.4-pve1
Machine: proxmox (x86_64)                                     2.1.4-pve1

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                    30.3 %    2.3 GiB
        Target size (adaptive):                        30.8 %    2.4 GiB
        Min size (hard limit):                          6.2 %  490.3 MiB
        Max size (high water):                           16:1    7.7 GiB
        Most Frequently Used (MFU) cache size:          1.9 %   12.2 MiB
        Most Recently Used (MRU) cache size:           98.1 %  625.9 MiB
        Metadata cache size (hard limit):              75.0 %    5.7 GiB
        Metadata cache size (current):                  3.5 %  206.1 MiB
        Dnode cache size (hard limit):                 10.0 %  588.3 MiB
        Dnode cache size (current):                   < 0.1 %  183.1 KiB

ARC hash breakdown:
        Elements max:                                              72.0k
        Elements current:                             100.0 %      72.0k
        Collisions:                                                 1.2k
        Chain max:                                                     2
        Chains:                                                     1.2k

ARC misc:
        Deleted:                                                      18
        Mutex misses:                                                  0
        Eviction skips:                                                1
        Eviction skips due to L2 writes:                               0
        L2 cached evictions:                                     0 Bytes
        L2 eligible evictions:                                 326.5 KiB
        L2 eligible MFU evictions:                     14.7 %   48.0 KiB
        L2 eligible MRU evictions:                     85.3 %  278.5 KiB
        L2 ineligible evictions:                                 4.0 KiB

ARC total accesses (hits + misses):                                10.4k
        Cache hit ratio:                               94.9 %       9.9k
        Cache miss ratio:                               5.1 %        530
        Actual hit ratio (MFU + MRU hits):             94.8 %       9.9k
        Data demand efficiency:                        87.9 %       2.2k
        Data prefetch efficiency:                       5.0 %        141

Cache hits by cache type:
        Most frequently used (MFU):                    49.4 %       4.9k
        Most recently used (MRU):                      50.5 %       5.0k
        Most frequently used (MFU) ghost:               0.0 %          0
        Most recently used (MRU) ghost:                 0.0 %          0
        Anonymously used:                               0.1 %         11

Cache hits by data type:
        Demand data:                                   19.9 %       2.0k
        Demand prefetch data:                           0.1 %          7
        Demand metadata:                               80.0 %       7.9k
        Demand prefetch metadata:                     < 0.1 %          4

Cache misses by data type:
        Demand data:                                   51.1 %        271
        Demand prefetch data:                          25.3 %        134
        Demand metadata:                               14.2 %         75
        Demand prefetch metadata:                       9.4 %         50

DMU prefetch efficiency:                                            2.4k
        Hit ratio:                                     92.3 %       2.2k
        Miss ratio:                                     7.7 %        184

L2ARC not detected, skipping section

Solaris Porting Layer (SPL):
        spl_hostid                                                     0
        spl_hostid_path                                      /etc/hostid
        spl_kmem_alloc_max                                       1048576
        spl_kmem_alloc_warn                                        65536
        spl_kmem_cache_kmem_threads                                    4
        spl_kmem_cache_magazine_size                                   0
        spl_kmem_cache_max_size                                       32
        spl_kmem_cache_obj_per_slab                                    8
        spl_kmem_cache_reclaim                                         0
        spl_kmem_cache_slab_limit                                  16384
        spl_max_show_tasks                                           512
        spl_panic_halt                                                 0
        spl_schedule_hrtimeout_slack_us                                0
        spl_taskq_kick                                                 0
        spl_taskq_thread_bind                                          0
        spl_taskq_thread_dynamic                                       1
        spl_taskq_thread_priority                                      1
        spl_taskq_thread_sequential                                    4

Dunuin · Aug 9, 2022

You didn't told us that Disks you are now using for testing. Maybe SMR HDDs or QLC SSDs that get terrible slow after SLC/RAM/CMR cache gets full?

badincite · Aug 9, 2022

Dunuin said:
You didn't told us that Disks you are now using for testing. Maybe SMR HDDs or QLC SSDs that get terrible slow after SLC/RAM/CMR cache gets full?

My fault MODEL: SEAGATE EXOS ST1000NX0443

The have a dell label on them though.

Is there something I missing I created a zfs raidz with the 4 drives under disks. Then added a 32GB hard drive to the VM no-cache using that zfs volume. I'm transferring a 6GB file above which makes it come to a crawl.

LnxBil · Aug 10, 2022

badincite said:
I'm transferring a 6GB file above which makes it come to a crawl.

I see the same thing if you run into a full buffer. Everything seems to work blazing fast and then your hit the wall. That is what @Dunuin described in #12.

Neobin · Aug 10, 2022

Dunuin said:
So according to your link consumer SSDs and HDDs with cache mode "none" would loose async writes on an power outage while enterprise SSDs should be save as physical cache is kind of persistent because of PLP. so in theory it would be saver to use cache mode "writethrough" with consumer SSDs and HDDs as the volatile RAM cache will be skipped?

I'm no expert, but how I understand it is:
With cache=none the guest OS is responsible to send a flush command because:

the guest's virtual storage adapter is informed that there is a writeback cache, so the guest would be expected to send down flush commands as needed to manage data integrity.

Whereas with cache=writethrough KVM forces fsyncs (flushes) from the outside all the time.

[X] So to be on the really safe side and to not rely on the guest OS, I would understand it too as you said.
[Y] Wondering how big of a topic this (responsibility to send flush commands) nowadays with actual OS's is.

Dunuin said:
Such a diagram including ZFS and its caches would be also interesting.

[Z] I would also be really interested on this. Couldn't find any write-up with a quick search.

Maybe @aaron can/wants to shed some light in here [X] [Y] [Z].

badincite said:
SEAGATE EXOS ST1000NX0443

From what I could find, these should be CMR-drives.
Despite the fact that these drives were most likely never a "burner"

and they now could be up to 7 years old (release of this model seems to be in 2015) with equivalent power-on-hours/usage and with all the circumstances, a dip to only 12,5 MB/s (funnily enough that those are exactly 100 Mbit/s) with an sequential write seems too much in my imagination.

I would be interested in more tests, like setting up a samba share directly on the host (or in an LXC) on the ZFS-pool to skip all the virtualization and the different block sizes and copy again a big file to it.

What is the controller for the disks? An simple one on the mainboard? In AHCI-mode? Hopefully no fancy raid-controller?!

aaron · Aug 10, 2022

I glanced over the thread, so if I misunderstand or missed something, let me know.

If all you have on the disks is some test data, I would recommend to start benchmark from the bottom layers. First a single disk, write with 4k and 4m and potentially sequential and random. To get an idea how they perform. Have a look the ZFS benchmark paper to see what FIO commands to use: https://forum.proxmox.com/threads/proxmox-ve-zfs-benchmark-with-nvme.80744/

Then start with the next layer, ZFS, then from within a VM to see what each layer is costing you.

The ARC is a read cache, therefore, unless you want to do read performance benchmarks, you should not have to worry about it. For completeness sake, you would need to disable the cache or use it at least only for metadata for the ZFS dataset in question: zfs set <pool/dataset> primarycache=metadata.

Do not forget that ZFS is copy on write, which will lead to fragmentation over time. This is where even for sequential data, you could end up with a lot of random IO, which will be much worse for HDDs than SSDs.

For example, one of my local pools, used to store dev and test VMs shows the following:

Code:

tank01  fragmentation                  82%                            -

Before you do any of it, switch the firmware to IT mode

And for VM storage, consider using a RAID10 like setup with 2 mirrored disks in the pool. See https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysadmin_zfs_raid_considerations for why.

badincite · Aug 10, 2022

Just to clarify this is all being done on my test system I setup just to get acquainted with proxmox

CPU: i5-4570
MB: Asus Z87-Plus
MEM: 16-32GB or Memory
HDS: ST1000NX0443 Exos7 Drives - have a bunch

These drives are brand new they've been sitting my desk for a few years.

The controllers is jus the sata ports on the main board = Asus Z87-Plus in ahci mode.

Okay I double the memory had borrowed some from my other system so its 32 GB now still same result.

HDS

badincite · Aug 10, 2022

Okay running a fio benchmark speeds are good directly on a the pool. Its just inside the VM where the speeds are slow.

Code:

root@proxmox:/zzzz# fio --randrepeat=1 --ioengine=libaio --direct=1 --name=zzzz --filename=/zzzz/test --bs=4M --size=4G --readwrite=write --ramp_time=4
zzzz: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
zzzz: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [W(1)][32.4%][w=124MiB/s][w=31 IOPS][eta 00m:25s]
zzzz: (groupid=0, jobs=1): err= 0: pid=8293: Wed Aug 10 09:13:08 2022
  write: IOPS=33, BW=135MiB/s (141MB/s)(1044MiB/7759msec); 0 zone resets
    slat (usec): min=22179, max=35169, avg=29801.78, stdev=3180.03
    clat (nsec): min=2479, max=15488, avg=3875.34, stdev=1151.38
     lat (usec): min=22184, max=35175, avg=29836.07, stdev=3152.02
    clat percentiles (nsec):
     |  1.00th=[ 2544],  5.00th=[ 2768], 10.00th=[ 3056], 20.00th=[ 3408],
     | 30.00th=[ 3600], 40.00th=[ 3728], 50.00th=[ 3792], 60.00th=[ 3888],
     | 70.00th=[ 3984], 80.00th=[ 4128], 90.00th=[ 4320], 95.00th=[ 4576],
     | 99.00th=[ 9024], 99.50th=[13376], 99.90th=[15552], 99.95th=[15552],
     | 99.99th=[15552]
   bw (  KiB/s): min=122880, max=180585, per=99.92%, avg=137667.13, stdev=17625.68, samples=15
   iops        : min=   30, max=   44, avg=33.60, stdev= 4.29, samples=15
  lat (usec)   : 4=71.15%, 10=28.08%, 20=0.77%
  cpu          : usr=0.46%, sys=3.70%, ctx=8525, majf=0, minf=59
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,260,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=135MiB/s (141MB/s), 135MiB/s-135MiB/s (141MB/s-141MB/s), io=1044MiB (1095MB), run=7759-7759msec
root@proxmox:/zzzz# ls -l
total 2321630

Also any idea why I can't add a 2TB disk

LnxBil · Aug 11, 2022

badincite said:
Also any idea why I can't add a 2TB disk

Yes, you're out of space as the error message indicates. RAIDz does that and free space is unpredictable... please search the forum, this is a regular problem people have.

badincite · Aug 11, 2022

LnxBil said:
Yes, you're out of space as the error message indicates. RAIDz does that and free space is unpredictable... please search the forum, this is a regular problem people have.

So where actual shows what I can use? shows 3.62T here.

root@proxmox:~# zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
zzzz 3.62T 5.51G 3.62T - - 0% 0% 1.00x ONLINE

ZFS disk benchmark causes crash

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

Distinguished Member

New Member

Distinguished Member

New Member

New Member

Attachments

Distinguished Member

New Member

Distinguished Member

Distinguished Member

Proxmox Staff Member

New Member

New Member

Distinguished Member

New Member

We value your privacy