ZFS disk benchmark causes crash

badincite

New Member
Aug 3, 2022
13
0
1
New to Proxmox doing some testing on and old pc before I move my esxi server over. When running a benchmark on a 4 drive raidz seeing really high speed assuming its using the ram as a cache. However every time I do a benchmark on the zfs disk it's crashing . What cache setting should I use for primarily read operations my server volume will be used for plex. Assuming my crashing issues are just due to the test hardware.

Current Test hardware
CPU: i5-4570
MEM: 16GB

Current Server Hardware
MB: EP2C602-4L/D16
CPU: Dual E5-2680's
GPU: GTX 1060 Passthrough to VM
MEM: 128GB ECC
HBA: Dell Perc H200 ~ I will be flashing this into IT mode
 
Last edited:
The whole system locks up I guess it's resetting no text in the web console. The VM is shut down when it comes back. I was using crystal discmark and atto disk benchmark.
 
Last edited:
The whole system locks up I guess it's resetting no text in the web console. The VM is shut down when it comes back. I was using crystal discmark and atto disk benchmark.
Look for OOM in syslog: cat /var/log/syslog | grep OOM to see if you are running out of RAM. ZFS by default will use up to 64GB for caching in your case. And if you use a cache mode other than "none" then there will be additilnal RAM be used by KVM for caching. So in case you gave your VMs/LXCs something like 64GB of RAM you could simply run out of RAM so OOM kills the VM to free some RAM.
 
I seemed to have fixed it, I recreated the windows 10 vm and used the recommended settings I think that was causing the issue.

Does no-cache mean no ram cache? Wondering how I'm getting these speeds with raidz and 4 spinning drives. Seems like no-cache gives better speeds.

NoCache.PNG

WriteThough.PNG
 
ZFS will already cache in RAM when cache mode "none" is selected (read about "ARC"). If you select anything else like "writethrough" or "writeback" KVM will additionally cache in the hosts RAM. So everything will be double cached in RAM. And then your guest OS will cache in virtual RAM too. So you might cache the same data 2 or 3 times wasting memory bandwidth and RAM capacity. Thats why cache mode "none" is recommended when using ZFS.

@Neobin:
So according to your link consumer SSDs and HDDs with cache mode "none" would loose async writes on an power outage while enterprise SSDs should be save as physical cache is kind of persistent because of PLP. so in theory it would be saver to use cache mode "writethrough" with consumer SSDs and HDDs as the volatile RAM cache will be skipped?
Such a diagram including ZFS and its caches would be also interesting.
 
Last edited:
ZFS will already cache in RAM when cache mode "none" is selected (read about "ARC"). If you select anything else like "writethrough" or "writeback" KVM will additionally cache in the hosts RAM. So everything will be double cached in RAM. And then your guest OS will cache in virtual RAM too. So you might cache the same data 2 or 3 times wasting memory bandwidth and RAM capacity. Thats why cache mode "none" is recommended when using ZFS.

@Neobin:
So according to your link consumer SSDs and HDDs with cache mode "none" would loose async writes on an power outage while enterprise SSDs should be save as physical cache is kind of persistent because of PLP. so in theory it would be saver to use cache mode "writethrough" with consumer SSDs and HDDs as the volatile RAM cache will be skipped?
Such a diagram including ZFS and its caches would be also interesting.
Thanks I I get it now, didn't see zfs recommended no-cache. Since zfs caches in ram is a zram disk even beneficial for something like a transcoding drive for Plex.

I have 3 8TB Western digital gold's on the way to add 2 the 8TB Red Pros I was planning on putting into a raidz array. I was just looking in on how to limit ARC on my test hardware since I only have 16GB of memory. Also noticing the speed drop and pausing transfering files on my test raidz 4x1TB array.
 
A rule of thumb would be 2-4GB RAM + 0.25-1GB RAM per 1 TB of raw capacity of your disks. And if you want to use deduplication its another 5GB RAM per 1TB of deduplicated capacity. So with 5x 8TB you got 40TB of raw storage that would mean something between 12 to 44GB RAM for your ARC would be good to have. That 16GB isn`t really much as PVE already needs 2GB.
By default ZFS will use up to 50% of your total RAM for its ARC, so should be up to 8GB right now which is already very low for all that disks.
You can use arc_summary to see some ARC statistics. If you shrink your ARC too much stuff like hit rates will get lower and the pool becomes very slow.
 
Last edited:
A rule of thumb would be 2-4GB RAM + 0.25-1GB RAM per 1 TB of raw capacity of your disks. And if you want to use deduplication its another 5GB RAM per 1TB of deduplicated capacity. So with 5x 8TB you got 40TB of raw storage that would mean something between 12 to 44GB RAM for your ARC would be good to have. That 16GB isn`t really much as PVE already needs 2GB.
By default ZFS will use up to 50% of your total RAM for its ARC, so should be up to 8GB right now which is already very low for all that disks.
You can use arc_summary to see some ARC statistics. If you shrink your ARC too much stuff like hit rates will get lower and the pool becomes very slow.
Thanks yeah I just set this up to play with it. I wanted to get familiar with it before transferring my ESXi server which has 128GB of ram.
 
Doesn't look like a ram issue any idea what could be causing the transfer speed to drop off like this?

1660074416473.png

1660074296587.png


Code:
ZFS Subsystem Report                            Tue Aug 09 15:45:04 2022
Linux 5.15.30-2-pve                                           2.1.4-pve1
Machine: proxmox (x86_64)                                     2.1.4-pve1

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                    30.3 %    2.3 GiB
        Target size (adaptive):                        30.8 %    2.4 GiB
        Min size (hard limit):                          6.2 %  490.3 MiB
        Max size (high water):                           16:1    7.7 GiB
        Most Frequently Used (MFU) cache size:          1.9 %   12.2 MiB
        Most Recently Used (MRU) cache size:           98.1 %  625.9 MiB
        Metadata cache size (hard limit):              75.0 %    5.7 GiB
        Metadata cache size (current):                  3.5 %  206.1 MiB
        Dnode cache size (hard limit):                 10.0 %  588.3 MiB
        Dnode cache size (current):                   < 0.1 %  183.1 KiB

ARC hash breakdown:
        Elements max:                                              72.0k
        Elements current:                             100.0 %      72.0k
        Collisions:                                                 1.2k
        Chain max:                                                     2
        Chains:                                                     1.2k

ARC misc:
        Deleted:                                                      18
        Mutex misses:                                                  0
        Eviction skips:                                                1
        Eviction skips due to L2 writes:                               0
        L2 cached evictions:                                     0 Bytes
        L2 eligible evictions:                                 326.5 KiB
        L2 eligible MFU evictions:                     14.7 %   48.0 KiB
        L2 eligible MRU evictions:                     85.3 %  278.5 KiB
        L2 ineligible evictions:                                 4.0 KiB

ARC total accesses (hits + misses):                                10.4k
        Cache hit ratio:                               94.9 %       9.9k
        Cache miss ratio:                               5.1 %        530
        Actual hit ratio (MFU + MRU hits):             94.8 %       9.9k
        Data demand efficiency:                        87.9 %       2.2k
        Data prefetch efficiency:                       5.0 %        141

Cache hits by cache type:
        Most frequently used (MFU):                    49.4 %       4.9k
        Most recently used (MRU):                      50.5 %       5.0k
        Most frequently used (MFU) ghost:               0.0 %          0
        Most recently used (MRU) ghost:                 0.0 %          0
        Anonymously used:                               0.1 %         11

Cache hits by data type:
        Demand data:                                   19.9 %       2.0k
        Demand prefetch data:                           0.1 %          7
        Demand metadata:                               80.0 %       7.9k
        Demand prefetch metadata:                     < 0.1 %          4

Cache misses by data type:
        Demand data:                                   51.1 %        271
        Demand prefetch data:                          25.3 %        134
        Demand metadata:                               14.2 %         75
        Demand prefetch metadata:                       9.4 %         50

DMU prefetch efficiency:                                            2.4k
        Hit ratio:                                     92.3 %       2.2k
        Miss ratio:                                     7.7 %        184

L2ARC not detected, skipping section

Solaris Porting Layer (SPL):
        spl_hostid                                                     0
        spl_hostid_path                                      /etc/hostid
        spl_kmem_alloc_max                                       1048576
        spl_kmem_alloc_warn                                        65536
        spl_kmem_cache_kmem_threads                                    4
        spl_kmem_cache_magazine_size                                   0
        spl_kmem_cache_max_size                                       32
        spl_kmem_cache_obj_per_slab                                    8
        spl_kmem_cache_reclaim                                         0
        spl_kmem_cache_slab_limit                                  16384
        spl_max_show_tasks                                           512
        spl_panic_halt                                                 0
        spl_schedule_hrtimeout_slack_us                                0
        spl_taskq_kick                                                 0
        spl_taskq_thread_bind                                          0
        spl_taskq_thread_dynamic                                       1
        spl_taskq_thread_priority                                      1
        spl_taskq_thread_sequential                                    4
 

Attachments

  • arc summary.txt
    22 KB · Views: 1
Last edited:
You didn't told us that Disks you are now using for testing. Maybe SMR HDDs or QLC SSDs that get terrible slow after SLC/RAM/CMR cache gets full?
 
You didn't told us that Disks you are now using for testing. Maybe SMR HDDs or QLC SSDs that get terrible slow after SLC/RAM/CMR cache gets full?
My fault MODEL: SEAGATE EXOS ST1000NX0443

The have a dell label on them though.

Is there something I missing I created a zfs raidz with the 4 drives under disks. Then added a 32GB hard drive to the VM no-cache using that zfs volume. I'm transferring a 6GB file above which makes it come to a crawl.
 
Last edited:
So according to your link consumer SSDs and HDDs with cache mode "none" would loose async writes on an power outage while enterprise SSDs should be save as physical cache is kind of persistent because of PLP. so in theory it would be saver to use cache mode "writethrough" with consumer SSDs and HDDs as the volatile RAM cache will be skipped?

I'm no expert, but how I understand it is:
With cache=none the guest OS is responsible to send a flush command because:
the guest's virtual storage adapter is informed that there is a writeback cache, so the guest would be expected to send down flush commands as needed to manage data integrity.
Whereas with cache=writethrough KVM forces fsyncs (flushes) from the outside all the time.

[X] So to be on the really safe side and to not rely on the guest OS, I would understand it too as you said.
[Y] Wondering how big of a topic this (responsibility to send flush commands) nowadays with actual OS's is.

Such a diagram including ZFS and its caches would be also interesting.

[Z] I would also be really interested on this. Couldn't find any write-up with a quick search.

Maybe @aaron can/wants to shed some light in here [X] [Y] [Z]. :)

SEAGATE EXOS ST1000NX0443

From what I could find, these should be CMR-drives.
Despite the fact that these drives were most likely never a "burner" :D and they now could be up to 7 years old (release of this model seems to be in 2015) with equivalent power-on-hours/usage and with all the circumstances, a dip to only 12,5 MB/s (funnily enough that those are exactly 100 Mbit/s) with an sequential write seems too much in my imagination.

I would be interested in more tests, like setting up a samba share directly on the host (or in an LXC) on the ZFS-pool to skip all the virtualization and the different block sizes and copy again a big file to it.

What is the controller for the disks? An simple one on the mainboard? In AHCI-mode? Hopefully no fancy raid-controller?!
 
  • Like
Reactions: Dunuin
I glanced over the thread, so if I misunderstand or missed something, let me know.

If all you have on the disks is some test data, I would recommend to start benchmark from the bottom layers. First a single disk, write with 4k and 4m and potentially sequential and random. To get an idea how they perform. Have a look the ZFS benchmark paper to see what FIO commands to use: https://forum.proxmox.com/threads/proxmox-ve-zfs-benchmark-with-nvme.80744/

Then start with the next layer, ZFS, then from within a VM to see what each layer is costing you.

The ARC is a read cache, therefore, unless you want to do read performance benchmarks, you should not have to worry about it. For completeness sake, you would need to disable the cache or use it at least only for metadata for the ZFS dataset in question: zfs set <pool/dataset> primarycache=metadata.

Do not forget that ZFS is copy on write, which will lead to fragmentation over time. This is where even for sequential data, you could end up with a lot of random IO, which will be much worse for HDDs than SSDs.

For example, one of my local pools, used to store dev and test VMs shows the following:
Code:
tank01  fragmentation                  82%                            -

Before you do any of it, switch the firmware to IT mode :)
And for VM storage, consider using a RAID10 like setup with 2 mirrored disks in the pool. See https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysadmin_zfs_raid_considerations for why.
 
Just to clarify this is all being done on my test system I setup just to get acquainted with proxmox

CPU: i5-4570
MB: Asus Z87-Plus
MEM: 16-32GB or Memory
HDS: ST1000NX0443 Exos7 Drives - have a bunch

These drives are brand new they've been sitting my desk for a few years.

The controllers is jus the sata ports on the main board = Asus Z87-Plus in ahci mode.

Okay I double the memory had borrowed some from my other system so its 32 GB now still same result.



1660134030724.png

1660134053332.png
HDS
 
Last edited:
Okay running a fio benchmark speeds are good directly on a the pool. Its just inside the VM where the speeds are slow.

Code:
root@proxmox:/zzzz# fio --randrepeat=1 --ioengine=libaio --direct=1 --name=zzzz --filename=/zzzz/test --bs=4M --size=4G --readwrite=write --ramp_time=4
zzzz: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
zzzz: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [W(1)][32.4%][w=124MiB/s][w=31 IOPS][eta 00m:25s]
zzzz: (groupid=0, jobs=1): err= 0: pid=8293: Wed Aug 10 09:13:08 2022
  write: IOPS=33, BW=135MiB/s (141MB/s)(1044MiB/7759msec); 0 zone resets
    slat (usec): min=22179, max=35169, avg=29801.78, stdev=3180.03
    clat (nsec): min=2479, max=15488, avg=3875.34, stdev=1151.38
     lat (usec): min=22184, max=35175, avg=29836.07, stdev=3152.02
    clat percentiles (nsec):
     |  1.00th=[ 2544],  5.00th=[ 2768], 10.00th=[ 3056], 20.00th=[ 3408],
     | 30.00th=[ 3600], 40.00th=[ 3728], 50.00th=[ 3792], 60.00th=[ 3888],
     | 70.00th=[ 3984], 80.00th=[ 4128], 90.00th=[ 4320], 95.00th=[ 4576],
     | 99.00th=[ 9024], 99.50th=[13376], 99.90th=[15552], 99.95th=[15552],
     | 99.99th=[15552]
   bw (  KiB/s): min=122880, max=180585, per=99.92%, avg=137667.13, stdev=17625.68, samples=15
   iops        : min=   30, max=   44, avg=33.60, stdev= 4.29, samples=15
  lat (usec)   : 4=71.15%, 10=28.08%, 20=0.77%
  cpu          : usr=0.46%, sys=3.70%, ctx=8525, majf=0, minf=59
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,260,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=135MiB/s (141MB/s), 135MiB/s-135MiB/s (141MB/s-141MB/s), io=1044MiB (1095MB), run=7759-7759msec
root@proxmox:/zzzz# ls -l
total 2321630

Also any idea why I can't add a 2TB disk

1660137625373.png

1660137798521.png
1660137814577.png
 
Last edited:
Yes, you're out of space as the error message indicates. RAIDz does that and free space is unpredictable... please search the forum, this is a regular problem people have.
So where actual shows what I can use? shows 3.62T here.

root@proxmox:~# zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
zzzz 3.62T 5.51G 3.62T - - 0% 0% 1.00x ONLINE
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!