ZFS "problem"

Nov 25, 2023
40
2
8
Germany
Hi,

i have a small problem with my ZFS pool. I have a 120TB Pool with HDDs and downloaded ISOs on it. I assigned 256GB of RAM to it and everything works fine all the RAM Cache is "full". After that i have a lot drops in download speed. When i restart the server and RAM cache is empty, the speed is full again.
Is there any solution for this?
 
Unfortunately, this is very unspecific. ZFS does not have a write cache itself, so the problem has to lie somewhere else.

First, please post the output of zpool status -v in CODE tags.
 
  • Like
Reactions: IsThisThingOn
Here is from the pool:

Code:
pool: Plex
 state: ONLINE
  scan: scrub repaired 0B in 22:10:52 with 0 errors on Sun Feb 11 22:35:05 2024
config:

        NAME                                      STATE     READ WRITE CKSUM
        Plex                                      ONLINE       0     0     0
          raidz1-0                                ONLINE       0     0     0
            ata-TOSHIBA_MG09ACA18TE_91J0A0FDFJDH  ONLINE       0     0     0
            ata-TOSHIBA_MG09ACA18TE_91J0A09GFJDH  ONLINE       0     0     0
            ata-TOSHIBA_MG09ACA18TE_53F0A2PJFJDH  ONLINE       0     0     0
            ata-TOSHIBA_MG09ACA18TE_53U0A0B3FJDH  ONLINE       0     0     0
            ata-TOSHIBA_MG09ACA18TE_71H0A08TFQDH  ONLINE       0     0     0
            ata-TOSHIBA_MG09ACA18TE_71F0A1JJFQDH  ONLINE       0     0     0
            ata-TOSHIBA_MG09ACA18TE_71F0A1KYFQDH  ONLINE       0     0     0
            ata-TOSHIBA_MG09ACA18TE_71H0A08UFQDH  ONLINE       0     0     0

errors: No known data errors
 
Last edited:
In general, this pool will not be very fast and concurrent access will be even slow. You're limited to one vdev, so the IOPS of one disk. You will also waste space due to padding with 8 disks. Please refer to this excellent post.

Nevertheless, how fast was the download? Normally, downloading stuff from the internet is not a good test, there could be a lot of reasons limiting the throughput, yet it should not be the local disk ... depending on your internet connection of course...

Maybe do a "real test", with fio? Have a look here.
 
You're limited to one vdev, so the IOPS of one disk.
Actually even less because of the overhead of raidz. Whenever I benchmarked any single raidz vdev vs single disk the single disk got better IOPS.

I guess you don't care that much about your data if its simply 120TB for storing your torrented "Linux ISOs". But I personally wouldn't be able to sleep well when using a raidz1 with that many big HDDs in a single vdev. Especially when its so much data that you probably can't afford proper backups for everything. Once a disk will start causing problems it might take days or even weeks to replace a disk and while waiting for it to resilver there is not a small chance that you will lose the whole pool or at least some files as it is then in a degraded state and every single errer will cause unfixable corruption.
I personally at least would have created that pool as a single 8-HDD raidz2 vdev + 3 SSDs as special vdev mirror.
 
Last edited:
  • Like
Reactions: IsThisThingOn
The Download speed was from 30mb/s to 5mb/s.

So the best is to create 2 raidz1 pools? if i mount them to the System, i have 2 different folders for each pool, or is there a way to combine this? Because its a big downloadfolder for my media.


i have tested already with pveperf and it shows me this:

Code:
CPU BOGOMIPS:      511971.84
REGEX/SECOND:      3782175
HD SIZE:           59558.46 GB (Plex)
FSYNCS/SECOND:     358.32
DNS EXT:           43.03 ms
DNS INT:           42.27 ms


Here is fio:

Code:
seq_read: (g=0): rw=read, bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=1
seq_read: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=132MiB/s][r=132 IOPS][eta 00m:00s]
seq_read: (groupid=0, jobs=1): err= 0: pid=204044: Wed Feb 21 13:50:29 2024
  read: IOPS=200, BW=201MiB/s (211MB/s)(11.8GiB/60046msec)
    slat (usec): min=46, max=732, avg=61.44, stdev=10.19
    clat (msec): min=2, max=781, avg= 4.91, stdev=12.85
     lat (msec): min=2, max=781, avg= 4.98, stdev=12.86
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    4], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    4], 90.00th=[    5], 95.00th=[   11],
     | 99.00th=[   35], 99.50th=[   58], 99.90th=[  128], 99.95th=[  209],
     | 99.99th=[  567]
   bw (  KiB/s): min= 2048, max=313344, per=100.00%, avg=207622.45, stdev=90411.68, samples=119
   iops        : min=    2, max=  306, avg=202.76, stdev=88.29, samples=119
  lat (msec)   : 4=81.25%, 10=13.61%, 20=2.72%, 50=1.81%, 100=0.46%
  lat (msec)   : 250=0.12%, 500=0.02%, 750=0.01%, 1000=0.01%
  cpu          : usr=0.08%, sys=1.12%, ctx=24132, majf=7, minf=269
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=12065,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=201MiB/s (211MB/s), 201MiB/s-201MiB/s (211MB/s-211MB/s), io=11.8GiB (12.7GB), run=60046-60046msec

Disk stats (read/write):
  sde: ios=28402/9838, merge=1/130, ticks=161085/49376, in_queue=212525, util=98.86%
-bash: syntax error near unexpected token `g=0'


and here the 2nd one with 4K:

Code:
seq_read: (g=0): rw=read, bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=1
seq_read: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=37.0MiB/s][r=37 IOPS][eta 00m:00s]
seq_read: (groupid=0, jobs=1): err= 0: pid=208478: Wed Feb 21 13:52:15 2024
  read: IOPS=42, BW=42.8MiB/s (44.9MB/s)(2571MiB/60020msec)
    slat (usec): min=42, max=642, avg=67.51, stdev=14.93
    clat (msec): min=2, max=428, avg=23.28, stdev=35.44
     lat (msec): min=2, max=428, avg=23.34, stdev=35.44
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    9], 60.00th=[   16],
     | 70.00th=[   21], 80.00th=[   36], 90.00th=[   63], 95.00th=[   96],
     | 99.00th=[  165], 99.50th=[  199], 99.90th=[  279], 99.95th=[  376],
     | 99.99th=[  430]
   bw (  KiB/s): min= 4096, max=147456, per=100.00%, avg=43971.76, stdev=27728.97, samples=119
   iops        : min=    4, max=  144, avg=42.94, stdev=27.08, samples=119
  lat (msec)   : 4=42.71%, 10=9.14%, 20=14.66%, 50=19.80%, 100=9.18%
  lat (msec)   : 250=4.28%, 500=0.23%
  cpu          : usr=0.01%, sys=0.27%, ctx=5140, majf=0, minf=267
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=2571,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=42.8MiB/s (44.9MB/s), 42.8MiB/s-42.8MiB/s (44.9MB/s-44.9MB/s), io=2571MiB (2696MB), run=60020-60020msec

Disk stats (read/write):
  sde: ios=12772/17363, merge=2/517, ticks=448095/137263, in_queue=586539, util=92.71%
-bash: syntax error near unexpected token `g=0'
 
Actually even less because of the overhead of raidz. Whenever I benchmarked any single raidz vdev vs single disk the single disk got better IOPS.

I guess you don't care that much about your data if its simply 120TB for storing your torrented "Linux ISOs". But I personally wouldn't be able to sleep well when using a raidz1 with that many big HDDs in a single vdev. Especially when its so much data that you probably can't afford proper backups for everything. Once a disk will start causing problems it might take days or even weeks to replace a disk and while waiting for it to resilver there is not a small chance that you will lose the whole pool or at least some files as it is then in a degraded state and every single errer will cause unfixable corruption.

Data is not that important, because mostly i can download with 1gig, and 18TB of HDD is fast replaced.

What would be your solution for it?
 
I personally at least would have created that pool as a single 8-HDD raidz2 vdev + 3 SSDs as special vdev mirror.

But it takes days/weeks to replace that single 18TB HDD and in that time all disks will be 100% busy hammered by the resilvering, make the pool nearly unusable, so not unlikely that a second disk will fail under the stress and then you would have to redownload all the 120TB.
 
I personally at least would have created that pool as a single 8-HDD raidz2 vdev + 3 SSDs as special vdev mirror.

But it takes days/weeks to replace that single 18TB HDD and in that time all disks will be 100% busy hammered by the resilvering, make the pool nearly unusable, so not unlikely that a second disk will fail under the stress and then you would have to redownload all the 120TB.

and for what are the 3 SSDs used for?
What if i download/unpack etc on an datacenter SSD and move after that to the pool?


First, before i was creating this pool, i was thinking of raid10, but its way to expensive for 120TB storage.
 
So the best is to create 2 raidz1 pools? if i mount them to the System, i have 2 different folders for each pool, or is there a way to combine this? Because its a big downloadfolder for my media.
For IOPS performance you want to stripe multiple small raidz1/2 or mirrors in a single pool. But not sure if you actually need all that IOPS performance for torrenting. My guess would be that is should primarily be big sequential async IO.
and for what are the 3 SSDs used for?
For storing metadata only or metdata+small files. Now you are storing data+metadata on the HDDs. With special device SSDs you would only need to store data on the HDDs and metadata on the SSDs. So the HDDs are hit by less small random IO and the pool will become faster as the HDDs horrible IOPS performance will bottleneck later.
 
For IOPS performance you want to stripe multiple small raidz1/2 or mirrors in a single pool. But not sure if you actually need all that IOPS performance for torrenting. My guess would be that is should primarily be big sequential async IO.

For storing metadata only or metdata+small files. Now you are storing data+metadata on the HDDs. With special device SSDs you would only need to store data on the HDDs and metadata on the SSDs. So the HDDs are hit by less small random IO and the pool will become faster as the HDDs horrible IOPS performance will bottleneck later.


I see the most dropdowns with big files, like 80-120GB. All small files 4-10GB are unpacking and downloading at same time at full performance.

Is it possible to add the 3 SSDs now and store the metadata there? Or i need to rebuild again?



What about, if i do all tasks on a datacenter SSD and copy after that only the file to the pool?
 
Is it possible to add the 3 SSDs now and store the metadata there? Or i need to rebuild again?
You can add them later, yet need to write the data once more so that the metadata will all go to the special device.

What about, if i do all tasks on a datacenter SSD and copy after that only the file to the pool?
That'll work too, but your pool will be the fastest if you do it like @Dunuin described. You can even improve sync write performance by adding two 16 GB optane NVMe disks if you have the slots available.
 
You can add them later, yet need to write the data once more so that the metadata will all go to the special device.


That'll work too, but your pool will be the fastest if you do it like @Dunuin described. You can even improve sync write performance by adding two 16 GB optane NVMe disks if you have the slots available.

How can I do this to put the metadata to the ssds? How big the ssds should be?

If I put optane to the pool, I need to configure something?
 
How can I do this to put the metadata to the ssds? How big the ssds should be?
You just add the drives as a mirror of special devices.

If I put optane to the pool, I need to configure something?
You need to add them as an slog device. This is the easiest step and only one command. General introduction.

You only need this if you plan to have a lot of sync writes, yet the improvement will be significantly for sync writes. Costs are also not that big.
 
You just add the drives as a mirror of special devices.


You need to add them as an slog device. This is the easiest step and only one command. General introduction.

You only need this if you plan to have a lot of sync writes, yet the improvement will be significantly for sync writes. Costs are also not that big.

Ok sounds good. How big the 3 SSDs needs to be for the Big pool? Or I go with 4 SSDs and raid1?
Not fully understanding it.


The slog device need to be optane or it can be every fast nvme?
 
Last edited:
So 0,3% for the pool should be the size of special device.
This depends heavily on your data and the use metadata.

Here are two examples of my VM data:

Code:
root@proxmox-zfs-storage-vm ~ > zpool list -v
NAME                                     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zpool                                   1.62T  1001G   658G        -         -    67%    60%  1.00x    ONLINE  -
  scsi-0QEMU_QEMU_HARDDISK_drive-scsi3  1.50T   976G   556G        -         -    67%  63.7%      -  ONLINE
special                                     -      -      -        -         -      -      -      -  -
  scsi-0QEMU_QEMU_HARDDISK_drive-scsi2   127G  25.2G   102G        -         -    27%  19.8%      -  ONLINE
logs                                        -      -      -        -         -      -      -      -  -
  scsi-0QEMU_QEMU_HARDDISK_drive-scsi1  7.50G  20.6M  7.48G        -         -     0%  0.26%      -  ONLINE

and

Code:
root@proxmox-zfs-storage-hardware ~ > zpool list -v
NAME                                                SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool                                              4.02T  2.39T  1.63T        -         -    30%    59%  1.00x    ONLINE  -
  mirror-0                                          556G   340G   216G        -         -    32%  61.1%      -    ONLINE
    scsi-35000cca07d0c6a94-part3                    558G      -      -        -         -      -      -      -    ONLINE
    scsi-35000cca07d1b8b24-part3                    558G      -      -        -         -      -      -      -    ONLINE
  mirror-2                                          556G   312G   244G        -         -    27%  56.2%      -    ONLINE
    scsi-35000039498116b54                          559G      -      -        -         -      -      -      -    ONLINE
    scsi-35000cca01636c444                          559G      -      -        -         -      -      -      -    ONLINE
  mirror-3                                          556G   337G   219G        -         -    29%  60.6%      -    ONLINE
    scsi-35000cca07d0f6e0c                          559G      -      -        -         -      -      -      -    ONLINE
    scsi-35000cca07d1cdc84                          559G      -      -        -         -      -      -      -    ONLINE
  mirror-4                                          556G   349G   207G        -         -    32%  62.7%      -    ONLINE
    scsi-35000c50089374073                          559G      -      -        -         -      -      -      -    ONLINE
    scsi-35000c500889fc837                          559G      -      -        -         -      -      -      -    ONLINE
  mirror-5                                          556G   343G   213G        -         -    31%  61.7%      -    ONLINE
    scsi-35000c500892c604f                          559G      -      -        -         -      -      -      -    ONLINE
    scsi-35000c50088a34637                          559G      -      -        -         -      -      -      -    ONLINE
  mirror-6                                          556G   354G   202G        -         -    30%  63.7%      -    ONLINE
    scsi-35000c50088a00e2b                          559G      -      -        -         -      -      -      -    ONLINE
    scsi-35000c500893cffb3                          559G      -      -        -         -      -      -      -    ONLINE
  mirror-7                                          556G   351G   205G        -         -    32%  63.1%      -    ONLINE
    scsi-35000c50088a0c1bb                          559G      -      -        -         -      -      -      -    ONLINE
    scsi-35000c50088a35233                          559G      -      -        -         -      -      -      -    ONLINE
special                                                -      -      -        -         -      -      -      -  -
  mirror-1                                          222G  60.5G   161G        -         -    21%  27.3%      -    ONLINE
    ata-SAMSUNG_MZ7LM240HCGR-00003_S1YFNX0H700687   224G      -      -        -         -      -      -      -    ONLINE
    ata-SAMSUNG_MZ7LM240HCGR-00003_S1YFNX0H700905   224G      -      -        -         -      -      -      -    ONLINE

That means if I buy 4x 480GB as a mirror I should be on the safe site?
Or 2x 1TB.
More disks - or more precicely more vdevs - is generally always faster. If you have the space/slots in the server go with more disks, yet keep potential expandability of your data part of the pool in mind.
 
More disks - or more precicely more vdevs - is generally always faster. If you have the space/slots in the server go with more disks, yet keep potential expandability of your data part of the pool in mind.

If i start with 2x1TB, can i add later 2 more disk?

What about this setting? What would be the recommendation for it?
zfs set special_small_blocks=
 
If i start with 2x1TB, can i add later 2 more disk?
Yes. But like already said. ZFS won't automatically move any old data/metadata from the HDDs to SSDs. You would need to write the whole 120TB again.

What about this setting? What would be the recommendation for it?
zfs set special_small_blocks=
Depends on your available space. The more you increase it, the more data will be stored on the SSDs instead of on the HDDs. With only 1TB SSDs and 120TB HDDs you probably want it quite low so the SSDs won't spill over.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!