[Suggestions] 8HDD's + 5NVME's (Z2 or Raid10 or something else)

Ramalama · May 19, 2023

Hey Boys & Girls,

I have hard times to decide whats the best Configuration between Performance and Size would be.

- In Short, i need to upgrade my Storage in the Server, because im running out of Space (50TB almost full), so i've ordered:
8x 20TB WD HDD's
1x Bifurbication Card (PCE4.0 x16 -> 4x4)
3x 990 Pro 1TB
2x 990 Pro 2TB

- I have 128gb ecc Ram & the Mainboard is an Asrock Rack x570d4i-2t, it has:
1x M2 (4.0x4) Slot
1x PCIe 4.0 X16 Slot (For the Bifurbication Card)
2x Oculink (Thats 8x Sata Ports)

- My idea is:
-- Oculink Sata Ports: All 8 HDD's in an ZFS-Z2 Array
-- M2 Mainboard Slot + 1 Bifurbication Card Slot: 2x(2tb 990 Pro) in Raid1 for OS + VMs/Containers
-- 2 Bifurbication Slots: 2x(1tb 990 Pro) in Raid1 as ZIL/LOG Cache
-- Last Bifurbication Slot: 1tb 990 Pro as L2Arc Read Cache

Now it would be nice to have addidionally and metadata vdev device, but i have no nvme slots left....

Adiitionally ZFS-Z2 Provides me a lot of Storage compared to Raid10, but an ZFS-Z2 gives me somehow headaches, because:
- The ZFS guys don't recommend ZFS-Z1 (Which is 33% Parity with 3 Disks)
- But ZFS-Z2 with 8 disks is even less Parity (25% with 8 Disks)
- On the sidenote, what i've found here on the Forums, that ZFS-Z2 and Raid10 are extremely similar in Performance.
- Keep in mind, i write Raid 10, but if, then i will do an ZFS Raid10 (4-way mirror vdev device), because of the ZFS benefits (scrubbing etc..)

What would you guys do with that Storage different as me?
Basically im collecting opinions.
An Raid 10 would give me 80Tb of Storage, an ZFS-Z2 would give me 140TB of Storage, which is a LOT more....

Cheers

LnxBil · May 19, 2023

You want opinions ... you get my 2 cents:

Your hardware is entry level server hardware (SOHO-type), yet neither the disks nor the NVMes are enterprise level, so that'll impact your archievable performance.

Ramalama said:
On the sidenote, what i've found here on the Forums, that ZFS-Z2 and Raid10 are extremely similar in Performance.

Not in general ... performance comes with the number of vdevs, so the more you have, the better it performs. The performance can be equivalent, but in general, it is not. Write performance is almost in all cases much better in a stripped mirror than in any raidz* setup with the same number of disks.

Ramalama said:
- 2 Bifurbication Slots: 2x(1tb 990 Pro) in Raid1 as ZIL/LOG Cache

Two Intel Optane 16 GB will be enough and much, much faster (and cheaper).

Ramalama said:
-- Last Bifurbication Slot: 1tb 990 Pro as L2Arc Read Cache

L2ARC is 99% useless in most environments in comparison to using special devices. You can archieve much better overall performance if you just use one pool with special devices (the NVMes) and store on them your VMs (via special_small_blocks) and metadata for your spinners.

Have you statistics about your L2ARC usage? We tried them also for years and the benefit was negligible in long term analytics.

Ramalama said:
2x Oculink (Thats 8x Sata Ports)

Never heard of them. I'd buy SAS adapters (used 6 GBit or new 12 GBit), e.g. MegaRAID-based.

Ramalama · May 19, 2023

LnxBil said:
You want opinions ... you get my 2 cents:

Your hardware is entry level server hardware (SOHO-type), yet neither the disks nor the NVMes are enterprise level, so that'll impact your archievable performance.

Its not even Server hardware, its AM4 (X570) just with the benefit of ibmc. X570 was a simple decision here, because of the support of ECC Memory, while beeing cheap and very power efficient. The Mainboard is paired with an 5800x which has tbh more as enough CPU power, the only downside i see actually is dual-channel instead of quad/octa channel and the limited amount of pcie lanes compared to server hardware.
Otherwise it has 2x 10Gb nics, which do SR-IOV, MMIO works very well, bifurbication is available.
Im absolutely not regretting that i've chosen this platform, or especially this Mainboard.

LnxBil said:
Not in general ... performance comes with the number of vdevs, so the more you have, the better it performs. The performance can be equivalent, but in general, it is not. Write performance is almost in all cases much better in a stripped mirror than in any raidz* setup with the same number of disks.

- Like using a stripe/pool of 2 vdevs, where each vdev is an zfs-z1 raid made out of 4 HDD's?
-- Instead of one vdev based on ZFS-Z2 out of 8 disks?

- Stripped mirror is just hard to swallow here, because of the loss of 80tb space.

LnxBil said:
Two Intel Optane 16 GB will be enough and much, much faster (and cheaper).

I didn't thought of Optane tbh :-(
However, wouldn't it be cleverer if i split that raid1 to 2 partitions, use 80GB for the ZIL and ~900GB for Small Files+Metadata?

LnxBil said:
L2ARC is 99% useless in most environments in comparison to using special devices. You can archieve much better overall performance if you just use one pool with special devices (the NVMes) and store on them your VMs (via special_small_blocks) and metadata for your spinners.

Youre right about L2ARC, but for special small blocks and metadata i would need an raid1 storage, the problem is, that i don't have any pcie slots left.
Or how you would realize with 5 nvme Slots (OS + VM's/Containers + ZIL + Smallfiles + Metadata)?
Maybe i should point out, that i need anyway ext4 Storage, because i have some Containers running Docker inside. (One docker lxc-container for the internal network and a separate docker lxc-container for the public network)

LnxBil said:
Have you statistics about your L2ARC usage? We tried them also for years and the benefit was negligible in long term analytics.

Sure, but my current Setup is pretty Crappy, i don't know if its any usefull actually, but 36gb is actually cached in l2arc:

Code:

pool                                                        alloc   free   read  write   read  write
----------------------------------------------------------  -----  -----  -----  -----  -----  -----
DATA                                                        2.23T  14.1T     13      7  2.50M  1.50M
  raidz1-0                                                  2.23T  14.1T     13      7  2.50M  1.49M
    ata-ST6000DM003-2CY186_WCT3E958                             -      -      5      2   854K   507K
    ata-ST6000DM003-2CY186_WCT3J7LG                             -      -      2      2   853K   507K
    ata-ST6000DM003-2CY186_ZCT2JX0G                             -      -      5      2   850K   507K
logs                                                            -      -      -      -      -      -
  ata-Seagate_FireCuda_120_SSD_ZA500GM10001_7SV003E7-part2   260K  63.5G      0      0      0  13.2K
cache                                                           -      -      -      -      -      -
  ata-Seagate_FireCuda_120_SSD_ZA500GM10001_7SV003E7-part1  36.0G   324G      0      3    480   464K
----------------------------------------------------------  -----  -----  -----  -----  -----  -----
SSD-1TB-Mirror                                               269G   683G     17     75   595K  1.22M
  mirror-0                                                   269G   683G     17     75   595K  1.22M
    ata-SPCC_Solid_State_Disk_AA000000000000010735              -      -      8     37   294K   625K
    ata-SPCC_Solid_State_Disk_AA000000000000010730              -      -      8     37   301K   625K
----------------------------------------------------------  -----  -----  -----  -----  -----  -----
root@proxmox:~# uptime
 19:01:22 up 15 days,  9:04,  3 users,  load average: 0.32, 0.32, 0.36

I think thats more usefull as arcstat, because you see there the actual pools/drives and how much it consumes for l2arc.
But anyway useless xD

LnxBil said:
Never heard of them. I'd buy SAS adapters (used 6 GBit or new 12 GBit), e.g. MegaRAID-based.

Oculink is simply PCIe 4.0x4, which has an SFF-8611 connector.
You can use that Oculink port either for M.2 Nvme Drives or switch it to Sata mode and Split it up to 4 Sata ports.
You can even get cables for Oculink to Mini SAS HD (SFF-8611 to SFF-8643)

Thats what i am using that ports for.
I have 2x IB-564SAS-12G which are connected via SFF-8643 to the Oculink ports on the Mainboard.
In the end its just 12 Gbit MiniSAS.

However, its an ITX board, so i have there just one pcie4.0 x16 slot xD
Which is in my opinion better suited for an bifurbication card (where i can get the speed out from) instead of another MiniSAS adapter.

Thanks a Lot for your Opinion btw!

LnxBil · May 19, 2023

Ramalama said:
- Like using a stripe/pool of 2 vdevs, where each vdev is an zfs-z1 raid made out of 4 HDD's?
-- Instead of one vdev based on ZFS-Z2 out of 8 disks?

Yes, it'll have twice the random IOPS performance.

Ramalama said:
I think thats more usefull as arcstat, because you see there the actual pools/drives and how much it consumes for l2arc.
But anyway useless xD

Yes

We had arcstat statistics from telegraf and e.g. cache hit rates over time and it was not that good in the end.

Ramalama said:
However, its an ITX board, so i have there just one pcie4.0 x16 slot xD

Oh ... yeah, that is restricting of course.

Ramalama said:
However, wouldn't it be cleverer if i split that raid1 to 2 partitions, use 80GB for the ZIL and ~900GB for Small Files+Metadata?

Yes, you can do that, but it won't be that fast. Have you a lot of sync writes?

alexskysilk · May 19, 2023

As long as you dont use the HDD storage for virtual disks, raidz2 is probably the best overall config for your spinners. I would suggest 10 disks instead of 8 for stripe alignment. If you DO intend to, striped mirrors will yield 4x IOPs then a raidz2 (and possibly more.) mind you, it would still be slow as molasses for any non sequential IO.

For your virtual disk store, striped mirrors with your nvmes. I wouldnt really bother with any slog/l2arc at all- a special device could be useful for your slow pool if you'd want to dedicate the resources- you want 2, each approx 4% of your pool capacity.

Ramalama · May 19, 2023

LnxBil said:
Yes, it'll have twice the random IOPS performance.

Yes
We had arcstat statistics from telegraf and e.g. cache hit rates over time and it was not that good in the end.

Oh ... yeah, that is restricting of course.

Yes, you can do that, but it won't be that fast. Have you a lot of sync writes?

Thats a hard question, because the big pool (80 or 140TB, depending on what i do in the end), is almost entirely for samba...
So yeah sync writes...
- But on the other hand, ~120TB (dataset) of that will be just Videos which have each a filesize of 60GB... And they will just be read and not written.

- The Rest ~20TB (dataset), will be a lot of small files where we store our Contructions/Office Files etc etc... which will be written and read.
-- That Dataset will have shadow copy, but only a snapshot like every hour (not every 5 minutes etc, like others do), but it will have a lot of snapshots, need to store there almost a year, which is round around typically ~100 Snapshots in total.
-- For that dataset it would be nice to have an Metadata special device, because there are a lot windows explorer searches.

But its all Samba.
For the VM's and Containers, i actually don't need that much Storage, even 500gb is enough.

Ramalama · May 19, 2023

alexskysilk said:
As long as you dont use the HDD storage for virtual disks, raidz2 is probably the best overall config for your spinners. I would suggest 10 disks instead of 8 for stripe alignment. If you DO intend to, striped mirrors will yield 4x IOPs then a raidz2 (and possibly more.) mind you, it would still be slow as molasses for any non sequential IO.

For your virtual disk store, striped mirrors with your nvmes. I wouldnt really bother with any slog/l2arc at all- a special device could be useful for your slow pool if you'd want to dedicate the resources- you want 2, each approx 4% of your pool capacity.

Problem is just, that for 2 Drives more, i would need an M.2 to Mini-SAS HD Adapter. Thats doable, but ughh, don't even know if such adapters exist.
And if it's compatible with the bifurbication card.

alexskysilk · May 19, 2023

So dont bother

your usecase doesnt really get much benefit (there is some, a special device would make metadata operations quicker but its not dealbreaking.)

LnxBil · May 19, 2023

alexskysilk said:
your usecase doesnt really get much benefit (there is some, a special device would make metadata operations quicker but its not dealbreaking.)

... and comes at no cost, if the drives would be in another pool.

Dunuin · May 19, 2023

Ramalama said:
However, wouldn't it be cleverer if i split that raid1 to 2 partitions, use 80GB for the ZIL and ~900GB for Small Files+Metadata?

I personally wouldn't do that. Special devices are no cache. Lose those special devices and all data on the HDDs ist lost too. And what is killing SSDs primarily are sync writes and thats the only thing the SLOG is doing. So by putting the SLOG on the same disk as your special devices, those special devices will fail way sooner.
And then those SSDs are consumer grade without a power-loss protection. So those SSDs got really crappy sync write performance (not much better than a HDD) because without the PLP the sync writes can't be cached in the SSDs volatile DRAM cache. And the SSD will also wear much quicker without the PLP, because writes can't be cached so many small sync writes can't be written as few big write operations which would reduce SSD wear as writes could be optimized.
I would say get a proper enterprise grade SSD with PLP dedicated as a SLOG or skip the SLOG and L2ARC thing and just use two of them as special devices and the other two as VM storage. Or use all 4 as special devices in a raid10 and work with the special_small_blocks option.

LnxBil · May 19, 2023

Dunuin said:
Lose those special devices and all data on the HDDs ist lost too. And what is killing SSDs primarily are sync writes and thats the only thing the SLOG is doing. So by putting the SLOG on the same disk as your special devices, those special devices will fail way sooner.

Good point!

Ramalama · May 20, 2023

So in short, i think im gonna forget zfs-z2 with 8 disks.

- I get only read benefits, but writes and iops gonna stay at the speed of 1 drive or worse.
- ill get 120tb of storage space, but because of unoptimal disk number (8 instead of 6), there will be anyway 5,5% overhead + some additional space loss, so in the end im looking at around 105-110TB of usable space.

So i can swallow the pill and take an raid 10, which will lead to around 80tb of usable space but at the benefit of 8x the read speed, 4x write speed and faster iops (dunno if iops gonna increase actually)
But i don't need any ZIL device anymore, since the write speeds are going to be faster anyway.

Or better an group of 4 stripes of mirrored drives in zfs, which is in theory an raid 10, just with the benefit of scrubbing.

Then ill make an raid1 out of 2x1tb for an metadata+small files vdev.

Then one additional raid1 with ext4 for the OS and vm's+containers.

And one spare 1tb nvme is left, which i could use as l2arc, since i haven't anyway any use for that drive.
Dunno what todo with that drive actually anyway.

Thanks guys for all your feedback, very appreciated

Dunuin · May 20, 2023

Ramalama said:
- I get only read benefits, but writes and iops gonna stay at the speed of 1 drive or worse.

Read+write throughput performance scales with number of disks. IOPS read+write performance scales with the number of vdevs. In theory a 8 disk raidz2 (raid6) should give you 6x read+write throughput performance + 1x read+write IOPS performance. A 8 disk striped mirror (raid10) should give you 8x read + 4x write throughput performance as well as 4x read+write IOPS performance.
So it really depends on your workload. When only storing big videos you primarily need throughput as you got sequential reads/writes. When storing virtual disks or small files you get a lot of small random IO, so you want IOPS performance where HDDs really suck and where you really want a striped mirror to squeeze out every bit of IOPS performance you can get.

Ramalama said:
- ill get 120tb of storage space, but because of unoptimal disk number (8 instead of 6), there will be anyway 5,5% overhead + some additional space loss, so in the end im looking at around 105-110TB of usable space.

Also keep in mind that a ZFS pool will get slower when filling it up too much. Recommendations are usually to only fill it up to 80 or 90%. Filling it up too much also increases fragmentation rates and you can't defrag a ZFS pool because of the copy-on-write. Only option to reduce fragmentation would be to move all 120TB of data off the pool and transfer it back.

Ramalama · May 20, 2023

Dunuin said:
Read+write throughput performance scales with number of disks. IOPS read+write performance scales with the number of vdevs. In theory a 8 disk raidz2 (raid6) should give you 6x read+write throughput performance + 1x read+write IOPS performance. A 8 disk striped mirror (raid10) should give you 8x read + 4x write throughput performance as well as 4x read+write IOPS performance.
So it really depends on your workload. When only storing big videos you primarily need throughput as you got sequential reads/writes. When storing virtual disks or small files you get a lot of small random IO, so you want IOPS performance where HDDs really suck and where you really want a striped mirror to squeeze out every bit of IOPS performance you can get.

Also keep in mind that a ZFS pool will get slower when filling it up too much. Recommendations are usually to only fill it up to 80 or 90%. Filling it up too much also increases fragmentation rates and you can't defrag a ZFS pool because of the copy-on-write. Only option to reduce fragmentation would be to move all 120TB of data off the pool and transfer it back.

I'll guess when the disks arrive, i will simply do some benchmarks first.
If the difference won't be that big, then ill go with an z2.

Thx Dunuin, good to see that you're still active here in the forums

Ramalama · May 20, 2023

Im using 2 of those Bays.

- Ive decided to use an Stiped Mirror (ZFS Raid10) over ZFS-Z2

Positives:
- 8x Read IOPS & 4x Write IOPS
- One of those ICY Bays (See Picture) can entirely Fail, if the cable fails etc...
- I can start with 4 Drives, which makes it a lot easier to migrate since ill have 4 Spare drives where i can migrate the existing data from the Server to, then move the data to the stripe pool and attach then the mirrors.
- Less CPU Overhead.

Negatives:
- Only 80TB instead of ~115TB
- If 1 Drive fails in both Bays, im fucked.
- ZFS-Z2 Write/Read Speeds are almost identical.

It's a super hard decision, since i have 5 NVME Slots, where i can use an Raid1 storage for the small files/metadata (special_small_blocks=128k, recordsize=1m), the IOPS of the HDD's going to be not so important.
Im anyway still deciding what makes the most sense for the 5NVME Slots, but that gets easier, because i need for the OS and the Docker Containers an raid1 ext4 storage. To Run Docker on ZFS is a pain in the ass and the performance is Horrible either.

However, thanks everyone again!
I think for everyone else that searches a Solution for 8 HDD's and have the ability to setup an nvme raid1 for metadata, and doesn't use those 2 Bays like me, Raid-Z2 is probably a better choice.

Cheers

Dunuin · May 20, 2023

Ramalama said:
If 1 Drive fails in both Bays, im fucked

1 up to 4 disks would be allowed to fail. Just not 2 disks of the same vdev.
But at least the resilvering will be way way faster with a striped mirror, so it's not that likely that a second disk will fail while you are resilvering. But I would get a ninth cold spare disk, so you can replace it as fast as possible. And you should set up some monitoring so you instantly get notified in case of a disk failure. One option for that would be to set up postfix + zfs-zed so your PVE server can send notification mails.

Ramalama · May 20, 2023

Dunuin said:
1 up to 4 disks would be allowed to fail. Just not 2 disks of the same vdev.
But at least the resilvering will be way way faster with a striped mirror, so it's not that likely that a second disk will fail while you are resilvering. But I would get a ninth cold spare disk, so you can replace it as fast as possible. And you should set up some monitoring so you instantly get notified in case of a disk failure. One option for that would be to set up postfix + zfs-zed so your PVE server can send notification mails.

I did setted up already mail alerts, recently seen "Techno Tims" video in YouTube about it and did it

I know, up to 4, but if you check the picture from the post above, i have 2 bays where 4 drives fits in, one bay is fully stipe, in the other bay are the mirrors of each drive of the first bay.
That's why i said, one bay can fully fail (4 drives), but if one drive in both bays fails, im fucked

And for the coldspare, you're right, i need definitively to buy one more drive and put it somewhere into the shelf.
Didn't thought about actually.

guletz · May 21, 2023

Hi to all,

One side note, about all the discusios...

With 8 x 20 TB hdd, any raidz1 is very risky, in the case of resilver, when is a good chance to encounter a block read error(risk is higher if the data to be resiver is bigger). During the resilver, all hdd will be stressed, for a long time.

Details about this can be read at:

https://www.zdnet.com/article/why-raid-5-stops-working-in-2009/

In my own oppinion, I would use raidz1 only with max. 4 TB hdd.

For a so bigger hdds, I would use only a draid, with at least dubble parity.

My oppinions, is about the usage of raidzX and not about raidzX versus sripped mirrors...

Good luck / Bafta !

Ramalama · Jun 10, 2023

I have finally migrated, printed a case for the "Icy Box IB-564SAS-12G" that goes ontop of my NR200P, the Airflow isamazing either, the NR200P sucks the Air from the buttom and pushes it through the top, through the Drives in the ICY Box'es.

However, everything works.

I made now an speed comparization, between 8 Drives in Raid10 & 8 Drives in Raid-Z2:

Code:

ZFS RAID10 -> Write

root@proxmox:/DATA# fio --ioengine=libaio --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name seq_write --filename=/DATA/test
seq_write: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=1157KiB/s][w=289 IOPS][eta 00m:00s]
seq_write: (groupid=0, jobs=1): err= 0: pid=339977: Sat Jun 10 12:38:11 2023
  write: IOPS=376, BW=1505KiB/s (1541kB/s)(88.2MiB/60001msec); 0 zone resets
    slat (msec): min=2, max=215, avg= 2.66, stdev= 3.12
    clat (nsec): min=240, max=6672, avg=493.69, stdev=463.46
     lat (msec): min=2, max=215, avg= 2.66, stdev= 3.13
    clat percentiles (nsec):
     |  1.00th=[  302],  5.00th=[  310], 10.00th=[  322], 20.00th=[  342],
     | 30.00th=[  350], 40.00th=[  370], 50.00th=[  390], 60.00th=[  430],
     | 70.00th=[  482], 80.00th=[  548], 90.00th=[  652], 95.00th=[  732],
     | 99.00th=[ 3568], 99.50th=[ 4704], 99.90th=[ 5344], 99.95th=[ 5664],
     | 99.99th=[ 6176]
   bw (  KiB/s): min=   48, max= 1848, per=100.00%, avg=1510.52, stdev=310.99, samples=119
   iops        : min=   12, max=  462, avg=377.63, stdev=77.75, samples=119
  lat (nsec)   : 250=0.01%, 500=72.29%, 750=23.05%, 1000=2.07%
  lat (usec)   : 2=1.55%, 4=0.09%, 10=0.94%
  cpu          : usr=0.06%, sys=0.51%, ctx=45156, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,22577,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1505KiB/s (1541kB/s), 1505KiB/s-1505KiB/s (1541kB/s-1541kB/s), io=88.2MiB (92.5MB), run=60001-60001msec

Code:

ZFS RAID10 -> Read

root@proxmox:/DATA# fio --ioengine=libaio --direct=1 --sync=1 --rw=read --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name seq_read --filename=/DATA/test
seq_read: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=2147MiB/s][r=550k IOPS][eta 00m:00s]
seq_read: (groupid=0, jobs=1): err= 0: pid=342133: Sat Jun 10 12:39:42 2023
  read: IOPS=547k, BW=2136MiB/s (2239MB/s)(125GiB/60001msec)
    slat (nsec): min=1021, max=137150, avg=1485.40, stdev=2214.37
    clat (nsec): min=150, max=24055, avg=173.76, stdev=23.01
     lat (nsec): min=1202, max=137561, avg=1686.19, stdev=2225.97
    clat percentiles (nsec):
     |  1.00th=[  161],  5.00th=[  161], 10.00th=[  161], 20.00th=[  171],
     | 30.00th=[  171], 40.00th=[  171], 50.00th=[  171], 60.00th=[  171],
     | 70.00th=[  171], 80.00th=[  181], 90.00th=[  191], 95.00th=[  191],
     | 99.00th=[  221], 99.50th=[  221], 99.90th=[  241], 99.95th=[  251],
     | 99.99th=[  342]
   bw (  MiB/s): min= 2114, max= 2178, per=100.00%, avg=2136.20, stdev=11.38, samples=119
   iops        : min=541348, max=557620, avg=546866.69, stdev=2912.00, samples=119
  lat (nsec)   : 250=99.95%, 500=0.04%, 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=17.62%, sys=82.37%, ctx=165, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=32803198,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=2136MiB/s (2239MB/s), 2136MiB/s-2136MiB/s (2239MB/s-2239MB/s), io=125GiB (134GB), run=60001-60001msec

Code:

[CODE]ZFS-Z2 -> Write

root@proxmox:/DATA2# fio --ioengine=libaio --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name seq_write --filename=/DATA2/test
seq_write: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=460KiB/s][w=115 IOPS][eta 00m:00s]
seq_write: (groupid=0, jobs=1): err= 0: pid=361645: Sat Jun 10 12:43:28 2023
  write: IOPS=115, BW=461KiB/s (472kB/s)(27.0MiB/60001msec); 0 zone resets
    slat (msec): min=2, max=117, avg= 8.67, stdev= 4.85
    clat (nsec): min=351, max=6632, avg=963.28, stdev=556.64
     lat (msec): min=2, max=117, avg= 8.67, stdev= 4.85
    clat percentiles (nsec):
     |  1.00th=[  390],  5.00th=[  442], 10.00th=[  490], 20.00th=[  540],
     | 30.00th=[  612], 40.00th=[  684], 50.00th=[  788], 60.00th=[ 1272],
     | 70.00th=[ 1288], 80.00th=[ 1288], 90.00th=[ 1304], 95.00th=[ 1320],
     | 99.00th=[ 4048], 99.50th=[ 4768], 99.90th=[ 5856], 99.95th=[ 6240],
     | 99.99th=[ 6624]
   bw (  KiB/s): min=  248, max=  768, per=99.99%, avg=461.18, stdev=89.98, samples=119
   iops        : min=   62, max=  192, avg=115.29, stdev=22.49, samples=119
  lat (nsec)   : 500=11.31%, 750=36.08%, 1000=5.91%
  lat (usec)   : 2=45.42%, 4=0.22%, 10=1.07%
  cpu          : usr=0.03%, sys=0.28%, ctx=13834, majf=0, minf=12
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,6916,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=461KiB/s (472kB/s), 461KiB/s-461KiB/s (472kB/s-472kB/s), io=27.0MiB (28.3MB), run=60001-60001msec

Code:

ZFS-Z2 -> Read

root@proxmox:/DATA2# fio --ioengine=libaio --direct=1 --sync=1 --rw=read --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name seq_read --filename=/DATA2/test
seq_read: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=2120MiB/s][r=543k IOPS][eta 00m:00s]
seq_read: (groupid=0, jobs=1): err= 0: pid=365542: Sat Jun 10 12:48:19 2023
  read: IOPS=540k, BW=2110MiB/s (2212MB/s)(124GiB/60001msec)
    slat (nsec): min=1031, max=46708, avg=1495.18, stdev=2182.03
    clat (nsec): min=150, max=28685, avg=180.17, stdev=23.59
     lat (nsec): min=1222, max=47169, avg=1702.59, stdev=2195.24
    clat percentiles (nsec):
     |  1.00th=[  161],  5.00th=[  171], 10.00th=[  171], 20.00th=[  171],
     | 30.00th=[  171], 40.00th=[  171], 50.00th=[  181], 60.00th=[  181],
     | 70.00th=[  181], 80.00th=[  191], 90.00th=[  191], 95.00th=[  201],
     | 99.00th=[  231], 99.50th=[  241], 99.90th=[  262], 99.95th=[  270],
     | 99.99th=[  382]
   bw (  MiB/s): min= 2062, max= 2144, per=100.00%, avg=2110.29, stdev=16.52, samples=119
   iops        : min=527918, max=548864, avg=540235.29, stdev=4228.13, samples=119
  lat (nsec)   : 250=99.70%, 500=0.29%, 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=17.84%, sys=82.16%, ctx=95, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=32405581,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=2110MiB/s (2212MB/s), 2110MiB/s-2110MiB/s (2212MB/s-2212MB/s), io=124GiB (133GB), run=60001-60001msec

Available Storage Difference:
Raid-Z2 -> 104TB
Raid10 -> 73TB

Still not sure if i should go with Raid10 or Z2. The Write Performance isn't that great, but more Storage available.
I think Z2 should be okay for Samba Storage only.

Cheers

Ramalama · Jun 10, 2023

Can anyone help me with this:

Code:

zpool status
  pool: HDD_Z2
 state: ONLINE
status: One or more devices are configured to use a non-native block size.
    Expect reduced performance.
action: Replace affected devices with devices that support the
    configured block size, or migrate data to a properly configured
    pool.
config:

    NAME                                  STATE     READ WRITE CKSUM
    HDD_Z2                                ONLINE       0     0     0
      raidz2-0                            ONLINE       0     0     0
        ata-WDC_WUH722020BLE6L4_8LG64ZHE  ONLINE       0     0     0
        ata-WDC_WUH722020BLE6L4_8LG7WHTE  ONLINE       0     0     0
        ata-WDC_WUH722020BLE6L4_8LG7Y79A  ONLINE       0     0     0
        ata-WDC_WUH722020BLE6L4_8LG7VRXA  ONLINE       0     0     0
        ata-WDC_WUH722020BLE6L4_8LG7XKNA  ONLINE       0     0     0
        ata-WDC_WUH722020BLE6L4_8LG7KSTA  ONLINE       0     0     0
        ata-WDC_WUH722020BLE6L4_8LG7ER9A  ONLINE       0     0     0
        ata-WDC_WUH722020BLE6L4_8LG7M8RA  ONLINE       0     0     0
    special
      hdd_metadata                        ONLINE       0     0     0  block size: 4096B configured, 8192B native

errors: No known data errors

  pool: rpool
 state: ONLINE
config:

    NAME                                 STATE     READ WRITE CKSUM
    rpool                                ONLINE       0     0     0
      mirror-0                           ONLINE       0     0     0
        nvme-eui.0025384431415a66-part3  ONLINE       0     0     0
        nvme-eui.0025384431414875-part3  ONLINE       0     0     0
      mirror-1                           ONLINE       0     0     0
        nvme-eui.00253844314008e0-part3  ONLINE       0     0     0
        nvme-eui.00253844314008c6-part3  ONLINE       0     0     0

errors: No known data errors

The idea was, i install proxmox with an zfs raid10 nvme array.
- create an 500GB device out of that nvme array afterwards zfs create -V 500G rpool/hdd_metadata
- create an ZFS-Z2 Pool out of that 8 20TB Disks (104TB Array)
- add the 500GB Nvme ZFS Device to the HDD_Z2 Array as special device: zpool add -f HDD_Z2 special /dev/zvol/rpool/hdd_metadata

Now im getting an vollblock mismatch, pretty sure i could recreate the device with 4kb vollblocks, but the problem is...
I don't know if my attempt/idea is even that good.

So it would be better to leave free space on all 4 nvme devices earlier already, partition them and use the disks directly as special device?

[Suggestions] 8HDD's + 5NVME's (Z2 or Raid10 or something else)

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

Distinguished Member

Renowned Member

Renowned Member

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Renowned Member

Distinguished Member

Renowned Member

Famous Member

Renowned Member

Renowned Member

We value your privacy