zfs topology and speeds

Apr 27, 2024
494
185
43
Portland, OR
www.gnetsys.net
Hello folks.

I've been looking at some of my zfs disk builds. Trying to optimize for speed.

This is an array of 8 identical spinners on a Gen13 Delll poweredge.

zpool iostat -v
capacity operations bandwidth
pool alloc free read write read write
-------------------------------- ----- ----- ----- ----- ----- -----
rpool 2.26G 4.36T 71 34 989K 351K
raidz1-0 2.26G 4.36T 71 34 989K 351K
scsi-350000397480bd199-part3 - - 18 8 261K 88.9K
scsi-350000397480bd259-part3 - - 17 8 238K 87.5K
scsi-350000397480b8b79-part3 - - 17 8 251K 88.3K
scsi-350000397480b8c01-part3 - - 17 8 239K 86.5K

So that's read 989K and write 351K.

And then I add another vdev and the read speed goes down?

zpool iostat -v
capacity operations bandwidth
pool alloc free read write read write
-------------------------------- ----- ----- ----- ----- ----- -----
rpool 46.4G 8.67T 0 57 3.69K 508K
raidz1-0 23.4G 4.34T 0 28 1.96K 253K
scsi-350000397480bd199-part3 - - 0 7 518 64.9K
scsi-350000397480bd259-part3 - - 0 6 462 60.3K
scsi-350000397480b8b79-part3 - - 0 7 549 64.9K
scsi-350000397480b8c01-part3 - - 0 7 479 62.7K
raidz1-1 22.9G 4.34T 0 28 1.73K 255K
scsi-350000397480ba0f9-part3 - - 0 7 476 65.5K
scsi-350000397480b6271-part3 - - 0 6 413 61.1K
scsi-350000397480bd34d-part3 - - 0 7 471 65.4K
scsi-35000039aa818b5c9-part3 - - 0 7 406 62.8K

read 3.69K and write 508K

All of the disks are reporting lower speeds.
Total combined write goes up, as you'd think it should, but the disks are slower.
Read speeds dropped to the byte range.
Those numbers aren't just bad, they are awful.

I have the host set in HBA mode.
Is there anything else here that I may have overlooked?
Why on earth would it do this?
 
Last edited:
Try zfs 3x mirror with ssd as zfs spezial device.
After Setup all your meta data will be located on this devices.
If you want this for old Starter you must write it again.
But only use Enterprise PNP SSDs. You can calculate approx. 2 % of your raw hdd space as ZFS special device size.
You can setup on your ZFS Pool atime as off.
 
Last edited:
And the oscar go to .. a unkow cpu and so on.
Please write more/ all about your setup.
Run a test with fio and check your zfs blocksize.

Your data block go now to 4k x 3 on raidz1-0 and 4k x 3 on raidz1-1 in parallel, if there are missing data blocks, then what will happen?
you must write at once 24k on the zfs pool.
draw your alligment on a paper, with the default blocksize of 16k and 128k (default) zfs recordsize and align the 24k..

# https://pve.proxmox.com/wiki/Benchmarking_Storage
 
Last edited:
I appreciate the reply. I don't entirely understand you, but there's enough there that I can go look it up.

The hardware info provided here is sufficient. CPU info would be a distraction. We want to talk about drives.
Its an enterprise server with 8 bays of spinners. They are identical drives. Enough said.

The PVE installation wizard created the first vdev.
I added the second second vdev manually via cli.
The speed went way way down after the second (identical!) vdev was joined.
That is not what I expected.

- Ok, I'll check on 'test with fio and check your zfs blocksize'
- And I'll try reading about alignment. That sounds like a good lead, but I'm pretty fuzzy about the details right now.
- The link doesn't work. I'll see if I can find that article. ... (err ... well the site is just lousy. the link works.)

And thanks again. We may not have communicated well here, but I had no leads at all. Now I have things to go look up. Groovy.
 
Last edited:
I read your link and a good deal of additional material. Thanks.

I used fio and did some testing. That's a hell of a tool, but I'm not very good with it yet.
Still, it gave me numbers that I can record and compare.

I used
zdb
zpool get all
zfs get all
And that produces a LOT of data.
I did my best to compare the two vdevs. They seem identically configured to me.
In particular, I looked at ashift, which appears to be the same in both vdevs
ashift: 12

I checked the sector size of my al14seb120n spinners.
Toshiba says: 512 to 528 B ( fixed length )
So ashift 12 should be good, right?

I partitioned the second set of disks using /dev/sda as a model and the sgdisk tool.
As far as I can tell, that produces an identical partition config as the installer.


... But ...
I see in zpool history that I didn't set the ashift on the second vdev when I added it.


This is the installer making the first vdev.
History for 'rpool':
2025-11-16.18:35:00 zpool create -f -o cachefile=none -o ashift=12 rpool raidz1 /dev/disk/by-id/scsi-350000397480bd199-part3 /dev/disk/by-id/scsi-350000397480bd259-part3 /dev/disk/by-id/scsi-350000397480b8b79-part3 /dev/disk/by-id/scsi-350000397480b8c01-part3

And this is me adding the second vdev.
2025-11-16.19:30:04 zpool add rpool raidz1 scsi-350000397480ba0f9-part3 scsi-350000397480b6271-part3 scsi-350000397480bd34d-part3 scsi-35000039aa818b5c9-part3


Maybe I screwed up by not setting ashift?
Maybe its wrong somehow, even tho zdb says its ashift 12 for both vdevs? Maybe it defaulted to something else?
This is my only lead.
I'm gonna burn it down and try again. More testing before I add the second vdev.
If this doesn't go well, I should consider your suggestion to run an array of 2 disk mirrors.
 
When you setup a zfs pool with only hdds, the iops in total matter and the zfs pool layout to.

You need the too a zfs special device for all ZFS Metadata, as 2x - 3x zfs mirror with SSD SATA3, we can easy get Kingston DC600M 480G, 960G with TLC PLP, DRAM and more in Germany.

So look backwards, from the default zfs recordsize=128K and the Sector size (512B or 4096B (4k)) of your HDDs.

When you write 512B to your actual ZFS Disk layout on a zfs dataset, then you write a 128K Block to it.
Then you get 256 x of 512B HDD Sector size, across 8 disks and then the 512B Blocks can spread over 43 Places on all the 8 disks.
If you use 4k HDD Sector size, then the 8 x 512B HDD Sector are one Block on the HDD.

Case 256B HDD Sector size write
ZFS Pool with 2 VDEV RaidZ1 4 Disks, remember all the zfs metadata are placed on the 8x hdds to and must read/ write.

VDEV0: < data 512B | zero 512B | zero 512B | Parity >
VDEV1: < zero 512B | zero 512B | zero 512B | Parity >
Sum : 3072B

The system has to write 131072B in total (recordsize=128kB)
So the system write more of 42 x the zero 512B on the HDDs after that:
VDEV0: < zero 512B | zero 512B | zero 512B | Parity >
VDEV1: < zero 512B | zero 512B | zero 512B | Parity >
In total: 129024B

Case 256B HDD Sector size read
The Data spread over the 8 Disks. All Disk will be read, all 512B will arranged and the system get a 128k record back.

So a 4096B HDD Sector size will be better.

Some people change the zfs recordsize to 64kB.

When you have a zfs volume, then the zfs volblocksize=16K matter for calculation.
 
  • Like
Reactions: tcabernoch
BTW (meant to post this yesterday)
"The command zpool iostat will display statistics for the specified interval (in seconds) for a given number of times, for example, zpool iostat 2 3 runs for 6 seconds (2 seconds, 3 times). Without these arguments, the command shows an average since the system booted."

so copy a huge file to the pool or start a scrub, then check iostat [with some numbers]
 
Last edited:
  • Like
Reactions: tcabernoch
Thanks all. I learned some things here.

I've learned fio is a good evaluation tool for the back end disks.
I've learned more about zpool iostat numbers. This is only useful under load.
I've come back to relying on my Atto disk tester. Its a free Windows app, and it gives solid data about the disk performance the VMs themselves experience.
I've experimented with the atime setting. Not much difference.
And I did a lot of reading about sector size and boy it sounds like in the end you just want ashift=12. I'm sure I have more to learn there.

For this set of spinners, I finally opted for a zfsraid10.
The drives are large enough that it was an easy choice in the end. There's enough disk space to give up 4 disks to redundancy.
And that's the fastest topology I can do with 8 disks and be redundant.
And it tests out ok. Really significant improvement.

My struggle is far from over. I have a number of machines with sub-optimal ZFS arrays.
As I write, I'm ordering more disks for some different servers.
I'm sure I'll be back ...
 
Last edited: