Yet Another "Poor ZFS performance issue"

anichang

New Member
Jan 27, 2024
4
0
1
Edit: Problem solved. Updated firmware to both the HBA and the port expander and perf got right.

System is AMD Threadripper 1900X with 64GB of ECC RAM, a Supermicro 9300/3008 HBA (mpt3sas) connected to an Adaptec SAS expander using a single SAS cable (ie: max 12Gbps), and 6 WD disks (some 5TB 5700rpm greens and some 4TB 7200rpm blacks) connected to the SAS expander.

Read speed is about 70MB/s and I get the CPU clogged with IO wait.
If I disable caching (ie: "zfs set primarycache=metadata tank") I get 2.5MB/s read speed.
Scrub goes up to 800-1000 MB/s.
If I use ext4 on a single disk I get max read speed (ie: 130-180MB/s).

As a reference I'm using the benchmarks here. So, considering my 6 disks raidz1 I should get something better than
Code:
5x 4TB, raidz1 (raid5),       15.0 TB,  w=469MB/s , rw=79MB/s  , r=598MB/s

I create the zraid1 pool using ashift=9 (as some disks don't support 4096 block size) and compression=off (for benchmarking purposes), no other option is given.
I've been toying with this setup for a while: hdparm/smartctl to enable disk cache on all disks, disabled zfs sync, and so on. Without any good result.
So I tried to pass the HBA controller to a Truenas VM.

Code:
root@truenas[/mnt/tank]# zpool status
  pool: boot-pool
 state: ONLINE
config:


        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          sdb3      ONLINE       0     0     0


errors: No known data errors


  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 00:01:30 with 0 errors on Sat Jan 27 14:31:54 2024
config:


        NAME                                      STATE     READ WRITE CKSUM
        tank                                      ONLINE       0     0     0
          raidz1-0                                ONLINE       0     0     0
            6ba7075b-b593-4c24-8707-fb9a0c9260fa  ONLINE       0     0     0
            437619e9-3e2e-4916-8533-faad27999189  ONLINE       0     0     0
            791a5f3e-2e78-4d14-9b86-6512f5fccad9  ONLINE       0     0     0
            4d54f693-9c86-4e62-bae2-43a5b6778165  ONLINE       0     0     0
            a3d5a21c-0c29-4521-ac0d-d00561bfe8ff  ONLINE       0     0     0
            77c02b25-403e-497b-abef-81bb94c91590  ONLINE       0     0     0


errors: No known data errors

Read speed is a bit better but underachieving, still.

Code:
root@truenas[/mnt/tank]# dd if=srand | pv | dd of=/dev/null
^C.3GiB 0:01:35 [ 118MiB/s] [        <=>                                                                             ]
23820368+0 records in
23820367+0 records out
23820373+0 records in
23820372+0 records out
12196027904 bytes (12 GB, 11 GiB) copied, 95.5508 s, 128 MB/s
12196030464 bytes (12 GB, 11 GiB) copied, 95.5511 s, 128 MB/s

root@truenas[/mnt/tank]#

(the file "srand" is just a 48GB file pre-filled with random bytes using openssl)

And I've no idea how to debug this. The HW seems fine as read speeds for both a single ext4 disk and zfs scrub are maxed. It must be something in software. Any clue?

Regards
 
Last edited:
Dont use ashift=9 unless ALL disks are 512B native. Performance will drop terribly when doing 512B IO to a HDD that is actually using a 4K physical sector size. So ashift=9 for 512B/512B logical/physical sector disks, ashift=12 for 512B/4K or 4K/4K logical/physical sector disks.
 
Dont use ashift=9 unless ALL disks are 512B native. Performance will drop terribly when doing 512B IO to a HDD that is actually using a 4K physical sector size. So ashift=9 for 512B/512B logical/physical sector disks, ashift=12 for 512B/4K or 4K/4K logical/physical sector disks.

ack'ed, thanks.
 
Mixing different performance disks will lower your all disk performance to the lowest disk performance.
Yeah, I supposed that; it's a waste. But it's all I have. I'd like to drop the 7200rpm ones for energy consumption reasons, and get Advanced Format ones; but it's tricky to figure out (before buying) which ones support AF. Example: my 2 WD blacks are apparently the same but one supports both 512 and 4096 BSs, the other one doesn't. Overall, the market says 2TB NVMEs are at 150EU on Amazon... don't want to spend that much for 4TB disks, or the whole system is a huge fail as NVMEs would be equally costing but much more performant and way less energy consuming. I'm monitoring daily some websites searching for the right "lot of second hand >5TB sata disks", but didn't get much luck yet.

To give you the big picture: the system is 2x EATX motherboards (both 1st/2nd gen threadrippers) in a Thermaltake W200 "consumer rack" (a dual system case, about the same size of a half-height 19' rack), powered by a Phanteks Revolt-X PSU (a rebranded Seasonic 1200w Platinum, with dual ATX motherboard connectors and silent mode). Both systems are water cooled using 2 indipendent custom loops, and 200mm fans, to make them silent. System is almost silent, but I might even pad the interiors with foam to make the system more silent as... 24 disks need some fans anyway... (and 24 water cooled HDD cages cost an arm and a leg; way too much). Sometimes I sleep in this room :)
I made it using my old workstations parts and adding the missing parts (case, HBA, pumps, ...). It was born from the idea to re-use my old gear AND to not get involved with second-hand enterprise stuff as enterprise stuff is power hungry, noisy, it adds complexity unnecessary in a home environment, it has uncomfortable form factors, and getting the missing tiny bits and bolts can be a pain for the wallet (because they are proprietary, because docs are missing if you don't have access to suppliers support sites and can't phone their support services)... yeah... you can find an old Xeon rack system for 100EU on ebay... but then put that to work is a pain, and you can't mix it with your consumer gear. Consider I'm in EU; the second hand market (the power bill, and the size of houses and rooms) is way different from US.
So, all in all, my system is an anomaly. A bit of waste (ex: 7200 and 5700 rpm disks mix) is ... unavoidable.
 
second-hand enterprise stuff as enterprise stuff is power hungry, noisy
Not all. For example have a look at Supermicro stuff which offer normal uATX/ATX/EATX boards with 92mm tower coolers and without proprietary stuff. No problem to combine those with silent consumer ATX PSUs, consumer uATX/ATX/EATX cases, 140mm fans and so on. Got second-hand servers here that are actually more silent and consuming way less power at idle than my gaming PC.
 
Last edited:
Not all. For example have a look at Supermicro stuff which offer normal uATX/ATX/EATX boards with 92mm tower coolers and without proprietary stuff. No problem to combine those with silent consumer ATX PSUs, consumer uATX/ATX/EATX cases, 140mm fans and so on. Got second-hand servers here that are actually more silent and consuming way less power at idle than my gaming PC.
I agree. And to tell you the truth I remember me being a kid drooling like a snail on Supermicro and Tyan's products of the time...

Last year I was planning this machine and I've spent 2-3 days of digging to evaluate the chance of getting a Supermicro EPYC system (with ECC, registered DDR4) as cost wise it was VERY attractive and it had partial compatibility with threadripper's gear (ex: heatsinks; waterblocks in my case); but at the end of the day... I opted for doubling my existing Threadripper (with ECC, NOT registered/buffered DDR4). When the system has hw problems, I can swap ram modules, cpu, heatsink and so on, in order to debug. If I get different systems, then I don't have "spares" to test hw in case of troubles. Or to decommission 1 system and use the spares to service the other one. And this is just one example of reasons for me not going down the route of business/enterprise HW. I've been investigating the chance to get some old Xeons as well; because usually Intel stuff has higher quality details (and core density; and less obnoxious NUMA crap). But differences with my existing gear were even bigger than EPYC systems.
As of power consumption: an underclocked+undervolted threadripper is ... an EPYC. More in general these threadripper boards have less pcie lanes and the CPUs have less max cores, less L2/L3 cache, less RAM channels, but they are good enough for my needs (they top at 32 cores, 128GB of RAM, 5 pcie slots expandable with pcie bifurcation, each system).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!