Why Does ZFS Hate my Server

jwsl224

Member
Apr 6, 2024
34
0
6
i have this bizarre situation where using ZFS on a Dell R730xd leads to something completely useless. i have spent most of an entire week trying to troubleshoot it in every manner possible. it simply. doesn't. budge.
system topology:
Dell R730xd
it was two 2.5" sata ports in the back connected directly to the motherboard
12x 3.5" SAS drives on the front connected to system using backplane and, i've tried both perc in HBA mode and actual Dell HBA

so firstly, no matter what system topology i use, both windows and proxmox on EXT4 will perfectly saturate the SATA bus in the back, pegging out at 450+ MB/s on the metal, and about 250 MB/s in a virtual machine. (for writes. reads in a vm are also 450 MB/s)

as soon as i switch to ZFS, things completely fall apart. i have tried an ashift of 9, 12, and 13. i have tried turning compression off and on. i have tried an ARC of 64MB to 16GB. i have tried zfs_dirty_data_max/max_max of 0 - 32GB. absolutely NOTHING functions.

symptoms: using zfs, the 2.5" data drives will completely peg out at about 60MB/s, and the 12k SAS drives on the front backplate will peg out at 5 MB/s. (5,000 KB/s) this is on sequential 1M writes using 20 threads.

with a large zfs_dirty_data pool you can see the RAM just eating those disk performance tests, and they will shoot up to 16-17 GB/s (16-17,000 MB/s) as soon as the dirty data pool gets full though, the vm will essentially lock up as ZFS slowly writes the now massive pool of dirty data to the ssd's at 60 MB/s. after a bunch of minutes, the vm will come to and disk utilization will finally move of 100% utilized.

what the heck is this. it has eaten most of a weak for me. and people are installing ZFS on much worse systems than and r730xd with 256gb of ram and 2x xeon e5-2678 v3 CPU's.

also, i'm using the proxmox ve 8.2 for these test installs.
 
i have this bizarre situation where using ZFS on a Dell R730xd leads to something completely useless. i have spent most of an entire week trying to troubleshoot it in every manner possible. it simply. doesn't. budge.
system topology:
Dell R730xd
it was two 2.5" sata ports in the back connected directly to the motherboard
12x 3.5" SAS drives on the front connected to system using backplane and, i've tried both perc in HBA mode and actual Dell HBA

so firstly, no matter what system topology i use, both windows and proxmox on EXT4 will perfectly saturate the SATA bus in the back, pegging out at 450+ MB/s on the metal, and about 250 MB/s in a virtual machine. (for writes. reads in a vm are also 450 MB/s)

as soon as i switch to ZFS, things completely fall apart. i have tried an ashift of 9, 12, and 13. i have tried turning compression off and on. i have tried an ARC of 64MB to 16GB. i have tried zfs_dirty_data_max/max_max of 0 - 32GB. absolutely NOTHING functions.

symptoms: using zfs, the 2.5" data drives will completely peg out at about 60MB/s, and the 12k SAS drives on the front backplate will peg out at 5 MB/s. (5,000 KB/s) this is on sequential 1M writes using 20 threads.

with a large zfs_dirty_data pool you can see the RAM just eating those disk performance tests, and they will shoot up to 16-17 GB/s (16-17,000 MB/s) as soon as the dirty data pool gets full though, the vm will essentially lock up as ZFS slowly writes the now massive pool of dirty data to the ssd's at 60 MB/s. after a bunch of minutes, the vm will come to and disk utilization will finally move of 100% utilized.

what the heck is this. it has eaten most of a weak for me. and people are installing ZFS on much worse systems than and r730xd with 256gb of ram and 2x xeon e5-2678 v3 CPU's.

also, i'm using the proxmox ve 8.2 for these test installs.

First, ZFS is a LOT Slower as Ext4 or anything else.
So when i need rally fast storage, i prefer lvm.

Second, it depends what the usecase is, some tunings that i use that are independend from the usecase and independent from drives (HDD/Nvme) are:
xattr=sa
atime=off (dont use this only for PBS as Backup storage)
recordsize=128k (128=default, is pretty good for mixed usecase, 2M is hugely recommended for PBS backup storage)
acltype=posix (If you dont use samba shares on that storage itself with special permissions)
dnodesize=auto (Always recommended, but not afterwards, you have to start with it)
dedup=off (default=off, never turn it on, deduplication is extremely crap for speed in almost all situations)
redundant_metadata=most (good for performance, but bad if you need the highest possible integrity, depends on the metadata storage if dedicated, if its mirrored, or in default environments with mirrored disks this is safe either. Its actually just risky if you use special vdev for metadata that is not mirrored or not mirrored any storage. if you have for example raid10 with 8 disks, its absolutely safe to set it to "most")

If SSD-Storage (No restrictions):
logbias=throughput (Recommended only for datasets with primary usage for Postgres/Mysql)

On really fast nvme drives in a ZFS Raid 10 (Like 12x Micron 7450 Max) i do additionally:
primarycache=metadata (it's faster without caching everything to memory here)

----------

Next Pro-Tipp, if you use nvme drives or even on some scsi drives you can check with "smartctl -a <disk>" what blocksize is the most performant:
Supported LBA Sizes (NSID 0x1)
Code:
Id Fmt  Data  Metadt  Rel_Perf
 0 -     512       0         2
 1 +    4096       0         0

minus = not used
plus = used
Rel_Perf: the less the number = the more performance

So you have to reformat the drive with either hdparm (scsi drives) or "nvme format" (nvme drives).
And set the Appropriate ashift then you create the zfs pool that matches your new blocksize.

------------------

Thats all the tipps i can give you. But yeah zfs is definitively slow compared to everything else.
However i still use zfs on all 12 Servers here, ranging from homelab to genoa 9374F max specced servers for the features.
Only on like 2 Servers im using LVM, because zfs is so utterly slow and i don't need there the integrity or compression or tons of snapshots etc...

Another tipp is to disable primarycache for benchmarking only and enable it afterwards.

Hope that helps, cheers
 
What is the configuration of your ZFS pool? Also note that EXT4 to a single disk will have different characteristics than writing to a RAIDZ2 pool as the latter ALL disks needs to respond. ZFS also actually writes the data to disk before returning. If things work to an individual disk but not to a pool, check if any disk(s) are very busy with iostat, that may indicate faulty disks or connections or other hardware problems.

That being said, 60MB/s is the true speed of random IOPS on a spinning disk, especially if you’re aggregating 20 threads, that is pretty good, 12 spinners even at 10k won’t be very fast in modern terms if you care about the data, if you’re using SSD, this is an older server and the Dell SSD in that generation were not very fast, consumer grade SSD are also a hit and a miss. Again, EXT4 may not be syncing your data to disk right away and you’re really testing the throughput of the SATA bus to the disk cache.
 
Last edited:
Hi,

How it is look:

zpool status -v

Good luck / Bafta !
first of all, jeepers. thanks all three of you for stopping by. i really appreciate that after almost a week of not getting anywhere with this.

ok i installed two different ssd i had in a drawer, just to test if drives i was trying to use, which had been operated by a hardware raid controller before, had an issue where the raid card changed something to make them optimized better for hardware raid. turns out, the issue is the same either way. (and yes those are cheap ssd's. but the issue here is more than cheap ssd's) so here's the command you reqested:

Code:
root@r730xd-1:~# zpool status -v
  pool: rpool
 state: ONLINE
config:

        NAME                                                        STATE     READ WRITE CKSUM
        rpool                                                       ONLINE       0     0     0
          mirror-0                                                  ONLINE       0     0     0
            ata-Patriot_Burst_Elite_120GB_PBEIIBB23122503052-part3  ONLINE       0     0     0
            ata-Patriot_Burst_Elite_120GB_PBEIIBB23122508793-part3  ONLINE       0     0     0

errors: No known data errors
 
That being said, 60MB/s is the true speed of random IOPS on a spinning disk
i am talking about sequential writes here to an SSD. even without caching, it's CERTAINLY supposed to get more than 60 MB/s. the spining SAS drives get 5 MB/s. something is certainly horrendously off.
 
Second, it depends what the usecase is, some tunings that i use that are independend from the usecase and independent from drives (HDD/Nvme) are:
xattr=sa
atime=off (dont use this only for PBS as Backup storage)
recordsize=128k (128=default, is pretty good for mixed usecase, 2M is hugely recommended for PBS backup storage)
acltype=posix (If you dont use samba shares on that storage itself with special permissions)
dnodesize=auto (Always recommended, but not afterwards, you have to start with it)
dedup=off (default=off, never turn it on, deduplication is extremely crap for speed in almost all situations)
redundant_metadata=most (good for performance, but bad if you need the highest possible integrity, depends on the metadata storage if dedicated, if its mirrored, or in default environments with mirrored disks this is safe either. Its actually just risky if you use special vdev for metadata that is not mirrored or not mirrored any storage. if you have for example raid10 with 8 disks, its absolutely safe to set it to "most")
jeepers. that's more optimizations than i even thought existed. thank you for taking time to type that out. i will certianly be coming back to that often in the future. but first, i need to get this issue solved so that i have a baseline functional system.

So you have to reformat the drive with either hdparm (scsi drives) or "nvme format" (nvme drives).
And set the Appropriate ashift then you create the zfs pool that matches your new blocksize.
this is one thing that stuck out to me. i wonder if i need to formet these drives first before they can be used optimally with zfs. could you help me check that? running your suggested command, here is the output:


Code:
root@r730xd-1:~# smartctl -a /dev/sdc
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.4-2-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     Patriot Burst Elite 120GB
Serial Number:    PBEIIBB23122508793
LU WWN Device Id: 0 000000 000000000
Firmware Version: HT3618C1
User Capacity:    120,034,123,776 bytes [120 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        Not in smartctl database 7.3/5319
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Sep  7 12:02:06 2024 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x5d) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Abort Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0002) Does not save SMART data before
                                        entering power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (   4) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   100   100   050    Old_age   Always       -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   050    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   050    Old_age   Always       -       1
 12 Power_Cycle_Count       0x0032   100   100   050    Old_age   Always       -       2
160 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       0
161 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       8804
163 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       2
164 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       0
165 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       0
166 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       0
167 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       0
168 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       0
169 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       100
175 Program_Fail_Count_Chip 0x0032   100   100   050    Old_age   Always       -       0
176 Erase_Fail_Count_Chip   0x0032   100   100   050    Old_age   Always       -       44
177 Wear_Leveling_Count     0x0032   100   100   050    Old_age   Always       -       22256
178 Used_Rsvd_Blk_Cnt_Chip  0x0032   100   100   050    Old_age   Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   050    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   050    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   050    Old_age   Always       -       2
194 Temperature_Celsius     0x0032   100   100   050    Old_age   Always       -       50
195 Hardware_ECC_Recovered  0x0032   100   100   050    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   100   100   050    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   050    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0032   100   100   050    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   050    Old_age   Always       -       0
232 Available_Reservd_Space 0x0032   100   100   050    Old_age   Always       -       100
241 Total_LBAs_Written      0x0032   100   100   050    Old_age   Always       -       1901
242 Total_LBAs_Read         0x0032   100   100   050    Old_age   Always       -       5

SMART Error Log Version: 0
No Errors Logged

SMART Self-test log structure revision number 0
Warning: ATA Specification requires self-test log structure revision number = 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
post actual benchmark command and result. its also instructive to know WHERE the benchmark is performed (eg, on host, in guest/type of guest)
absolutely. and thank you for dropping by.
first of all, on this test system, running fdisk -l, here is the output so you can see what we are working on:


Code:
root@r730xd-1:~# fdisk -l
Disk /dev/sda: 29.11 TiB, 32001801322496 bytes, 62503518208 sectors
Disk model: PERC H730 Mini 
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 5DB50750-D882-411E-B91B-993B1CC33115


Disk /dev/sdb: 111.79 GiB, 120034123776 bytes, 234441648 sectors
Disk model: Patriot Burst El
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: BF781D07-0451-4281-BB53-D44EA29AAF38

Device       Start       End   Sectors   Size Type
/dev/sdb1       34      2047      2014  1007K BIOS boot
/dev/sdb2     2048   2099199   2097152     1G EFI System
/dev/sdb3  2099200 234441614 232342415 110.8G Solaris /usr & Apple ZFS


Disk /dev/sdc: 111.79 GiB, 120034123776 bytes, 234441648 sectors
Disk model: Patriot Burst El
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 07E8E706-318F-48A0-B66D-6977100F9EE2

Device       Start       End   Sectors   Size Type
/dev/sdc1       34      2047      2014  1007K BIOS boot
/dev/sdc2     2048   2099199   2097152     1G EFI System
/dev/sdc3  2099200 234441614 232342415 110.8G Solaris /usr & Apple ZFS


Disk /dev/zd0: 60 GiB, 64424509440 bytes, 125829120 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 16384 bytes
I/O size (minimum/optimal): 16384 bytes / 16384 bytes
Disklabel type: dos
Disk identifier: 0x7ac443b2

Device     Boot     Start       End   Sectors  Size Id Type
/dev/zd0p1 *         2048    104447    102400   50M  7 HPFS/NTFS/exFAT
/dev/zd0p2         104448 124781196 124676749 59.5G  7 HPFS/NTFS/exFAT
/dev/zd0p3      124782592 125825023   1042432  509M 27 Hidden NTFS WinRE
The backup GPT table is corrupt, but the primary appears OK, so that will be used.


Disk /dev/sdd: 953.87 GiB, 1024209543168 bytes, 2000409264 sectors
Disk model: Samsung SSD 860
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 92975E7F-9EFE-4A79-96D4-4962F04DBB5D


so, the benchmark we are running is going to be in the shell of the host proxmox system, using fio, and here is the command we will execute:

Code:
fio
--filename=/usr/test/testfile --rw=write --bs=4k --ioengine=io_uring --runtime=240 --time_based --numjobs=1 --iodepth=8 --name=speed_test --size=30G

i am using 30G test file to make sure we get past the cache on zfs to where we are actually writing on the disk. the result of this benchmark is this:

Code:
root@r730xd-1:~# fio --filename=/usr/test/testfile --rw=write --bs=4k --ioengine=io_uring --runtime=240 --time_based --numjobs=1 --iodepth=8 --name=speed_test --size=30G
speed_test: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=8
fio-3.33
Starting 1 process
speed_test: Laying out IO file (1 file / 30720MiB)
Jobs: 1 (f=1): [W(1)][100.0%][eta 00m:00s]                         
speed_test: (groupid=0, jobs=1): err= 0: pid=30450: Sat Sep  7 17:18:29 2024
  write: IOPS=35.9k, BW=140MiB/s (147MB/s)(33.5GiB/244694msec); 0 zone resets
    slat (nsec): min=780, max=2836.4k, avg=1535.41, stdev=1265.61
    clat (usec): min=9, max=20226k, avg=220.91, stdev=32924.68
     lat (usec): min=10, max=20226k, avg=222.44, stdev=32924.69
    clat percentiles (usec):
     |  1.00th=[   34],  5.00th=[   35], 10.00th=[   35], 20.00th=[   35],
     | 30.00th=[   36], 40.00th=[   38], 50.00th=[   40], 60.00th=[   42],
     | 70.00th=[   56], 80.00th=[   64], 90.00th=[   98], 95.00th=[  469],
     | 99.00th=[  865], 99.50th=[  898], 99.90th=[24511], 99.95th=[28443],
     | 99.99th=[31065]
   bw (  KiB/s): min= 2808, max=726400, per=100.00%, avg=235126.48, stdev=266421.99, samples=299
   iops        : min=  702, max=181600, avg=58781.65, stdev=66605.54, samples=299
  lat (usec)   : 10=0.01%, 20=0.05%, 50=67.76%, 100=22.38%, 250=2.02%
  lat (usec)   : 500=3.25%, 750=2.44%, 1000=1.83%
  lat (msec)   : 2=0.11%, 4=0.01%, 10=0.01%, 20=0.05%, 50=0.11%
  lat (msec)   : 100=0.01%, 250=0.01%, 2000=0.01%, >=2000=0.01%
  cpu          : usr=5.50%, sys=8.24%, ctx=5137281, majf=0, minf=103
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,8787945,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8

Run status group 0 (all jobs):
WRITE: bw=140MiB/s (147MB/s), 140MiB/s-140MiB/s (147MB/s-147MB/s), io=33.5GiB (36.0GB), run=244694-244694msec


and, while the test was running, here is an "iostat -xs 1" output to show the drives being written:

1725747608923.png
 
These cheapest ssd disks will die in few month, as ZFS write more data than actual data, for integrity + there is another write amplification for zvol.
Don't expect get better performance with those.
 
jeepers. that's more optimizations than i even thought existed. thank you for taking time to type that out. i will certianly be coming back to that often in the future. but first, i need to get this issue solved so that i have a baseline functional system.


this is one thing that stuck out to me. i wonder if i need to formet these drives first before they can be used optimally with zfs. could you help me check that? running your suggested command, here is the output:


Code:
root@r730xd-1:~# smartctl -a /dev/sdc
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.4-2-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     Patriot Burst Elite 120GB
Serial Number:    PBEIIBB23122508793
LU WWN Device Id: 0 000000 000000000
Firmware Version: HT3618C1
User Capacity:    120,034,123,776 bytes [120 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        Not in smartctl database 7.3/5319
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Sep  7 12:02:06 2024 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x5d) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Abort Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0002) Does not save SMART data before
                                        entering power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (   4) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   100   100   050    Old_age   Always       -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   050    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   050    Old_age   Always       -       1
 12 Power_Cycle_Count       0x0032   100   100   050    Old_age   Always       -       2
160 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       0
161 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       8804
163 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       2
164 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       0
165 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       0
166 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       0
167 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       0
168 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       0
169 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       100
175 Program_Fail_Count_Chip 0x0032   100   100   050    Old_age   Always       -       0
176 Erase_Fail_Count_Chip   0x0032   100   100   050    Old_age   Always       -       44
177 Wear_Leveling_Count     0x0032   100   100   050    Old_age   Always       -       22256
178 Used_Rsvd_Blk_Cnt_Chip  0x0032   100   100   050    Old_age   Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   050    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   050    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   050    Old_age   Always       -       2
194 Temperature_Celsius     0x0032   100   100   050    Old_age   Always       -       50
195 Hardware_ECC_Recovered  0x0032   100   100   050    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   100   100   050    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   050    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0032   100   100   050    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   050    Old_age   Always       -       0
232 Available_Reservd_Space 0x0032   100   100   050    Old_age   Always       -       100
241 Total_LBAs_Written      0x0032   100   100   050    Old_age   Always       -       1901
242 Total_LBAs_Read         0x0032   100   100   050    Old_age   Always       -       5

SMART Error Log Version: 0
No Errors Logged

SMART Self-test log structure revision number 0
Warning: ATA Specification requires self-test log structure revision number = 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
When that LBA section isnt shown, your drive doesnt support other sizes. As far i know it shows up only on some SAS HDDs and almost all nvme drives.
But i added that for the sake of completeness.

Tbh take in mind that even with my optimizations, zfs is slower as any other filesystem. And well yeah, your drives arent the best for zfs either.
There are actually no good "recommended" drives for zfs, but because ZFS is slow, good drives like Micron 7450/7500/9300/9400, Samsung PM9a3 and so on will perform very well, because they offer tons of iops.
But with ext4/lvm they will be still faster, almost twice. (Except when the drives are too fast, like 12x 7450 max, then there is almost a tie)

However, before all the speculation,
1mb size x20 threads and just 60mb/s (Your initial post), is really pretty crappy. (I think the drives are crappy, but anyway)
What does pveperf tells you?
You can even execute "pveperf /targetstorage"

Cheers.
 
When that LBA section isnt shown, your drive doesnt support other sizes. As far i know it shows up only on some SAS HDDs and almost all nvme drives.
But i added that for the sake of completeness.

Tbh take in mind that even with my optimizations, zfs is slower as any other filesystem. And well yeah, your drives arent the best for zfs either.
There are actually no good "recommended" drives for zfs, but because ZFS is slow, good drives like Micron 7450/7500/9300/9400, Samsung PM9a3 and so on will perform very well, because they offer tons of iops.
But with ext4/lvm they will be still faster, almost twice. (Except when the drives are too fast, like 12x 7450 max, then there is almost a tie)

However, before all the speculation,
1mb size x20 threads and just 60mb/s (Your initial post), is really pretty crappy. (I think the drives are crappy, but anyway)
What does pveperf tells you?
You can even execute "pveperf /targetstorage"

Cheers.
the reason i swapped in those cheap drives to run tests with, in the event that you would've told me they need some kind of low level format to work correctly, doing that on 120gb drives vs the 1tb ones that are actually going to be used would've been faster. there's no reason they shouldn't get to at least 75% of the sata bus speed of 450 MB/s. not 25 MB/s as the test shows. speaking of the sas, drives, here is what they look like when a vm is trying to write to them:

1725754057690.png
they're pegging out at 5 MB/s. so something is definitely foundationally wrong with this server's execution of zfs.

here is the command you had me execute:

Code:
root@r730xd-1:~# pveperf /usr/test
CPU BOGOMIPS:      120000.24
REGEX/SECOND:      3017756
HD SIZE:           84.97 GB (rpool/ROOT/pve-1)
FSYNCS/SECOND:     6790.02
DNS EXT:           53.18 ms
DNS INT:           34.42 ms (local)
 
So you are getting really good performance for that brand at 140MB/s. Here is from a review on that series average write transfer rate of about 71MB/s and 56MB/s.

Those SSD are pure garbage, they may work for booting, they may eat your data, YMMV.
 
Last edited:
So you are getting really good performance for that brand at 140MB/s. Here is from a review on that series average write transfer rate of about 71MB/s and 56MB/s.

Those SSD are pure garbage, they may work for booting, they may eat your data, YMMV.
again. they are only for testing. also, 140 MB/s is the average on the test with the fast zfs cache-based beginning of the test. what's more, does your review explain why the 12 Gbps SAS drives are capping out at 5 MB/s? what's even more, those drives have no problem doing sustained writes at 400+ MB/s when using windows or EXT4. the moral of the story here is, i have some kind of a problem on this server running ZFS. i'm trying to find out what it is. hence, the 2 test ssd's.
 
Last edited:
Yes, they are for testing, fine, but you can’t expect full bandwidth performance.

You have the performance analysis, but I don’t see what test, I don’t immediately see anything wrong based on the test you ran: you are maxing out IOPS, not testing bandwidth/throughput. And that is indeed the right test for VM servers, it was not unusual to have 24+ 15k RPM spinning disks per VMware server back in the day, more if you wanted databases.

This has nothing to do with ZFS but with the limits of spinning disks, those IOPS across 12 disks is pretty good.
 
Last edited:
don't report test with your cheap ssd as they can't be ZFS member.

what is your HDD ZFS configuration ?
the faster is striping mirrored (=RAID10).
with 12 HDD disks, max threads for sequential should be 6 threads.
 
don't report test with your cheap ssd as they can't be ZFS member.

what is your HDD ZFS configuration ?
the faster is striping mirrored (=RAID10).
with 12 HDD disks, max threads for sequential should be 6 threads.
1725768908241.png

this test is not cheap ssd. as mentioned, those are Dell 12Gbps SAS drives. they are pegging out at 5 MB/s. guys. jeepers. i KNOW what a hard drive can do at a basic level. this is not what's up here.

this configuration is raidz, and this is a vm asking for sequencial writes.
 
yes, the SAS drives, will not provide more than 500 IOPS per disk worth of small (12kB) sync writes.

The metric you are testing is the wrong one, you are maxing out IOPS, that is generally what we care in shared storage for VM situations about but if you care about throughput, do large (1MB) single thread sequential unsynced block writes, make sure your VM isn’t set to force sync write (which is the default) but set it to write back and you should see ~600-1200MB/s. Also make sure you are using virtio.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!