3 x 4TB Samsung SSD in ZFS raidz1 => poor performance

That all really depends on your workload and setup. My homeserver is running 20 VMs and these are writing 900GB per day while idleing where most of the writes are just logs/metrics created by the VMs themself. Sum that up and a 1TB consumer SSDs TBW will be exeeded within a year.
If you are not using any DBs doing alot of small sync writes and if you skip raid/zfs and just use a single SSD with LVM it might survive for many years.
Absolutely, but I think the context is sometimes forgotten and often the ones that are trying to help lives in different worlds and speaks a different language than the newbies looking for help with different requirements and risk profiles. I make this mistake all the time myself trying to help people in other domains where I am more of the expert.
Again, its all about the workload. Consumer SSDs are great if you need small bursts of sequential async reads/writes but the performance will drop massively as soon as the cache is filled up because of long sustained loads. In such a case a enterprise SSD will be much faster because the performance won't drop that hard. And most consumer SSDs won't be able to use caching at all for sync writes, so here the performance is always horrible.
That right there is what I'm talking about. Most people who just want to set this up for a smal scale home server has no clue if they will be running stuff that only generates a "a small bursts of sequential async reads/writes". They just want a couple of VMs for a Windows 10 and maybe a Home Assistant instance or whatnot and maybe some LXCs for a few services. They don't know what sequential async reads/writes is or what generates it.

And please don't misunderstand...I think everyone appreciate the time people like yourself put into replying and assisting with issues here. It's just that the context is really important recognize at both ends.
 
Last edited:
Absolutely, but I think the context is sometimes forgotten and often the ones that are trying to help lives in different worlds and speaks a different language than the newbies looking for help with different requirements and risk profiles. I make this mistake all the time myself trying to help people in other domains where I am more of the expert.

That right there is what I'm talking about. Most people who just want to set this up for a smal scale home server has no clue if they will be running stuff that only generates a "a small bursts of sequential async reads/writes". They just want a couple of VMs for a Windows 10 and maybe a Home Assistant instance or whatnot and maybe some LXCs for a few services. They don't know what sequential async reads/writes is or what generates it.
I think its always a good idea to get enterprise grade hardware if you want to run a server unless it is just for testing.
ECC RAM and enterprise SSDs aren't that much more expensive and even if you dont need that features now it is futureproof if you get more experienced over time. Maybe you later want more data reliability and need to switch from LVM to ZFS for that. Here ECC and durable SSDs are recommended so you would need to buy everthing twice.
And even if you dont want to do more complex stuff later, you always will benefit from better support, longer security updates / bug fixes, better reliability and longer warranties. And your server will be more stable and corrupt less data.
 
Last edited:
Most ZFS performance complaints I have seen relate to ZFS config, using hardware RAID with ZFS, or slow drives.
I haven't changed any ZFS configurations.
I don't have a hardware RAID.
Are samsung 860 EVO slow?

I do not think it depends from disks.
IOTOP show random 500MBi/s write not related to any VM but mostly releted to ZFS tasks.
 
I haven't changed any ZFS configurations.
I don't have a hardware RAID.
Are samsung 860 EVO slow?

I do not think it depends from disks.
IOTOP show random 500MBi/s write not related to any VM but mostly releted to ZFS tasks.
Try random 4k writes with queue deep of 1 or try sync writes. Writes should drop down really hard. The other point is that you need to disable compression if you benchmark your ZFS pool or you need to make sure that is testdata is created with /dev/urandom.
And you need to write alot of data at once (100GB for example) or otherweise you are just benchmarking the cache and not the flash of that SSD.

My 970 Evo dropped from 2000MB/s to 30MB/s after I did all of that.
 
Last edited:
Try random 4k writes with queue deep of 1 or try sync writes. Writes should drop down really hard. The other point is that you need to disable compression if you benchmark your ZFS pool or you need to make sure that is testdata is created with /dev/urandom.
And you need to write alot of data at once (100GB for example) or otherweise you are just benchmarking the cache and not the flash of that SSD.
Ok, where and how do I have to put these settings?
 
Ok, where and how do I have to put these settings?
A good tool for testing is "fio". You need to install it first (apt install fio).

A difficult benchmark would for example be this:
fio --rw=randwrite --name=test --size=100G --direct=1 --refill_buffers --bs=4k --numjobs=1 --iodepth=1 --runtime=600 --ioengine=fsync --filename=/mnt/yourDir/test.file

Make sure you got 100GB of free space and that the filename is pointing to mountpoint of the ZFS pool and not a device. If you choose a device like /dev/sda fio would destroy all data on the sda drive.

The Proxmox staff also did a ZFS SSD benchmark using fio. You can run the same test and compare your results with the ones in the paper.
Compare the consumer drives (Samsung 850 Evo and Crucial MX100) to the enterprise drives (the rest of them except the Seagate HDD at the bottom):
paper.png
 
Last edited:
That soulds like it is throttling/overheating. Add a heatsink, und you'l get 200MB/s
It got a actively cooled heatsink. Its just super slow if you try 4k sync writes or 4k random async writes with a iodepth of 1 that can't be compressed.
Look at the benchmark above. There the 850 Evo dropped to 1.3 MB/s (and should be able to write with 520MB/s if you for example use it in a windows machine with sequential async writes).

Edit:
And there is a big difference in performance if you try the same benchmark with a SSD that is new and empty and later when that SSD that is already 90% full. Depending on the model this makes a huge difference. My Evos for example handled this much better than my Cruial BX or Patriot Bursts.
 
Last edited:
@Dunuin
I just run the test on my desktop on a 970 EVO with heatsink:
Filesystem is xfs on luks.
Code:
fio --rw=randwrite --name=test --size=100G --direct=1 --refill_buffers --bs=4k --numjobs=1 --iodepth=1  --runtime=600 --filename=test.file
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.26
Starting 1 process
test: Laying out IO file (1 file / 102400MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=632KiB/s][w=158 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=27271: Mon Apr 26 19:07:17 2021
  write: IOPS=5650, BW=22.1MiB/s (23.1MB/s)(12.9GiB/600001msec); 0 zone resets
    clat (usec): min=18, max=8646.2k, avg=175.42, stdev=11710.85
     lat (usec): min=18, max=8646.2k, avg=175.49, stdev=11710.86
    clat percentiles (usec):
     |  1.00th=[   24],  5.00th=[   26], 10.00th=[   27], 20.00th=[   29],
     | 30.00th=[   30], 40.00th=[   31], 50.00th=[   32], 60.00th=[   33],
     | 70.00th=[   34], 80.00th=[   35], 90.00th=[   37], 95.00th=[   41],
     | 99.00th=[ 5276], 99.50th=[ 6456], 99.90th=[11600], 99.95th=[13960],
     | 99.99th=[16909]
   bw (  KiB/s): min=  192, max=126408, per=100.00%, avg=26311.29, stdev=37922.55, samples=1031
   iops        : min=   48, max=31602, avg=6577.79, stdev=9480.65, samples=1031
  lat (usec)   : 20=0.01%, 50=97.01%, 100=0.87%, 250=0.10%, 500=0.01%
  lat (usec)   : 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.04%, 4=0.75%, 10=1.03%, 20=0.18%, 50=0.01%
  lat (msec)   : 100=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%, 2000=0.01%
  lat (msec)   : >=2000=0.01%
  cpu          : usr=1.03%, sys=7.25%, ctx=3417944, majf=0, minf=14
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,3390507,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=22.1MiB/s (23.1MB/s), 22.1MiB/s-22.1MiB/s (23.1MB/s-23.1MB/s), io=12.9GiB (13.9GB), run=600001-600001msec

Disk stats (read/write):
    dm-0: ios=1/3913695, merge=0/0, ticks=10/4846990, in_queue=4847000, util=100.00%, aggrios=35/3907244, aggrmerge=60/6757, aggrticks=168/4769860, aggrin_queue=5359135, aggrutil=100.00%
  nvme0n1: ios=35/3907244, merge=60/6757, ticks=168/4769860, in_queue=5359135, util=100.00%
Temperature at the end was: 58C

So I can confirm your test result: 23MB/s.


For comparison the "datacenter" SM983 m.2 without heatsink in my Proxmox server:
Filesystem = zfs, no encryption.
Code:
fio --rw=randwrite --name=test --size=100G --direct=1 --refill_buffers --bs=4k --numjobs=1 --iodepth=1  --runtime=600 --filename=test.file
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.12
Starting 1 process
test: Laying out IO file (1 file / 102400MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=27.2MiB/s][w=6961 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=11634: Mon Apr 26 19:26:56 2021
  write: IOPS=12.3k, BW=47.0MiB/s (50.3MB/s)(28.1GiB/600001msec); 0 zone resets
    clat (usec): min=3, max=9720, avg=80.05, stdev=126.11
     lat (usec): min=3, max=9720, avg=80.11, stdev=126.11
    clat percentiles (usec):
     |  1.00th=[   17],  5.00th=[   35], 10.00th=[   36], 20.00th=[   39],
     | 30.00th=[   41], 40.00th=[   44], 50.00th=[   47], 60.00th=[   50],
     | 70.00th=[   56], 80.00th=[   67], 90.00th=[  188], 95.00th=[  241],
     | 99.00th=[  619], 99.50th=[ 1004], 99.90th=[ 1532], 99.95th=[ 1778],
     | 99.99th=[ 2073]
   bw (  KiB/s): min=20720, max=97600, per=100.00%, avg=49139.80, stdev=21026.05, samples=1199
   iops        : min= 5180, max=24400, avg=12284.92, stdev=5256.51, samples=1199
  lat (usec)   : 4=0.01%, 10=0.92%, 20=0.08%, 50=58.66%, 100=26.52%
  lat (usec)   : 250=9.60%, 500=3.04%, 750=0.35%, 1000=0.32%
  lat (msec)   : 2=0.49%, 4=0.01%, 10=0.01%
  cpu          : usr=2.05%, sys=53.21%, ctx=1017474, majf=0, minf=9
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,7369279,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=47.0MiB/s (50.3MB/s), 47.0MiB/s-47.0MiB/s (50.3MB/s-50.3MB/s), io=28.1GiB (30.2GB), run=600001-600001msec

Might be throttling at the end. 78C
Much higher CPU load ~12%
Twice as fast but at same price: SM983 costs about the same as the 970EVO
 
Last edited:
  • Like
Reactions: Dunuin
@Dunuin
I just run the test on my desktop on a 970 EVO with heatsink:
Filesystem is xfs on luks.
Code:
fio --rw=randwrite --name=test --size=100G --direct=1 --refill_buffers --bs=4k --numjobs=1 --iodepth=1  --runtime=600 --filename=test.file
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.26
Starting 1 process
test: Laying out IO file (1 file / 102400MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=632KiB/s][w=158 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=27271: Mon Apr 26 19:07:17 2021
  write: IOPS=5650, BW=22.1MiB/s (23.1MB/s)(12.9GiB/600001msec); 0 zone resets
    clat (usec): min=18, max=8646.2k, avg=175.42, stdev=11710.85
     lat (usec): min=18, max=8646.2k, avg=175.49, stdev=11710.86
    clat percentiles (usec):
     |  1.00th=[   24],  5.00th=[   26], 10.00th=[   27], 20.00th=[   29],
     | 30.00th=[   30], 40.00th=[   31], 50.00th=[   32], 60.00th=[   33],
     | 70.00th=[   34], 80.00th=[   35], 90.00th=[   37], 95.00th=[   41],
     | 99.00th=[ 5276], 99.50th=[ 6456], 99.90th=[11600], 99.95th=[13960],
     | 99.99th=[16909]
   bw (  KiB/s): min=  192, max=126408, per=100.00%, avg=26311.29, stdev=37922.55, samples=1031
   iops        : min=   48, max=31602, avg=6577.79, stdev=9480.65, samples=1031
  lat (usec)   : 20=0.01%, 50=97.01%, 100=0.87%, 250=0.10%, 500=0.01%
  lat (usec)   : 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.04%, 4=0.75%, 10=1.03%, 20=0.18%, 50=0.01%
  lat (msec)   : 100=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%, 2000=0.01%
  lat (msec)   : >=2000=0.01%
  cpu          : usr=1.03%, sys=7.25%, ctx=3417944, majf=0, minf=14
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,3390507,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=22.1MiB/s (23.1MB/s), 22.1MiB/s-22.1MiB/s (23.1MB/s-23.1MB/s), io=12.9GiB (13.9GB), run=600001-600001msec

Disk stats (read/write):
    dm-0: ios=1/3913695, merge=0/0, ticks=10/4846990, in_queue=4847000, util=100.00%, aggrios=35/3907244, aggrmerge=60/6757, aggrticks=168/4769860, aggrin_queue=5359135, aggrutil=100.00%
  nvme0n1: ios=35/3907244, merge=60/6757, ticks=168/4769860, in_queue=5359135, util=100.00%
Temperature at the end was: 58C

So I can confirm your test result: 23MB/s.


For comparison the "datacenter" SM983 m.2 without heatsink in my Proxmox server:
Filesystem = zfs, no encryption.
Code:
fio --rw=randwrite --name=test --size=100G --direct=1 --refill_buffers --bs=4k --numjobs=1 --iodepth=1  --runtime=600 --filename=test.file
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.12
Starting 1 process
test: Laying out IO file (1 file / 102400MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=27.2MiB/s][w=6961 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=11634: Mon Apr 26 19:26:56 2021
  write: IOPS=12.3k, BW=47.0MiB/s (50.3MB/s)(28.1GiB/600001msec); 0 zone resets
    clat (usec): min=3, max=9720, avg=80.05, stdev=126.11
     lat (usec): min=3, max=9720, avg=80.11, stdev=126.11
    clat percentiles (usec):
     |  1.00th=[   17],  5.00th=[   35], 10.00th=[   36], 20.00th=[   39],
     | 30.00th=[   41], 40.00th=[   44], 50.00th=[   47], 60.00th=[   50],
     | 70.00th=[   56], 80.00th=[   67], 90.00th=[  188], 95.00th=[  241],
     | 99.00th=[  619], 99.50th=[ 1004], 99.90th=[ 1532], 99.95th=[ 1778],
     | 99.99th=[ 2073]
   bw (  KiB/s): min=20720, max=97600, per=100.00%, avg=49139.80, stdev=21026.05, samples=1199
   iops        : min= 5180, max=24400, avg=12284.92, stdev=5256.51, samples=1199
  lat (usec)   : 4=0.01%, 10=0.92%, 20=0.08%, 50=58.66%, 100=26.52%
  lat (usec)   : 250=9.60%, 500=3.04%, 750=0.35%, 1000=0.32%
  lat (msec)   : 2=0.49%, 4=0.01%, 10=0.01%
  cpu          : usr=2.05%, sys=53.21%, ctx=1017474, majf=0, minf=9
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,7369279,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=47.0MiB/s (50.3MB/s), 47.0MiB/s-47.0MiB/s (50.3MB/s-50.3MB/s), io=28.1GiB (30.2GB), run=600001-600001msec

Might be throttling at the end. 78C
Much higher CPU load ~12%
Twice as fast but at same price: SM983 costs about the same as the 970EVO
Could you clarify for a newbie please?

Is the SM983 or 970 evo better in your test?
 
SM983 is a Datacenter SSD, with power loss protection, but no power saving modes, it uses ~5-7W, it is 110mm long!
970 Evo is the consumer SSD no PLP, but energy saving modes e.g. ~50mW when idle 80mm long.

Both overheat easily after a few minutes of full load, then they throttle. I think it is extremely difficult to cool them sufficiently when mounted on the motherboard. SSDs in external cases are easier to cool e.g. with a U.2 interface.

IMHO, zfs is not the filesystem for maximum performance with NVMe, since the software stack is too heavy. But then it scales extremely well, and often you want the features, like data protection. Then a good performance is good enough, and is all you need.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!