ZFS slow write performance

Haithabu84

Well-Known Member
Oct 19, 2016
119
4
58
32
Hello,

a few days ago I already explained my problem in the german forum, but I can't get any further, maybe someone can help me here.

The overall write performance is horrible. vzdump of 22Gb LXC container ~82Mb/s, same results when i upload files to file-server container.

Setup:

Xeon E5-2620 V3
32Gb Ram
2x 960Gb Samsung SM863
Mirror
compression=on
ashift=12

Proxmox installed with root on ZFS raid1, standard procedure.

io --filename=/rpool/data/test/testus --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=120 --size=1G --time_based --group_reporting --name=journal-test
journal-test: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.12
Starting 1 process
journal-test: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=9.84MiB/s][w=2518 IOPS][eta 00m:00s]
journal-test: (groupid=0, jobs=1): err= 0: pid=25958: Thu Sep 5 13:55:14 2019
write: IOPS=2563, BW=10.0MiB/s (10.5MB/s)(1202MiB/120001msec); 0 zone resets
clat (usec): min=235, max=20752, avg=388.32, stdev=168.71
lat (usec): min=235, max=20752, avg=388.58, stdev=168.74
clat percentiles (usec):
| 1.00th=[ 314], 5.00th=[ 330], 10.00th=[ 334], 20.00th=[ 343],
| 30.00th=[ 355], 40.00th=[ 363], 50.00th=[ 375], 60.00th=[ 383],
| 70.00th=[ 396], 80.00th=[ 416], 90.00th=[ 474], 95.00th=[ 502],
| 99.00th=[ 545], 99.50th=[ 553], 99.90th=[ 586], 99.95th=[ 709],
| 99.99th=[ 2409]
bw ( KiB/s): min= 8528, max=11992, per=99.98%, avg=10252.11, stdev=644.11, samples=239
iops : min= 2132, max= 2998, avg=2563.01, stdev=161.03, samples=239
lat (usec) : 250=0.01%, 500=94.61%, 750=5.33%, 1000=0.02%
lat (msec) : 2=0.02%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
cpu : usr=1.02%, sys=9.37%, ctx=615314, majf=0, minf=10
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,307641,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=10.0MiB/s (10.5MB/s), 10.0MiB/s-10.0MiB/s (10.5MB/s-10.5MB/s), io=1202MiB (1260MB), run=120001-120001msec

fio --filename=/rpool/data/test/testus --sync=1 --rw=read --bs=4k --numjobs=1 --iodepth=1 --runtime=120 --size=1G --time_based --group_reporting --name=journal-test
journal-test: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=1257MiB/s][r=322k IOPS][eta 00m:00s]
journal-test: (groupid=0, jobs=1): err= 0: pid=9803: Thu Sep 5 13:58:26 2019
read: IOPS=317k, BW=1238MiB/s (1298MB/s)(145GiB/120001msec)
clat (nsec): min=1990, max=688829, avg=2932.38, stdev=4310.83
lat (usec): min=2, max=688, avg= 2.96, stdev= 4.31
clat percentiles (nsec):
| 1.00th=[ 2064], 5.00th=[ 2064], 10.00th=[ 2064], 20.00th=[ 2064],
| 30.00th=[ 2096], 40.00th=[ 2096], 50.00th=[ 2096], 60.00th=[ 2096],
| 70.00th=[ 2128], 80.00th=[ 2288], 90.00th=[ 2416], 95.00th=[ 2576],
| 99.00th=[26496], 99.50th=[27520], 99.90th=[30080], 99.95th=[32384],
| 99.99th=[46336]
bw ( MiB/s): min= 1081, max= 1303, per=99.98%, avg=1237.86, stdev=23.00, samples=239
iops : min=276812, max=333644, avg=316892.29, stdev=5886.78, samples=239
lat (usec) : 2=0.01%, 4=96.67%, 10=0.14%, 20=0.09%, 50=3.09%
lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
cpu : usr=20.01%, sys=79.97%, ctx=848, majf=0, minf=10
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=38034713,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
READ: bw=1238MiB/s (1298MB/s), 1238MiB/s-1238MiB/s (1298MB/s-1298MB/s), io=145GiB (156GB), run=120001-120001msec

fio --name=/rpool/data/test/randfile --ioengine=libaio --iodepth=32 --rw=randwrite --bs=4k --direct=1 --size=1G --numjobs=1 --group_reporting
/rpool/data/test/randfile: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.12
Starting 1 process
/rpool/data/test/randfile: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=143MiB/s][w=36.7k IOPS][eta 00m:00s]
/rpool/data/test/randfile: (groupid=0, jobs=1): err= 0: pid=28555: Thu Sep 5 14:03:01 2019
write: IOPS=49.3k, BW=193MiB/s (202MB/s)(1024MiB/5312msec); 0 zone resets
slat (usec): min=5, max=104127, avg=18.59, stdev=451.38
clat (nsec): min=1854, max=105420k, avg=628986.43, stdev=2591309.26
lat (usec): min=9, max=105432, avg=647.71, stdev=2634.70
clat percentiles (usec):
| 1.00th=[ 273], 5.00th=[ 285], 10.00th=[ 289], 20.00th=[ 302],
| 30.00th=[ 314], 40.00th=[ 334], 50.00th=[ 359], 60.00th=[ 408],
| 70.00th=[ 498], 80.00th=[ 685], 90.00th=[ 1012], 95.00th=[ 1369],
| 99.00th=[ 3326], 99.50th=[ 5211], 99.90th=[ 13042], 99.95th=[101188],
| 99.99th=[105382]
bw ( KiB/s): min=35216, max=317072, per=93.23%, avg=184042.40, stdev=83970.31, samples=10
iops : min= 8804, max=79268, avg=46010.40, stdev=20992.39, samples=10
lat (usec) : 2=0.01%, 20=0.01%, 50=0.01%, 100=0.01%, 250=0.05%
lat (usec) : 500=70.08%, 750=12.31%, 1000=7.28%
lat (msec) : 2=8.17%, 4=1.33%, 10=0.62%, 20=0.09%, 250=0.06%
cpu : usr=8.92%, sys=67.78%, ctx=10759, majf=0, minf=10
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=0,262144,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
WRITE: bw=193MiB/s (202MB/s), 193MiB/s-193MiB/s (202MB/s-202MB/s), io=1024MiB (1074MB), run=5312-5312msec


zpool status -v
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 0 days 00:00:03 with 0 errors on Tue Aug 20 13:47:23 2019
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-SAMSUNG_sda-part3 ONLINE 0 0 0
ata-SAMSUNG_sdb-part3 ONLINE 0 0 0

errors: No known data errors

pveperf
CPU BOGOMIPS: 57595.56
REGEX/SECOND: 3123298
HD SIZE: 760.00 GB (rpool/ROOT/pve-1)
FSYNCS/SECOND: 2071.76
DNS EXT: 50.04 ms
DNS INT: 35.46 ms

pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.21-1-pve)
pve-manager: 6.0-6 (running version: 6.0-6/c71f879f)
pve-kernel-5.0: 6.0-7
pve-kernel-helper: 6.0-7
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.11-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-4
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-7
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-64
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-7
pve-container: 3.0-5
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve2

For comparison i have tested with an identical machine, with root on ext4 and lvm mdadm raid1... vzdump the same container ~220MB/s. Another Machine with root on zfs, but with consumer hardware (i7-3770, 8Gb RAM, 2x 256 Samsung 850 Pro), has the same results with fio in read and vzdump 22Gb container ~110Mb/s. Other 2-Node-Cluster with storage Replication and root on ext4, container on separate zfs mirror has ~170Gb/s.

I cant understand that.

My theory: Wrong ashift or hardware to bad. I have no Ideas anymore.

Whats wrong?
 
well let me tell you that you won't get the same write perf like with ext4 at all (at least when you use the same hardware specs)...that is by "design" for any cow filesystem. you're the only person who can tell if that is suitable for your use case.

what's the physical connection to the harddisks ? hba ? raid controller ?

also pl post output of arc_summary
 
I can live with a bit worse performance, but not when a server is overhauled by a consumer pc. In contrast to ext4, you have better features: Backup and Data Replication, write on disk.

Mainboard is an Supermicro X10SRH-CLN4F. Connection is over onboard HBA. Chip is an LSI 3008.
 

Attachments

  • arc_summary.pdf
    25.6 KB · Views: 13
Not sure of a solution, but I am curious if that is enough ram. My servers with ZFS will take 40 gigs of ram to run with no CT's or VM's running.
 
After a long time of testing and no problems, expect low write rates, i found out that i have some data corruption in one of my routine containers. No idea how that's possible.

In this Routine Container i have some c-isam databases, and there are double Entries with corrupt data now.. oh my god.
 
I think the Problem is the root-on-zfs installation. I'm gonna rebuild the system, to: 2x SSD with ext4 and raid1 over mdadm for root, and zfs raid10 with 4 SSDs for vm/container. Another Node in an similar configuration, runs much better.
 
There seems to be more problems since this weekend and ZFSonLinux. I noticed something else: i have modified my command history with date and up to 1000 entries. The older entries partly have the current date from today. Is that all right?

Code:
    1  2019-09-09 09:08:12  zpool status
...
  248  2019-09-05 16:04:20  apt get install iotop

i would say this is not good for Linux too.
 
Thats wierd. If the cache is not full, i have with vzdump ~100MB/s - old configuration. Small Improvement.

When all Containers and VM off on that setup, then i have 220MB/s with randwrite. I mean in this containers there are no write intensive Operations. Wierd.

Code:
fio --name=/rpool/data/testfile --ioengine=libaio --iodepth=32 --rw=randwrite --bs=4k --direct=1 --size=1G --numjobs=1 --group_reporting
/rpool/data/testfile: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.12
Starting 1 process
/rpool/data/testfile: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1): [w(1)][80.0%][w=238MiB/s][w=60.9k IOPS][eta 00m:01s]
/rpool/data/testfile: (groupid=0, jobs=1): err= 0: pid=29581: Wed Sep 11 21:04:49 2019
  write: IOPS=54.3k, BW=212MiB/s (222MB/s)(1024MiB/4830msec); 0 zone resets
    slat (usec): min=5, max=102373, avg=16.80, stdev=214.43
    clat (nsec): min=1850, max=106828k, avg=571987.83, stdev=1333913.51
     lat (usec): min=9, max=106870, avg=588.92, stdev=1358.91
    clat percentiles (usec):
     |  1.00th=[   265],  5.00th=[   281], 10.00th=[   289], 20.00th=[   302],
     | 30.00th=[   314], 40.00th=[   334], 50.00th=[   359], 60.00th=[   408],
     | 70.00th=[   498], 80.00th=[   668], 90.00th=[   979], 95.00th=[  1303],
     | 99.00th=[  3130], 99.50th=[  4752], 99.90th=[ 10814], 99.95th=[ 12911],
     | 99.99th=[103285]
   bw (  KiB/s): min=131664, max=306208, per=93.43%, avg=202835.33, stdev=60937.64, samples=9
   iops        : min=32916, max=76552, avg=50708.78, stdev=15234.35, samples=9
  lat (usec)   : 2=0.01%, 20=0.01%, 50=0.01%, 100=0.01%, 250=0.21%
  lat (usec)   : 500=69.89%, 750=12.82%, 1000=7.60%
  lat (msec)   : 2=7.53%, 4=1.27%, 10=0.58%, 20=0.10%, 250=0.01%
  cpu          : usr=9.40%, sys=73.78%, ctx=11609, majf=6, minf=10
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,262144,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=212MiB/s (222MB/s), 212MiB/s-212MiB/s (222MB/s-222MB/s), io=1024MiB (1074MB), run=4830-4830msec

Last... i have installed Proxmox new on an mdadm raid1 on two seperate SSDs with lvm and ext4. Four other SSDs in an Raid10, created over GUI:

Code:
fio --name=/rpool/testfile --ioengine=libaio --iodepth=32 --rw=randwrite --bs=4k --direct=1 --size=1G --numjobs=1 --group_reporting
/rpool/testfile: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.12
Starting 1 process
/rpool/testfile: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=216MiB/s][w=55.3k IOPS][eta 00m:00s]
/rpool/testfile: (groupid=0, jobs=1): err= 0: pid=25570: Thu Sep 12 00:43:10 2019
  write: IOPS=37.0k, BW=145MiB/s (152MB/s)(1024MiB/7081msec); 0 zone resets
    slat (usec): min=5, max=103661, avg=25.07, stdev=533.98
    clat (usec): min=2, max=108397, avg=838.16, stdev=3240.21
     lat (usec): min=12, max=108437, avg=863.39, stdev=3297.83
    clat percentiles (usec):
     |  1.00th=[   265],  5.00th=[   277], 10.00th=[   289], 20.00th=[   314],
     | 30.00th=[   351], 40.00th=[   400], 50.00th=[   465], 60.00th=[   545],
     | 70.00th=[   635], 80.00th=[   766], 90.00th=[  1090], 95.00th=[  1549],
     | 99.00th=[  7570], 99.50th=[ 12780], 99.90th=[ 39060], 99.95th=[103285],
     | 99.99th=[106431]
   bw (  KiB/s): min=72181, max=248040, per=98.63%, avg=146056.57, stdev=56510.07, samples=14
   iops        : min=18045, max=62010, avg=36514.07, stdev=14127.57, samples=14
  lat (usec)   : 4=0.01%, 20=0.01%, 50=0.01%, 100=0.01%, 250=0.15%
  lat (usec)   : 500=54.52%, 750=24.22%, 1000=9.41%
  lat (msec)   : 2=8.15%, 4=1.61%, 10=1.23%, 20=0.49%, 50=0.16%
  lat (msec)   : 100=0.01%, 250=0.07%
  cpu          : usr=7.08%, sys=58.63%, ctx=16917, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,262144,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=145MiB/s (152MB/s), 145MiB/s-145MiB/s (152MB/s-152MB/s), io=1024MiB (1074MB), run=7081-7081msec

It seems that not more possible. Interesting: Restore 20Gb from ext4 to zfs 72MB/s... i have no ideas anymore.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!