Poor performance on ZFS compared to LVM (same hardware)

javildn · Sep 1, 2017

Hi, I’m trying to configure a host with Proxmox 5 and ZFS because the great features, but I am not able to get a good IO performance compared to a Proxmox 5 LVM with same specs.

At this moment I’m trying with two hosts with these specs:

· Supermicro Server
· Xeon E3-1270v6
· 2x Intel SSD SC3520 240Gb
· 16 GB DDR4

In both servers I have installed Proxmox 5, in the first server I had configured disks with ext4 (no raid), in the second server I had configure disks with zfs raid 1.

Both servers are updated and using latest kernel.

In both servers I have created a Centos 7 VM with this configuration:

Server1:
bootdisk: scsi0
cores: 8
ide2: none,media=cdrom
memory: 4096
net0: virtio=7A:70:B7:AC:23:5C,bridge=vmbr0
numa: 1
ostype: l26
scsi0: local-lvm:vm-100-disk-1,size=32G
scsihw: virtio-scsi-pci
sockets: 1

Server2:
bootdisk: scsi0
cores: 8
ide2: none,media=cdrom
memory: 4096
net0: virtio=1A:84:03:89:37:4E,bridge=vmbr0
numa: 1
ostype: l26
scsi0: local-zfs:vm-101-disk-1,size=32G
scsihw: virtio-scsi-pci
sockets: 1

I ran fio in both servers:

fio --name=randfile --ioengine=libaio --iodepth=32 --rw=randwrite --bs=4k --direct=1 --size=1G --numjobs=8 --group_reporting

In the Centos 7 VM on the LVM host, I get 18981 IOPS:

randfile: (groupid=0, jobs=8): err= 0: pid=7071: Fri Sep 1 11:42:11 2017
write: io=8192.0MB, bw=75925KB/s, iops=18981, runt=110485msec
slat (usec): min=1, max=161671, avg=341.20, stdev=3200.12
clat (usec): min=196, max=207266, avg=13059.06, stdev=20009.97
lat (usec): min=199, max=207273, avg=13400.46, stdev=20256.69
clat percentiles (usec):
| 1.00th=[ 1096], 5.00th=[ 2192], 10.00th=[ 3664], 20.00th=[ 4960],
| 30.00th=[ 5728], 40.00th=[ 6496], 50.00th=[ 7200], 60.00th=[ 8256],
| 70.00th=[ 9664], 80.00th=[11840], 90.00th=[22400], 95.00th=[60672],
| 99.00th=[105984], 99.50th=[120320], 99.90th=[152576], 99.95th=[164864],
| 99.99th=[189440]
bw (KB /s): min= 5282, max=117656, per=12.60%, avg=9563.05, stdev=3815.41
lat (usec) : 250=0.01%, 500=0.04%, 750=0.22%, 1000=0.50%
lat (msec) : 2=3.53%, 4=7.59%, 10=60.33%, 20=16.83%, 50=5.15%
lat (msec) : 100=4.44%, 250=1.37%
cpu : usr=0.50%, sys=1.93%, ctx=437658, majf=0, minf=242
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=2097152/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
WRITE: io=8192.0MB, aggrb=75925KB/s, minb=75925KB/s, maxb=75925KB/s, mint=110485msec, maxt=110485msec

Disk stats (read/write):
dm-0: ios=0/2614403, merge=0/0, ticks=0/11926981, in_queue=11947665, util=94.13%, aggrios=0/2655908, aggrmerge=0/68964, aggrticks=0/11176185, aggrin_queue=11197603, aggrutil=94.05%
sda: ios=0/2655908, merge=0/68964, ticks=0/11176185, in_queue=11197603, util=94.05%

However in the Centos 7 VM on the ZFS host, I only get 4989 IOPS:

randfile: (groupid=0, jobs=8): err= 0: pid=7184: Fri Sep 1 11:47:26 2017
write: io=8192.0MB, bw=19959KB/s, iops=4989, runt=420285msec
slat (usec): min=1, max=1389.7K, avg=1092.88, stdev=11087.91
clat (usec): min=237, max=1428.7K, avg=50085.97, stdev=79131.65
lat (usec): min=327, max=1429.8K, avg=51179.15, stdev=79835.78
clat percentiles (msec):
| 1.00th=[ 7], 5.00th=[ 10], 10.00th=[ 13], 20.00th=[ 16],
| 30.00th=[ 20], 40.00th=[ 23], 50.00th=[ 26], 60.00th=[ 29],
| 70.00th=[ 34], 80.00th=[ 52], 90.00th=[ 120], 95.00th=[ 192],
| 99.00th=[ 400], 99.50th=[ 486], 99.90th=[ 840], 99.95th=[ 955],
| 99.99th=[ 1401]
bw (KB /s): min= 5, max=24960, per=12.81%, avg=2556.49, stdev=1227.99
lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.08%, 4=0.19%, 10=5.01%, 20=27.76%, 50=46.48%
lat (msec) : 100=7.91%, 250=9.65%, 500=2.45%, 750=0.34%, 1000=0.08%
lat (msec) : 2000=0.04%
cpu : usr=0.20%, sys=1.38%, ctx=563382, majf=0, minf=243
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=2097152/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
WRITE: io=8192.0MB, aggrb=19959KB/s, minb=19959KB/s, maxb=19959KB/s, mint=420285msec, maxt=420285msec

Disk stats (read/write):
dm-0: ios=59/2623582, merge=0/0, ticks=1306/53167732, in_queue=53194483, util=99.20%, aggrios=59/2646661, aggrmerge=0/84569, aggrticks=844/50802459, aggrin_queue=50842281, aggrutil=99.18%
sda: ios=59/2646661, merge=0/84569, ticks=844/50802459, in_queue=50842281, util=99.18%

Is there anything to do here? Most configurations are by default.

I would like to use ZFS but seems to be very slow.

Kind regards.

czechsys · Sep 1, 2017

You compare different setups totally, what you expected, do you even know ext4/zfs difference and nonraid/raid difference in performance/negatives? You mixed...

1] nonraid vs raid1
2] ext4 vs zfs

Ext4 on nonraid will be a lot faster than zfs on raid1.

Rhinox · Sep 1, 2017

Not sure if your 1st point is valid. ZFS on raid1 should give about the same write/rewrite performance, as ZFS on single drive (while read performance on raid1 is much higher on single drive)...
https://calomel.org/zfs_raid_speed_capacity.html

wolfgang · Sep 1, 2017

Hi,

what fs do you use in the Centos?

javildn · Sep 4, 2017

czechsys said:
You compare different setups totally, what you expected, do you even know ext4/zfs difference and nonraid/raid difference in performance/negatives? You mixed...

1] nonraid vs raid1
2] ext4 vs zfs

Ext4 on nonraid will be a lot faster than zfs on raid1.

Both setups are similar, only difference is one server is ZFS backed and the other one is LVM.

1] According to Rhinox, zfs raid1 would have same write performance as single drive.
2] I know it, this is what I am testing, I expected to have some write penalty in ZFS, but my benchmarks shows 4x slower.

wolfgang said:
Hi,

what fs do you use in the Centos?

I am using XFS (default in Centos 7). Should I try another fs?

[root@localhost ~]# xfs_info /
meta-data=/dev/mapper/cl-root isize=512 agcount=4, agsize=1821440 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0 spinodes=0
data = bsize=4096 blocks=7285760, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal bsize=4096 blocks=3557, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

wolfgang · Sep 4, 2017

Yes please try it with ext4.

javildn · Sep 4, 2017

Similar result, 4590 IOPS

[root@localhost ~]# mount | grep root
/dev/mapper/cl-root on / type ext4 (rw,relatime,data=ordered)

Fio:
randfile: (groupid=0, jobs=8): err= 0: pid=10228: Mon Sep 4 09:54:14 2017
write: io=8192.0MB, bw=18364KB/s, iops=4590, runt=456801msec
slat (usec): min=2, max=2833.3K, avg=1291.16, stdev=17908.35
clat (usec): min=84, max=3242.7K, avg=54272.79, stdev=131163.71
lat (usec): min=128, max=3242.8K, avg=55564.21, stdev=132860.59
clat percentiles (msec):
| 1.00th=[ 6], 5.00th=[ 8], 10.00th=[ 10], 20.00th=[ 12],
| 30.00th=[ 14], 40.00th=[ 16], 50.00th=[ 19], 60.00th=[ 23],
| 70.00th=[ 28], 80.00th=[ 44], 90.00th=[ 149], 95.00th=[ 227],
| 99.00th=[ 453], 99.50th=[ 857], 99.90th=[ 1893], 99.95th=[ 2212],
| 99.99th=[ 3064]
bw (KB /s): min= 1, max=17373, per=13.92%, avg=2555.63, stdev=2076.35
lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.04%
lat (msec) : 2=0.25%, 4=0.18%, 10=13.01%, 20=41.49%, 50=26.36%
lat (msec) : 100=4.70%, 250=9.93%, 500=3.12%, 750=0.27%, 1000=0.22%
lat (msec) : 2000=0.32%, >=2000=0.09%
cpu : usr=0.17%, sys=1.23%, ctx=532424, majf=0, minf=245
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=2097152/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
WRITE: io=8192.0MB, aggrb=18363KB/s, minb=18363KB/s, maxb=18363KB/s, mint=456801msec, maxt=456801msec

Disk stats (read/write):
dm-0: ios=0/2462515, merge=0/0, ticks=0/90495335, in_queue=90530630, util=100.00%, aggrios=0/2108211, aggrmerge=0/355892, aggrticks=0/66119062, aggrin_queue=66148660, aggrutil=100.00%
sda: ios=0/2108211, merge=0/355892, ticks=0/66119062, in_queue=66148660, util=100.00%

javildn · Sep 5, 2017

Nobody can help?

Is this normal in ZFS?

Nemesiz · Sep 5, 2017

Do simple test. Turn off sync in ZFS pool and do the test. If you will see a different - you need good ZIL for sync writes.

javildn · Sep 5, 2017

I have tried it right now, FIO results inside VM are very similar with sync disabled. 4900-5000 IOPS

# zfs get all | grep sync
rpool sync disabled local
rpool/ROOT sync disabled inherited from rpool
rpool/ROOT/pve-1 sync disabled inherited from rpool
rpool/data sync disabled inherited from rpool
rpool/data/vm-101-disk-1 sync disabled inherited from rpool
rpool/swap sync always local

Any idea?

wolfgang · Sep 5, 2017

Can you repeat the fio test on the host with a zvol?

The we can see if the problem is zfs or the VM layer.

wolfgang · Sep 5, 2017

You can try to update the intel ssd with the Intel datacenter tools.

javildn · Sep 5, 2017

wolfgang said:
Can you repeat the fio test on the host with a zvol?

The we can see if the problem is zfs or the VM layer.

Running fio test on a zvol directly on the host gets higher values! About 2x/3x faster! Not as fast as LVM but I think it will be enough.

Why the VM cannot get these IOPS?

I noticed while running FIO test directly on the host, iowait raises up to 50% but when running FIO on the VM, iowait on the host only gets 12-15%.

Thank you!

wolfgang said:
You can try to update the intel ssd with the Intel datacenter tools.

Firmware is already updated.

Regards

guletz · Sep 5, 2017

Hi,

Maybe I have a clue

By default zfs on Proxmox create zvols with 8k block size. Using a default VM with ext4, the default block is 4k. Now you test inside VM with 4k( fio is not my friend, maybe I am wrong ) in random mode.

Results:
With lvm/ext4 the iops is 2x compare with zfs because:

- fio write 1 block of 4k (ext4 lvm) and ext4 write the same 1 block(on a single SSD)
- in case of zfs fio write the same 1 block of 4k, but because zvol default is 8k, zfs will need to write 2x4k=8k(on 2 SSD )

In my opinion you compare oranges with apples. So I would recommend you to try to use in fio instead of 4k, a 8k value.

As a side note, each SSD model/firmware has a optimum block size. Try to make several tests with different block size and choose the what is best for your case.
Also think that you apply 1 test for your storage (like fio), and you conclude that lvm is your best option. This is wrong - because fio or whatever test will not tell you for sure that in the production with real data with real load, the results will be the same.
As a example (maybe I am a dummy guy) I test my own storage only with real data/real load - no fio/dd/whatever or other test.

By the way, why do you need the best IOPs? Do you want to run some DataBase ?

Good luck with fio

wolfgang · Sep 6, 2017

guletz said:
By default zfs on Proxmox create zvols with 8k block size.

Default is ashift 12(1^12 = 4096 = 4k)
But you can check what blocksize you have.

Code:

zpool get ashift rpool

guletz · Sep 6, 2017

wolfgang said:
Default is ashift 12(1^12 = 4096 = 4k)

Code:

zpool get ashift rpool
NAME   PROPERTY  VALUE   SOURCE
rpool  ashift    12      local

and ....

Code:

zfs create -V 8388608K rpool/test

zfs get all rpool/test
NAME        PROPERTY              VALUE                  SOURCE
rpool/test  type                  volume                 -
rpool/test  creation              Wed Sep  6  8:52 2017  -
rpool/test  used                  8.25G                  -
rpool/test  available             3.50T                  -
rpool/test  referenced            64K                    -
rpool/test  compressratio         1.00x                  -
rpool/test  reservation           none                   default
rpool/test  volsize               8G                     local
rpool/test  volblocksize          8K                     -

So, by-default is 8K, and has nothing to do with ashift, at least in linux. In the last 5 years I was see only 8k value! And is 8K for a good reason(think at raidz1 how would be with 4k)

javildn · Sep 6, 2017

guletz said:
Hi,

Maybe I have a clue

By default zfs on Proxmox create zvols with 8k block size. Using a default VM with ext4, the default block is 4k. Now you test inside VM with 4k( fio is not my friend, maybe I am wrong ) in random mode.

Results:
With lvm/ext4 the iops is 2x compare with zfs because:

- fio write 1 block of 4k (ext4 lvm) and ext4 write the same 1 block(on a single SSD)
- in case of zfs fio write the same 1 block of 4k, but because zvol default is 8k, zfs will need to write 2x4k=8k(on 2 SSD )

In my opinion you compare oranges with apples. So I would recommend you to try to use in fio instead of 4k, a 8k value.

As a side note, each SSD model/firmware has a optimum block size. Try to make several tests with different block size and choose the what is best for your case.
Also think that you apply 1 test for your storage (like fio), and you conclude that lvm is your best option. This is wrong - because fio or whatever test will not tell you for sure that in the production with real data with real load, the results will be the same.
As a example (maybe I am a dummy guy) I test my own storage only with real data/real load - no fio/dd/whatever or other test.

By the way, why do you need the best IOPs? Do you want to run some DataBase ?

Good luck with fio

Hi! I understand your point, however I maded a new test:

I created a new VM, I installed then Centos 7 without LVM and ext4. If I do the fio test inside the VM the results are same ~5000 IOPS. However if I stop the VM and mount the zvol directly in the host (mount /dev/mapper/vm-102-disk-1p2 /temp2) and do the fio test, I get about ~15000 IOPS

So the it seems the VM is loosing IOPS somewhere, because the same zvol, with 8k blocksize is performing better in the host.

I think I need better IOPS because I am having some perfomance problems in a production server (web server and DB), and I think I could solve these problems getting more IOPS (maybe I'm wrong).

BTW I also have ashift 12 and 8k volblocksize.

# zpool get ashift rpool
NAME PROPERTY VALUE SOURCE
rpool ashift 12 local

# zfs get all | grep blocksize
rpool/data/vm-101-disk-1 volblocksize 8K -
rpool/data/vm-102-disk-1 volblocksize 8K -

Should I try with 4k volblocksize?

guletz · Sep 6, 2017

javildn said:
I created a new VM, I installed then Centos 7 without LVM and ext4. If I do the fio test inside the VM the results are same ~5000 IOPS. However if I stop the VM and mount the zvol directly in the host (mount /dev/mapper/vm-102-disk-1p2 /temp2) and do the fio test, I get about ~15000 IOPS

Hi,
Try to use different combinations for VM(controller type/disk type).

javildn said:
I think I need better IOPS because I am having some perfomance problems in a production server (web server and DB), and I think I could solve these problems getting more IOPS (maybe I'm wrong).

Any webserver do not need a very high IOPS. But indeed, any DB can perform very badly if you have a low IOPS. My recomendation is to use 2 differents virtual disks for this(one only for DB use). Also use the proper volblocksize for this zvol as DB manufacture recomended(ex. 8k for postgresql if I remember, 16k for mysql/perconaDB/mariaDB). Also for the same zvol set only metadata to be cached(primarycache) if you want to avoid duble-caching(at the DB level an at the zfs level for the same data)

javildn · Sep 7, 2017

I haved created a new zvol with volblocksize=4k, asigned to the VM, cloned CentOS with dd to the new disk, and rebooted the VM with the new disk.

The performance increased dramatically! Now I get expected IOPS.

Maybe it would be a good idea to be able to change volblocksize when creating a new disk in the Proxmox GUI as it is probed it can affect performance, at least in my setup.

dcsapak · Sep 7, 2017

javildn said:
Maybe it would be a good idea to be able to change volblocksize when creating a new disk in the Proxmox GUI as it is probed it can affect performance, at least in my setup.

you can set it in the storage config

Poor performance on ZFS compared to LVM (same hardware)

New Member

Renowned Member

Active Member

Proxmox Retired Staff

New Member

Proxmox Retired Staff

New Member

New Member

Renowned Member

New Member

Proxmox Retired Staff

Proxmox Retired Staff

New Member

Distinguished Member

Proxmox Retired Staff

Distinguished Member

New Member

Distinguished Member

New Member

Proxmox Staff Member

We value your privacy