[SOLVED] Slow ZFS performance

xtavras

Renowned Member
Jun 29, 2015
31
2
73
Berlin
Hi,

this post is part a solution and part of question to developers/community.

We have some small servers with ZFS. Setup is simple, 2 SSDs with ZFS mirror for OS and VM data.

Code:
zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 3h58m with 0 errors on Sun Feb 10 04:22:39 2019
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda2    ONLINE       0     0     0
            sdb2    ONLINE       0     0     0

The problem was really slow performance and IO delay, the fix was to disable "sync".

Code:
zfs set sync=disabled rpool

Now performance is almost 10x better. Since we have UPS this shouldn't be a problem. Nevertheless my question is why "sync=standard" is so slow? I understand some performance boost, but 10x? Some more details:

pveversion -v
proxmox-ve: 5.2-2 (running kernel: 4.15.18-8-pve)
pve-manager: 5.2-10 (running version: 5.2-10/6f892b40)
pve-kernel-4.15: 5.2-11
pve-kernel-4.15.18-8-pve: 4.15.18-28
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.17-1-pve: 4.15.17-9
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-41
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-30
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-3
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-29
pve-docs: 5.2-9
pve-firewall: 3.0-14
pve-firmware: 2.0-6
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-38
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve2~bpo1

We use Micron 1100 and Crucial MX200 SSDs. Some perfomance test mit "dbench" in a VM with virtio. On Promox itself the performance is similar just a bit faster (1000 MB/s vs. 100 MB/s)

With sync=standard

Code:
Throughput 40.5139 MB/sec  2 clients  2 procs  max_latency=138.780 ms

with sync=disabled

Code:
Throughput 408.308 MB/sec  2 clients  2 procs  max_latency=30.781 ms
 
  • Like
Reactions: nununo and SamTzu
With sync=disabled, writes are buffered to RAM and flushed every 5 seconds in the background (non-blocking unless it takes longer than 5s to flush). With sync=standard, writes must be flushed to disk anytime software issues a sync request and sync operations block until the disk has acknowledged the successful write. If dbench does random read/write IO, you're not far off these benchmarks (IO Meter 4K random transfer read/write):
https://www.storagereview.com/crucial_mx200_ssd_review

You should understand that as a result of sync=disabled, you could lose ~5s of data if the system unexpectedly stops (power outage, kernel crash, etc.). There isn't a risk of data corruption, only the loss of pending writes in RAM. Whether that is acceptable depends on your use case. For example, it wouldn't be acceptable if your system processes credit card transactions (eg. 5 seconds of successful credit card transactions disappear).
 
Thanks for explanation, for me it still doesn't look right that system had so much IO wait with "sync=standard" which is default. I would understand it on spinning drives, but SSDs, even consumer grade, should be a bit faster. So not sure if it's bug or "feature"
 
I would understand it on spinning drives, but SSDs, even consumer grade, should be a bit faster.

That is where you are wrong. Consumer SSD even Prosumer ones are total rubbish for sync random writes. This article sheds some light on that and has a ton of performance data for various SSDs:

http://www.sebastien-han.fr/blog/20...-if-your-ssd-is-suitable-as-a-journal-device/

(I think Sebastiens Blog is the most cited external url in the PVE forums)

Please also monitor your smart values, especially the wearout. On some consumer SSDs with ZFS and PVE you can see the values run down almost daily (from 100% left).
 
  • Like
Reactions: nununo
I'm not sure where I'm wrong, SSD are faster than spinning drives, even the consumer ones (I'm not talking about 15k SAS enterprise drives of course). I know the link, because I've build Ceph clusters before and agree, you don't want to use cheap SSDs in Ceph clusters (I had good experience with Samsung SM/PM863). But ZFS is not Ceph, you don't need enterprise SSDs for hosting 6-7 Linux VMs without much IO load.
 
I'm not sure where I'm wrong, SSD are faster than spinning drives, even the consumer ones (I'm not talking about 15k SAS enterprise drives of course).

I just wanted to say that there are SSDs that are in fact slower in random write than ordinary spinners, that's all.
 
Hi @LnxBil, I am experiencing slow performance on a ZFS RAID1 pool:
Bash:
# zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:19:18 with 0 errors on Sun Jun 14 00:43:19 2020
config:

    NAME                                                  STATE     READ WRITE CKSUM
    rpool                                                 ONLINE       0     0     0
      mirror-0                                            ONLINE       0     0     0
        nvme-eui.XXXXXXXXXXXXXXXX-part3                   ONLINE       0     0     0
        ata-KINGSTON_SA400S37960G_XXXXXXXXXXXXXXXX-part3  ONLINE       0     0     0

About 2-3 months ago, when I tried to investigate it I found that, much to my surprise, the SATA disk was showing a SMART wearout of 93%! I was told for sure there was something wrong with this SSD and since then I've been considering removing it and replace it with a new one. But now, 2-3 months later, it is only reporting 88% wearout.

Today I ran into this post where you say this:
Please also monitor your smart values, especially the wearout. On some consumer SSDs with ZFS and PVE you can see the values run down almost daily (from 100% left).

Does that mean that this SSD is not really damaged and this is an expected behavior and I should just relax?

Thanks in advance,
Nuno
 
Thank you @xtavras for this tip:
Code:
zfs set sync=disabled rpool
My write performance is now suddenly 2x better and my IO delay significantly smaller!

But how does one manage to get these measurements?
Code:
Throughput 408.308 MB/sec  2 clients  2 procs  max_latency=30.781 ms

And thank you @LnxBil for your insights on this. It's reassuring to know that disabling sync doesn't increase the risk of corruption.
 
My write performance is now suddenly 2x better and my IO delay significantly smaller!
Because now ZFS doesn't wait for the write to be acknowledged! If you like your data enable it again! Otherwise, you will lose data sooner or later.

Your mirror VDEV is made up of a NVME and SATA disk. The slower device is setting the speed limit and ~400MB/s is actually quite good.

How fast is it with sync enabled?
 
@aaron, hum... when you talk about losing data, do you mean the last few seconds before an unexpected stop like @LnxBil said above:
You should understand that as a result of sync=disabled, you could lose ~5s of data if the system unexpectedly stops (power outage, kernel crash, etc.). There isn't a risk of data corruption, only the loss of pending writes in RAM. Whether that is acceptable depends on your use case. For example, it wouldn't be acceptable if your system processes credit card transactions (eg. 5 seconds of successful credit card transactions disappear).

Or are you talking data corruption?

I can live with the former but I definitely don't want the latter.

Those 400MB/s are not mine. Mine are probably much slower exactly due to one SSD being SATA. I just quoted a post above exactly to ask how I could measure mine. Do you know how?

Thanks!
Nuno
 
Does that mean that this SSD is not really damaged and this is an expected behavior and I should just relax?

No, your SSD is worn out, so that is a problem and it may fail soon. Monitoring the SMART values and act accordingly: so replace SSDs or make sure that you will not have two identical worn out disks that may fail in short order destroying your data. You can go with consumer grade SSDs, but for your datas safety, monitor the wearout and replace accordingly or have regular backups.

Or are you talking data corruption? I can live with the former but I definitely don't want the latter.

There is no data corruption with ZFS (or at least there shouldn't be ... there could always be hardware faults and software bugs). Everything that ZFS writes to the disk is always fully written (everything or nothing in an IO) and working. That is how ZFS is built. Disabling sync will increase the performance, because important data is not forcefully flushed to disk, but only kept in RAM, like @aaron said.

There exist a lot of software like databases that rely on this flush to disk technique to ensure that the data is consistent, so disabling sync will not corrupt the data as an entity on disk, but it may break your application point of view of the data. If you can live with this, good, most people cannot.
 
Thank you @LnxBil for the explanation on data corruption in ZFS. It's perfectly clear now. It is indeed more dangerous than I was led to think.

Regarding my SSD, I'm confused. I thought I had understood that you said that some drives start at 100% and go down as they wear out. Can you please explain why you think my particular SSD is worn out? Since I read this I found that, while Proxmox reports 88%, there is another SMART attribute called normalized value which is 12 and maybe it should be reporting that one. What you said together with knowing that this normalized value exists made me think that my drive is ok after all...
 
Regarding my SSD, I'm confused. I thought I had understood that you said that some drives start at 100% and go down as they wear out. Can you please explain why you think my particular SSD is worn out? Since I read this I found that, while Proxmox reports 88%, there is another SMART attribute called normalized value which is 12 and maybe it should be reporting that one. What you said together with knowing that this normalized value exists made me think that my drive is ok after all...

88% is totally fine and you understood it correctly. The "worn out" part was implied on my part by your "that this SSD is not really damaged". It was not meant to be a direct response to your SSD, but a "damaged" SSD in general. Damage often occurs after the disk is worn out, so that was the implied part on my side. Sorry for the confusion.
 
  • Like
Reactions: nununo
Now performance is almost 10x better. Since we have UPS this shouldn't be a problem


Hi,

Sometimes, UPS could be a big problem. The big problem for UPS is how many mili second will take to find .... Huston, we have a problem! Usually is about 8-12 ms. So if your electicity problem is under this value .... you can say.... Huston we have a crash :) And more dangerous, is not the UPS, think at a kernel crash, where no UPS can save you, and you will lost for sure all the sync and async data that is in RAM/arc zfs. By default is 5 huge seconds. In 5 seconds a lot of sync data can be accumulated, depend of your ram size/ and so on. So use a let say 3 seconds/like(but your arc figure will be not so good) I use, or much better use also a l2arc device!

Good luck /Bafta with UPS and sync=disable,:)
 
  • Like
Reactions: nununo
Understood. I guess I'll re-enable it then :)

Thank you all for the detailed explanations.
 
What actually helped me a lot with the performance of my 8 disk raidz2 was to turn on each individual disk's cache. It was off by default so we turned it on via hdparm -W1 /dev/sdX.

You can check the disk cache via hdparm -W /dev/sdX.

It changed writing and reading 4K block (tested via fio) from around 1MB/s to around 14 MB/s.

Btw - do you think it is a normal speed for small blocks on new hardware? When i do the same fio test for 1024K blocks, i get around 700 MB/s - which is absolutely fine, i think.
 
  • Like
Reactions: thiagotgc
The problem was really slow performance and IO delay, the fix was to disable "sync".
zfs set sync=disabled rpool

2024 and this solution still holds true. I've gone from absoultely terrible IO delay, making the whole system unusable, to having just 20-30% IO delay under load. Nice. Thanks @xtavras
 
@hadus You're aware of the trade offs of using that option though yeah? :)
partially, I read through this thread, so that's the extent of my knowledge.

I'm not super psyched about it, I must admit. But I'm not sure what else I could do at this time. I have a ZFS mirror thats holds VMs and LXC. Initially I figured I could break it up and create a thin LVM instead. But thats as far as I've gotten :)
I don't know if I can break a ZFS mirror without losing data and I don't know if a thin LVM can be mirrored.
 
Oh. That kind of sounds like your ZFS storage doesn't have "Thin provision" enabled?

If you take a look at DatacenterStorage[storage pool], it should be a checkbox option in there:

1717758705251.png

Or are you thinking about LVM with thin provisioning for a different reason? :)
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!