Disk Cache wiki documentation

IsThisThingOn

Member
Nov 26, 2021
61
14
13
Hi everyone

I was reading through the wiki
https://pve.proxmox.com/wiki/Performance_Tweaks#Disk_Cache
have some questions and maybe even suggestions for improvement. Sometimes it also contradicts the points from the article it references to.

I'm no expert by any means and not a native speaker, so please don't take these suggestions as an insult. I could be very wrong on some basic understandings :)
Because it is a complex topic, I like to number the different points.

1. Maybe start with an exclaimer that for a secure writeback you need the barrier-passing feature. That feature is enabled by default on ext4
2. Do NTFS or ZFS guests also support write barriers by default?
3. The wiki says this only applies to RAW, but I was unable to find good documentation how these caching modes behave for qcow2. Anyone has some inputs?
4. The table is a little bit confusing to newcomers. I think simpler to understand by splitting up the Host Page Cache into read and write and what data can be lost in a power failure. Also I would replace "power loss" with "unexpected shutdown" because that covers more incidents than can happen.
5. Rename the mode "none" to "no cache" or "Default (no cache)", because that is how the GUI shows it
6. The description for cache=none says: "Because the actual storage device may report a write as completed when placed in its write queue only". For what device is that true? Are we talking about the drive firmware lying? I had this discussion in a TrueNAS forum, where I asked how we can trust HDDs to not have writes in a queue and the general consent was that there is no drive lying about having writes written while they are still in cache no matter if SSD or HDD. Otherwise the outrage would be even bigger than the SMR fiasko from WD. That is why I would ditch the "in case of power failure you can lose data"
7. "cache=none seems to be the best performance", can that really be true? Compared to other modes, it is just not using cache and directly talking to the disk, how can that be the best performance?
8. Reorder the modes to match the order of the GUI
9. According to the notes, directsync is the safest option. But why? Write through behaves the same but with a read cache, how is that less secure?


Here is how I as a beginner would write that wiki entry:



ModeProxmox Page cache readProxmox Page cache writeDisk Writedata loss from unexpected shutdowns?Notes
default (No Cache)disableddisablednormal behaviorOnly async writesNormal read performance

Normal write performance

Safe with enabled host barrier support
Direct Syncdisableddisabledforces syncNoNormal read performance

Slow write performance,
because even async writes are written as sync writes.

Safe even without host barrier support.
Write throughenableddisabledforces syncNoGood read performance, because of cache

Slow write performance, because even async writes are written as sync writes

Safe even without host barrier support
Write backenabledenablednormal behaviorOnly async writesGood read performance

Normal write performance

Safe with enabled barrier support
Write back (unsafe)enabledenabledignores flushYesGood read performance

Good write performance for sync writes

Very unsafe! Not recommended!

default (No Cache) is the default since Proxmox 2.X.
  • host page cache is not used
  • In case of a power failure, you could loose async writes
  • You need to use the barrier option in your Linux guest to avoid FS corruption in case of power failure.
This mode causes qemu-kvm to interact with the disk image file or block device with O_DIRECT semantics, so the host page cache is bypassed and I/O happens directly between the qemu-kvm userspace buffers and the storage device. The guest is expected to send down flush commands as needed to manage data integrity. Equivalent to direct access to your hosts' disk, performance wise.

Direct Sync
  • host page cache is not used
  • guest disk cache mode is writethrough
  • similar to Write through, an fsync is made for each write.
This mode causes qemu-kvm to interact with the disk image file or block device with both O_DSYNC and O_DIRECT semantics, where writes are reported as completed only when the data has been committed to the storage device, and when it is also desirable to bypass the host page cache. Like cache=writethrough, it is helpful to guests that do not send flushes when needed. It was the last cache mode added, completing the possible combinations of caching and direct access semantics.

Write through
  • host page cache is used as read cache
  • guest disk cache mode is writethrough
  • similar to Direct Sync, an fsync is made for each write.
This mode causes qemu-kvm to interact with the disk image file or block device with O_DSYNC semantics, where writes are reported as completed only when the data has been committed to the storage device. The host page cache is used in what can be termed a writethrough caching mode. Guest virtual storage adapter is informed that there is no writeback cache, so the guest would not need to send down flush commands to manage data integrity. The storage behaves as if there is a writethrough cache.

Write Back
  • host page cache is used as read & write cache
  • guest disk cache mode is writeback
  • In case of a power failure, you could loose async writes
  • You need to use the barrier option in your Linux guest to avoid FS corruption in case of power failure.
This mode causes qemu-kvm to interact with the disk image file or block device with neither O_DSYNC nor O_DIRECT semantics, so the host page cache is used and writes are reported to the guest as completed when placed in the host page cache, and the normal page cache management will handle commitment to the storage device. Additionally, the guest's virtual storage adapter is informed of the writeback cache, so the guest would be expected to send down flush commands as needed to manage data integrity.

Write Back (unsafe)
  • as Write Back, but ignores flush commands from the guest!
  • Warning: No data integrity even if the guest is sending flush commands. Not recommended for production use.
 
Last edited:
Thanks for the feedback. It is always good if someone reads it critically. I'll see how I can incorporate them :)

Some answers to your questions:

First off, Qemu might not interact directly with a disk only separated by a thin file system layer. There can be quite a few other storage layers in between, such as ZFS, Ceph, Gluster, iSCSI, … you name it, that can have their own quirks in how they implement access semantics.

The different cache modes are all possible combinations of O_DIRECT and O_DSYNC semantics ( see man 2 open).

6. The description for cache=none says: "Because the actual storage device may report a write as completed when placed in its write queue only". For what device is that true? Are we talking about the drive firmware lying? I had this discussion in a TrueNAS forum, where I asked how we can trust HDDs to not have writes in a queue and the general consent was that there is no drive lying about having writes written while they are still in cache no matter if SSD or HDD. Otherwise the outrage would be even bigger than the SMR fiasko from WD. That is why I would ditch the "in case of power failure you can lose data"
Do you have a link to that TrueNAS forum thread?

This thread on twitter is quite interesting in that regard: https://twitter.com/xenadu02/status/1495693475584557056
Scrolling through the responses is worth it ;)

Consumer SSDs are definitely lying about data being written in a way that it can be powered off in that exact moment. For example, if you benchmark them for 10 minutes without sync=1, they will, as expected, show a performance that is close to the specs. If you enable sync, then some will drop considerably in their bandwidth performance, while others will keep the same speeds… Without real power loss protection (capacitors) where the RAM can still be written down to non-volatile memory if power is cut, they have no way to actually achieve these sustained speeds. I my small test sample, the more reputable vendors were the ones dropping in speed once sync was enabled.


7. "cache=none seems to be the best performance", can that really be true? Compared to other modes, it is just not using cache and directly talking to the disk, how can that be the best performance?
The guest OS usually has its own page cache, in which it stores non-sync writes until they are written down to disk. Doing that twice comes with a cost. But, as almost always in live, it depends. There are situations where you might see an improvement if you use the host's page cache.

9. According to the notes, directsync is the safest option. But why? Write through behaves the same but with a read cache, how is that less secure?
You actually read the data from the disk/storage and not the cache. It is not 100% out of the question, that the data from the kernel down to the disk gets corrupted. Thus, the cache does reflect what you want the data to be, but it isn't what is on the actual storage.
 
Hi arron, thanks for taking the time to read it and write a response.
The different cache modes are all possible combinations of O_DIRECT and O_DSYNC semantics ( see man 2 open).
Not sure if I really understand the man page.
It sais:

Code:
To guarantee synchronous I/O,
O_SYNC must be used in addition to O_DIRECT

So to be 100% all data gets written and nothing is lost in an outage, I have to use direct sync as cache, because this is the only one that uses both O_SYNC and O_Direct?
I have a hard time believing this. Does that mean that to get the same behavior as a bare metal machine (a guaranteed write for sync writes) I have to use direct sync and force my async writes to be sync?


Do you have a link to that TrueNAS forum thread?
Sure. If you don't wanna read the whole thread, here is the gist of it: https://www.truenas.com/community/t...-pool-power-loss-protection.87100/post-721914

This thread on twitter is quite interesting in that regard: https://twitter.com/xenadu02/status/1495693475584557056
Scrolling through the responses is worth it ;)
Wow, not surprised by the Sabrent but SK Hynix looks bad.

Consumer SSDs are definitely lying about data being written in a way that it can be powered off in that exact moment. For example, if you benchmark them for 10 minutes without sync=1, they will, as expected, show a performance that is close to the specs. If you enable sync, then some will drop considerably in their bandwidth performance, while others will keep the same speeds… Without real power loss protection (capacitors) where the RAM can still be written down to non-volatile memory if power is cut, they have no way to actually achieve these sustained speeds. I my small test sample, the more reputable vendors were the ones dropping in speed once sync was enabled.
In my experience, most consumer SSDs are in fact way slower and will drop in performance if you enable sync because they don't lie. Even your twitter link, I read as:
"1 out of 7 reputable consumer(!) SSDs (Samsung, WD, Intel, Kingston, Seagate, Crucial, SK Hynix) don't loose data"

The guest OS usually has its own page cache, in which it stores non-sync writes until they are written down to disk. Doing that twice comes with a cost. But, as almost always in live, it depends. There are situations where you might see an improvement if you use the host's page cache.
That is a great point I have not thought about. So would you say that most of the time, it is not really worth bothering with cache? Can you explain a scenario where there are benefits of using cache?
 
Last edited:
So to be 100% all data gets written and nothing is lost in an outage, I have to use direct sync as cache, because this is the only one that uses both O_SYNC and O_Direct?
I have a hard time believing this. Does that mean that to get the same behavior as a bare metal machine (a guaranteed write for sync writes) I have to use direct sync and force my async writes to be sync?
Even on a bare metal machine, you would need to issue writes with direct & sync to only get the ACK once it is fully written down. Since that would mean that the performance is terrible, caching is used all the time. Typically, the OS flushes down writes periodically, every few seconds.

Applications that need good write assurances, for example databases, issue their writes with sync. For everything else, "we" have settled that cached writes are a good balance between write assurances and performance.

In my experience, most consumer SSDs are in fact way slower and will drop in performance if you enable sync because they don't lie.
I got my hands on some NVMEs last year, a Samsung 980 Pro, a WD Black SN 770 and a Crucial P5 Plus, all 2 TiB.

The Samsung was the only one who dropped to a few hundred IOPS once sync was enabled for the FIO 4k benchmarks. The other two were still at a few thousand IOPS with sync. I suspect that they lie about the speed and are still using a cache when ACKing the sync writes. The way the graphs oscillate kinda look like it. Unfortunately, I did not have any means to connect them in a way where I could detach them right after a flush / sync write test to see if the data was actually written.


That is a great point I have not thought about. So would you say that most of the time, it is not really worth bothering with cache? Can you explain a scenario where there are benefits of using cache?
Unless you don't have performance issues, leave the defaults (no cache). If you do experience issues, the writeback is the one I would try. Ceph (RBD) would be an example where it can improve the overall performance. RBD doesn't perform too well with many tiny writes. So using the writeback cache to additionally bunch them into less but larger writes can help.

While we are at it, Ceph is really happy if it gets SSDs with power loss protection, as they will ACK once the data is in its cache. And it is okay as it has enough power stored to write down everything from the cache to the non volatile memory.
 
Even on a bare metal machine, you would need to issue writes with direct & sync to only get the ACK once it is fully written down. Since that would mean that the performance is terrible, caching is used all the time. Typically, the OS flushes down writes periodically, every few seconds.

Applications that need good write assurances, for example databases, issue their writes with sync. For everything else, "we" have settled that cached writes are a good balance between write assurances and performance.
That is how I understand it to be handled. Default behavior is to only write stuff sync that is requested by the application to be sync, and write everything else async. But now I get what this doc means by "safest"! In this context it means "even safer than bare metal, because unlike bare metal, even stuff that did not ask for sync, gets sync and thous will not be lost". To play devils advocate here, one could argue that this is not really more safe, because these things did not expect to be safely written to begin with. But I think I get the argument.

So for me, no cache basically acts the same as a bare metal.

The Samsung was the only one who dropped to a few hundred IOPS once sync was enabled for the FIO 4k benchmarks.
Strange, the twitter thread explicitly states your Crucial P5 Plus and the previous SN 750 to handle it correctly. Maybe this should be a point of discussion here and in the TrueNAS forums, this could be pretty dangerous.

Unless you don't have performance issues, leave the defaults (no cache).
Will do. Is host page cache also in RAM, and because of that ARC will do some read caching anyway?

Thank you guys for your inputs!
 
Last edited:
That is how I understand it to be handled. Default behavior is to only write stuff sync that is requested by the application to be sync, and write everything else async. But now I get what this doc means by "safest"! In this context it means "even safer than bare metal, because unlike bare metal, even stuff that did not ask for sync, gets sync and thous will not be lost". To play devils advocate here, one could argue that this is not really more safe, because these things did not expect to be safely written to begin with. But I think I get the argument.
Yep, depending on the cache mode selected, you can force stricter / safer behavior than the guest OS / application might choose.

Strange, the twitter thread explicitly states your Crucial P5 Plus and the previous SN 750 to handle it correctly.
Again, take it with a grain of salt, as I could not actually test it. It is just my suspicion! And please keep in mind, that consumer SSDs are known to change over time but keep their name. Different firmware, different controllers, different memory chips. It all happened, unfortunately. This even goes so far, that some review sites put a disclaimer on older reviews as the same model that can be currently bought has barely anything in common with the one reviewed.
Different sized SSDs of the same model also might differ. The TL;DR is, that in the consumer SSD space, it is unfortunately really easy to buy something that is not what you expected if you don't pay close attention.

Will do. Is host page cache also in RAM, and because of that ARC will do some read caching anyway?
If you have the disk images on a ZFS pool, then yeah, using writeback (or writethrough) for a disk could probably lead to double caching.
 
  • Like
Reactions: IsThisThingOn

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!