ZFS sync=disabled safeness

decibel83

Renowned Member
Oct 15, 2008
210
1
83
Hi,

is it safe to set the ZFS sync as disabled on Proxmox 6 if I have a working UPS protection?
I see that fsyncs increases a lot with sync disabled, will this corrupt some snapshot?

Thanks!
 
power loss is not the only thing that can cause your system to drop dead. I'd only set sync to disabled if you don't care about losing the last few X of writes, where X depends on your work load.
 
Hi dB's ;)

Sync disable in the worst case scenario -> you will lose the data that are in the zfs chace(5 sec by default). But not only power
will be a problem that could result in a data lose(UPS will avoid this), but it could also be a kernel crash(in this cache UPS will not help)
 
power loss is not the only thing that can cause your system to drop dead. I'd only set sync to disabled if you don't care about losing the last few X of writes, where X depends on your work load.

Thanks.
Could you help me to understand which are the possible events which causes last writes to be lost?
  • Power interruption without UPS
  • Hard shutdown
  • Kernel crash on Proxmox host
  • Kernel crash on virtual machine?
  • Qm process or container kill from Proxmox host?
  • Snapshot?
  • Other?
Are snapshots safe with sync=disabled?
 
Because I'm getting 10x fsyncs with sync disabled, and virtual machine are much much more faster.

well, disabling sync makes all io asynchronous - regardless the protocol.
I don't know what the default value of tgx commits are but this timeframe you will basically loose on data.
depending on the application this can/will result in inconsistent data and for instance linux will probably run into recovery mode so that you fix it with fsck. with databases you will have different problems then.

rather then disabling it completely you should investigate why it is so slow
I guess you don't have a SLOG do you ?
 
No, I don't have a SLOG, do I need it even using SSD and NVMe drives?

A simple enterprise level SSD with e.g. 32 GB is totally sufficient for this and will increase your throughput tremendously.

Thanks.
Could you help me to understand which are the possible events which causes last writes to be lost?
  • Power interruption without UPS
  • Hard shutdown
  • Kernel crash on Proxmox host
  • Kernel crash on virtual machine?
  • Qm process or container kill from Proxmox host?
  • Snapshot?
  • Other?
Are snapshots safe with sync=disabled?

The qm process kill thing does not apply, but everything else will. If you create a snapshot and press the reset button, you could have a corrupt snapshot if it has been written/registred at all. Your overall likelihood of data loss is increased in general
 
It's interesting to read.. Does anybody really experience data corruption of any kind or corrupted snapshot during the power loss with sync=disabled?
To my understanding, the consequences will be exactly the same as if the power loss happened ~ 5 seconds earlier with sync=standard. Am I wrong?
 
It's interesting to read.. Does anybody really experience data corruption of any kind or corrupted snapshot during the power loss with sync=disabled?
To my understanding, the consequences will be exactly the same as if the power loss happened ~ 5 seconds earlier with sync=standard. Am I wrong?
yes, you are. you cannot just look at the disk state, the issue is that
- application writes to disk with sync, hands out reply corresponding to the persisted state (or does something else that has side-effects)
- crash

with sync, the on-disk and the replied-with state are in agreement
with sync=disabled, the on-disk state and the replied-with state are in disagreement

a basic example: the reply contains some kind of (auto-incremented) ID, the system starts up again, assigns that ID again, the client that got it before the crash will be rather confused since its local state and the server state are not referring to the same thing under that ID.

when you disable sync semantics, you basically break an invariant of the application logic since you take away one mechanism that is there to ensure consistency. this might not matter, or it might matter a lot - it all depends on what is using those semantics.
 
  • Like
Reactions: fiona
It's interesting to read.. Does anybody really experience data corruption of any kind or corrupted snapshot during the power loss with sync=disabled?
To my understanding, the consequences will be exactly the same as if the power loss happened ~ 5 seconds earlier with sync=standard. Am I wrong?
I did lose a whole pool once because of a power loss while an SSD mirror was trimming (and this was with sync not disabled). I recently bought my first enterprise-ish SSD with PLP and the sync writes are much faster (from 400 to 17500 in pveperf). They can be cached because of the power loss protection and I don't have to worry about unsafe settings or power loss or poorly-timed reboots anymore. No wonder they are so often recommended on this forum.
 
I disabled sync years ago and never faced any problem.
But remind, this is a homeserver. In production I won't disable it.
 
Last edited:
I burned my fingers disabling sync in a workstation setup where the pool only had a single disk. At some point, the data was corrupted, and I created the pool from scratch again, leaving sync enabled. ;)

recently bought my first enterprise-ish SSD with PLP and the sync writes are much faster
Yeah, and they usually deliver on the specs or can handle even more, unlike many consumer SSDs where you can get close to the specs only in very defined circumstances ;)
 
yes, you are. you cannot just look at the disk state, the issue is that
- application writes to disk with sync, hands out reply corresponding to the persisted state (or does something else that has side-effects)
- crash

with sync, the on-disk and the replied-with state are in agreement
with sync=disabled, the on-disk state and the replied-with state are in disagreement

a basic example: the reply contains some kind of (auto-incremented) ID, the system starts up again, assigns that ID again, the client that got it before the crash will be rather confused since its local state and the server state are not referring to the same thing under that ID.

when you disable sync semantics, you basically break an invariant of the application logic since you take away one mechanism that is there to ensure consistency. this might not matter, or it might matter a lot - it all depends on what is using those semantics.
Thanks for your answer, Fabian. Sorry, I didn't make it clear. I understand that when we are talking about distributed system - it totally depends on the application, If the app can't handle that situation, then of cause one node may be confused about the state of the other node. Totally agree.
What I wanted to know - if we are talking about one independent node. Does `sync` really matters in terms of data corruption? Or it is just a loss of ~ 5 seconds. Again, database is only on one node.

I did lose a whole pool once because of a power loss while an SSD mirror was trimming (and this was with sync not disabled).
Thanks for your info! unfortunately It's unclear what caused the corruption in this situation.

I burned my fingers disabling sync in a workstation setup where the pool only had a single disk. At some point, the data was corrupted, and I created the pool from scratch again, leaving sync enabled. ;)
Thank you for the info. Did the disk survive, no bad blocks? Or was it entirely software fault?
 
Last edited:
Thanks for your info! unfortunately It's unclear what caused the corruption in this situation.
Trimming is another thing that can go wrong if the drives have no power loss protection (like enterprise SSDs, which can therefore also cache sync writes). I managed for years without PLP but the speed and peace of mind are amazing.
 
Thank you for the info. Did the disk survive, no bad blocks? Or was it entirely software fault?
Switching off power at the wrong time was mainly the reason AFAIR. The disk is still working, just with sync not disabled after the pool was recreated ;)
 
  • Like
Reactions: albert_a
Switching off power at the wrong time was mainly the reason AFAIR. The disk is still working, just with sync not disabled after the pool was recreated ;)
That shouldn't have happened, according to the documentation, but probably something went wrong.. Thanks!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!