ZFS Replication scheduling within 3 node cluster

rauxon · Apr 16, 2021

Hi,

I have a 3 node cluster (not particularly high spec nodes) that keeps giving me zfs snapshot timeout errors (according to my grafana graphs it is about 1 failure every 30 minutes). My configuration is set to replicate a subset (about 15) VMs/containers every 5 minutes, and therefore there are often jobs running between nodes at the same time.

E.g. 'vmhost3' can be syncing a VM to 'vmhost4' at the same time as a different vm is being synced the other way.

Changing the configuration so that each VM/container is synced at a different time to any other has removed the snapshot timeout error, however, I'm now using 0/15 to 14/15 to get them distributed.

Although the spec of the systems is low, the I/O load is not high when the timeout occurs (I've added a ZIL to try and resolve any issues with spinning rust and ZFS) so it looks to be a simultaneous issue rather than raw I/O.

The total time taken for all the VMs to sync is well below 5 minutes, so I'm wondering whether there is any way to put something into the sync schedule to prevent simultaneous running - thoughts that occur are:

'n/5' so that it runs at a node id offset - this would therefore mean that node1 runs at 1/5, node2 at 2/5, node3 at 3/5. This would solve my problem
'<5' so that the sync is run such that the replica is never more than 5 minutes old (with an implicit 'only run one job at a time').

I do not believe either actually exists I'm just trying to express what I would like to be able to do.

Is anyone else experiencing this, and if so do they have a solution?

Thanks

Search

Search

ZFS Replication scheduling within 3 node cluster

rauxon

Member