Hi,
I have a 3 node cluster (not particularly high spec nodes) that keeps giving me zfs snapshot timeout errors (according to my grafana graphs it is about 1 failure every 30 minutes). My configuration is set to replicate a subset (about 15) VMs/containers every 5 minutes, and therefore there are often jobs running between nodes at the same time.
E.g. 'vmhost3' can be syncing a VM to 'vmhost4' at the same time as a different vm is being synced the other way.
Changing the configuration so that each VM/container is synced at a different time to any other has removed the snapshot timeout error, however, I'm now using 0/15 to 14/15 to get them distributed.
Although the spec of the systems is low, the I/O load is not high when the timeout occurs (I've added a ZIL to try and resolve any issues with spinning rust and ZFS) so it looks to be a simultaneous issue rather than raw I/O.
The total time taken for all the VMs to sync is well below 5 minutes, so I'm wondering whether there is any way to put something into the sync schedule to prevent simultaneous running - thoughts that occur are:
Is anyone else experiencing this, and if so do they have a solution?
Thanks
				
			I have a 3 node cluster (not particularly high spec nodes) that keeps giving me zfs snapshot timeout errors (according to my grafana graphs it is about 1 failure every 30 minutes). My configuration is set to replicate a subset (about 15) VMs/containers every 5 minutes, and therefore there are often jobs running between nodes at the same time.
E.g. 'vmhost3' can be syncing a VM to 'vmhost4' at the same time as a different vm is being synced the other way.
Changing the configuration so that each VM/container is synced at a different time to any other has removed the snapshot timeout error, however, I'm now using 0/15 to 14/15 to get them distributed.
Although the spec of the systems is low, the I/O load is not high when the timeout occurs (I've added a ZIL to try and resolve any issues with spinning rust and ZFS) so it looks to be a simultaneous issue rather than raw I/O.
The total time taken for all the VMs to sync is well below 5 minutes, so I'm wondering whether there is any way to put something into the sync schedule to prevent simultaneous running - thoughts that occur are:
- 'n/5' so that it runs at a node id offset - this would therefore mean that node1 runs at 1/5, node2 at 2/5, node3 at 3/5.  This would solve my problem  
- '<5' so that the sync is run such that the replica is never more than 5 minutes old (with an implicit 'only run one job at a time').
Is anyone else experiencing this, and if so do they have a solution?
Thanks
 
	