I have replication running for most of my LXC hosts, but recently I found I found one that was "stuck" on the receiver side, in a zfs rollback. (The only way to see this is happening is to click "Replication" on every single host in the cluster.)
Consequences were:
1. All jobs on the sender side were stopped (for several months)
2. Neither side timed out or retried, even for jobs that were *not* the stuck job
The sender was easily fixed by killing the "zfs send" instances, and all the other jobs on the sender worked then. The stalled "zfs rollback" required a reboot of the receiver host.
Can there be some monitoring for this, so I get email if a replication job takes more than a day, or if it makes no progress? Or, a daily list of "replica health" with warnings if things are wedged this badly?
Consequences were:
1. All jobs on the sender side were stopped (for several months)
2. Neither side timed out or retried, even for jobs that were *not* the stuck job
The sender was easily fixed by killing the "zfs send" instances, and all the other jobs on the sender worked then. The stalled "zfs rollback" required a reboot of the receiver host.
Can there be some monitoring for this, so I get email if a replication job takes more than a day, or if it makes no progress? Or, a daily list of "replica health" with warnings if things are wedged this badly?