Replication stalled

Michael Herf · Jan 5, 2021

I have replication running for most of my LXC hosts, but recently I found I found one that was "stuck" on the receiver side, in a zfs rollback. (The only way to see this is happening is to click "Replication" on every single host in the cluster.)

Consequences were:
1. All jobs on the sender side were stopped (for several months)
2. Neither side timed out or retried, even for jobs that were *not* the stuck job

The sender was easily fixed by killing the "zfs send" instances, and all the other jobs on the sender worked then. The stalled "zfs rollback" required a reboot of the receiver host.

Can there be some monitoring for this, so I get email if a replication job takes more than a day, or if it makes no progress? Or, a daily list of "replica health" with warnings if things are wedged this badly?

Michael Herf · May 25, 2021

This replication lockup issue persists in the latest kernels and ZFS versions.

On the destination side, "zfs receive -F -- [poolname]" hangs for many days, and in some cases this job cannot be killed - the machine won't reboot cleanly and must be hardware reset.

This happens on a variety of hosts, so it is not corruption in one particular machine. Should this be reported upstream to ZoL, or does anyone know of a proxmox-specific fix? I wonder if there may be "backup" snapshots happening at the same time as the replication snapshots, or something like this that is unusual to do outside proxmox.

Search

Search

Replication stalled

Michael Herf

Active Member

Michael Herf

Active Member

We value your privacy