Replication stalled

Michael Herf

Member
May 31, 2019
11
2
23
49
I have replication running for most of my LXC hosts, but recently I found I found one that was "stuck" on the receiver side, in a zfs rollback. (The only way to see this is happening is to click "Replication" on every single host in the cluster.)

Consequences were:
1. All jobs on the sender side were stopped (for several months)
2. Neither side timed out or retried, even for jobs that were *not* the stuck job

The sender was easily fixed by killing the "zfs send" instances, and all the other jobs on the sender worked then. The stalled "zfs rollback" required a reboot of the receiver host.

Can there be some monitoring for this, so I get email if a replication job takes more than a day, or if it makes no progress? Or, a daily list of "replica health" with warnings if things are wedged this badly?
 
This replication lockup issue persists in the latest kernels and ZFS versions.

On the destination side, "zfs receive -F -- [poolname]" hangs for many days, and in some cases this job cannot be killed - the machine won't reboot cleanly and must be hardware reset.

This happens on a variety of hosts, so it is not corruption in one particular machine. Should this be reported upstream to ZoL, or does anyone know of a proxmox-specific fix? I wonder if there may be "backup" snapshots happening at the same time as the replication snapshots, or something like this that is unusual to do outside proxmox.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!