Random zfs replication errors

mbosma

Renowned Member
Dec 3, 2018
124
25
68
30
Recently I upgraded a 3 node-cluster running pve-5.3 to pve-6.3-3.
These nodes all have zfs-replication running to the two other nodes.

Since the upgrade I've been getting random errors about the Replication failing.
I'm receiving an e-mail with "Replication Job: 127-2 failed - no tunnel IP received" at kinda random intervals.
When the replication fails the next (or manual) replication succeeds again.

Recreating the replication or running the two replications for a vm on other intervals doesn't help.
I checked to see if there were any loose/orphaned zfs snapshots left on the target nodes but they seem to be clean.

It's pretty hard to debug because I can't find any helpful information in the logs and the errors aren't persistent.

I have another cluster running the exact same setup but installed from pve 6.2-1 that doesn't experience these errors so I'm guessing this has something to do with the upgrade.

Does anyone have a tip on how to debug this problem?
 
Last edited:
Have you resolved your problem? I get random replication errors mostly during the backups. Moved to PBS has helped. Today I was doing some upgrading, creating some disk I/O and started getting them. Under normal load everything is quiet. Checked my logs as well and see nothing. Running all Sata SSD arrays so performace hasn't been an issue. I just keep watching to see if anyone else has experienced this.
 
Unfortunately I haven't been able to resolve the issue so far.
As these are mostly pbx vm's they don't generate a lot of load (especially not on the storage) and I don't run a backup either.

The systems are connected using a dedicated 10G backend link for replication so I don't suspect the network to be a problem.
I'm also using ssd's btw.

It's pretty hard to debug because the issue pops up seemingly random.
I have checked the logs and graphs (from munin) to try and find a connection but haven't been able to find one.
 
I wanted to update this thread again.
The cluster just got updated to the newest pve enterprise repo version (6.3-6 as of writing) including ZFS 2.0.
Even though this did not fix the issue it did give me some more information.

Instead of the general "no tunnel IP received" error it now spits out "no tunnel IP received, got 'trying to acquire lock...'".
It seems like it isn't able to get a lock, my guess is that there are simply too many replications going on.

The affected cluster has 58 vm's distributed over 3 nodes and each vm gets replicated to the other two nodes.
This results in 116 replications every 15 minutes.

The other cluster which isn't affected only has 25 vm's and 50 replications per 15 minutes.

Could this be an issue of trying to do too many replications?
Is there a way I can check that without having to cat every log when an error happens?
 
Just got another error and catted every file in /var/log/pve/replicate/*.

Not a conclusive error but still wanted to include it.
The affected replication is 107-0.
Code:
2021-04-21 17:00:14 101-0: (remote_finalize_local_job) delete stale replication snapshot '__replicate_101-0_1619016309__' on local-zfs:vm-101-disk-0
2021-04-21 17:00:15 101-0: end replication job
2021-04-21 17:00:06 101-1: start replication job
2021-04-21 17:00:06 101-1: guest => VM 101, running => 493
2021-04-21 17:00:06 101-1: volumes => local-zfs:vm-101-disk-0
2021-04-21 17:00:07 101-1: create snapshot '__replicate_101-1_1619017206__' on local-zfs:vm-101-disk-0
2021-04-21 17:00:07 101-1: using insecure transmission, rate limit: none
2021-04-21 17:00:07 101-1: incremental sync 'local-zfs:vm-101-disk-0' (__replicate_101-1_1619016305__ => __replicate_101-1_1619017206__)
2021-04-21 17:00:09 101-1: send from @__replicate_101-1_1619016305__ to rpool/data/vm-101-disk-0@__replicate_101-0_1619016309__ estimated size is 407K
2021-04-21 17:00:09 101-1: send from @__replicate_101-0_1619016309__ to rpool/data/vm-101-disk-0@__replicate_101-1_1619017206__ estimated size is 88.3M
2021-04-21 17:00:09 101-1: total estimated size is 88.7M
2021-04-21 17:00:09 101-1: [pve5] successfully imported 'local-zfs:vm-101-disk-0'
2021-04-21 17:00:09 101-1: delete previous replication snapshot '__replicate_101-1_1619016305__' on local-zfs:vm-101-disk-0
2021-04-21 17:00:10 101-1: (remote_finalize_local_job) delete stale replication snapshot '__replicate_101-1_1619016305__' on local-zfs:vm-101-disk-0
2021-04-21 17:00:10 101-1: end replication job
2021-04-21 17:00:15 107-0: start replication job
2021-04-21 17:00:15 107-0: guest => VM 107, running => 12492
2021-04-21 17:00:15 107-0: volumes => local-zfs:vm-107-disk-0
2021-04-21 17:00:16 107-0: create snapshot '__replicate_107-0_1619017215__' on local-zfs:vm-107-disk-0
2021-04-21 17:00:16 107-0: using insecure transmission, rate limit: none
2021-04-21 17:00:16 107-0: incremental sync 'local-zfs:vm-107-disk-0' (__replicate_107-0_1619016318__ => __replicate_107-0_1619017215__)
2021-04-21 17:00:17 107-0: delete previous replication snapshot '__replicate_107-0_1619017215__' on local-zfs:vm-107-disk-0
2021-04-21 17:00:17 107-0: end replication job with error: no tunnel IP received, got 'trying to acquire lock...
'
2021-04-21 17:00:17 107-1: start replication job
2021-04-21 17:00:17 107-1: guest => VM 107, running => 12492
2021-04-21 17:00:17 107-1: volumes => local-zfs:vm-107-disk-0
2021-04-21 17:00:18 107-1: create snapshot '__replicate_107-1_1619017217__' on local-zfs:vm-107-disk-0
2021-04-21 17:00:19 107-1: using insecure transmission, rate limit: none
2021-04-21 17:00:19 107-1: incremental sync 'local-zfs:vm-107-disk-0' (__replicate_107-1_1619016314__ => __replicate_107-1_1619017217__)
2021-04-21 17:00:20 107-1: send from @__replicate_107-1_1619016314__ to rpool/data/vm-107-disk-0@__replicate_107-0_1619016318__ estimated size is 624B
2021-04-21 17:00:20 107-1: send from @__replicate_107-0_1619016318__ to rpool/data/vm-107-disk-0@__replicate_107-1_1619017217__ estimated size is 11.9M
2021-04-21 17:00:20 107-1: total estimated size is 11.9M
2021-04-21 17:00:20 107-1: [pve5] successfully imported 'local-zfs:vm-107-disk-0'
2021-04-21 17:00:20 107-1: delete previous replication snapshot '__replicate_107-1_1619016314__' on local-zfs:vm-107-disk-0
2021-04-21 17:00:21 107-1: (remote_finalize_local_job) delete stale replication snapshot '__replicate_107-1_1619016314__' on local-zfs:vm-107-disk-0
2021-04-21 17:00:21 107-1: end replication job
2021-04-21 17:00:27 110-0: start replication job
2021-04-21 17:00:27 110-0: guest => VM 110, running => 21489
2021-04-21 17:00:27 110-0: volumes => local-zfs:vm-110-disk-1,local-zfs:vm-110-state-Update_firmware
2021-04-21 17:00:28 110-0: create snapshot '__replicate_110-0_1619017227__' on local-zfs:vm-110-disk-1
2021-04-21 17:00:28 110-0: create snapshot '__replicate_110-0_1619017227__' on local-zfs:vm-110-state-Update_firmware
2021-04-21 17:00:28 110-0: using insecure transmission, rate limit: none
 
Thank you for the update!
I mitigated the issue by switching to secure, I'll update the package once we get a maintenance window from the client.
 
  • Like
Reactions: Moayad

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!