Replication failure improvement - staggering or bandwidth limit

Elliott Partridge

Well-Known Member
Oct 7, 2018
56
11
48
I routinely get replication job failures between the hosts in my 2-node Proxmox cluster. I have grown accustomed to ignoring them as they always recover on the next run. Is there a way to avoid these failures? I'm thinking of staggering the replication schedule between the two hosts, as they currently run jobs at the same time. Another thought would be to limit bandwidth of replication, but not sure that I want to slow down everything like that.

Has anyone had experience with this and care to share?
 
Hello

Could you give me your journal of the last 2 days (assuming the issue occured in this time)?
Code:
journalctl --since '2023-10-22' > $(hostname)-journal.txt

Also, please share your /etc/pve/storage.cfg
 
Hello

Could you give me your journal of the last 2 days (assuming the issue occured in this time)?
Code:
journalctl --since '2023-10-22' > $(hostname)-journal.txt

Also, please share your /etc/pve/storage.cfg
Thanks for the prompt response!

I've attached the requested files, zipped up. There are many ZFS operations resulting in failed: got timeout, mostly if not exclusively occurring during backup jobs.
 

Attachments

  • pve-journal.zip
    172.4 KB · Views: 1

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!