I recently started getting replication failures. I have 2 servers and a monitor node set up in a sort of poor man's HA, the VMs replicate back and forth every 15 minutes. From host02 to host01 they work fine, however from host01 to host02 most of the VMs get:
2 or 3 of the VMs I get:
The 2 servers are tied together on 2 10gb dac links(server to server no switch) bonded together in a balanced-xor configuration.
After googling a bit I found that someone recommended deleting the copies on the remote server, and that seemed to work for a few copies, but did not fix the timeouts, and didnt fix the issue permanently.
I also tried deleting and recreating the replication tasks (waited between to ensure the task was removed)
What are some things to try? Where do I start looking?
Code:
command 'set -o pipefail && pvesm export Main_Zpool:vm-105-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_105-0_1584670268__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=Host02' root@192.168.20.62 -- pvesm import Main_Zpool:vm-105-disk-0 zfs - -with-snapshots 1' failed: exit code 255
2 or 3 of the VMs I get:
Code:
command 'zfs snapshot Main-ZFS/vm-101-disk-0@__replicate_101-0_1584656345__' failed: got timeout
The 2 servers are tied together on 2 10gb dac links(server to server no switch) bonded together in a balanced-xor configuration.
After googling a bit I found that someone recommended deleting the copies on the remote server, and that seemed to work for a few copies, but did not fix the timeouts, and didnt fix the issue permanently.
I also tried deleting and recreating the replication tasks (waited between to ensure the task was removed)
What are some things to try? Where do I start looking?