Replication failures on zfs

anthony

Member
Jun 20, 2018
30
1
11
28
I recently started getting replication failures. I have 2 servers and a monitor node set up in a sort of poor man's HA, the VMs replicate back and forth every 15 minutes. From host02 to host01 they work fine, however from host01 to host02 most of the VMs get:
Code:
command 'set -o pipefail && pvesm export Main_Zpool:vm-105-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_105-0_1584670268__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=Host02' root@192.168.20.62 -- pvesm import Main_Zpool:vm-105-disk-0 zfs - -with-snapshots 1' failed: exit code 255

2 or 3 of the VMs I get:
Code:
command 'zfs snapshot Main-ZFS/vm-101-disk-0@__replicate_101-0_1584656345__' failed: got timeout

The 2 servers are tied together on 2 10gb dac links(server to server no switch) bonded together in a balanced-xor configuration.

After googling a bit I found that someone recommended deleting the copies on the remote server, and that seemed to work for a few copies, but did not fix the timeouts, and didnt fix the issue permanently.

I also tried deleting and recreating the replication tasks (waited between to ensure the task was removed)

What are some things to try? Where do I start looking?
 
that means you system is too overloaded to create new snapshots within a reasonable time..
 
Theyre utilizing like 10% cpu and maybe 50% ram..I can't see them being overloaded..
 
well you can see the error message - zfs snapshot times out. that usually means that your ZFS disks are overloaded. or that ZFS is deadlocked, but then you'd see other symptoms as well (such as not being able to access disks in side VMs)
 
Oh like zfs is overloaded... Gotcha. The timeout is on 2 particular VMs, the rest are the 255 error. Which is something different? All the VMs are able to both read and write to the disk without problem..
 
Well, ZFS actually can't get "overloaded" in a sense, that it lacks cpu or ram resources. The time out rather happens, because the zfs snapshot command does not return and the script has a timeout "around" the zfs commands, to be able to continue in a predictable manner. So, you'd first go ahead and examine your zpools and zfs datasets. Then try the snapshot command manually and see what ZFS has to tell you.

It might be, that the underlying storage devices - or merely one of them, is giving ZFS trouble… There might also be some corruption in that particular ZFS dataset, which can also cause ZFS operations to stall. This a case, where I'd assume something like this to be the issue.
 
Interestingly I was in the middle of doing that already, I was seeing some errors on one drive and decided to reinitialize it through the IPMI and reboot. When the machine came back up I found that The problem drive had completely dropped offline and was no longer detectable even by the IPMI / hardware. So looks like it was a failed drive, but in an interesting way.
 
The crux of SATA-drives… SAS drives seem to handle these kind of issues much better, but they're more expensive and also come only with smaller capacities. ZFS's zpools are set to "WAIT" on error by default, which can cause such deadlocks.
 
No, not exactly, you can set the zpool's failmode from wait to panic, which would cause a reboot, if such an issue occurred.
 
So I fixed the drive problem, and deleted the vm images on the second machine. everything seems to run through once, than i get the error
what am i missing?
Ive run iperf tests and am right at my 10gb speed with minimal errors, but to eliminate this as a potential problem I switched the replication network to an open 1gbps link with no difference
IO delay is less than 1% 99% of the time, i have seen it jump to 9%, but immediately went back down.

Code:
2020-03-23 17:00:58 100-0: start replication job
2020-03-23 17:00:58 100-0: guest => VM 100, running => 37347
2020-03-23 17:00:58 100-0: volumes => Main_Zpool:vm-100-disk-0
2020-03-23 17:01:01 100-0: create snapshot '__replicate_100-0_1584997258__' on Main_Zpool:vm-100-disk-0
2020-03-23 17:01:01 100-0: full sync 'Main_Zpool:vm-100-disk-0' (__replicate_100-0_1584997258__)
2020-03-23 17:01:01 100-0: full send of Main-ZFS/vm-100-disk-0@__replicate_100-0_1584997141__ estimated size is 8.91G
2020-03-23 17:01:01 100-0: send from @__replicate_100-0_1584997141__ to Main-ZFS/vm-100-disk-0@__replicate_100-0_1584997258__ estimated size is 473K
2020-03-23 17:01:01 100-0: total estimated size is 8.92G
2020-03-23 17:01:01 100-0: TIME        SENT   SNAPSHOT Main-ZFS/vm-100-disk-0@__replicate_100-0_1584997141__
2020-03-23 17:01:02 100-0: Main-ZFS/vm-100-disk-0    name    Main-ZFS/vm-100-disk-0    -
2020-03-23 17:01:02 100-0: volume 'Main-ZFS/vm-100-disk-0' already exists
2020-03-23 17:01:02 100-0: warning: cannot send 'Main-ZFS/vm-100-disk-0@__replicate_100-0_1584997141__': signal received
2020-03-23 17:01:02 100-0: TIME        SENT   SNAPSHOT Main-ZFS/vm-100-disk-0@__replicate_100-0_1584997258__
2020-03-23 17:01:02 100-0: warning: cannot send 'Main-ZFS/vm-100-disk-0@__replicate_100-0_1584997258__': Broken pipe
2020-03-23 17:01:02 100-0: cannot send 'Main-ZFS/vm-100-disk-0': I/O error
2020-03-23 17:01:02 100-0: command 'zfs send -Rpv -- Main-ZFS/vm-100-disk-0@__replicate_100-0_1584997258__' failed: exit code 1
2020-03-23 17:01:02 100-0: delete previous replication snapshot '__replicate_100-0_1584997258__' on Main_Zpool:vm-100-disk-0
2020-03-23 17:01:02 100-0: end replication job with error: command 'set -o pipefail && pvesm export Main_Zpool:vm-100-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_100-0_1584997258__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=Host02' root@<remote host IP> -- pvesm import Main_Zpool:vm-100-disk-0 zfs - -with-snapshots 1' failed: exit code 255
 
I found something interesting on a hunch: when trying to ssh from the problem machine to the other, the ssh key was changed for the remote host. no idea how that would have changed, so i have some investigating to do. that would explain the broken pipe message.
 
  • Like
Reactions: zorrobiwan
I found something interesting on a hunch: when trying to ssh from the problem machine to the other, the ssh key was changed for the remote host. no idea how that would have changed, so i have some investigating to do. that would explain the broken pipe message.

Did you make any headway regarding the failed: got timeout with your replication snapshots?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!