ZFS Replication failure timeout

tubatodd

Member
Feb 3, 2020
10
0
21
44
I'm seeing quite a few ZFS replication failures and I'm not sure how to diagnose the root cause.

I'm just replicating 1 VM to a second Proxmox host. The second host is replicating 1 VM back to the first. Each VM has 2 virtual disks with iothread=1. I've had the same issue with iothread disabled as well. I have the replication set at ever 5 minutes.

What are some things to look at?


Code:
Feb 05 11:40:09 vmhost3.dev20.example.com pvesr[206905]: command 'zfs destroy zfs-dev20/vm-104-disk-1@__replicate_104-0_1580920200__' failed: got timeout
Feb 05 11:41:59 vmhost3.dev20.example.com pvesr[206905]: command 'zfs destroy zfs-dev20/vm-104-disk-1@__replicate_104-0_1580919600__' failed: got timeout
Feb 05 11:45:09 vmhost3.dev20.example.com pvesr[270130]: 104-0: got unexpected replication job error - command 'zfs snapshot zfs-dev20/vm-104-disk-0@__replicate_104-0_1580921100__' failed: got timeout
Feb 05 11:50:07 vmhost3.dev20.example.com pvesr[331309]: command 'zfs destroy zfs-dev20/vm-104-disk-0@__replicate_104-0_1580921100__' failed: got timeout
Feb 05 11:51:35 vmhost3.dev20.example.com pvesr[331309]: command 'zfs destroy zfs-dev20/vm-104-disk-0@__replicate_104-0_1580920800__' failed: got timeout
Feb 05 11:51:40 vmhost3.dev20.example.com pvesr[331309]: command 'zfs destroy zfs-dev20/vm-104-disk-1@__replicate_104-0_1580920800__' failed: got timeout
Feb 05 11:56:17 vmhost3.dev20.example.com pvesr[392435]: command 'zfs destroy zfs-dev20/vm-104-disk-0@__replicate_104-0_1580921400__' failed: got timeout
Feb 05 11:56:25 vmhost3.dev20.example.com pvesr[392435]: command 'zfs destroy zfs-dev20/vm-104-disk-1@__replicate_104-0_1580921400__' failed: got timeout
Feb 05 12:05:07 vmhost3.dev20.example.com pvesr[518555]: 104-0: got unexpected replication job error - command 'zfs snapshot zfs-dev20/vm-104-disk-0@__replicate_104-0_1580922300__' failed: got timeout
Feb 05 12:10:10 vmhost3.dev20.example.com pvesr[576697]: command 'zfs destroy zfs-dev20/vm-104-disk-0@__replicate_104-0_1580922300__' failed: got timeout
Feb 05 12:10:16 vmhost3.dev20.example.com pvesr[576697]: 104-0: got unexpected replication job error - command 'zfs snapshot zfs-dev20/vm-104-disk-0@__replicate_104-0_1580922600__' failed: got timeout
Feb 05 12:20:06 vmhost3.dev20.example.com pvesr[690055]: command 'zfs destroy zfs-dev20/vm-104-disk-0@__replicate_104-0_1580922600__' failed: got timeout
Feb 05 12:21:50 vmhost3.dev20.example.com pvesr[690055]: command 'zfs destroy zfs-dev20/vm-104-disk-0@__replicate_104-0_1580922000__' failed: got timeout
Feb 05 12:22:00 vmhost3.dev20.example.com pvesr[690055]: command 'zfs destroy zfs-dev20/vm-104-disk-1@__replicate_104-0_1580922000__' failed: got timeout
Feb 05 12:25:09 vmhost3.dev20.example.com pvesr[768434]: 104-0: got unexpected replication job error - command 'zfs snapshot zfs-dev20/vm-104-disk-0@__replicate_104-0_1580923500__' failed: got timeout
Feb 05 12:30:09 vmhost3.dev20.example.com pvesr[818339]: 104-0: got unexpected replication job error - command 'zfs snapshot zfs-dev20/vm-104-disk-0@__replicate_104-0_1580923800__' failed: got timeout
Feb 05 12:40:06 vmhost3.dev20.example.com pvesr[944399]: command 'zfs destroy zfs-dev20/vm-104-disk-0@__replicate_104-0_1580923800__' failed: got timeout
Feb 05 12:41:51 vmhost3.dev20.example.com pvesr[944399]: command 'zfs destroy zfs-dev20/vm-104-disk-0@__replicate_104-0_1580923200__' failed: got timeout
Feb 05 12:41:58 vmhost3.dev20.example.com pvesr[944399]: command 'zfs destroy zfs-dev20/vm-104-disk-1@__replicate_104-0_1580923200__' failed: got timeout
Feb 05 12:45:08 vmhost3.dev20.example.com pvesr[1021813]: 104-0: got unexpected replication job error - command 'zfs snapshot zfs-dev20/vm-104-disk-0@__replicate_104-0_1580924700__' failed: got timeout
Feb 05 12:55:21 vmhost3.dev20.example.com pvesr[1146265]: command 'zfs destroy zfs-dev20/vm-104-disk-0@__replicate_104-0_1580925300__' failed: got timeout
Feb 05 12:55:21 vmhost3.dev20.example.com pvesr[1146265]: 104-0: got unexpected replication job error - command 'zfs snapshot zfs-dev20/vm-104-disk-1@__replicate_104-0_1580925300__' failed: got timeout
Feb 05 13:00:10 vmhost3.dev20.example.com pvesr[1208270]: 104-0: got unexpected replication job error - command 'zfs snapshot zfs-dev20/vm-104-disk-0@__replicate_104-0_1580925600__' failed: got timeout
Feb 05 13:16:09 vmhost3.dev20.example.com pvesr[1402147]: command 'zfs destroy zfs-dev20/vm-104-disk-0@__replicate_104-0_1580926200__' failed: got timeout
Feb 05 13:16:16 vmhost3.dev20.example.com pvesr[1402147]: command 'zfs destroy zfs-dev20/vm-104-disk-1@__replicate_104-0_1580926200__' failed: got timeout
 
Hi,

You got a timeout at the replication.
the most common reasons are
1.) the zfs pool is under heavy load
2.) the other node does not answer (ssh doe not work)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!