Dear all,
I'm running a three-node cluster with replicated ZFS datasets between all machines. Most VMs replicate each 15 minutes, some each two hours. The machines are connected with dedicated 10Gbit for storage and a separate, dedicated 1Gbit for cluster. All three nodes run Proxmox 7.1 (see
Since updating from Proxmox 7.0 to 7.1 some, but not all, replications fail with a distinct error message:
Deleting the replication job, any leftover snapshots and disks on the replication target and then re-creating the replication job does not help.
This already started to happen with an earlier version of 7.1, but I don't have record of the exact package versions. An update to the latest packages did not help yet, unfortunately.
All ZFS pools are reported healthy. Here's a list of installed package versions and some status:
Edit: Just double-checked again, all three nodes produce the exact same output for
When required, I can provide more logs or outputs. Thank you for any help or insights!
I'm running a three-node cluster with replicated ZFS datasets between all machines. Most VMs replicate each 15 minutes, some each two hours. The machines are connected with dedicated 10Gbit for storage and a separate, dedicated 1Gbit for cluster. All three nodes run Proxmox 7.1 (see
pveversion -v
below for more details).Since updating from Proxmox 7.0 to 7.1 some, but not all, replications fail with a distinct error message:
Code:
2021-12-02 09:26:01 108-1: start replication job
2021-12-02 09:26:01 108-1: guest => VM 108, running => 0
2021-12-02 09:26:01 108-1: volumes => preontank:vm-108-disk-0,preontank:vm-108-disk-1,preontank:vm-108-disk-2,preontank:vm-108-disk-3,preontank:vm-108-disk-4
2021-12-02 09:26:04 108-1: create snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-0
2021-12-02 09:26:04 108-1: create snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-1
2021-12-02 09:26:04 108-1: create snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-2
2021-12-02 09:26:04 108-1: create snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-3
2021-12-02 09:26:05 108-1: create snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-4
2021-12-02 09:26:05 108-1: using secure transmission, rate limit: none
2021-12-02 09:26:05 108-1: full sync 'preontank:vm-108-disk-0' (__replicate_108-1_1638433561__)
2021-12-02 09:26:06 108-1: full send of preontank/vm-108-disk-0@__replicate_108-1_1638433561__ estimated size is 16.0G
2021-12-02 09:26:06 108-1: total estimated size is 16.0G
2021-12-02 09:26:06 108-1: volume 'preontank/vm-108-disk-0' already exists
2021-12-02 09:26:06 108-1: warning: cannot send 'preontank/vm-108-disk-0@__replicate_108-1_1638433561__': Broken pipe
2021-12-02 09:26:06 108-1: cannot send 'preontank/vm-108-disk-0': I/O error
2021-12-02 09:26:06 108-1: command 'zfs send -Rpv -- preontank/vm-108-disk-0@__replicate_108-1_1638433561__' failed: exit code 1
2021-12-02 09:26:06 108-1: delete previous replication snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-0
2021-12-02 09:26:07 108-1: delete previous replication snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-1
2021-12-02 09:26:07 108-1: delete previous replication snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-2
2021-12-02 09:26:07 108-1: delete previous replication snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-3
2021-12-02 09:26:07 108-1: delete previous replication snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-4
2021-12-02 09:26:07 108-1: end replication job with error: command 'set -o pipefail && pvesm export preontank:vm-108-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_108-1_1638433561__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=palicki' root@10.99.2.2 -- pvesm import preontank:vm-108-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_108-1_1638433561__ -allow-rename 0' failed: exit code 255
Deleting the replication job, any leftover snapshots and disks on the replication target and then re-creating the replication job does not help.
This already started to happen with an earlier version of 7.1, but I don't have record of the exact package versions. An update to the latest packages did not help yet, unfortunately.
All ZFS pools are reported healthy. Here's a list of installed package versions and some status:
pveversion -v
:
Code:
proxmox-ve: 7.1-1 (running kernel: 5.13.19-1-pve)
pve-manager: 7.1-7 (running version: 7.1-7/df5740ad)
pve-kernel-5.13: 7.1-4
pve-kernel-helper: 7.1-4
pve-kernel-5.11: 7.0-10
pve-kernel-5.4: 6.4-6
pve-kernel-5.13.19-1-pve: 5.13.19-3
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-4-pve: 5.11.22-9
pve-kernel-5.4.140-1-pve: 5.4.140-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 14.2.21-1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve1
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-4
libpve-storage-perl: 7.0-15
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-4
pve-cluster: 7.1-2
pve-container: 4.1-2
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-2
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3
pvecm status
:
Code:
Cluster information
-------------------
Name: PHAC
Config Version: 5
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Thu Dec 2 09:41:17 2021
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000003
Ring ID: 1.18f
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.99.1.3
0x00000002 1 10.99.1.2
0x00000003 1 10.99.1.1 (local)
pvesm status
Code:
Name Type Status Total Used Available %
backup nfs active 19331526656 14019972096 5311554560 72.52%
local dir active 451089920 12896640 438193280 2.86%
media nfs active 5361519616 49965056 5311554560 0.93%
preontank zfspool active 7650410496 3345095436 4305315060 43.72%
Edit: Just double-checked again, all three nodes produce the exact same output for
pveversion -v
.When required, I can provide more logs or outputs. Thank you for any help or insights!
Last edited: