end replication job with error: command [ssh] failed: exit code 255

lubbdi

Renowned Member
Dec 18, 2016
10
0
66
Dear all,

I'm running a three-node cluster with replicated ZFS datasets between all machines. Most VMs replicate each 15 minutes, some each two hours. The machines are connected with dedicated 10Gbit for storage and a separate, dedicated 1Gbit for cluster. All three nodes run Proxmox 7.1 (see pveversion -v below for more details).

Since updating from Proxmox 7.0 to 7.1 some, but not all, replications fail with a distinct error message:
Code:
2021-12-02 09:26:01 108-1: start replication job
2021-12-02 09:26:01 108-1: guest => VM 108, running => 0
2021-12-02 09:26:01 108-1: volumes => preontank:vm-108-disk-0,preontank:vm-108-disk-1,preontank:vm-108-disk-2,preontank:vm-108-disk-3,preontank:vm-108-disk-4
2021-12-02 09:26:04 108-1: create snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-0
2021-12-02 09:26:04 108-1: create snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-1
2021-12-02 09:26:04 108-1: create snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-2
2021-12-02 09:26:04 108-1: create snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-3
2021-12-02 09:26:05 108-1: create snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-4
2021-12-02 09:26:05 108-1: using secure transmission, rate limit: none
2021-12-02 09:26:05 108-1: full sync 'preontank:vm-108-disk-0' (__replicate_108-1_1638433561__)
2021-12-02 09:26:06 108-1: full send of preontank/vm-108-disk-0@__replicate_108-1_1638433561__ estimated size is 16.0G
2021-12-02 09:26:06 108-1: total estimated size is 16.0G
2021-12-02 09:26:06 108-1: volume 'preontank/vm-108-disk-0' already exists
2021-12-02 09:26:06 108-1: warning: cannot send 'preontank/vm-108-disk-0@__replicate_108-1_1638433561__': Broken pipe
2021-12-02 09:26:06 108-1: cannot send 'preontank/vm-108-disk-0': I/O error
2021-12-02 09:26:06 108-1: command 'zfs send -Rpv -- preontank/vm-108-disk-0@__replicate_108-1_1638433561__' failed: exit code 1
2021-12-02 09:26:06 108-1: delete previous replication snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-0
2021-12-02 09:26:07 108-1: delete previous replication snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-1
2021-12-02 09:26:07 108-1: delete previous replication snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-2
2021-12-02 09:26:07 108-1: delete previous replication snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-3
2021-12-02 09:26:07 108-1: delete previous replication snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-4
2021-12-02 09:26:07 108-1: end replication job with error: command 'set -o pipefail && pvesm export preontank:vm-108-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_108-1_1638433561__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=palicki' root@10.99.2.2 -- pvesm import preontank:vm-108-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_108-1_1638433561__ -allow-rename 0' failed: exit code 255

Deleting the replication job, any leftover snapshots and disks on the replication target and then re-creating the replication job does not help.

This already started to happen with an earlier version of 7.1, but I don't have record of the exact package versions. An update to the latest packages did not help yet, unfortunately.

All ZFS pools are reported healthy. Here's a list of installed package versions and some status:

pveversion -v:
Code:
proxmox-ve: 7.1-1 (running kernel: 5.13.19-1-pve)
pve-manager: 7.1-7 (running version: 7.1-7/df5740ad)
pve-kernel-5.13: 7.1-4
pve-kernel-helper: 7.1-4
pve-kernel-5.11: 7.0-10
pve-kernel-5.4: 6.4-6
pve-kernel-5.13.19-1-pve: 5.13.19-3
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-4-pve: 5.11.22-9
pve-kernel-5.4.140-1-pve: 5.4.140-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 14.2.21-1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve1
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-4
libpve-storage-perl: 7.0-15
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-4
pve-cluster: 7.1-2
pve-container: 4.1-2
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-2
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3

pvecm status:
Code:
Cluster information
-------------------
Name:             PHAC
Config Version:   5
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Dec  2 09:41:17 2021
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000003
Ring ID:          1.18f
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.99.1.3
0x00000002          1 10.99.1.2
0x00000003          1 10.99.1.1 (local)

pvesm status
Code:
Name             Type     Status           Total            Used       Available        %
backup            nfs     active     19331526656     14019972096      5311554560   72.52%
local             dir     active       451089920        12896640       438193280    2.86%
media             nfs     active      5361519616        49965056      5311554560    0.93%
preontank     zfspool     active      7650410496      3345095436      4305315060   43.72%

Edit: Just double-checked again, all three nodes produce the exact same output for pveversion -v.

When required, I can provide more logs or outputs. Thank you for any help or insights!
 
Last edited:
Hi Moayad,

thank you for your response! Please excuse my half-baked report, I found out that email notification was already disabled for pvesr.service previously, based on this forum thread. This --mail 0 seems to be not respected anymore with the switch from pvesr.service to pvescheduler.

Can you please give a hint about how to disable these email notifications for pvescheduler?
 
2023-12-26 17:19:09 192-0: cannot receive: local origin for clone rpool/data/vm-192-disk-1@S2023_12_18_10_54 does not exist
2023-12-26 17:19:09 192-0: cannot open 'rpool/data/vm-192-disk-1': dataset does not exist
2023-12-26 17:19:09 192-0: command 'zfs recv -F -- rpool/data/vm-192-disk-1' failed: exit code 1
2023-12-26 17:19:09 192-0: warning: cannot send 'rpool/data/vm-192-disk-1@S2023_12_18_10_54': signal received
2023-12-26 17:19:09 192-0: TIME SENT SNAPSHOT rpool/data/vm-192-disk-1@S_2023_12_23_11_58
2023-12-26 17:19:09 192-0: warning: cannot send 'rpool/data/vm-192-disk-1@S_2023_12_23_11_58': Broken pipe
2023-12-26 17:19:09 192-0: TIME SENT SNAPSHOT rpool/data/vm-192-disk-1@__replicate_192-0_1703571544__
2023-12-26 17:19:09 192-0: warning: cannot send 'rpool/data/vm-192-disk-1@__replicate_192-0_1703571544__': Broken pipe
2023-12-26 17:19:09 192-0: cannot send 'rpool/data/vm-192-disk-1': I/O error
2023-12-26 17:19:09 192-0: command 'zfs send -Rpv -- rpool/data/vm-192-disk-1@__replicate_192-0_1703571544__' failed: exit code 1
2023-12-26 17:19:09 192-0: delete previous replication snapshot '__replicate_192-0_1703571544__' on local-zfs:base-9002-disk-0/vm-192-disk-1
2023-12-26 17:19:09 192-0: delete previous replication snapshot '__replicate_192-0_1703571544__' on local-zfs:base-9002-disk-1/vm-192-disk-0
2023-12-26 17:19:09 192-0: delete previous replication snapshot '__replicate_192-0_1703571544__' on local-zfs:vm-192-disk-2
2023-12-26 17:19:09 192-0: delete previous replication snapshot '__replicate_192-0_1703571544__' on local-zfs:vm-192-state-S2023_12_18_10_54
2023-12-26 17:19:09 192-0: delete previous replication snapshot '__replicate_192-0_1703571544__' on local-zfs:vm-192-state-S_2023_12_23_11_58
2023-12-26 17:19:09 192-0: end replication job with error: command 'set -o pipefail && pvesm export local-zfs:base-9002-disk-0/vm-192-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_192-0_1703571544__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=dsco-vm-01' root@192.168.3.61 -- pvesm import local-zfs:base-9002-disk-0/vm-192-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_192-0_1703571544__ -allow-rename 0' failed: exit code 1


Suggestion what can i do?