end replication job with error: command [ssh] failed: exit code 255

lubbdi

Active Member
Dec 18, 2016
10
0
41
Dear all,

I'm running a three-node cluster with replicated ZFS datasets between all machines. Most VMs replicate each 15 minutes, some each two hours. The machines are connected with dedicated 10Gbit for storage and a separate, dedicated 1Gbit for cluster. All three nodes run Proxmox 7.1 (see pveversion -v below for more details).

Since updating from Proxmox 7.0 to 7.1 some, but not all, replications fail with a distinct error message:
Code:
2021-12-02 09:26:01 108-1: start replication job
2021-12-02 09:26:01 108-1: guest => VM 108, running => 0
2021-12-02 09:26:01 108-1: volumes => preontank:vm-108-disk-0,preontank:vm-108-disk-1,preontank:vm-108-disk-2,preontank:vm-108-disk-3,preontank:vm-108-disk-4
2021-12-02 09:26:04 108-1: create snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-0
2021-12-02 09:26:04 108-1: create snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-1
2021-12-02 09:26:04 108-1: create snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-2
2021-12-02 09:26:04 108-1: create snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-3
2021-12-02 09:26:05 108-1: create snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-4
2021-12-02 09:26:05 108-1: using secure transmission, rate limit: none
2021-12-02 09:26:05 108-1: full sync 'preontank:vm-108-disk-0' (__replicate_108-1_1638433561__)
2021-12-02 09:26:06 108-1: full send of preontank/vm-108-disk-0@__replicate_108-1_1638433561__ estimated size is 16.0G
2021-12-02 09:26:06 108-1: total estimated size is 16.0G
2021-12-02 09:26:06 108-1: volume 'preontank/vm-108-disk-0' already exists
2021-12-02 09:26:06 108-1: warning: cannot send 'preontank/vm-108-disk-0@__replicate_108-1_1638433561__': Broken pipe
2021-12-02 09:26:06 108-1: cannot send 'preontank/vm-108-disk-0': I/O error
2021-12-02 09:26:06 108-1: command 'zfs send -Rpv -- preontank/vm-108-disk-0@__replicate_108-1_1638433561__' failed: exit code 1
2021-12-02 09:26:06 108-1: delete previous replication snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-0
2021-12-02 09:26:07 108-1: delete previous replication snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-1
2021-12-02 09:26:07 108-1: delete previous replication snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-2
2021-12-02 09:26:07 108-1: delete previous replication snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-3
2021-12-02 09:26:07 108-1: delete previous replication snapshot '__replicate_108-1_1638433561__' on preontank:vm-108-disk-4
2021-12-02 09:26:07 108-1: end replication job with error: command 'set -o pipefail && pvesm export preontank:vm-108-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_108-1_1638433561__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=palicki' root@10.99.2.2 -- pvesm import preontank:vm-108-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_108-1_1638433561__ -allow-rename 0' failed: exit code 255

Deleting the replication job, any leftover snapshots and disks on the replication target and then re-creating the replication job does not help.

This already started to happen with an earlier version of 7.1, but I don't have record of the exact package versions. An update to the latest packages did not help yet, unfortunately.

All ZFS pools are reported healthy. Here's a list of installed package versions and some status:

pveversion -v:
Code:
proxmox-ve: 7.1-1 (running kernel: 5.13.19-1-pve)
pve-manager: 7.1-7 (running version: 7.1-7/df5740ad)
pve-kernel-5.13: 7.1-4
pve-kernel-helper: 7.1-4
pve-kernel-5.11: 7.0-10
pve-kernel-5.4: 6.4-6
pve-kernel-5.13.19-1-pve: 5.13.19-3
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-4-pve: 5.11.22-9
pve-kernel-5.4.140-1-pve: 5.4.140-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 14.2.21-1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve1
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-4
libpve-storage-perl: 7.0-15
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-4
pve-cluster: 7.1-2
pve-container: 4.1-2
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-2
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3

pvecm status:
Code:
Cluster information
-------------------
Name:             PHAC
Config Version:   5
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Dec  2 09:41:17 2021
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000003
Ring ID:          1.18f
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.99.1.3
0x00000002          1 10.99.1.2
0x00000003          1 10.99.1.1 (local)

pvesm status
Code:
Name             Type     Status           Total            Used       Available        %
backup            nfs     active     19331526656     14019972096      5311554560   72.52%
local             dir     active       451089920        12896640       438193280    2.86%
media             nfs     active      5361519616        49965056      5311554560    0.93%
preontank     zfspool     active      7650410496      3345095436      4305315060   43.72%

Edit: Just double-checked again, all three nodes produce the exact same output for pveversion -v.

When required, I can provide more logs or outputs. Thank you for any help or insights!
 
Last edited:
Hi Moayad,

thank you for your response! Please excuse my half-baked report, I found out that email notification was already disabled for pvesr.service previously, based on this forum thread. This --mail 0 seems to be not respected anymore with the switch from pvesr.service to pvescheduler.

Can you please give a hint about how to disable these email notifications for pvescheduler?
 
2023-12-26 17:19:09 192-0: cannot receive: local origin for clone rpool/data/vm-192-disk-1@S2023_12_18_10_54 does not exist
2023-12-26 17:19:09 192-0: cannot open 'rpool/data/vm-192-disk-1': dataset does not exist
2023-12-26 17:19:09 192-0: command 'zfs recv -F -- rpool/data/vm-192-disk-1' failed: exit code 1
2023-12-26 17:19:09 192-0: warning: cannot send 'rpool/data/vm-192-disk-1@S2023_12_18_10_54': signal received
2023-12-26 17:19:09 192-0: TIME SENT SNAPSHOT rpool/data/vm-192-disk-1@S_2023_12_23_11_58
2023-12-26 17:19:09 192-0: warning: cannot send 'rpool/data/vm-192-disk-1@S_2023_12_23_11_58': Broken pipe
2023-12-26 17:19:09 192-0: TIME SENT SNAPSHOT rpool/data/vm-192-disk-1@__replicate_192-0_1703571544__
2023-12-26 17:19:09 192-0: warning: cannot send 'rpool/data/vm-192-disk-1@__replicate_192-0_1703571544__': Broken pipe
2023-12-26 17:19:09 192-0: cannot send 'rpool/data/vm-192-disk-1': I/O error
2023-12-26 17:19:09 192-0: command 'zfs send -Rpv -- rpool/data/vm-192-disk-1@__replicate_192-0_1703571544__' failed: exit code 1
2023-12-26 17:19:09 192-0: delete previous replication snapshot '__replicate_192-0_1703571544__' on local-zfs:base-9002-disk-0/vm-192-disk-1
2023-12-26 17:19:09 192-0: delete previous replication snapshot '__replicate_192-0_1703571544__' on local-zfs:base-9002-disk-1/vm-192-disk-0
2023-12-26 17:19:09 192-0: delete previous replication snapshot '__replicate_192-0_1703571544__' on local-zfs:vm-192-disk-2
2023-12-26 17:19:09 192-0: delete previous replication snapshot '__replicate_192-0_1703571544__' on local-zfs:vm-192-state-S2023_12_18_10_54
2023-12-26 17:19:09 192-0: delete previous replication snapshot '__replicate_192-0_1703571544__' on local-zfs:vm-192-state-S_2023_12_23_11_58
2023-12-26 17:19:09 192-0: end replication job with error: command 'set -o pipefail && pvesm export local-zfs:base-9002-disk-0/vm-192-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_192-0_1703571544__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=dsco-vm-01' root@192.168.3.61 -- pvesm import local-zfs:base-9002-disk-0/vm-192-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_192-0_1703571544__ -allow-rename 0' failed: exit code 1


Suggestion what can i do?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!