Storage replication still failing

Aug 17, 2018
13
0
1
31
I am running a 3 node cluster with local storage (zfs) and latest version of pve:

proxmox-ve: 5.2-2 (running kernel: 4.15.18-2-pve)
pve-manager: 5.2-7 (running version: 5.2-7/8d88e66a)
pve-kernel-4.15: 5.2-5
pve-kernel-4.15.18-2-pve: 4.15.18-20
pve-kernel-4.15.18-1-pve: 4.15.18-19
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-38
libpve-guest-common-perl: 2.0-17
libpve-http-server-perl: 2.0-10
libpve-storage-perl: 5.0-24
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-1
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-19
pve-cluster: 5.0-29
pve-container: 2.0-25
pve-docs: 5.2-8
pve-firewall: 3.0-13
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
pve-zsync: 1.6-16
qemu-server: 5.0-32
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9


Despite a fix for storage replication [1] I still experience the same issue that replication jobs are sporadically failing:

[1] https://forum.proxmox.com/threads/storage-replication-constantly-failing.43347/#post-221446

From /var/log/syslog
Sep 6 11:11:00 pve1 systemd[1]: Starting Proxmox VE replication runner...
Sep 6 11:11:00 pve1 systemd[1]: Started Session 265785 of user root.
Sep 6 11:11:01 pve1 systemd[1]: Started Session 265786 of user root.
Sep 6 11:11:01 pve1 zed: eid=1000089 class=history_event pool_guid=0x47C9A96EEDAF31FB
Sep 6 11:11:02 pve1 systemd[1]: Started Session 265787 of user root.
Sep 6 11:11:02 pve1 pvesr[3987]: send from @__replicate_100-0_1536225000__ to rpool/data/vm-100-disk-1@__replicate_100-0_1536225060__ estimated size is 1.13M
Sep 6 11:11:02 pve1 pvesr[3987]: total estimated size is 1.13M
Sep 6 11:11:02 pve1 zed: eid=1000090 class=history_event pool_guid=0x47C9A96EEDAF31FB
Sep 6 11:11:02 pve1 pvesr[3987]: TIME SENT SNAPSHOT
Sep 6 11:11:02 pve1 zed: eid=1000091 class=history_event pool_guid=0x47C9A96EEDAF31FB
Sep 6 11:11:02 pve1 zed: eid=1000092 class=history_event pool_guid=0x47C9A96EEDAF31FB
Sep 6 11:11:02 pve1 zed: eid=1000093 class=history_event pool_guid=0x47C9A96EEDAF31FB
Sep 6 11:11:02 pve1 pvesr[3987]: cannot receive incremental stream: checksum mismatch or incomplete stream
Sep 6 11:11:02 pve1 pvesr[3987]: command 'zfs recv -F -- rpool/data/vm-100-disk-1' failed: exit code 1
Sep 6 11:11:02 pve1 pvesr[3987]: exit code 255
Sep 6 11:11:02 pve1 pvesr[3987]: send/receive failed, cleaning up snapshot(s)..
Sep 6 11:11:02 pve1 zed: eid=1000094 class=history_event pool_guid=0x47C9A96EEDAF31FB
Sep 6 11:11:02 pve1 pvesr[3987]: 100-0: got unexpected replication job error - import failed: exit code 29
Replication job log
2018-09-06 13:50:00 100-0: start replication job
2018-09-06 13:50:00 100-0: guest => VM 100, running => 2320
2018-09-06 13:50:00 100-0: volumes => local-SSD:vm-100-disk-1
2018-09-06 13:50:01 100-0: freeze guest filesystem
2018-09-06 13:50:01 100-0: create snapshot '__replicate_100-0_1536234600__' on local-SSD:vm-100-disk-1
2018-09-06 13:50:01 100-0: thaw guest filesystem
2018-09-06 13:50:01 100-0: incremental sync 'local-SSD:vm-100-disk-1' (__replicate_100-0_1536234540__ => __replicate_100-0_1536234600__)
2018-09-06 13:50:02 100-0: delete previous replication snapshot '__replicate_100-0_1536234600__' on local-SSD:vm-100-disk-1
2018-09-06 13:50:02 100-0: end replication job with error: import failed: exit code 29
The next try actually always succeeds.

Please let me know if you need more information that I can provide.
 
Last edited:

wolfgang

Proxmox Staff Member
Staff member
Oct 1, 2014
4,763
316
83
Hi,
Are all nodes on the latest PVE version and also rebooted so they use the same kernel?
 

wolfgang

Proxmox Staff Member
Staff member
Oct 1, 2014
4,763
316
83
Maybe a network error

Sep 6 11:11:02 pve1 pvesr[3987]: cannot receive incremental stream: checksum mismatch or incomplete stream


there must a problem at the sending process.
 
Aug 17, 2018
13
0
1
31
Maybe a network error

Sep 6 11:11:02 pve1 pvesr[3987]: cannot receive incremental stream: checksum mismatch or incomplete stream


there must a problem at the sending process.
I have 2 pools, first mirror of 2 SSDs, second mirror of 2 HDDs.
I see when replication fails it is always the first VM disk which is on the SSD pool, could it be related to that?

Both pools are configured with thin provision and disks have discard on, I have now moved all VM disks to the HDD pool to see if it only happens with SSDs.
 
Last edited:
Aug 17, 2018
13
0
1
31
After letting the replication run only on the HDD pool for a while I am very sure the SSD pool is causing the problem.
I checked the parameters of both pools and compared them but no idea what the cause is.

Could you please point me into the right direction what is special about ZFS on pure SSD?
 

wolfgang

Proxmox Staff Member
Staff member
Oct 1, 2014
4,763
316
83
Could you please point me into the right direction what is special about ZFS on pure SSD?
Nothing SSD and HDD are hande the same way.
SSD are only faster.
 
Aug 17, 2018
13
0
1
31
There seems to be an issue with the qemu guest agent, the replication is failing many times for some VMs if it is enabled and running.

I tried to disable the freeze commands in /etc/qemu/qemu-ga.conf:

Code:
[general]
blacklist=guest-fsfreeze-freeze,guest-fsfreeze-thaw
and see the difference in the logs that only the guest-ping is replied but still same behavior.
Only disabling/stopping the agent helps to make replication stable.

I think this is not a proxmox related issue but the combination of using ZFS + SSDs + replication + guest agent
ZFS snapshots already take care of a consistent state and the guest agent also tries to freeze the VM (quiesce).
Probably the snapshot is too fast on SSD and it is already done after the agent responds, must be something like that.
 

wolfgang

Proxmox Staff Member
Staff member
Oct 1, 2014
4,763
316
83
That's interesting what version do you use of Qemu-guest-agent?
Also what OS do you use as Guest.
 
Aug 17, 2018
13
0
1
31
The guest is Ubuntu 18.04 with latest updates/kernel.
It happens since I deployed the VMs which is about 4 months ago so it had several kernels running and 3 guest agent versions.

apt changelog qemu-guest-agent
Code:
qemu (1:2.11+dfsg-1ubuntu7.6) bionic; urgency=medium

  [ Christian Ehrhardt ]
  * Add cpu model for z14 ZR1 (LP: #1780773)
  * d/p/ubuntu/lp-1789551-seccomp-set-the-seccomp-filter-to-all-threads.patch:
    ensure that the seccomp blacklist is applied to all threads (LP: #1789551)
    - CVE-2018-15746
  * improve s390x spectre mitigation with etoken facility (LP: #1790457)
    - debian/patches/ubuntu/lp-1790457-s390x-kvm-add-etoken-facility.patch
    - debian/patches/ubuntu/lp-1790457-partial-s390x-linux-headers-update.patch

  [ Phillip Susi ]
  * d/p/ubuntu/lp-1787267-fix-en_us-vnc-pipe.patch: Fix pipe, greater than and
    less than keys over vnc when using en_us kemaps (LP: #1787267).

 -- Christian Ehrhardt <christian.ehrhardt@canonical.com>  Wed, 29 Aug 2018 11:46:37 +0200

qemu (1:2.11+dfsg-1ubuntu7.5) bionic; urgency=medium

  [Christian Ehrhardt]
  * d/p/lp-1755912-qxl-fix-local-renderer-crash.patch: Fix an issue triggered
    by migrations with UI frontends or frequent guest resolution changes
    (LP: #1755912)

  [ Murilo Opsfelder Araujo ]
  * d/p/ubuntu/target-ppc-extend-eieio-for-POWER9.patch: Backport to
    extend eieio for POWER9 emulation (LP: #1787408).

 -- Christian Ehrhardt <christian.ehrhardt@canonical.com>  Tue, 21 Aug 2018 11:25:45 +0200

qemu (1:2.11+dfsg-1ubuntu7.4) bionic; urgency=medium

  * d/p/ubuntu/machine-type-hpb.patch: add -hpb machine type
    for host-phys-bits=true (LP: #1776189)
    - add an info about this change in debian/qemu-system-x86.NEWS

 -- Christian Ehrhardt <christian.ehrhardt@canonical.com>  Wed, 13 Jun 2018 10:41:34 +0200
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!