Storage replication still failing

sadai · Sep 6, 2018

I am running a 3 node cluster with local storage (zfs) and latest version of pve:

proxmox-ve: 5.2-2 (running kernel: 4.15.18-2-pve)
pve-manager: 5.2-7 (running version: 5.2-7/8d88e66a)
pve-kernel-4.15: 5.2-5
pve-kernel-4.15.18-2-pve: 4.15.18-20
pve-kernel-4.15.18-1-pve: 4.15.18-19
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-38
libpve-guest-common-perl: 2.0-17
libpve-http-server-perl: 2.0-10
libpve-storage-perl: 5.0-24
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-1
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-19
pve-cluster: 5.0-29
pve-container: 2.0-25
pve-docs: 5.2-8
pve-firewall: 3.0-13
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
pve-zsync: 1.6-16
qemu-server: 5.0-32
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9

Despite a fix for storage replication [1] I still experience the same issue that replication jobs are sporadically failing:

[1] https://forum.proxmox.com/threads/storage-replication-constantly-failing.43347/#post-221446

From /var/log/syslog

Sep 6 11:11:00 pve1 systemd[1]: Starting Proxmox VE replication runner...
Sep 6 11:11:00 pve1 systemd[1]: Started Session 265785 of user root.
Sep 6 11:11:01 pve1 systemd[1]: Started Session 265786 of user root.
Sep 6 11:11:01 pve1 zed: eid=1000089 class=history_event pool_guid=0x47C9A96EEDAF31FB
Sep 6 11:11:02 pve1 systemd[1]: Started Session 265787 of user root.
Sep 6 11:11:02 pve1 pvesr[3987]: send from @__replicate_100-0_1536225000__ to rpool/data/vm-100-disk-1@__replicate_100-0_1536225060__ estimated size is 1.13M
Sep 6 11:11:02 pve1 pvesr[3987]: total estimated size is 1.13M
Sep 6 11:11:02 pve1 zed: eid=1000090 class=history_event pool_guid=0x47C9A96EEDAF31FB
Sep 6 11:11:02 pve1 pvesr[3987]: TIME SENT SNAPSHOT
Sep 6 11:11:02 pve1 zed: eid=1000091 class=history_event pool_guid=0x47C9A96EEDAF31FB
Sep 6 11:11:02 pve1 zed: eid=1000092 class=history_event pool_guid=0x47C9A96EEDAF31FB
Sep 6 11:11:02 pve1 zed: eid=1000093 class=history_event pool_guid=0x47C9A96EEDAF31FB
Sep 6 11:11:02 pve1 pvesr[3987]: cannot receive incremental stream: checksum mismatch or incomplete stream
Sep 6 11:11:02 pve1 pvesr[3987]: command 'zfs recv -F -- rpool/data/vm-100-disk-1' failed: exit code 1
Sep 6 11:11:02 pve1 pvesr[3987]: exit code 255
Sep 6 11:11:02 pve1 pvesr[3987]: send/receive failed, cleaning up snapshot(s)..
Sep 6 11:11:02 pve1 zed: eid=1000094 class=history_event pool_guid=0x47C9A96EEDAF31FB
Sep 6 11:11:02 pve1 pvesr[3987]: 100-0: got unexpected replication job error - import failed: exit code 29

Replication job log

2018-09-06 13:50:00 100-0: start replication job
2018-09-06 13:50:00 100-0: guest => VM 100, running => 2320
2018-09-06 13:50:00 100-0: volumes => local-SSD:vm-100-disk-1
2018-09-06 13:50:01 100-0: freeze guest filesystem
2018-09-06 13:50:01 100-0: create snapshot '__replicate_100-0_1536234600__' on local-SSD:vm-100-disk-1
2018-09-06 13:50:01 100-0: thaw guest filesystem
2018-09-06 13:50:01 100-0: incremental sync 'local-SSD:vm-100-disk-1' (__replicate_100-0_1536234540__ => __replicate_100-0_1536234600__)
2018-09-06 13:50:02 100-0: delete previous replication snapshot '__replicate_100-0_1536234600__' on local-SSD:vm-100-disk-1
2018-09-06 13:50:02 100-0: end replication job with error: import failed: exit code 29

The next try actually always succeeds.

Please let me know if you need more information that I can provide.

wolfgang · Sep 6, 2018

Hi,
Are all nodes on the latest PVE version and also rebooted so they use the same kernel?

sadai · Sep 6, 2018

wolfgang said:
Hi,
Are all nodes on the latest PVE version and also rebooted so they use the same kernel?

Yes all nodes are identical in software and run the same kernel (from pve enterprise repository)

wolfgang · Sep 7, 2018

Maybe a network error

Sep 6 11:11:02 pve1 pvesr[3987]: cannot receive incremental stream: checksum mismatch or incomplete stream

there must a problem at the sending process.

sadai · Sep 7, 2018

wolfgang said:
Maybe a network error

Sep 6 11:11:02 pve1 pvesr[3987]: cannot receive incremental stream: checksum mismatch or incomplete stream

there must a problem at the sending process.

I have 2 pools, first mirror of 2 SSDs, second mirror of 2 HDDs.
I see when replication fails it is always the first VM disk which is on the SSD pool, could it be related to that?

Both pools are configured with thin provision and disks have discard on, I have now moved all VM disks to the HDD pool to see if it only happens with SSDs.

sadai · Sep 18, 2018

After letting the replication run only on the HDD pool for a while I am very sure the SSD pool is causing the problem.
I checked the parameters of both pools and compared them but no idea what the cause is.

Could you please point me into the right direction what is special about ZFS on pure SSD?

wolfgang · Sep 18, 2018

sadai said:
Could you please point me into the right direction what is special about ZFS on pure SSD?

Nothing SSD and HDD are hande the same way.
SSD are only faster.

sadai · Oct 15, 2018

There seems to be an issue with the qemu guest agent, the replication is failing many times for some VMs if it is enabled and running.

I tried to disable the freeze commands in /etc/qemu/qemu-ga.conf:

Code:

[general]
blacklist=guest-fsfreeze-freeze,guest-fsfreeze-thaw

and see the difference in the logs that only the guest-ping is replied but still same behavior.
Only disabling/stopping the agent helps to make replication stable.

I think this is not a proxmox related issue but the combination of using ZFS + SSDs + replication + guest agent
ZFS snapshots already take care of a consistent state and the guest agent also tries to freeze the VM (quiesce).
Probably the snapshot is too fast on SSD and it is already done after the agent responds, must be something like that.

wolfgang · Oct 15, 2018

That's interesting what version do you use of Qemu-guest-agent?
Also what OS do you use as Guest.

sadai · Oct 15, 2018

The guest is Ubuntu 18.04 with latest updates/kernel.
It happens since I deployed the VMs which is about 4 months ago so it had several kernels running and 3 guest agent versions.

apt changelog qemu-guest-agent

Code:

qemu (1:2.11+dfsg-1ubuntu7.6) bionic; urgency=medium

  [ Christian Ehrhardt ]
  * Add cpu model for z14 ZR1 (LP: #1780773)
  * d/p/ubuntu/lp-1789551-seccomp-set-the-seccomp-filter-to-all-threads.patch:
    ensure that the seccomp blacklist is applied to all threads (LP: #1789551)
    - CVE-2018-15746
  * improve s390x spectre mitigation with etoken facility (LP: #1790457)
    - debian/patches/ubuntu/lp-1790457-s390x-kvm-add-etoken-facility.patch
    - debian/patches/ubuntu/lp-1790457-partial-s390x-linux-headers-update.patch

  [ Phillip Susi ]
  * d/p/ubuntu/lp-1787267-fix-en_us-vnc-pipe.patch: Fix pipe, greater than and
    less than keys over vnc when using en_us kemaps (LP: #1787267).

 -- Christian Ehrhardt <christian.ehrhardt@canonical.com>  Wed, 29 Aug 2018 11:46:37 +0200

qemu (1:2.11+dfsg-1ubuntu7.5) bionic; urgency=medium

  [Christian Ehrhardt]
  * d/p/lp-1755912-qxl-fix-local-renderer-crash.patch: Fix an issue triggered
    by migrations with UI frontends or frequent guest resolution changes
    (LP: #1755912)

  [ Murilo Opsfelder Araujo ]
  * d/p/ubuntu/target-ppc-extend-eieio-for-POWER9.patch: Backport to
    extend eieio for POWER9 emulation (LP: #1787408).

 -- Christian Ehrhardt <christian.ehrhardt@canonical.com>  Tue, 21 Aug 2018 11:25:45 +0200

qemu (1:2.11+dfsg-1ubuntu7.4) bionic; urgency=medium

  * d/p/ubuntu/machine-type-hpb.patch: add -hpb machine type
    for host-phys-bits=true (LP: #1776189)
    - add an info about this change in debian/qemu-system-x86.NEWS

 -- Christian Ehrhardt <christian.ehrhardt@canonical.com>  Wed, 13 Jun 2018 10:41:34 +0200

Search

Search

Storage replication still failing

sadai

Member

wolfgang

Proxmox Retired Staff

sadai

Member

wolfgang

Proxmox Retired Staff

sadai

Member

sadai

Member

wolfgang

Proxmox Retired Staff

sadai

Member

wolfgang

Proxmox Retired Staff

sadai

Member

We value your privacy