Storage replication still failing

Discussion in 'Proxmox VE: Installation and configuration' started by sadai, Sep 6, 2018.

  1. sadai

    sadai New Member
    Proxmox Subscriber

    Joined:
    Aug 17, 2018
    Messages:
    13
    Likes Received:
    0
    I am running a 3 node cluster with local storage (zfs) and latest version of pve:

    proxmox-ve: 5.2-2 (running kernel: 4.15.18-2-pve)
    pve-manager: 5.2-7 (running version: 5.2-7/8d88e66a)
    pve-kernel-4.15: 5.2-5
    pve-kernel-4.15.18-2-pve: 4.15.18-20
    pve-kernel-4.15.18-1-pve: 4.15.18-19
    corosync: 2.4.2-pve5
    criu: 2.11.1-1~bpo90
    glusterfs-client: 3.8.8-1
    ksm-control-daemon: 1.2-2
    libjs-extjs: 6.0.1-2
    libpve-access-control: 5.0-8
    libpve-apiclient-perl: 2.0-5
    libpve-common-perl: 5.0-38
    libpve-guest-common-perl: 2.0-17
    libpve-http-server-perl: 2.0-10
    libpve-storage-perl: 5.0-24
    libqb0: 1.0.1-1
    lvm2: 2.02.168-pve6
    lxc-pve: 3.0.2+pve1-1
    lxcfs: 3.0.0-1
    novnc-pve: 1.0.0-2
    proxmox-widget-toolkit: 1.0-19
    pve-cluster: 5.0-29
    pve-container: 2.0-25
    pve-docs: 5.2-8
    pve-firewall: 3.0-13
    pve-firmware: 2.0-5
    pve-ha-manager: 2.0-5
    pve-i18n: 1.0-6
    pve-libspice-server1: 0.12.8-3
    pve-qemu-kvm: 2.11.2-1
    pve-xtermjs: 1.0-5
    pve-zsync: 1.6-16
    qemu-server: 5.0-32
    smartmontools: 6.5+svn4324-1
    spiceterm: 3.0-5
    vncterm: 1.5-3
    zfsutils-linux: 0.7.9-pve1~bpo9


    Despite a fix for storage replication [1] I still experience the same issue that replication jobs are sporadically failing:

    [1] https://forum.proxmox.com/threads/storage-replication-constantly-failing.43347/#post-221446

    From /var/log/syslog
    Replication job log
    The next try actually always succeeds.

    Please let me know if you need more information that I can provide.
     
    #1 sadai, Sep 6, 2018
    Last edited: Sep 6, 2018
  2. wolfgang

    wolfgang Proxmox Staff Member
    Staff Member

    Joined:
    Oct 1, 2014
    Messages:
    4,763
    Likes Received:
    316
    Hi,
    Are all nodes on the latest PVE version and also rebooted so they use the same kernel?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. sadai

    sadai New Member
    Proxmox Subscriber

    Joined:
    Aug 17, 2018
    Messages:
    13
    Likes Received:
    0
    Yes all nodes are identical in software and run the same kernel (from pve enterprise repository)
     
  4. wolfgang

    wolfgang Proxmox Staff Member
    Staff Member

    Joined:
    Oct 1, 2014
    Messages:
    4,763
    Likes Received:
    316
    Maybe a network error

    Sep 6 11:11:02 pve1 pvesr[3987]: cannot receive incremental stream: checksum mismatch or incomplete stream


    there must a problem at the sending process.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  5. sadai

    sadai New Member
    Proxmox Subscriber

    Joined:
    Aug 17, 2018
    Messages:
    13
    Likes Received:
    0
    I have 2 pools, first mirror of 2 SSDs, second mirror of 2 HDDs.
    I see when replication fails it is always the first VM disk which is on the SSD pool, could it be related to that?

    Both pools are configured with thin provision and disks have discard on, I have now moved all VM disks to the HDD pool to see if it only happens with SSDs.
     
    #5 sadai, Sep 7, 2018
    Last edited: Sep 7, 2018
  6. sadai

    sadai New Member
    Proxmox Subscriber

    Joined:
    Aug 17, 2018
    Messages:
    13
    Likes Received:
    0
    After letting the replication run only on the HDD pool for a while I am very sure the SSD pool is causing the problem.
    I checked the parameters of both pools and compared them but no idea what the cause is.

    Could you please point me into the right direction what is special about ZFS on pure SSD?
     
  7. wolfgang

    wolfgang Proxmox Staff Member
    Staff Member

    Joined:
    Oct 1, 2014
    Messages:
    4,763
    Likes Received:
    316
    Nothing SSD and HDD are hande the same way.
    SSD are only faster.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  8. sadai

    sadai New Member
    Proxmox Subscriber

    Joined:
    Aug 17, 2018
    Messages:
    13
    Likes Received:
    0
    There seems to be an issue with the qemu guest agent, the replication is failing many times for some VMs if it is enabled and running.

    I tried to disable the freeze commands in /etc/qemu/qemu-ga.conf:

    Code:
    [general]
    blacklist=guest-fsfreeze-freeze,guest-fsfreeze-thaw
    and see the difference in the logs that only the guest-ping is replied but still same behavior.
    Only disabling/stopping the agent helps to make replication stable.

    I think this is not a proxmox related issue but the combination of using ZFS + SSDs + replication + guest agent
    ZFS snapshots already take care of a consistent state and the guest agent also tries to freeze the VM (quiesce).
    Probably the snapshot is too fast on SSD and it is already done after the agent responds, must be something like that.
     
  9. wolfgang

    wolfgang Proxmox Staff Member
    Staff Member

    Joined:
    Oct 1, 2014
    Messages:
    4,763
    Likes Received:
    316
    That's interesting what version do you use of Qemu-guest-agent?
    Also what OS do you use as Guest.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  10. sadai

    sadai New Member
    Proxmox Subscriber

    Joined:
    Aug 17, 2018
    Messages:
    13
    Likes Received:
    0
    The guest is Ubuntu 18.04 with latest updates/kernel.
    It happens since I deployed the VMs which is about 4 months ago so it had several kernels running and 3 guest agent versions.

    apt changelog qemu-guest-agent
    Code:
    qemu (1:2.11+dfsg-1ubuntu7.6) bionic; urgency=medium
    
      [ Christian Ehrhardt ]
      * Add cpu model for z14 ZR1 (LP: #1780773)
      * d/p/ubuntu/lp-1789551-seccomp-set-the-seccomp-filter-to-all-threads.patch:
        ensure that the seccomp blacklist is applied to all threads (LP: #1789551)
        - CVE-2018-15746
      * improve s390x spectre mitigation with etoken facility (LP: #1790457)
        - debian/patches/ubuntu/lp-1790457-s390x-kvm-add-etoken-facility.patch
        - debian/patches/ubuntu/lp-1790457-partial-s390x-linux-headers-update.patch
    
      [ Phillip Susi ]
      * d/p/ubuntu/lp-1787267-fix-en_us-vnc-pipe.patch: Fix pipe, greater than and
        less than keys over vnc when using en_us kemaps (LP: #1787267).
    
     -- Christian Ehrhardt <christian.ehrhardt@canonical.com>  Wed, 29 Aug 2018 11:46:37 +0200
    
    qemu (1:2.11+dfsg-1ubuntu7.5) bionic; urgency=medium
    
      [Christian Ehrhardt]
      * d/p/lp-1755912-qxl-fix-local-renderer-crash.patch: Fix an issue triggered
        by migrations with UI frontends or frequent guest resolution changes
        (LP: #1755912)
    
      [ Murilo Opsfelder Araujo ]
      * d/p/ubuntu/target-ppc-extend-eieio-for-POWER9.patch: Backport to
        extend eieio for POWER9 emulation (LP: #1787408).
    
     -- Christian Ehrhardt <christian.ehrhardt@canonical.com>  Tue, 21 Aug 2018 11:25:45 +0200
    
    qemu (1:2.11+dfsg-1ubuntu7.4) bionic; urgency=medium
    
      * d/p/ubuntu/machine-type-hpb.patch: add -hpb machine type
        for host-phys-bits=true (LP: #1776189)
        - add an info about this change in debian/qemu-system-x86.NEWS
    
     -- Christian Ehrhardt <christian.ehrhardt@canonical.com>  Wed, 13 Jun 2018 10:41:34 +0200
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice