Replication Job: 101-0 failed

Discussion in 'Proxmox VE: Installation and configuration' started by bennn, May 13, 2019.

  1. bennn

    bennn New Member

    Joined:
    May 13, 2019
    Messages:
    4
    Likes Received:
    0
    Hi there,

    I have a 3 node cluster, 2 of these nodes have 4 x SSDs and use ZFS with RAIDZ-1. These pools have 13% and 22% capacity used.

    I've recently started converting some containers to QEMU machines to take advantage of the near-realtime migration available with ZFS. I have also enabled replication to help with this.

    Here's a high level overview:
    node1 has vm1
    node2 has vm2

    Replication is every 15 minutes and copies vm2 to node1, and vm1 to node2.

    The vm2 replication was set up first and has never had an issue.
    vm1 has an intermittent issue, email alert is basic:
    Code:
    Subject: Replication Job: 101-0 failed
    import failed: exit code 29
    
    The log from the web interface shows this:
    Code:
    2019-05-13 10:00:01 101-0: start replication job
    2019-05-13 10:00:01 101-0: guest => VM 101, running => 35788
    2019-05-13 10:00:01 101-0: volumes => local-zfs:vm-101-disk-0
    2019-05-13 10:00:02 101-0: freeze guest filesystem
    2019-05-13 10:00:02 101-0: create snapshot '__replicate_101-0_1557741601__' on local-zfs:vm-101-disk-0
    2019-05-13 10:00:02 101-0: thaw guest filesystem
    2019-05-13 10:00:02 101-0: incremental sync 'local-zfs:vm-101-disk-0' (__replicate_101-0_1557740701__ => __replicate_101-0_1557741601__)
    2019-05-13 10:00:04 101-0: delete previous replication snapshot '__replicate_101-0_1557741601__' on local-zfs:vm-101-disk-0
    2019-05-13 10:00:04 101-0: end replication job with error: import failed: exit code 29
    
    I've read some other threads on this and they all hint at the storage being overloaded, however there is no indication of this on these nodes.
    CPU is ~1.5%,
    LOAD is ~0.4
    RAM is ~18%
    IO is esentially idle

    I adjusted the schedule of the failing job (vm1) so that it wouldn't conflict with the incoming vm2 sync, buit this has not helped.

    I tried to capture the output form iotop on node2 during a sync, it does not look overloaded at all.
    Code:
    Total DISK READ :     272.67 K/s | Total DISK WRITE :     386.59 K/s
    Actual DISK READ:     422.08 K/s | Actual DISK WRITE:      24.41 M/s
      TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
      929 be/4 root        0.00 B/s    0.00 B/s  0.00 %  8.47 % [txg_sync]
     1514 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.44 % [kworker/u98:3]
    43974 be/4 root        0.00 B/s   33.62 K/s  0.00 %  0.22 % kvm -id 100 -name db0~=300 -machine type=pc
    42476 be/4 root        0.00 B/s   48.56 K/s  0.00 %  0.16 % kvm -id 100 -name db0~=300 -machine type=pc
    42768 be/4 root        0.00 B/s  123.26 K/s  0.00 %  0.16 % kvm -id 100 -name db0~=300 -machine type=pc
    43973 be/4 root        0.00 B/s   37.35 K/s  0.00 %  0.16 % kvm -id 100 -name db0~=300 -machine type=pc
    42473 be/4 root        0.00 B/s   82.17 K/s  0.00 %  0.08 % kvm -id 100 -name db0~=300 -machine type=pc
     2302 be/4 root        0.00 B/s  956.21 B/s  0.00 %  0.00 % rsyslogd -n [rs:main Q:Reg]
      735 be/0 root       48.56 K/s    0.00 B/s  0.00 %  0.00 % [z_rd_iss]
    36951 be/0 root      115.79 K/s    0.00 B/s  0.00 %  0.00 % [z_rd_iss]
    36953 be/0 root      108.32 K/s    0.00 B/s  0.00 %  0.00 % [z_rd_iss]
     2849 be/4 root        0.00 B/s   60.70 K/s  0.00 %  0.00 % pmxcfs [cfs_loop]
        1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % init
        2 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kthreadd]
    22531 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kworker/17:1]
        4 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kworker/0:0H]
    43938 be/4 postfix     0.00 B/s    0.00 B/s  0.00 %  0.00 % qmgr -l -t unix -u
        7 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [mm_percpu_wq]
        8 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/0]
        9 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [rcu_sched]
    
    The failure is random, I cannot infer any patterns from it.

    Any ideas much appreciated!
    Thanks!

    Package versions:
    Code:
    proxmox-ve: 5.4-1 (running kernel: 4.15.18-13-pve)
    pve-manager: 5.4-5 (running version: 5.4-5/c6fdb264)
    pve-kernel-4.15: 5.4-1
    pve-kernel-4.15.18-13-pve: 4.15.18-37
    pve-kernel-4.15.18-12-pve: 4.15.18-36
    pve-kernel-4.15.18-9-pve: 4.15.18-30
    corosync: 2.4.4-pve1
    criu: 2.11.1-1~bpo90
    glusterfs-client: 3.8.8-1
    ksm-control-daemon: 1.2-2
    libjs-extjs: 6.0.1-2
    libpve-access-control: 5.1-8
    libpve-apiclient-perl: 2.0-5
    libpve-common-perl: 5.0-51
    libpve-guest-common-perl: 2.0-20
    libpve-http-server-perl: 2.0-13
    libpve-storage-perl: 5.0-41
    libqb0: 1.0.3-1~bpo9
    lvm2: 2.02.168-pve6
    lxc-pve: 3.1.0-3
    lxcfs: 3.0.3-pve1
    novnc-pve: 1.0.0-3
    proxmox-widget-toolkit: 1.0-26
    pve-cluster: 5.0-36
    pve-container: 2.0-37
    pve-docs: 5.4-2
    pve-edk2-firmware: 1.20190312-1
    pve-firewall: 3.0-20
    pve-firmware: 2.0-6
    pve-ha-manager: 2.0-9
    pve-i18n: 1.1-4
    pve-libspice-server1: 0.14.1-2
    pve-qemu-kvm: 2.12.1-3
    pve-xtermjs: 3.12.0-1
    qemu-server: 5.0-50
    smartmontools: 6.5+svn4324-1
    spiceterm: 3.0-5
    vncterm: 1.5-3
    zfsutils-linux: 0.7.13-pve1~bpo2
    
    
     
  2. wolfgang

    wolfgang Proxmox Staff Member
    Staff Member

    Joined:
    Oct 1, 2014
    Messages:
    4,707
    Likes Received:
    311
    Hi,

    does the job only fail once and recover at the next time?
    Or do you must remove the replication and create it from scratch?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. bennn

    bennn New Member

    Joined:
    May 13, 2019
    Messages:
    4
    Likes Received:
    0
    It recovers each time (it seems to retry 5 min later?). There has never been more than 1 failure in a row.

    Let me know if theres any other logs that could be useful.

    Thanks
     
  4. wolfgang

    wolfgang Proxmox Staff Member
    Staff Member

    Joined:
    Oct 1, 2014
    Messages:
    4,707
    Likes Received:
    311
    Then I guess it is the remote side what make the troubles or you have network problems?
    Is the pool the rpool on this nodes?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  5. bennn

    bennn New Member

    Joined:
    May 13, 2019
    Messages:
    4
    Likes Received:
    0
    Both nodes are using rpools, the nodes are replicating to each other which is why I'm confused: 2 to 1 doesn't fail, only 1 to 2.

    Both are connected through a 10Gb switch, I can't find any obvious networking issues.

    Got a slightly better error this time, from daemon.log:
    Code:
    May 14 09:18:00 loc0-pve1 systemd[1]: Starting Proxmox VE replication runner...
    May 14 09:18:02 loc0-pve1 zed: eid=9866 class=history_event pool_guid=0x7D034AB176D92D1A
    May 14 09:18:04 loc0-pve1 pvesr[7532]: send from @__replicate_101-0_1557825061__ to rpool/data/vm-101-disk-0@__replicate_101-0_1557825481__ estimated size is 1.56M
    May 14 09:18:04 loc0-pve1 pvesr[7532]: total estimated size is 1.56M
    May 14 09:18:04 loc0-pve1 zed: eid=9867 class=history_event pool_guid=0x7D034AB176D92D1A
    May 14 09:18:04 loc0-pve1 pvesr[7532]: TIME        SENT   SNAPSHOT
    May 14 09:18:04 loc0-pve1 zed: eid=9868 class=history_event pool_guid=0x7D034AB176D92D1A
    May 14 09:18:04 loc0-pve1 zed: eid=9869 class=history_event pool_guid=0x7D034AB176D92D1A
    May 14 09:18:04 loc0-pve1 zed: eid=9870 class=history_event pool_guid=0x7D034AB176D92D1A
    May 14 09:18:04 loc0-pve1 pvesr[7532]: cannot receive incremental stream: checksum mismatch or incomplete stream
    May 14 09:18:04 loc0-pve1 pvesr[7532]: command 'zfs recv -F -- rpool/data/vm-101-disk-0' failed: exit code 1
    May 14 09:18:04 loc0-pve1 pvesr[7532]: exit code 255
    May 14 09:18:04 loc0-pve1 pvesr[7532]: send/receive failed, cleaning up snapshot(s)..
    May 14 09:18:04 loc0-pve1 zed: eid=9871 class=history_event pool_guid=0x7D034AB176D92D1A
    May 14 09:18:04 loc0-pve1 pvesr[7532]: 101-0: got unexpected replication job error - import failed: exit code 29
    May 14 09:18:05 loc0-pve1 systemd[1]: Started Proxmox VE replication runner.
    
    Given the small file size I'm not sure if this could be network? Though I'm not sure what else could cause that.

    Thanks
     
  6. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,835
    Likes Received:
    158
    Hi,
    i have the same issue with the replication between two nodes.
    One VM was replicates from B to A every 15 minute and fails app. four times a day and one VM with the the same migration schedule from A to B fails not or only one times a day.

    I would guess it's the load (IO) and ZOL related in my case...
    Network is ok, otherwise the replication from A to B must fail in the same time...

    Udo
     
  7. wolfgang

    wolfgang Proxmox Staff Member
    Staff Member

    Joined:
    Oct 1, 2014
    Messages:
    4,707
    Likes Received:
    311
    Can you try to remove/disable the swap from the rpool.
    We have seen in the past that zfs is swapping blocks from the ARC.
    So this could explain the hanging pool.

    Generall, we recommend to remove the swap from the ZFS.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  8. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,835
    Likes Received:
    158
    Hi Wolfgang,
    in my case both servers use an seperate disk for swap...
    Code:
    swapon
    NAME      TYPE      SIZE USED PRIO
    /dev/sdf1 partition  16G 147M   -2
    
    
    swapon
    NAME      TYPE      SIZE USED PRIO
    /dev/sdf1 partition  16G   0B   -2
    
    Udo
     
  9. bennn

    bennn New Member

    Joined:
    May 13, 2019
    Messages:
    4
    Likes Received:
    0
    I do not have any swap configured on my hosts (196GB total RAM, with 32GB set for zfs_arc_max)
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice