[SOLVED] Replication runner on all hosts and vm's broken since update

Discussion in 'Proxmox VE: Installation and configuration' started by fireon, Oct 27, 2018.

  1. fireon

    fireon Well-Known Member
    Proxmox VE Subscriber

    Joined:
    Oct 25, 2010
    Messages:
    2,678
    Likes Received:
    144
    Hello,

    Problem exists about a week. If i click on Replicaton, i got an timeout on all nodes in the cluster. Here are the jounal:
    Code:
    Okt 27 01:27:00 backup systemd[1]: Starting Proxmox VE replication runner...
    Okt 27 01:27:01 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
    Okt 27 01:27:02 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
    Okt 27 01:27:03 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
    Okt 27 01:27:04 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
    Okt 27 01:27:05 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
    Okt 27 01:27:06 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
    Okt 27 01:27:07 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
    Okt 27 01:27:08 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
    Okt 27 01:27:09 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
    Okt 27 01:27:10 backup pvesr[28719]: error with cfs lock 'file-replication_cfg': got lock request timeout
    Okt 27 01:27:10 backup systemd[1]: pvesr.service: Main process exited, code=exited, status=17/n/a
    Okt 27 01:27:10 backup systemd[1]: Failed to start Proxmox VE replication runner.
    Okt 27 01:27:10 backup systemd[1]: pvesr.service: Unit entered failed state.
    Okt 27 01:27:10 backup systemd[1]: pvesr.service: Failed with result 'exit-code'.
    
    Maybe someone can help me with that?

    Code:
    proxmox-ve: 5.2-2 (running kernel: 4.15.18-7-pve)
    pve-manager: 5.2-10 (running version: 5.2-10/6f892b40)
    pve-kernel-4.15: 5.2-10
    pve-kernel-4.15.18-7-pve: 4.15.18-27
    pve-kernel-4.15.18-5-pve: 4.15.18-24
    pve-kernel-4.15.18-4-pve: 4.15.18-23
    pve-kernel-4.15.18-3-pve: 4.15.18-22
    pve-kernel-4.15.18-2-pve: 4.15.18-21
    pve-kernel-4.15.18-1-pve: 4.15.18-19
    corosync: 2.4.2-pve5
    criu: 2.11.1-1~bpo90
    glusterfs-client: 3.8.8-1
    ksm-control-daemon: 1.2-2
    libjs-extjs: 6.0.1-2
    libpve-access-control: 5.0-8
    libpve-apiclient-perl: 2.0-5
    libpve-common-perl: 5.0-40
    libpve-guest-common-perl: 2.0-18
    libpve-http-server-perl: 2.0-11
    libpve-storage-perl: 5.0-30
    libqb0: 1.0.1-1
    lvm2: 2.02.168-pve6
    lxc-pve: 3.0.2+pve1-3
    lxcfs: 3.0.2-2
    novnc-pve: 1.0.0-2
    proxmox-widget-toolkit: 1.0-20
    pve-cluster: 5.0-30
    pve-container: 2.0-28
    pve-docs: 5.2-8
    pve-firewall: 3.0-14
    pve-firmware: 2.0-5
    pve-ha-manager: 2.0-5
    pve-i18n: 1.0-6
    pve-libspice-server1: 0.14.1-1
    pve-qemu-kvm: 2.11.2-1
    pve-xtermjs: 1.0-5
    pve-zsync: 1.7-1
    qemu-server: 5.0-36
    smartmontools: 6.5+svn4324-1
    spiceterm: 3.0-5
    vncterm: 1.5-3
    zfsutils-linux: 0.7.11-pve1~bpo1
    
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  2. acidrop

    acidrop Member

    Joined:
    Jul 17, 2012
    Messages:
    194
    Likes Received:
    4
    I'm having the same problem after upgrading a 3 node cluster to the latest package versions.
    Multicast communication works fine, but pvesr.service is unable to start because of "error with cfs lock 'file-replication_cfg': got lock request timeout".
    As a workaround, I have removed the content of /etc/pve/replication.cfg file and that at least seems to bring pvesr.service up.
    Once you create a new replication job though, same error occurs ...

    Code:
    proxmox-ve: 5.2-2 (running kernel: 4.15.18-5-pve)
    pve-manager: 5.2-10 (running version: 5.2-10/6f892b40)
    pve-kernel-4.15: 5.2-8
    pve-kernel-4.15.18-5-pve: 4.15.18-24
    pve-kernel-4.15.18-2-pve: 4.15.18-21
    pve-kernel-4.15.18-1-pve: 4.15.18-19
    pve-kernel-4.15.17-3-pve: 4.15.17-14
    pve-kernel-4.15.17-1-pve: 4.15.17-9
    ceph: 12.2.8-pve1
    corosync: 2.4.2-pve5
    criu: 2.11.1-1~bpo90
    glusterfs-client: 3.8.8-1
    ksm-control-daemon: 1.2-2
    libjs-extjs: 6.0.1-2
    libpve-access-control: 5.0-8
    libpve-apiclient-perl: 2.0-5
    libpve-common-perl: 5.0-40
    libpve-guest-common-perl: 2.0-18
    libpve-http-server-perl: 2.0-11
    libpve-storage-perl: 5.0-30
    libqb0: 1.0.1-1
    lvm2: 2.02.168-pve6
    lxc-pve: 3.0.2+pve1-3
    lxcfs: 3.0.2-2
    novnc-pve: 1.0.0-2
    openvswitch-switch: 2.7.0-3
    proxmox-widget-toolkit: 1.0-20
    pve-cluster: 5.0-30
    pve-container: 2.0-29
    pve-docs: 5.2-8
    pve-firewall: 3.0-14
    pve-firmware: 2.0-5
    pve-ha-manager: 2.0-5
    pve-i18n: 1.0-6
    pve-libspice-server1: 0.14.1-1
    pve-qemu-kvm: 2.12.1-1
    pve-xtermjs: 1.0-5
    qemu-server: 5.0-38
    smartmontools: 6.5+svn4324-1
    spiceterm: 3.0-5
    vncterm: 1.5-3
    zfsutils-linux: 0.7.11-pve1~bpo1
     
  3. saphirblanc

    saphirblanc New Member

    Joined:
    Jul 4, 2017
    Messages:
    23
    Likes Received:
    0
    Having exactly the same issue, is there any workaround without rebooting completely the host for now ?
     
  4. acidrop

    acidrop Member

    Joined:
    Jul 17, 2012
    Messages:
    194
    Likes Received:
    4
    Don't think that the reboot will make any difference, I have already tried rebooting all nodes with no luck.
    This most likely looks like a bug, so let's see if an updated package fix will be released in one of the upcoming days...
    Proxmox staff has to verify this first though..
     
  5. fireon

    fireon Well-Known Member
    Proxmox VE Subscriber

    Joined:
    Oct 25, 2010
    Messages:
    2,678
    Likes Received:
    144
    Rebooted too, didn't help.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  6. saphirblanc

    saphirblanc New Member

    Joined:
    Jul 4, 2017
    Messages:
    23
    Likes Received:
    0
    Just to let the proxmox developers knows, here are my package versions :

    Code:
    proxmox-ve: 5.2-2 (running kernel: 4.15.18-3-pve)
    pve-manager: 5.2-8 (running version: 5.2-8/fdf39912)
    pve-kernel-4.15: 5.2-6
    pve-kernel-4.15.18-3-pve: 4.15.18-22
    pve-kernel-4.15.17-1-pve: 4.15.17-9
    corosync: 2.4.2-pve5
    criu: 2.11.1-1~bpo90
    glusterfs-client: 3.8.8-1
    ksm-control-daemon: 1.2-2
    libjs-extjs: 6.0.1-2
    libpve-access-control: 5.0-8
    libpve-apiclient-perl: 2.0-5
    libpve-common-perl: 5.0-38
    libpve-guest-common-perl: 2.0-17
    libpve-http-server-perl: 2.0-10
    libpve-storage-perl: 5.0-25
    libqb0: 1.0.1-1
    lvm2: 2.02.168-pve6
    lxc-pve: 3.0.2+pve1-1
    lxcfs: 3.0.0-1
    novnc-pve: 1.0.0-2
    proxmox-widget-toolkit: 1.0-19
    pve-cluster: 5.0-30
    pve-container: 2.0-26
    pve-docs: 5.2-8
    pve-firewall: 3.0-14
    pve-firmware: 2.0-5
    pve-ha-manager: 2.0-5
    pve-i18n: 1.0-6
    pve-libspice-server1: 0.12.8-3
    pve-qemu-kvm: 2.11.2-1
    pve-xtermjs: 1.0-5
    qemu-server: 5.0-33
    smartmontools: 6.5+svn4324-1
    spiceterm: 3.0-5
    vncterm: 1.5-3
    zfsutils-linux: 0.7.9-pve1~bpo9
    Configuration has not changed between. It seems to have started at the time of a replication schedule, so a replication might have been in progress ?

    Here are the syslog of two nodes showing the state from "working" to not :

    Code:
    Oct 25 23:59:00 athos systemd[1]: Starting Proxmox VE replication runner...
    Oct 25 23:59:00 athos systemd[1]: Started Proxmox VE replication runner.
    Oct 25 23:59:11 athos postfix/anvil[25155]: statistics: max connection rate 1/60s for (X) at Oct 25 23:50:51
    Oct 25 23:59:11 athos postfix/anvil[25155]: statistics: max connection count 1 for (X) at Oct 25 23:50:51
    Oct 25 23:59:11 athos postfix/anvil[25155]: statistics: max cache size 2 at Oct 25 23:50:51
    Oct 26 00:00:00 athos systemd[1]: Starting Proxmox VE replication runner...
    Oct 26 00:00:01 athos CRON[19132]: pam_unix(cron:session): session opened for user root by (uid=0)
    Oct 26 00:00:01 athos CRON[19133]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
    Oct 26 00:00:01 athos CRON[19132]: pam_unix(cron:session): session closed for user root
    Oct 26 00:00:02 athos zed[19421]: eid=2551 class=history_event pool_guid=0x1E98F7F7D9A9A016
    Oct 26 00:00:02 athos zed[19563]: eid=2552 class=history_event pool_guid=0x1E98F7F7D9A9A016
    Oct 26 00:00:02 athos zed[19646]: eid=2553 class=history_event pool_guid=0x1E98F7F7D9A9A016
    Oct 26 00:00:03 athos zed[20045]: eid=2554 class=history_event pool_guid=0x1E98F7F7D9A9A016
    Oct 26 00:00:03 athos zed[20151]: eid=2555 class=history_event pool_guid=0x1E98F7F7D9A9A016
    Oct 26 00:00:12 athos zed[38194]: eid=2556 class=history_event pool_guid=0x1E98F7F7D9A9A016
    Oct 26 00:00:12 athos zed[38290]: eid=2557 class=history_event pool_guid=0x1E98F7F7D9A9A016
    Oct 26 00:00:13 athos zed[38438]: eid=2558 class=history_event pool_guid=0x1E98F7F7D9A9A016
    Oct 26 00:00:13 athos zed[38497]: eid=2559 class=history_event pool_guid=0x1E98F7F7D9A9A016
    Oct 26 00:00:14 athos zed[38584]: eid=2560 class=history_event pool_guid=0x1E98F7F7D9A9A016
    Oct 26 00:00:14 athos zed[38665]: eid=2561 class=history_event pool_guid=0x1E98F7F7D9A9A016
    Oct 26 00:00:16 athos zed[38965]: eid=2562 class=history_event pool_guid=0x1E98F7F7D9A9A016
    Oct 26 00:00:16 athos zed[39022]: eid=2563 class=history_event pool_guid=0x1E98F7F7D9A9A016
    Oct 26 00:00:16 athos zed[39032]: eid=2564 class=history_event pool_guid=0x1E98F7F7D9A9A016
    Oct 26 00:00:16 athos zed[39107]: eid=2565 class=history_event pool_guid=0x1E98F7F7D9A9A016
    Oct 26 00:00:16 athos zed[39110]: eid=2566 class=history_event pool_guid=0x1E98F7F7D9A9A016
    Oct 26 00:00:17 athos zed[39420]: eid=2567 class=history_event pool_guid=0x1E98F7F7D9A9A016
    Oct 26 00:00:17 athos zed[39658]: eid=2568 class=history_event pool_guid=0x1E98F7F7D9A9A016
    Oct 26 00:00:18 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
    Oct 26 00:00:19 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
    Oct 26 00:00:20 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
    Oct 26 00:00:21 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
    Oct 26 00:00:22 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
    Oct 26 00:00:23 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
    Oct 26 00:00:24 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
    Oct 26 00:00:25 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
    Oct 26 00:00:26 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
    Oct 26 00:00:27 athos pvesr[19090]: error with cfs lock 'file-replication_cfg': got lock request timeout
    Oct 26 00:00:27 athos systemd[1]: pvesr.service: Main process exited, code=exited, status=17/n/a
    Oct 26 00:00:27 athos systemd[1]: Failed to start Proxmox VE replication runner.
    Oct 26 00:00:27 athos systemd[1]: pvesr.service: Unit entered failed state.
    Oct 26 00:00:27 athos systemd[1]: pvesr.service: Failed with result 'exit-code'.
    The other node :
    Code:
    Oct 25 23:59:00 aramis systemd[1]: Starting Proxmox VE replication runner...
    Oct 25 23:59:00 aramis systemd[1]: Started Proxmox VE replication runner.
    Oct 25 23:59:12 aramis postfix/anvil[35118]: statistics: max connection rate 1/60s for (X) at Oct 25 23:55:52
    Oct 25 23:59:12 aramis postfix/anvil[35118]: statistics: max connection count 1 for (X) at Oct 25 23:55:52
    Oct 25 23:59:12 aramis postfix/anvil[35118]: statistics: max cache size 2 at Oct 25 23:55:52
    Oct 26 00:00:00 aramis systemd[1]: Starting Proxmox VE replication runner...
    Oct 26 00:00:01 aramis zed[25495]: eid=5402 class=history_event pool_guid=0x32C807D8808E1CD9
    Oct 26 00:00:01 aramis CRON[25562]: pam_unix(cron:session): session opened for user root by (uid=0)
    Oct 26 00:00:01 aramis CRON[25563]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
    Oct 26 00:00:01 aramis CRON[25562]: pam_unix(cron:session): session closed for user root
    Oct 26 00:00:01 aramis zed[25976]: eid=5403 class=history_event pool_guid=0x32C807D8808E1CD9
    Oct 26 00:00:01 aramis zed[26134]: eid=5404 class=history_event pool_guid=0x32C807D8808E1CD9
    Oct 26 00:00:13 aramis zed[26096]: eid=5405 class=history_event pool_guid=0x32C807D8808E1CD9
    Oct 26 00:00:13 aramis zed[26113]: eid=5406 class=history_event pool_guid=0x32C807D8808E1CD9
    Oct 26 00:00:15 aramis zed[26415]: eid=5407 class=history_event pool_guid=0x32C807D8808E1CD9
    Oct 26 00:00:34 aramis postfix/smtpd[1998]: connect from unknown[X]
    Oct 26 00:00:34 aramis postfix/smtpd[1998]: lost connection after AUTH from unknown[178.159.36.53]
    Oct 26 00:00:34 aramis postfix/smtpd[1998]: disconnect from unknown[X] ehlo=1 auth=0/1 commands=1/2
    Oct 26 00:00:39 aramis sshd[5048]: Connection closed by X port 59672 [preauth]
    Oct 26 00:00:39 aramis sshd[5050]: Connection closed by X port 51776 [preauth]
    Oct 26 00:00:52 aramis postfix/smtpd[1998]: connect from X
    Oct 26 00:00:52 aramis postfix/smtpd[1998]: disconnect fromXhelo=1 quit=1 commands=2
    Oct 26 00:00:52 aramis postfix/smtpd[1998]: connect from X
    Oct 26 00:00:52 aramis postfix/smtpd[1998]: disconnect from X helo=1 quit=1 commands=2
    Oct 26 00:01:16 aramis pvesr[24571]: error with cfs lock 'file-replication_cfg': got lock timeout - aborting command
    Oct 26 00:01:16 aramis systemd[1]: pvesr.service: Main process exited, code=exited, status=255/n/a
    Oct 26 00:01:16 aramis systemd[1]: Failed to start Proxmox VE replication runner.
    Oct 26 00:01:16 aramis systemd[1]: pvesr.service: Unit entered failed state.
    Oct 26 00:01:16 aramis systemd[1]: pvesr.service: Failed with result 'exit-code'.
    Oct 26 00:01:16 aramis systemd[1]: Starting Proxmox VE replication runner...
    Hope this can help further.

    Thanks to all.
     
  7. udo

    udo Well-Known Member
    Proxmox VE Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,716
    Likes Received:
    149
    Hi,
    I think it hasn't anything to do with the issue, but your versions shows that you don't use "apt dist-upgrade" which is important on proxmox.
    "apt upgrade" isn't enough!

    Udo
     
  8. fireon

    fireon Well-Known Member
    Proxmox VE Subscriber

    Joined:
    Oct 25, 2010
    Messages:
    2,678
    Likes Received:
    144
    Code:
    pveupdate
    pveupgrade
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  9. dendi

    dendi Member

    Joined:
    Nov 17, 2011
    Messages:
    85
    Likes Received:
    3
  10. saphirblanc

    saphirblanc New Member

    Joined:
    Jul 4, 2017
    Messages:
    23
    Likes Received:
    0
    Not sure if it's because of the time change as it appeared few days ago.

    I can confirm that the process pvesr is using 100% of the CPU on the first node (sender) "athos" for hours but no on the two other nodes (including receiver node - porthos).
    Is it safe to kill it ?

    Code:
    root     14828 99.5  0.0 495372 77404 ?        Rs   10:49   0:35 /usr/bin/perl -T /usr/bin/pvesr run --mail 1
    Thanks.
     
  11. dendi

    dendi Member

    Joined:
    Nov 17, 2011
    Messages:
    85
    Likes Received:
    3
    Saphirblanc, you can check how old is your replica's snapshot with "zfs list -t all" and see when it stopped to work.
    I "solved" with:
    Code:
    cp -a /etc/pve/replication.cfg /root/
    vi /etc/pve/replication.cfg #clear it, only on one node because it's in a cluster fs
    systemctl stop pvesr.timer
    systemctl stop pvesr
    systemctl restart pvedaemon
    
    You have to redo all your replicas o course and remember to manually delete the old replica's snapshot on the source node
    and restart pvesr.timer and pvesr
     
    saphirblanc likes this.
  12. saphirblanc

    saphirblanc New Member

    Joined:
    Jul 4, 2017
    Messages:
    23
    Likes Received:
    0
    Thanks dendi! I was indeed able to remove the error 500... and create more than one replication task through the GUI which is a big step (I have not yet restarted pvesr.timer and pvesr) !
    How can I delete the old replica's snapshot ? Sorry, not so used with ZFS yet...

    Code:
    root@athos:/var/log# zfs list -t all
    NAME                                                      USED  AVAIL  REFER  MOUNTPOINT
    rpool                                                     215G  3.15T   166K  /rpool
    rpool-hdd                                                 207G  3.31T   128K  /rpool-hdd
    rpool-hdd/vm-108-disk-1                                  8.68G  3.31T  8.35G  -
    rpool-hdd/vm-108-disk-1@__replicate_108-0_1540504800__    338M      -  8.35G  -
    rpool-hdd/vm-108-disk-2                                   188M  3.31T   186M  -
    rpool-hdd/vm-108-disk-2@__replicate_108-0_1540504800__   1.63M      -   186M  -
    rpool-hdd/vm-108-disk-3                                  74.6K  3.31T  74.6K  -
    rpool-hdd/vm-108-disk-3@__replicate_108-0_1540504800__      0B      -  74.6K  -
    rpool-hdd/vm-109-disk-2                                  1.92G  3.31T  1.92G  -
    rpool-hdd/vm-109-disk-2@__replicate_109-0_1540418416__    884K      -  1.92G  -
    rpool-hdd/vm-109-disk-3                                  5.38G  3.31T  4.93G  -
    rpool-hdd/vm-109-disk-3@__replicate_109-0_1540418416__    455M      -  4.93G  -
    rpool-hdd/vm-112-disk-1                                  12.0G  3.31T  11.7G  -
    rpool-hdd/vm-112-disk-1@__replicate_112-0_1540418430__    358M      -  11.2G  -
    rpool-hdd/vm-112-disk-2                                   101G  3.31T  99.9G  -
    rpool-hdd/vm-112-disk-2@__replicate_112-0_1540418430__   1.39G      -  97.8G  -
    rpool-hdd/vm-122-disk-1                                  30.9G  3.31T  30.4G  -
    rpool-hdd/vm-122-disk-1@__replicate_122-0_1540418502__    478M      -  30.4G  -
    rpool-hdd/vm-200-disk-1                                  46.1G  3.31T  42.3G  -
    rpool-hdd/vm-200-disk-1@__replicate_200-0_1540418530__   3.83G      -  42.3G  -
    rpool/ROOT                                               3.28G  3.15T   153K  /rpool/ROOT
    rpool/ROOT/pve-1                                         3.28G  3.15T  3.28G  /
    rpool/data                                                203G  3.15T   153K  /rpool/data
    rpool/data/vm-102-disk-1                                 6.73G  3.15T  6.00G  -
    rpool/data/vm-102-disk-1@__replicate_102-0_1540418417__  99.3M      -  5.94G  -
    rpool/data/vm-102-disk-1@__replicate_102-0_1540467900__  82.5M      -  5.97G  -
    rpool/data/vm-111-disk-1                                 50.5G  3.15T  50.5G  -
    rpool/data/vm-113-disk-1                                 89.7G  3.15T  81.3G  -
    rpool/data/vm-113-disk-1@__replicate_113-0_1540418472__  8.47G      -  81.1G  -
    rpool/data/vm-117-disk-1                                 30.2G  3.15T  29.6G  -
    rpool/data/vm-117-disk-1@__replicate_117-0_1540418496__   581M      -  29.6G  -
    rpool/data/vm-127-disk-1                                 6.70G  3.15T  5.47G  -
    rpool/data/vm-127-disk-1@__replicate_127-0_1540418507__  1.23G      -  5.47G  -
    rpool/data/vm-139-disk-1                                 8.01G  3.15T  7.48G  -
    rpool/data/vm-139-disk-1@__replicate_139-0_1540631221__   545M      -  6.88G  -
    rpool/data/vm-145-disk-1                                 5.85G  3.15T  5.57G  -
    rpool/data/vm-145-disk-1@__replicate_145-0_1540418519__   283M      -  5.43G  -
    rpool/data/vm-147-disk-1                                 5.04G  3.15T  4.86G  -
    rpool/data/vm-147-disk-1@__replicate_147-0_1540418524__   181M      -  4.60G  -
    rpool/swap                                               8.50G  3.16T  1.79G  -
    Thanks for your big help!

    EDIT : simply by using :
    Code:
    zfs detroy rpool/data/vm-102-disk-1@__replicate_102-0_1540418417__
    , on the sender and receiver nodes ?
     
    #12 saphirblanc, Oct 28, 2018
    Last edited: Oct 28, 2018
  13. saphirblanc

    saphirblanc New Member

    Joined:
    Jul 4, 2017
    Messages:
    23
    Likes Received:
    0
    Well, from my side, I deleted the old replica from the node and the target, had this issue :

    Code:
    Oct 28 16:44:02 athos zed: eid=2617 class=history_event pool_guid=0x765A2359F9A05698
    Oct 28 16:44:02 athos pvesr[14216]: 102-0: got unexpected replication job error - command 'set -o pipefail && pvesm export local-zfs:vm-102-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_102-0_1540741440__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=porthos' root@10.1.0.10 -- pvesm import local-zfs:vm-102-disk-1 zfs - -with-snapshots 1' failed: exit code 255
    Oct 28 16:44:02 athos systemd[1]: Started Proxmox VE replication runner.
    Then I understood that it was because the image disk was still present on the target (the full clone), deleted it using
    Code:
    zfs destroy rpool/data/vm-102-disk-1
    Then, tried again and I'm back to the beginning with the 500 error code and the crash on all nodes of the pvesr service :(
     
  14. fireon

    fireon Well-Known Member
    Proxmox VE Subscriber

    Joined:
    Oct 25, 2010
    Messages:
    2,678
    Likes Received:
    144
    Yes, same here.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  15. dendi

    dendi Member

    Joined:
    Nov 17, 2011
    Messages:
    85
    Likes Received:
    3
    I tried to restore the original /etc/pve/replication.cfg but I got errors:
    Code:
    command 'set -o pipefail && pvesm export local-zfs:vm-101-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_101-1_1540805341__ | /usr/bin/cstream -t 10000000 | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve3' root@192.168.1.3 -- pvesm import local-zfs:vm-101-disk-1 zfs - -with-snapshots 1' failed: exit code 255
    Hope this will help the staff...
     
  16. fireon

    fireon Well-Known Member
    Proxmox VE Subscriber

    Joined:
    Oct 25, 2010
    Messages:
    2,678
    Likes Received:
    144
    hope too
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  17. acidrop

    acidrop Member

    Joined:
    Jul 17, 2012
    Messages:
    194
    Likes Received:
    4
    There's an open bug in bugzilla, please post your info in there to help narrow down this issue ...

    Thanks
     
    saphirblanc likes this.
  18. wolfgang

    wolfgang Proxmox Staff Member
    Staff Member

    Joined:
    Oct 1, 2014
    Messages:
    3,973
    Likes Received:
    240
    Hi at all,

    can you send the replication.cfg to see the replication schedules.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  19. saphirblanc

    saphirblanc New Member

    Joined:
    Jul 4, 2017
    Messages:
    23
    Likes Received:
    0
    When it crashed, here was the replications :

    Code:
    local: 101-1
        target porthos
        schedule mon..fri
        source aramis
    
    local: 102-0
        target porthos
        schedule mon..fri
        source athos
    
    local: 106-0
        target porthos
        schedule mon..fri
        source aramis
    
    local: 108-0
        target porthos
        schedule mon..fri
        source athos
    
    local: 109-0
        target porthos
        schedule mon..fri
        source athos
    
    local: 116-0
        target porthos
        schedule mon..fri
        source aramis
    
    local: 117-0
        target porthos
        schedule mon..fri
        source athos
    
    local: 113-0
        target porthos
        schedule mon..fri
        source athos
    
    local: 104-0
        target porthos
        schedule mon..fri
        source aramis
    
    local: 103-0
        target porthos
        schedule mon..fri
        source aramis
    
    local: 105-0
        target porthos
        schedule mon..fri
        source aramis
    
    local: 112-0
        target porthos
        schedule mon..fri
        source athos
    
    local: 143-0
        target porthos
        schedule mon..fri
        source aramis
    
    local: 145-0
        target porthos
        schedule mon..fri
        source athos
    
    local: 114-0
        target porthos
        schedule mon..fri
        source aramis
    
    local: 115-0
        target porthos
        schedule mon..fri
        source aramis
    
    local: 126-0
        target porthos
        schedule mon..fri
        source aramis
    
    local: 146-0
        target porthos
        schedule mon..fri
        source aramis
    
    local: 144-0
        target porthos
        schedule mon..fri
        source aramis
    
    local: 118-0
        target porthos
        schedule mon..fri
        source aramis
    
    local: 107-0
        target porthos
        schedule mon..fri
        source aramis
    
    local: 147-0
        target porthos
        schedule mon..fri
        source athos
    
    local: 122-0
        target porthos
        schedule mon..fri
        source athos
    
    local: 127-0
        target porthos
        schedule mon..fri
        source athos
    
    local: 200-0
        target porthos
        schedule mon..fri
        source athos
     
  20. rholighaus

    rholighaus New Member
    Proxmox VE Subscriber

    Joined:
    Dec 15, 2016
    Messages:
    28
    Likes Received:
    2
    I have the same issue also since 3am Oct 28 so I agree that I think it's a time change related issue.
    I have opened ticket FHL-759-38090 for this issue but have not heard of any suggestions as how to solve this.

    My setup heavily relies on function replication and I see the "error with cfs lock 'file-replication_cfg': got lock request timeout (500)" message even on a node that is neither a replication target nor source, but a member of the same cluster:

    # pvesr status
    trying to acquire cfs lock 'file-replication_cfg' ...
    trying to acquire cfs lock 'file-replication_cfg' ...
    trying to acquire cfs lock 'file-replication_cfg' ...
    trying to acquire cfs lock 'file-replication_cfg' ...
    trying to acquire cfs lock 'file-replication_cfg' ...
    trying to acquire cfs lock 'file-replication_cfg' ...
    trying to acquire cfs lock 'file-replication_cfg' ...
    trying to acquire cfs lock 'file-replication_cfg' ...
    trying to acquire cfs lock 'file-replication_cfg' ...
    error with cfs lock 'file-replication_cfg': got lock request timeout
     
    #20 rholighaus, Oct 30, 2018
    Last edited: Oct 30, 2018
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice