[SOLVED] Replication runner on all hosts and vm's broken since update

fireon

Distinguished Member
Oct 25, 2010
4,120
387
153
41
Austria/Graz
iteas.at
Hello,

Problem exists about a week. If i click on Replicaton, i got an timeout on all nodes in the cluster. Here are the jounal:
Code:
Okt 27 01:27:00 backup systemd[1]: Starting Proxmox VE replication runner...
Okt 27 01:27:01 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
Okt 27 01:27:02 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
Okt 27 01:27:03 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
Okt 27 01:27:04 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
Okt 27 01:27:05 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
Okt 27 01:27:06 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
Okt 27 01:27:07 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
Okt 27 01:27:08 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
Okt 27 01:27:09 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
Okt 27 01:27:10 backup pvesr[28719]: error with cfs lock 'file-replication_cfg': got lock request timeout
Okt 27 01:27:10 backup systemd[1]: pvesr.service: Main process exited, code=exited, status=17/n/a
Okt 27 01:27:10 backup systemd[1]: Failed to start Proxmox VE replication runner.
Okt 27 01:27:10 backup systemd[1]: pvesr.service: Unit entered failed state.
Okt 27 01:27:10 backup systemd[1]: pvesr.service: Failed with result 'exit-code'.
Maybe someone can help me with that?

Code:
proxmox-ve: 5.2-2 (running kernel: 4.15.18-7-pve)
pve-manager: 5.2-10 (running version: 5.2-10/6f892b40)
pve-kernel-4.15: 5.2-10
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.18-5-pve: 4.15.18-24
pve-kernel-4.15.18-4-pve: 4.15.18-23
pve-kernel-4.15.18-3-pve: 4.15.18-22
pve-kernel-4.15.18-2-pve: 4.15.18-21
pve-kernel-4.15.18-1-pve: 4.15.18-19
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-40
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-30
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-3
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-28
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
pve-zsync: 1.7-1
qemu-server: 5.0-36
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve1~bpo1
 
I'm having the same problem after upgrading a 3 node cluster to the latest package versions.
Multicast communication works fine, but pvesr.service is unable to start because of "error with cfs lock 'file-replication_cfg': got lock request timeout".
As a workaround, I have removed the content of /etc/pve/replication.cfg file and that at least seems to bring pvesr.service up.
Once you create a new replication job though, same error occurs ...

Code:
proxmox-ve: 5.2-2 (running kernel: 4.15.18-5-pve)
pve-manager: 5.2-10 (running version: 5.2-10/6f892b40)
pve-kernel-4.15: 5.2-8
pve-kernel-4.15.18-5-pve: 4.15.18-24
pve-kernel-4.15.18-2-pve: 4.15.18-21
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.15.17-1-pve: 4.15.17-9
ceph: 12.2.8-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-40
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-30
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-3
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
openvswitch-switch: 2.7.0-3
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-29
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-38
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve1~bpo1
 
Having exactly the same issue, is there any workaround without rebooting completely the host for now ?
 
Don't think that the reboot will make any difference, I have already tried rebooting all nodes with no luck.
This most likely looks like a bug, so let's see if an updated package fix will be released in one of the upcoming days...
Proxmox staff has to verify this first though..
 
Just to let the proxmox developers knows, here are my package versions :

Code:
proxmox-ve: 5.2-2 (running kernel: 4.15.18-3-pve)
pve-manager: 5.2-8 (running version: 5.2-8/fdf39912)
pve-kernel-4.15: 5.2-6
pve-kernel-4.15.18-3-pve: 4.15.18-22
pve-kernel-4.15.17-1-pve: 4.15.17-9
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-38
libpve-guest-common-perl: 2.0-17
libpve-http-server-perl: 2.0-10
libpve-storage-perl: 5.0-25
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-1
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-19
pve-cluster: 5.0-30
pve-container: 2.0-26
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-33
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9

Configuration has not changed between. It seems to have started at the time of a replication schedule, so a replication might have been in progress ?

Here are the syslog of two nodes showing the state from "working" to not :

Code:
Oct 25 23:59:00 athos systemd[1]: Starting Proxmox VE replication runner...
Oct 25 23:59:00 athos systemd[1]: Started Proxmox VE replication runner.
Oct 25 23:59:11 athos postfix/anvil[25155]: statistics: max connection rate 1/60s for (X) at Oct 25 23:50:51
Oct 25 23:59:11 athos postfix/anvil[25155]: statistics: max connection count 1 for (X) at Oct 25 23:50:51
Oct 25 23:59:11 athos postfix/anvil[25155]: statistics: max cache size 2 at Oct 25 23:50:51
Oct 26 00:00:00 athos systemd[1]: Starting Proxmox VE replication runner...
Oct 26 00:00:01 athos CRON[19132]: pam_unix(cron:session): session opened for user root by (uid=0)
Oct 26 00:00:01 athos CRON[19133]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Oct 26 00:00:01 athos CRON[19132]: pam_unix(cron:session): session closed for user root
Oct 26 00:00:02 athos zed[19421]: eid=2551 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:02 athos zed[19563]: eid=2552 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:02 athos zed[19646]: eid=2553 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:03 athos zed[20045]: eid=2554 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:03 athos zed[20151]: eid=2555 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:12 athos zed[38194]: eid=2556 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:12 athos zed[38290]: eid=2557 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:13 athos zed[38438]: eid=2558 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:13 athos zed[38497]: eid=2559 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:14 athos zed[38584]: eid=2560 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:14 athos zed[38665]: eid=2561 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:16 athos zed[38965]: eid=2562 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:16 athos zed[39022]: eid=2563 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:16 athos zed[39032]: eid=2564 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:16 athos zed[39107]: eid=2565 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:16 athos zed[39110]: eid=2566 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:17 athos zed[39420]: eid=2567 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:17 athos zed[39658]: eid=2568 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:18 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 26 00:00:19 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 26 00:00:20 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 26 00:00:21 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 26 00:00:22 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 26 00:00:23 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 26 00:00:24 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 26 00:00:25 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 26 00:00:26 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 26 00:00:27 athos pvesr[19090]: error with cfs lock 'file-replication_cfg': got lock request timeout
Oct 26 00:00:27 athos systemd[1]: pvesr.service: Main process exited, code=exited, status=17/n/a
Oct 26 00:00:27 athos systemd[1]: Failed to start Proxmox VE replication runner.
Oct 26 00:00:27 athos systemd[1]: pvesr.service: Unit entered failed state.
Oct 26 00:00:27 athos systemd[1]: pvesr.service: Failed with result 'exit-code'.

The other node :
Code:
Oct 25 23:59:00 aramis systemd[1]: Starting Proxmox VE replication runner...
Oct 25 23:59:00 aramis systemd[1]: Started Proxmox VE replication runner.
Oct 25 23:59:12 aramis postfix/anvil[35118]: statistics: max connection rate 1/60s for (X) at Oct 25 23:55:52
Oct 25 23:59:12 aramis postfix/anvil[35118]: statistics: max connection count 1 for (X) at Oct 25 23:55:52
Oct 25 23:59:12 aramis postfix/anvil[35118]: statistics: max cache size 2 at Oct 25 23:55:52
Oct 26 00:00:00 aramis systemd[1]: Starting Proxmox VE replication runner...
Oct 26 00:00:01 aramis zed[25495]: eid=5402 class=history_event pool_guid=0x32C807D8808E1CD9
Oct 26 00:00:01 aramis CRON[25562]: pam_unix(cron:session): session opened for user root by (uid=0)
Oct 26 00:00:01 aramis CRON[25563]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Oct 26 00:00:01 aramis CRON[25562]: pam_unix(cron:session): session closed for user root
Oct 26 00:00:01 aramis zed[25976]: eid=5403 class=history_event pool_guid=0x32C807D8808E1CD9
Oct 26 00:00:01 aramis zed[26134]: eid=5404 class=history_event pool_guid=0x32C807D8808E1CD9
Oct 26 00:00:13 aramis zed[26096]: eid=5405 class=history_event pool_guid=0x32C807D8808E1CD9
Oct 26 00:00:13 aramis zed[26113]: eid=5406 class=history_event pool_guid=0x32C807D8808E1CD9
Oct 26 00:00:15 aramis zed[26415]: eid=5407 class=history_event pool_guid=0x32C807D8808E1CD9
Oct 26 00:00:34 aramis postfix/smtpd[1998]: connect from unknown[X]
Oct 26 00:00:34 aramis postfix/smtpd[1998]: lost connection after AUTH from unknown[178.159.36.53]
Oct 26 00:00:34 aramis postfix/smtpd[1998]: disconnect from unknown[X] ehlo=1 auth=0/1 commands=1/2
Oct 26 00:00:39 aramis sshd[5048]: Connection closed by X port 59672 [preauth]
Oct 26 00:00:39 aramis sshd[5050]: Connection closed by X port 51776 [preauth]
Oct 26 00:00:52 aramis postfix/smtpd[1998]: connect from X
Oct 26 00:00:52 aramis postfix/smtpd[1998]: disconnect fromXhelo=1 quit=1 commands=2
Oct 26 00:00:52 aramis postfix/smtpd[1998]: connect from X
Oct 26 00:00:52 aramis postfix/smtpd[1998]: disconnect from X helo=1 quit=1 commands=2
Oct 26 00:01:16 aramis pvesr[24571]: error with cfs lock 'file-replication_cfg': got lock timeout - aborting command
Oct 26 00:01:16 aramis systemd[1]: pvesr.service: Main process exited, code=exited, status=255/n/a
Oct 26 00:01:16 aramis systemd[1]: Failed to start Proxmox VE replication runner.
Oct 26 00:01:16 aramis systemd[1]: pvesr.service: Unit entered failed state.
Oct 26 00:01:16 aramis systemd[1]: pvesr.service: Failed with result 'exit-code'.
Oct 26 00:01:16 aramis systemd[1]: Starting Proxmox VE replication runner...

Hope this can help further.

Thanks to all.
 
Just to let the proxmox developers knows, here are my package versions :

Code:
proxmox-ve: 5.2-2 (running kernel: 4.15.18-3-pve)
pve-manager: 5.2-8 (running version: 5.2-8/fdf39912)
pve-kernel-4.15: 5.2-6
pve-kernel-4.15.18-3-pve: 4.15.18-22
pve-kernel-4.15.17-1-pve: 4.15.17-9
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-38
libpve-guest-common-perl: 2.0-17
libpve-http-server-perl: 2.0-10
libpve-storage-perl: 5.0-25
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-1
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-19
pve-cluster: 5.0-30
pve-container: 2.0-26
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-33
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9

...
Hi,
I think it hasn't anything to do with the issue, but your versions shows that you don't use "apt dist-upgrade" which is important on proxmox.
"apt upgrade" isn't enough!

Udo
 
Same problem, like a year ago, I think because of time change (daylight saving time)
https://forum.proxmox.com/threads/pvesr-status-hanging-after-upgrade-from-5-0-to-5-1.37738/

Not sure if it's because of the time change as it appeared few days ago.

I can confirm that the process pvesr is using 100% of the CPU on the first node (sender) "athos" for hours but no on the two other nodes (including receiver node - porthos).
Is it safe to kill it ?

Code:
root     14828 99.5  0.0 495372 77404 ?        Rs   10:49   0:35 /usr/bin/perl -T /usr/bin/pvesr run --mail 1

Thanks.
 
Saphirblanc, you can check how old is your replica's snapshot with "zfs list -t all" and see when it stopped to work.
I "solved" with:
Code:
cp -a /etc/pve/replication.cfg /root/
vi /etc/pve/replication.cfg #clear it, only on one node because it's in a cluster fs
systemctl stop pvesr.timer
systemctl stop pvesr
systemctl restart pvedaemon
You have to redo all your replicas o course and remember to manually delete the old replica's snapshot on the source node
and restart pvesr.timer and pvesr
 
  • Like
Reactions: saphirblanc
Saphirblanc, you can check how old is your replica's snapshot with "zfs list -t all" and see when it stopped to work.
I "solved" with:
Code:
cp -a /etc/pve/replication.cfg /root/
vi /etc/pve/replication.cfg #clear it, only on one node because it's in a cluster fs
systemctl stop pvesr.timer
systemctl stop pvesr
systemctl restart pvedaemon
You have to redo all your replicas o course and remember to manually delete the old replica's snapshot on the source node
and restart pvesr.timer and pvesr

Thanks dendi! I was indeed able to remove the error 500... and create more than one replication task through the GUI which is a big step (I have not yet restarted pvesr.timer and pvesr) !
How can I delete the old replica's snapshot ? Sorry, not so used with ZFS yet...

Code:
root@athos:/var/log# zfs list -t all
NAME                                                      USED  AVAIL  REFER  MOUNTPOINT
rpool                                                     215G  3.15T   166K  /rpool
rpool-hdd                                                 207G  3.31T   128K  /rpool-hdd
rpool-hdd/vm-108-disk-1                                  8.68G  3.31T  8.35G  -
rpool-hdd/vm-108-disk-1@__replicate_108-0_1540504800__    338M      -  8.35G  -
rpool-hdd/vm-108-disk-2                                   188M  3.31T   186M  -
rpool-hdd/vm-108-disk-2@__replicate_108-0_1540504800__   1.63M      -   186M  -
rpool-hdd/vm-108-disk-3                                  74.6K  3.31T  74.6K  -
rpool-hdd/vm-108-disk-3@__replicate_108-0_1540504800__      0B      -  74.6K  -
rpool-hdd/vm-109-disk-2                                  1.92G  3.31T  1.92G  -
rpool-hdd/vm-109-disk-2@__replicate_109-0_1540418416__    884K      -  1.92G  -
rpool-hdd/vm-109-disk-3                                  5.38G  3.31T  4.93G  -
rpool-hdd/vm-109-disk-3@__replicate_109-0_1540418416__    455M      -  4.93G  -
rpool-hdd/vm-112-disk-1                                  12.0G  3.31T  11.7G  -
rpool-hdd/vm-112-disk-1@__replicate_112-0_1540418430__    358M      -  11.2G  -
rpool-hdd/vm-112-disk-2                                   101G  3.31T  99.9G  -
rpool-hdd/vm-112-disk-2@__replicate_112-0_1540418430__   1.39G      -  97.8G  -
rpool-hdd/vm-122-disk-1                                  30.9G  3.31T  30.4G  -
rpool-hdd/vm-122-disk-1@__replicate_122-0_1540418502__    478M      -  30.4G  -
rpool-hdd/vm-200-disk-1                                  46.1G  3.31T  42.3G  -
rpool-hdd/vm-200-disk-1@__replicate_200-0_1540418530__   3.83G      -  42.3G  -
rpool/ROOT                                               3.28G  3.15T   153K  /rpool/ROOT
rpool/ROOT/pve-1                                         3.28G  3.15T  3.28G  /
rpool/data                                                203G  3.15T   153K  /rpool/data
rpool/data/vm-102-disk-1                                 6.73G  3.15T  6.00G  -
rpool/data/vm-102-disk-1@__replicate_102-0_1540418417__  99.3M      -  5.94G  -
rpool/data/vm-102-disk-1@__replicate_102-0_1540467900__  82.5M      -  5.97G  -
rpool/data/vm-111-disk-1                                 50.5G  3.15T  50.5G  -
rpool/data/vm-113-disk-1                                 89.7G  3.15T  81.3G  -
rpool/data/vm-113-disk-1@__replicate_113-0_1540418472__  8.47G      -  81.1G  -
rpool/data/vm-117-disk-1                                 30.2G  3.15T  29.6G  -
rpool/data/vm-117-disk-1@__replicate_117-0_1540418496__   581M      -  29.6G  -
rpool/data/vm-127-disk-1                                 6.70G  3.15T  5.47G  -
rpool/data/vm-127-disk-1@__replicate_127-0_1540418507__  1.23G      -  5.47G  -
rpool/data/vm-139-disk-1                                 8.01G  3.15T  7.48G  -
rpool/data/vm-139-disk-1@__replicate_139-0_1540631221__   545M      -  6.88G  -
rpool/data/vm-145-disk-1                                 5.85G  3.15T  5.57G  -
rpool/data/vm-145-disk-1@__replicate_145-0_1540418519__   283M      -  5.43G  -
rpool/data/vm-147-disk-1                                 5.04G  3.15T  4.86G  -
rpool/data/vm-147-disk-1@__replicate_147-0_1540418524__   181M      -  4.60G  -
rpool/swap                                               8.50G  3.16T  1.79G  -

Thanks for your big help!

EDIT : simply by using :
Code:
zfs detroy rpool/data/vm-102-disk-1@__replicate_102-0_1540418417__
, on the sender and receiver nodes ?
 
Last edited:
Well, from my side, I deleted the old replica from the node and the target, had this issue :

Code:
Oct 28 16:44:02 athos zed: eid=2617 class=history_event pool_guid=0x765A2359F9A05698
Oct 28 16:44:02 athos pvesr[14216]: 102-0: got unexpected replication job error - command 'set -o pipefail && pvesm export local-zfs:vm-102-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_102-0_1540741440__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=porthos' root@10.1.0.10 -- pvesm import local-zfs:vm-102-disk-1 zfs - -with-snapshots 1' failed: exit code 255
Oct 28 16:44:02 athos systemd[1]: Started Proxmox VE replication runner.

Then I understood that it was because the image disk was still present on the target (the full clone), deleted it using
Code:
zfs destroy rpool/data/vm-102-disk-1
Then, tried again and I'm back to the beginning with the 500 error code and the crash on all nodes of the pvesr service :(
 
I tried to restore the original /etc/pve/replication.cfg but I got errors:
Code:
command 'set -o pipefail && pvesm export local-zfs:vm-101-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_101-1_1540805341__ | /usr/bin/cstream -t 10000000 | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve3' root@192.168.1.3 -- pvesm import local-zfs:vm-101-disk-1 zfs - -with-snapshots 1' failed: exit code 255
Hope this will help the staff...
 
Hi at all,

can you send the replication.cfg to see the replication schedules.
 
Hi at all,

can you send the replication.cfg to see the replication schedules.

When it crashed, here was the replications :

Code:
local: 101-1
    target porthos
    schedule mon..fri
    source aramis

local: 102-0
    target porthos
    schedule mon..fri
    source athos

local: 106-0
    target porthos
    schedule mon..fri
    source aramis

local: 108-0
    target porthos
    schedule mon..fri
    source athos

local: 109-0
    target porthos
    schedule mon..fri
    source athos

local: 116-0
    target porthos
    schedule mon..fri
    source aramis

local: 117-0
    target porthos
    schedule mon..fri
    source athos

local: 113-0
    target porthos
    schedule mon..fri
    source athos

local: 104-0
    target porthos
    schedule mon..fri
    source aramis

local: 103-0
    target porthos
    schedule mon..fri
    source aramis

local: 105-0
    target porthos
    schedule mon..fri
    source aramis

local: 112-0
    target porthos
    schedule mon..fri
    source athos

local: 143-0
    target porthos
    schedule mon..fri
    source aramis

local: 145-0
    target porthos
    schedule mon..fri
    source athos

local: 114-0
    target porthos
    schedule mon..fri
    source aramis

local: 115-0
    target porthos
    schedule mon..fri
    source aramis

local: 126-0
    target porthos
    schedule mon..fri
    source aramis

local: 146-0
    target porthos
    schedule mon..fri
    source aramis

local: 144-0
    target porthos
    schedule mon..fri
    source aramis

local: 118-0
    target porthos
    schedule mon..fri
    source aramis

local: 107-0
    target porthos
    schedule mon..fri
    source aramis

local: 147-0
    target porthos
    schedule mon..fri
    source athos

local: 122-0
    target porthos
    schedule mon..fri
    source athos

local: 127-0
    target porthos
    schedule mon..fri
    source athos

local: 200-0
    target porthos
    schedule mon..fri
    source athos
 
I have the same issue also since 3am Oct 28 so I agree that I think it's a time change related issue.
I have opened ticket FHL-759-38090 for this issue but have not heard of any suggestions as how to solve this.

My setup heavily relies on function replication and I see the "error with cfs lock 'file-replication_cfg': got lock request timeout (500)" message even on a node that is neither a replication target nor source, but a member of the same cluster:

# pvesr status
trying to acquire cfs lock 'file-replication_cfg' ...
trying to acquire cfs lock 'file-replication_cfg' ...
trying to acquire cfs lock 'file-replication_cfg' ...
trying to acquire cfs lock 'file-replication_cfg' ...
trying to acquire cfs lock 'file-replication_cfg' ...
trying to acquire cfs lock 'file-replication_cfg' ...
trying to acquire cfs lock 'file-replication_cfg' ...
trying to acquire cfs lock 'file-replication_cfg' ...
trying to acquire cfs lock 'file-replication_cfg' ...
error with cfs lock 'file-replication_cfg': got lock request timeout
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!