[SOLVED] Replication runner on all hosts and vm's broken since update

fireon · Oct 27, 2018

Hello,

Problem exists about a week. If i click on Replicaton, i got an timeout on all nodes in the cluster. Here are the jounal:

Code:

Okt 27 01:27:00 backup systemd[1]: Starting Proxmox VE replication runner...
Okt 27 01:27:01 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
Okt 27 01:27:02 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
Okt 27 01:27:03 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
Okt 27 01:27:04 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
Okt 27 01:27:05 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
Okt 27 01:27:06 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
Okt 27 01:27:07 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
Okt 27 01:27:08 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
Okt 27 01:27:09 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
Okt 27 01:27:10 backup pvesr[28719]: error with cfs lock 'file-replication_cfg': got lock request timeout
Okt 27 01:27:10 backup systemd[1]: pvesr.service: Main process exited, code=exited, status=17/n/a
Okt 27 01:27:10 backup systemd[1]: Failed to start Proxmox VE replication runner.
Okt 27 01:27:10 backup systemd[1]: pvesr.service: Unit entered failed state.
Okt 27 01:27:10 backup systemd[1]: pvesr.service: Failed with result 'exit-code'.

Maybe someone can help me with that?

Code:

proxmox-ve: 5.2-2 (running kernel: 4.15.18-7-pve)
pve-manager: 5.2-10 (running version: 5.2-10/6f892b40)
pve-kernel-4.15: 5.2-10
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.18-5-pve: 4.15.18-24
pve-kernel-4.15.18-4-pve: 4.15.18-23
pve-kernel-4.15.18-3-pve: 4.15.18-22
pve-kernel-4.15.18-2-pve: 4.15.18-21
pve-kernel-4.15.18-1-pve: 4.15.18-19
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-40
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-30
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-3
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-28
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
pve-zsync: 1.7-1
qemu-server: 5.0-36
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve1~bpo1

acidrop · Oct 27, 2018

I'm having the same problem after upgrading a 3 node cluster to the latest package versions.
Multicast communication works fine, but pvesr.service is unable to start because of "error with cfs lock 'file-replication_cfg': got lock request timeout".
As a workaround, I have removed the content of /etc/pve/replication.cfg file and that at least seems to bring pvesr.service up.
Once you create a new replication job though, same error occurs ...

Code:

proxmox-ve: 5.2-2 (running kernel: 4.15.18-5-pve)
pve-manager: 5.2-10 (running version: 5.2-10/6f892b40)
pve-kernel-4.15: 5.2-8
pve-kernel-4.15.18-5-pve: 4.15.18-24
pve-kernel-4.15.18-2-pve: 4.15.18-21
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.15.17-1-pve: 4.15.17-9
ceph: 12.2.8-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-40
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-30
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-3
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
openvswitch-switch: 2.7.0-3
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-29
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-38
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve1~bpo1

saphirblanc · Oct 27, 2018

Having exactly the same issue, is there any workaround without rebooting completely the host for now ?

acidrop · Oct 27, 2018

Don't think that the reboot will make any difference, I have already tried rebooting all nodes with no luck.
This most likely looks like a bug, so let's see if an updated package fix will be released in one of the upcoming days...
Proxmox staff has to verify this first though..

fireon · Oct 27, 2018

saphirblanc said:
Having exactly the same issue, is there any workaround without rebooting completely the host for now ?

Rebooted too, didn't help.

saphirblanc · Oct 27, 2018

Just to let the proxmox developers knows, here are my package versions :

Code:

proxmox-ve: 5.2-2 (running kernel: 4.15.18-3-pve)
pve-manager: 5.2-8 (running version: 5.2-8/fdf39912)
pve-kernel-4.15: 5.2-6
pve-kernel-4.15.18-3-pve: 4.15.18-22
pve-kernel-4.15.17-1-pve: 4.15.17-9
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-38
libpve-guest-common-perl: 2.0-17
libpve-http-server-perl: 2.0-10
libpve-storage-perl: 5.0-25
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-1
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-19
pve-cluster: 5.0-30
pve-container: 2.0-26
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-33
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9

Configuration has not changed between. It seems to have started at the time of a replication schedule, so a replication might have been in progress ?

Here are the syslog of two nodes showing the state from "working" to not :

Code:

Oct 25 23:59:00 athos systemd[1]: Starting Proxmox VE replication runner...
Oct 25 23:59:00 athos systemd[1]: Started Proxmox VE replication runner.
Oct 25 23:59:11 athos postfix/anvil[25155]: statistics: max connection rate 1/60s for (X) at Oct 25 23:50:51
Oct 25 23:59:11 athos postfix/anvil[25155]: statistics: max connection count 1 for (X) at Oct 25 23:50:51
Oct 25 23:59:11 athos postfix/anvil[25155]: statistics: max cache size 2 at Oct 25 23:50:51
Oct 26 00:00:00 athos systemd[1]: Starting Proxmox VE replication runner...
Oct 26 00:00:01 athos CRON[19132]: pam_unix(cron:session): session opened for user root by (uid=0)
Oct 26 00:00:01 athos CRON[19133]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Oct 26 00:00:01 athos CRON[19132]: pam_unix(cron:session): session closed for user root
Oct 26 00:00:02 athos zed[19421]: eid=2551 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:02 athos zed[19563]: eid=2552 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:02 athos zed[19646]: eid=2553 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:03 athos zed[20045]: eid=2554 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:03 athos zed[20151]: eid=2555 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:12 athos zed[38194]: eid=2556 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:12 athos zed[38290]: eid=2557 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:13 athos zed[38438]: eid=2558 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:13 athos zed[38497]: eid=2559 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:14 athos zed[38584]: eid=2560 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:14 athos zed[38665]: eid=2561 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:16 athos zed[38965]: eid=2562 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:16 athos zed[39022]: eid=2563 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:16 athos zed[39032]: eid=2564 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:16 athos zed[39107]: eid=2565 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:16 athos zed[39110]: eid=2566 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:17 athos zed[39420]: eid=2567 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:17 athos zed[39658]: eid=2568 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:18 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 26 00:00:19 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 26 00:00:20 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 26 00:00:21 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 26 00:00:22 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 26 00:00:23 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 26 00:00:24 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 26 00:00:25 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 26 00:00:26 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 26 00:00:27 athos pvesr[19090]: error with cfs lock 'file-replication_cfg': got lock request timeout
Oct 26 00:00:27 athos systemd[1]: pvesr.service: Main process exited, code=exited, status=17/n/a
Oct 26 00:00:27 athos systemd[1]: Failed to start Proxmox VE replication runner.
Oct 26 00:00:27 athos systemd[1]: pvesr.service: Unit entered failed state.
Oct 26 00:00:27 athos systemd[1]: pvesr.service: Failed with result 'exit-code'.

The other node :

Code:

Oct 25 23:59:00 aramis systemd[1]: Starting Proxmox VE replication runner...
Oct 25 23:59:00 aramis systemd[1]: Started Proxmox VE replication runner.
Oct 25 23:59:12 aramis postfix/anvil[35118]: statistics: max connection rate 1/60s for (X) at Oct 25 23:55:52
Oct 25 23:59:12 aramis postfix/anvil[35118]: statistics: max connection count 1 for (X) at Oct 25 23:55:52
Oct 25 23:59:12 aramis postfix/anvil[35118]: statistics: max cache size 2 at Oct 25 23:55:52
Oct 26 00:00:00 aramis systemd[1]: Starting Proxmox VE replication runner...
Oct 26 00:00:01 aramis zed[25495]: eid=5402 class=history_event pool_guid=0x32C807D8808E1CD9
Oct 26 00:00:01 aramis CRON[25562]: pam_unix(cron:session): session opened for user root by (uid=0)
Oct 26 00:00:01 aramis CRON[25563]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Oct 26 00:00:01 aramis CRON[25562]: pam_unix(cron:session): session closed for user root
Oct 26 00:00:01 aramis zed[25976]: eid=5403 class=history_event pool_guid=0x32C807D8808E1CD9
Oct 26 00:00:01 aramis zed[26134]: eid=5404 class=history_event pool_guid=0x32C807D8808E1CD9
Oct 26 00:00:13 aramis zed[26096]: eid=5405 class=history_event pool_guid=0x32C807D8808E1CD9
Oct 26 00:00:13 aramis zed[26113]: eid=5406 class=history_event pool_guid=0x32C807D8808E1CD9
Oct 26 00:00:15 aramis zed[26415]: eid=5407 class=history_event pool_guid=0x32C807D8808E1CD9
Oct 26 00:00:34 aramis postfix/smtpd[1998]: connect from unknown[X]
Oct 26 00:00:34 aramis postfix/smtpd[1998]: lost connection after AUTH from unknown[178.159.36.53]
Oct 26 00:00:34 aramis postfix/smtpd[1998]: disconnect from unknown[X] ehlo=1 auth=0/1 commands=1/2
Oct 26 00:00:39 aramis sshd[5048]: Connection closed by X port 59672 [preauth]
Oct 26 00:00:39 aramis sshd[5050]: Connection closed by X port 51776 [preauth]
Oct 26 00:00:52 aramis postfix/smtpd[1998]: connect from X
Oct 26 00:00:52 aramis postfix/smtpd[1998]: disconnect fromXhelo=1 quit=1 commands=2
Oct 26 00:00:52 aramis postfix/smtpd[1998]: connect from X
Oct 26 00:00:52 aramis postfix/smtpd[1998]: disconnect from X helo=1 quit=1 commands=2
Oct 26 00:01:16 aramis pvesr[24571]: error with cfs lock 'file-replication_cfg': got lock timeout - aborting command
Oct 26 00:01:16 aramis systemd[1]: pvesr.service: Main process exited, code=exited, status=255/n/a
Oct 26 00:01:16 aramis systemd[1]: Failed to start Proxmox VE replication runner.
Oct 26 00:01:16 aramis systemd[1]: pvesr.service: Unit entered failed state.
Oct 26 00:01:16 aramis systemd[1]: pvesr.service: Failed with result 'exit-code'.
Oct 26 00:01:16 aramis systemd[1]: Starting Proxmox VE replication runner...

Hope this can help further.

Thanks to all.

udo · Oct 28, 2018

saphirblanc said:

Just to let the proxmox developers knows, here are my package versions :

Code:

proxmox-ve: 5.2-2 (running kernel: 4.15.18-3-pve)
pve-manager: 5.2-8 (running version: 5.2-8/fdf39912)
pve-kernel-4.15: 5.2-6
pve-kernel-4.15.18-3-pve: 4.15.18-22
pve-kernel-4.15.17-1-pve: 4.15.17-9
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-38
libpve-guest-common-perl: 2.0-17
libpve-http-server-perl: 2.0-10
libpve-storage-perl: 5.0-25
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-1
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-19
pve-cluster: 5.0-30
pve-container: 2.0-26
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-33
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9

...

Hi,
I think it hasn't anything to do with the issue, but your versions shows that you don't use "apt dist-upgrade" which is important on proxmox.
"apt upgrade" isn't enough!

Udo

fireon · Oct 28, 2018

Code:

pveupdate
pveupgrade

dendi · Oct 28, 2018

Same problem, like a year ago, I think because of time change (daylight saving time)
https://forum.proxmox.com/threads/pvesr-status-hanging-after-upgrade-from-5-0-to-5-1.37738/

saphirblanc · Oct 28, 2018

dendi said:
Same problem, like a year ago, I think because of time change (daylight saving time)
https://forum.proxmox.com/threads/pvesr-status-hanging-after-upgrade-from-5-0-to-5-1.37738/

Not sure if it's because of the time change as it appeared few days ago.

I can confirm that the process pvesr is using 100% of the CPU on the first node (sender) "athos" for hours but no on the two other nodes (including receiver node - porthos).
Is it safe to kill it ?

Code:

root     14828 99.5  0.0 495372 77404 ?        Rs   10:49   0:35 /usr/bin/perl -T /usr/bin/pvesr run --mail 1

Thanks.

dendi · Oct 28, 2018

Saphirblanc, you can check how old is your replica's snapshot with "zfs list -t all" and see when it stopped to work.
I "solved" with:

Code:

cp -a /etc/pve/replication.cfg /root/
vi /etc/pve/replication.cfg #clear it, only on one node because it's in a cluster fs
systemctl stop pvesr.timer
systemctl stop pvesr
systemctl restart pvedaemon

You have to redo all your replicas o course and remember to manually delete the old replica's snapshot on the source node
and restart pvesr.timer and pvesr

saphirblanc · Oct 28, 2018

dendi said:
Saphirblanc, you can check how old is your replica's snapshot with "zfs list -t all" and see when it stopped to work.
I "solved" with:

Code:

cp -a /etc/pve/replication.cfg /root/ vi /etc/pve/replication.cfg #clear it, only on one node because it's in a cluster fs systemctl stop pvesr.timer systemctl stop pvesr systemctl restart pvedaemon

You have to redo all your replicas o course and remember to manually delete the old replica's snapshot on the source node
and restart pvesr.timer and pvesr

Thanks dendi! I was indeed able to remove the error 500... and create more than one replication task through the GUI which is a big step (I have not yet restarted pvesr.timer and pvesr) !
How can I delete the old replica's snapshot ? Sorry, not so used with ZFS yet...

Code:

root@athos:/var/log# zfs list -t all
NAME                                                      USED  AVAIL  REFER  MOUNTPOINT
rpool                                                     215G  3.15T   166K  /rpool
rpool-hdd                                                 207G  3.31T   128K  /rpool-hdd
rpool-hdd/vm-108-disk-1                                  8.68G  3.31T  8.35G  -
rpool-hdd/vm-108-disk-1@__replicate_108-0_1540504800__    338M      -  8.35G  -
rpool-hdd/vm-108-disk-2                                   188M  3.31T   186M  -
rpool-hdd/vm-108-disk-2@__replicate_108-0_1540504800__   1.63M      -   186M  -
rpool-hdd/vm-108-disk-3                                  74.6K  3.31T  74.6K  -
rpool-hdd/vm-108-disk-3@__replicate_108-0_1540504800__      0B      -  74.6K  -
rpool-hdd/vm-109-disk-2                                  1.92G  3.31T  1.92G  -
rpool-hdd/vm-109-disk-2@__replicate_109-0_1540418416__    884K      -  1.92G  -
rpool-hdd/vm-109-disk-3                                  5.38G  3.31T  4.93G  -
rpool-hdd/vm-109-disk-3@__replicate_109-0_1540418416__    455M      -  4.93G  -
rpool-hdd/vm-112-disk-1                                  12.0G  3.31T  11.7G  -
rpool-hdd/vm-112-disk-1@__replicate_112-0_1540418430__    358M      -  11.2G  -
rpool-hdd/vm-112-disk-2                                   101G  3.31T  99.9G  -
rpool-hdd/vm-112-disk-2@__replicate_112-0_1540418430__   1.39G      -  97.8G  -
rpool-hdd/vm-122-disk-1                                  30.9G  3.31T  30.4G  -
rpool-hdd/vm-122-disk-1@__replicate_122-0_1540418502__    478M      -  30.4G  -
rpool-hdd/vm-200-disk-1                                  46.1G  3.31T  42.3G  -
rpool-hdd/vm-200-disk-1@__replicate_200-0_1540418530__   3.83G      -  42.3G  -
rpool/ROOT                                               3.28G  3.15T   153K  /rpool/ROOT
rpool/ROOT/pve-1                                         3.28G  3.15T  3.28G  /
rpool/data                                                203G  3.15T   153K  /rpool/data
rpool/data/vm-102-disk-1                                 6.73G  3.15T  6.00G  -
rpool/data/vm-102-disk-1@__replicate_102-0_1540418417__  99.3M      -  5.94G  -
rpool/data/vm-102-disk-1@__replicate_102-0_1540467900__  82.5M      -  5.97G  -
rpool/data/vm-111-disk-1                                 50.5G  3.15T  50.5G  -
rpool/data/vm-113-disk-1                                 89.7G  3.15T  81.3G  -
rpool/data/vm-113-disk-1@__replicate_113-0_1540418472__  8.47G      -  81.1G  -
rpool/data/vm-117-disk-1                                 30.2G  3.15T  29.6G  -
rpool/data/vm-117-disk-1@__replicate_117-0_1540418496__   581M      -  29.6G  -
rpool/data/vm-127-disk-1                                 6.70G  3.15T  5.47G  -
rpool/data/vm-127-disk-1@__replicate_127-0_1540418507__  1.23G      -  5.47G  -
rpool/data/vm-139-disk-1                                 8.01G  3.15T  7.48G  -
rpool/data/vm-139-disk-1@__replicate_139-0_1540631221__   545M      -  6.88G  -
rpool/data/vm-145-disk-1                                 5.85G  3.15T  5.57G  -
rpool/data/vm-145-disk-1@__replicate_145-0_1540418519__   283M      -  5.43G  -
rpool/data/vm-147-disk-1                                 5.04G  3.15T  4.86G  -
rpool/data/vm-147-disk-1@__replicate_147-0_1540418524__   181M      -  4.60G  -
rpool/swap                                               8.50G  3.16T  1.79G  -

Thanks for your big help!

EDIT : simply by using :

Code:

zfs detroy rpool/data/vm-102-disk-1@__replicate_102-0_1540418417__

, on the sender and receiver nodes ?

saphirblanc · Oct 28, 2018

Well, from my side, I deleted the old replica from the node and the target, had this issue :

Code:

Oct 28 16:44:02 athos zed: eid=2617 class=history_event pool_guid=0x765A2359F9A05698
Oct 28 16:44:02 athos pvesr[14216]: 102-0: got unexpected replication job error - command 'set -o pipefail && pvesm export local-zfs:vm-102-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_102-0_1540741440__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=porthos' root@10.1.0.10 -- pvesm import local-zfs:vm-102-disk-1 zfs - -with-snapshots 1' failed: exit code 255
Oct 28 16:44:02 athos systemd[1]: Started Proxmox VE replication runner.

Then I understood that it was because the image disk was still present on the target (the full clone), deleted it using

Code:

zfs destroy rpool/data/vm-102-disk-1

Then, tried again and I'm back to the beginning with the 500 error code and the crash on all nodes of the pvesr service

fireon · Oct 28, 2018

saphirblanc said:
I can confirm that the process pvesr is using 100% of the CPU on the first node (sender) "athos" for hours but no on the two other nodes (including receiver node - porthos).

Yes, same here.

dendi · Oct 29, 2018

I tried to restore the original /etc/pve/replication.cfg but I got errors:

Code:

command 'set -o pipefail && pvesm export local-zfs:vm-101-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_101-1_1540805341__ | /usr/bin/cstream -t 10000000 | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve3' root@192.168.1.3 -- pvesm import local-zfs:vm-101-disk-1 zfs - -with-snapshots 1' failed: exit code 255

Hope this will help the staff...

fireon · Oct 29, 2018

dendi said:
Hope this will help the staff...

hope too

acidrop · Oct 29, 2018

There's an open bug in bugzilla, please post your info in there to help narrow down this issue ...

Thanks

wolfgang · Oct 30, 2018

Hi at all,

can you send the replication.cfg to see the replication schedules.

saphirblanc · Oct 30, 2018

wolfgang said:
Hi at all,

can you send the replication.cfg to see the replication schedules.

When it crashed, here was the replications :

Code:

local: 101-1
    target porthos
    schedule mon..fri
    source aramis

local: 102-0
    target porthos
    schedule mon..fri
    source athos

local: 106-0
    target porthos
    schedule mon..fri
    source aramis

local: 108-0
    target porthos
    schedule mon..fri
    source athos

local: 109-0
    target porthos
    schedule mon..fri
    source athos

local: 116-0
    target porthos
    schedule mon..fri
    source aramis

local: 117-0
    target porthos
    schedule mon..fri
    source athos

local: 113-0
    target porthos
    schedule mon..fri
    source athos

local: 104-0
    target porthos
    schedule mon..fri
    source aramis

local: 103-0
    target porthos
    schedule mon..fri
    source aramis

local: 105-0
    target porthos
    schedule mon..fri
    source aramis

local: 112-0
    target porthos
    schedule mon..fri
    source athos

local: 143-0
    target porthos
    schedule mon..fri
    source aramis

local: 145-0
    target porthos
    schedule mon..fri
    source athos

local: 114-0
    target porthos
    schedule mon..fri
    source aramis

local: 115-0
    target porthos
    schedule mon..fri
    source aramis

local: 126-0
    target porthos
    schedule mon..fri
    source aramis

local: 146-0
    target porthos
    schedule mon..fri
    source aramis

local: 144-0
    target porthos
    schedule mon..fri
    source aramis

local: 118-0
    target porthos
    schedule mon..fri
    source aramis

local: 107-0
    target porthos
    schedule mon..fri
    source aramis

local: 147-0
    target porthos
    schedule mon..fri
    source athos

local: 122-0
    target porthos
    schedule mon..fri
    source athos

local: 127-0
    target porthos
    schedule mon..fri
    source athos

local: 200-0
    target porthos
    schedule mon..fri
    source athos

rholighaus · Oct 30, 2018

I have the same issue also since 3am Oct 28 so I agree that I think it's a time change related issue.
I have opened ticket FHL-759-38090 for this issue but have not heard of any suggestions as how to solve this.

My setup heavily relies on function replication and I see the "error with cfs lock 'file-replication_cfg': got lock request timeout (500)" message even on a node that is neither a replication target nor source, but a member of the same cluster:

# pvesr status
trying to acquire cfs lock 'file-replication_cfg' ...
trying to acquire cfs lock 'file-replication_cfg' ...
trying to acquire cfs lock 'file-replication_cfg' ...
trying to acquire cfs lock 'file-replication_cfg' ...
trying to acquire cfs lock 'file-replication_cfg' ...
trying to acquire cfs lock 'file-replication_cfg' ...
trying to acquire cfs lock 'file-replication_cfg' ...
trying to acquire cfs lock 'file-replication_cfg' ...
trying to acquire cfs lock 'file-replication_cfg' ...
error with cfs lock 'file-replication_cfg': got lock request timeout

[SOLVED] Replication runner on all hosts and vm's broken since update

Distinguished Member

Renowned Member

Well-Known Member

Renowned Member

Distinguished Member

Well-Known Member

Distinguished Member

Distinguished Member

Renowned Member

Well-Known Member

Renowned Member

Well-Known Member

Well-Known Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Proxmox Retired Staff

Well-Known Member

Well-Known Member