[SOLVED] Replication runner on all hosts and vm's broken since update

fireon · Oct 27, 2018

Hello,

Problem exists about a week. If i click on Replicaton, i got an timeout on all nodes in the cluster. Here are the jounal:

Code:

Okt 27 01:27:00 backup systemd[1]: Starting Proxmox VE replication runner...
Okt 27 01:27:01 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
Okt 27 01:27:02 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
Okt 27 01:27:03 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
Okt 27 01:27:04 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
Okt 27 01:27:05 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
Okt 27 01:27:06 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
Okt 27 01:27:07 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
Okt 27 01:27:08 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
Okt 27 01:27:09 backup pvesr[28719]: trying to acquire cfs lock 'file-replication_cfg' ...
Okt 27 01:27:10 backup pvesr[28719]: error with cfs lock 'file-replication_cfg': got lock request timeout
Okt 27 01:27:10 backup systemd[1]: pvesr.service: Main process exited, code=exited, status=17/n/a
Okt 27 01:27:10 backup systemd[1]: Failed to start Proxmox VE replication runner.
Okt 27 01:27:10 backup systemd[1]: pvesr.service: Unit entered failed state.
Okt 27 01:27:10 backup systemd[1]: pvesr.service: Failed with result 'exit-code'.

Maybe someone can help me with that?

Code:

proxmox-ve: 5.2-2 (running kernel: 4.15.18-7-pve)
pve-manager: 5.2-10 (running version: 5.2-10/6f892b40)
pve-kernel-4.15: 5.2-10
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.18-5-pve: 4.15.18-24
pve-kernel-4.15.18-4-pve: 4.15.18-23
pve-kernel-4.15.18-3-pve: 4.15.18-22
pve-kernel-4.15.18-2-pve: 4.15.18-21
pve-kernel-4.15.18-1-pve: 4.15.18-19
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-40
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-30
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-3
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-28
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
pve-zsync: 1.7-1
qemu-server: 5.0-36
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve1~bpo1

acidrop · Oct 27, 2018

I'm having the same problem after upgrading a 3 node cluster to the latest package versions.
Multicast communication works fine, but pvesr.service is unable to start because of "error with cfs lock 'file-replication_cfg': got lock request timeout".
As a workaround, I have removed the content of /etc/pve/replication.cfg file and that at least seems to bring pvesr.service up.
Once you create a new replication job though, same error occurs ...

Code:

proxmox-ve: 5.2-2 (running kernel: 4.15.18-5-pve)
pve-manager: 5.2-10 (running version: 5.2-10/6f892b40)
pve-kernel-4.15: 5.2-8
pve-kernel-4.15.18-5-pve: 4.15.18-24
pve-kernel-4.15.18-2-pve: 4.15.18-21
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.15.17-1-pve: 4.15.17-9
ceph: 12.2.8-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-40
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-30
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-3
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
openvswitch-switch: 2.7.0-3
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-29
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-38
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve1~bpo1

saphirblanc · Oct 27, 2018

Having exactly the same issue, is there any workaround without rebooting completely the host for now ?

acidrop · Oct 27, 2018

Don't think that the reboot will make any difference, I have already tried rebooting all nodes with no luck.
This most likely looks like a bug, so let's see if an updated package fix will be released in one of the upcoming days...
Proxmox staff has to verify this first though..

fireon · Oct 27, 2018

saphirblanc said:
Having exactly the same issue, is there any workaround without rebooting completely the host for now ?

Rebooted too, didn't help.

saphirblanc · Oct 27, 2018

Just to let the proxmox developers knows, here are my package versions :

Code:

proxmox-ve: 5.2-2 (running kernel: 4.15.18-3-pve)
pve-manager: 5.2-8 (running version: 5.2-8/fdf39912)
pve-kernel-4.15: 5.2-6
pve-kernel-4.15.18-3-pve: 4.15.18-22
pve-kernel-4.15.17-1-pve: 4.15.17-9
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-38
libpve-guest-common-perl: 2.0-17
libpve-http-server-perl: 2.0-10
libpve-storage-perl: 5.0-25
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-1
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-19
pve-cluster: 5.0-30
pve-container: 2.0-26
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-33
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9

Configuration has not changed between. It seems to have started at the time of a replication schedule, so a replication might have been in progress ?

Here are the syslog of two nodes showing the state from "working" to not :

Code:

Oct 25 23:59:00 athos systemd[1]: Starting Proxmox VE replication runner...
Oct 25 23:59:00 athos systemd[1]: Started Proxmox VE replication runner.
Oct 25 23:59:11 athos postfix/anvil[25155]: statistics: max connection rate 1/60s for (X) at Oct 25 23:50:51
Oct 25 23:59:11 athos postfix/anvil[25155]: statistics: max connection count 1 for (X) at Oct 25 23:50:51
Oct 25 23:59:11 athos postfix/anvil[25155]: statistics: max cache size 2 at Oct 25 23:50:51
Oct 26 00:00:00 athos systemd[1]: Starting Proxmox VE replication runner...
Oct 26 00:00:01 athos CRON[19132]: pam_unix(cron:session): session opened for user root by (uid=0)
Oct 26 00:00:01 athos CRON[19133]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Oct 26 00:00:01 athos CRON[19132]: pam_unix(cron:session): session closed for user root
Oct 26 00:00:02 athos zed[19421]: eid=2551 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:02 athos zed[19563]: eid=2552 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:02 athos zed[19646]: eid=2553 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:03 athos zed[20045]: eid=2554 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:03 athos zed[20151]: eid=2555 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:12 athos zed[38194]: eid=2556 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:12 athos zed[38290]: eid=2557 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:13 athos zed[38438]: eid=2558 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:13 athos zed[38497]: eid=2559 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:14 athos zed[38584]: eid=2560 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:14 athos zed[38665]: eid=2561 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:16 athos zed[38965]: eid=2562 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:16 athos zed[39022]: eid=2563 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:16 athos zed[39032]: eid=2564 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:16 athos zed[39107]: eid=2565 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:16 athos zed[39110]: eid=2566 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:17 athos zed[39420]: eid=2567 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:17 athos zed[39658]: eid=2568 class=history_event pool_guid=0x1E98F7F7D9A9A016
Oct 26 00:00:18 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 26 00:00:19 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 26 00:00:20 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 26 00:00:21 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 26 00:00:22 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 26 00:00:23 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 26 00:00:24 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 26 00:00:25 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 26 00:00:26 athos pvesr[19090]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 26 00:00:27 athos pvesr[19090]: error with cfs lock 'file-replication_cfg': got lock request timeout
Oct 26 00:00:27 athos systemd[1]: pvesr.service: Main process exited, code=exited, status=17/n/a
Oct 26 00:00:27 athos systemd[1]: Failed to start Proxmox VE replication runner.
Oct 26 00:00:27 athos systemd[1]: pvesr.service: Unit entered failed state.
Oct 26 00:00:27 athos systemd[1]: pvesr.service: Failed with result 'exit-code'.

The other node :

Code:

Oct 25 23:59:00 aramis systemd[1]: Starting Proxmox VE replication runner...
Oct 25 23:59:00 aramis systemd[1]: Started Proxmox VE replication runner.
Oct 25 23:59:12 aramis postfix/anvil[35118]: statistics: max connection rate 1/60s for (X) at Oct 25 23:55:52
Oct 25 23:59:12 aramis postfix/anvil[35118]: statistics: max connection count 1 for (X) at Oct 25 23:55:52
Oct 25 23:59:12 aramis postfix/anvil[35118]: statistics: max cache size 2 at Oct 25 23:55:52
Oct 26 00:00:00 aramis systemd[1]: Starting Proxmox VE replication runner...
Oct 26 00:00:01 aramis zed[25495]: eid=5402 class=history_event pool_guid=0x32C807D8808E1CD9
Oct 26 00:00:01 aramis CRON[25562]: pam_unix(cron:session): session opened for user root by (uid=0)
Oct 26 00:00:01 aramis CRON[25563]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Oct 26 00:00:01 aramis CRON[25562]: pam_unix(cron:session): session closed for user root
Oct 26 00:00:01 aramis zed[25976]: eid=5403 class=history_event pool_guid=0x32C807D8808E1CD9
Oct 26 00:00:01 aramis zed[26134]: eid=5404 class=history_event pool_guid=0x32C807D8808E1CD9
Oct 26 00:00:13 aramis zed[26096]: eid=5405 class=history_event pool_guid=0x32C807D8808E1CD9
Oct 26 00:00:13 aramis zed[26113]: eid=5406 class=history_event pool_guid=0x32C807D8808E1CD9
Oct 26 00:00:15 aramis zed[26415]: eid=5407 class=history_event pool_guid=0x32C807D8808E1CD9
Oct 26 00:00:34 aramis postfix/smtpd[1998]: connect from unknown[X]
Oct 26 00:00:34 aramis postfix/smtpd[1998]: lost connection after AUTH from unknown[178.159.36.53]
Oct 26 00:00:34 aramis postfix/smtpd[1998]: disconnect from unknown[X] ehlo=1 auth=0/1 commands=1/2
Oct 26 00:00:39 aramis sshd[5048]: Connection closed by X port 59672 [preauth]
Oct 26 00:00:39 aramis sshd[5050]: Connection closed by X port 51776 [preauth]
Oct 26 00:00:52 aramis postfix/smtpd[1998]: connect from X
Oct 26 00:00:52 aramis postfix/smtpd[1998]: disconnect fromXhelo=1 quit=1 commands=2
Oct 26 00:00:52 aramis postfix/smtpd[1998]: connect from X
Oct 26 00:00:52 aramis postfix/smtpd[1998]: disconnect from X helo=1 quit=1 commands=2
Oct 26 00:01:16 aramis pvesr[24571]: error with cfs lock 'file-replication_cfg': got lock timeout - aborting command
Oct 26 00:01:16 aramis systemd[1]: pvesr.service: Main process exited, code=exited, status=255/n/a
Oct 26 00:01:16 aramis systemd[1]: Failed to start Proxmox VE replication runner.
Oct 26 00:01:16 aramis systemd[1]: pvesr.service: Unit entered failed state.
Oct 26 00:01:16 aramis systemd[1]: pvesr.service: Failed with result 'exit-code'.
Oct 26 00:01:16 aramis systemd[1]: Starting Proxmox VE replication runner...

Hope this can help further.

Thanks to all.

udo · Oct 28, 2018

saphirblanc said:

Just to let the proxmox developers knows, here are my package versions :

Code:

proxmox-ve: 5.2-2 (running kernel: 4.15.18-3-pve)
pve-manager: 5.2-8 (running version: 5.2-8/fdf39912)
pve-kernel-4.15: 5.2-6
pve-kernel-4.15.18-3-pve: 4.15.18-22
pve-kernel-4.15.17-1-pve: 4.15.17-9
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-38
libpve-guest-common-perl: 2.0-17
libpve-http-server-perl: 2.0-10
libpve-storage-perl: 5.0-25
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-1
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-19
pve-cluster: 5.0-30
pve-container: 2.0-26
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-33
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9

...

Hi,
I think it hasn't anything to do with the issue, but your versions shows that you don't use "apt dist-upgrade" which is important on proxmox.
"apt upgrade" isn't enough!

Udo

fireon · Oct 28, 2018

Code:

pveupdate
pveupgrade

dendi · Oct 28, 2018

Same problem, like a year ago, I think because of time change (daylight saving time)
https://forum.proxmox.com/threads/pvesr-status-hanging-after-upgrade-from-5-0-to-5-1.37738/

saphirblanc · Oct 28, 2018

dendi said:
Same problem, like a year ago, I think because of time change (daylight saving time)
https://forum.proxmox.com/threads/pvesr-status-hanging-after-upgrade-from-5-0-to-5-1.37738/

Not sure if it's because of the time change as it appeared few days ago.

I can confirm that the process pvesr is using 100% of the CPU on the first node (sender) "athos" for hours but no on the two other nodes (including receiver node - porthos).
Is it safe to kill it ?

Code:

root     14828 99.5  0.0 495372 77404 ?        Rs   10:49   0:35 /usr/bin/perl -T /usr/bin/pvesr run --mail 1

Thanks.

dendi · Oct 28, 2018

Saphirblanc, you can check how old is your replica's snapshot with "zfs list -t all" and see when it stopped to work.
I "solved" with:

Code:

cp -a /etc/pve/replication.cfg /root/
vi /etc/pve/replication.cfg #clear it, only on one node because it's in a cluster fs
systemctl stop pvesr.timer
systemctl stop pvesr
systemctl restart pvedaemon

You have to redo all your replicas o course and remember to manually delete the old replica's snapshot on the source node
and restart pvesr.timer and pvesr

saphirblanc · Oct 28, 2018

dendi said:
Saphirblanc, you can check how old is your replica's snapshot with "zfs list -t all" and see when it stopped to work.
I "solved" with:

Code:

cp -a /etc/pve/replication.cfg /root/ vi /etc/pve/replication.cfg #clear it, only on one node because it's in a cluster fs systemctl stop pvesr.timer systemctl stop pvesr systemctl restart pvedaemon

You have to redo all your replicas o course and remember to manually delete the old replica's snapshot on the source node
and restart pvesr.timer and pvesr

Thanks dendi! I was indeed able to remove the error 500... and create more than one replication task through the GUI which is a big step (I have not yet restarted pvesr.timer and pvesr) !
How can I delete the old replica's snapshot ? Sorry, not so used with ZFS yet...

Code:

root@athos:/var/log# zfs list -t all
NAME                                                      USED  AVAIL  REFER  MOUNTPOINT
rpool                                                     215G  3.15T   166K  /rpool
rpool-hdd                                                 207G  3.31T   128K  /rpool-hdd
rpool-hdd/vm-108-disk-1                                  8.68G  3.31T  8.35G  -
rpool-hdd/vm-108-disk-1@__replicate_108-0_1540504800__    338M      -  8.35G  -
rpool-hdd/vm-108-disk-2                                   188M  3.31T   186M  -
rpool-hdd/vm-108-disk-2@__replicate_108-0_1540504800__   1.63M      -   186M  -
rpool-hdd/vm-108-disk-3                                  74.6K  3.31T  74.6K  -
rpool-hdd/vm-108-disk-3@__replicate_108-0_1540504800__      0B      -  74.6K  -
rpool-hdd/vm-109-disk-2                                  1.92G  3.31T  1.92G  -
rpool-hdd/vm-109-disk-2@__replicate_109-0_1540418416__    884K      -  1.92G  -
rpool-hdd/vm-109-disk-3                                  5.38G  3.31T  4.93G  -
rpool-hdd/vm-109-disk-3@__replicate_109-0_1540418416__    455M      -  4.93G  -
rpool-hdd/vm-112-disk-1                                  12.0G  3.31T  11.7G  -
rpool-hdd/vm-112-disk-1@__replicate_112-0_1540418430__    358M      -  11.2G  -
rpool-hdd/vm-112-disk-2                                   101G  3.31T  99.9G  -
rpool-hdd/vm-112-disk-2@__replicate_112-0_1540418430__   1.39G      -  97.8G  -
rpool-hdd/vm-122-disk-1                                  30.9G  3.31T  30.4G  -
rpool-hdd/vm-122-disk-1@__replicate_122-0_1540418502__    478M      -  30.4G  -
rpool-hdd/vm-200-disk-1                                  46.1G  3.31T  42.3G  -
rpool-hdd/vm-200-disk-1@__replicate_200-0_1540418530__   3.83G      -  42.3G  -
rpool/ROOT                                               3.28G  3.15T   153K  /rpool/ROOT
rpool/ROOT/pve-1                                         3.28G  3.15T  3.28G  /
rpool/data                                                203G  3.15T   153K  /rpool/data
rpool/data/vm-102-disk-1                                 6.73G  3.15T  6.00G  -
rpool/data/vm-102-disk-1@__replicate_102-0_1540418417__  99.3M      -  5.94G  -
rpool/data/vm-102-disk-1@__replicate_102-0_1540467900__  82.5M      -  5.97G  -
rpool/data/vm-111-disk-1                                 50.5G  3.15T  50.5G  -
rpool/data/vm-113-disk-1                                 89.7G  3.15T  81.3G  -
rpool/data/vm-113-disk-1@__replicate_113-0_1540418472__  8.47G      -  81.1G  -
rpool/data/vm-117-disk-1                                 30.2G  3.15T  29.6G  -
rpool/data/vm-117-disk-1@__replicate_117-0_1540418496__   581M      -  29.6G  -
rpool/data/vm-127-disk-1                                 6.70G  3.15T  5.47G  -
rpool/data/vm-127-disk-1@__replicate_127-0_1540418507__  1.23G      -  5.47G  -
rpool/data/vm-139-disk-1                                 8.01G  3.15T  7.48G  -
rpool/data/vm-139-disk-1@__replicate_139-0_1540631221__   545M      -  6.88G  -
rpool/data/vm-145-disk-1                                 5.85G  3.15T  5.57G  -
rpool/data/vm-145-disk-1@__replicate_145-0_1540418519__   283M      -  5.43G  -
rpool/data/vm-147-disk-1                                 5.04G  3.15T  4.86G  -
rpool/data/vm-147-disk-1@__replicate_147-0_1540418524__   181M      -  4.60G  -
rpool/swap                                               8.50G  3.16T  1.79G  -

Thanks for your big help!

EDIT : simply by using :

Code:

zfs detroy rpool/data/vm-102-disk-1@__replicate_102-0_1540418417__

, on the sender and receiver nodes ?

saphirblanc · Oct 28, 2018

Well, from my side, I deleted the old replica from the node and the target, had this issue :

Code:

Oct 28 16:44:02 athos zed: eid=2617 class=history_event pool_guid=0x765A2359F9A05698
Oct 28 16:44:02 athos pvesr[14216]: 102-0: got unexpected replication job error - command 'set -o pipefail && pvesm export local-zfs:vm-102-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_102-0_1540741440__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=porthos' root@10.1.0.10 -- pvesm import local-zfs:vm-102-disk-1 zfs - -with-snapshots 1' failed: exit code 255
Oct 28 16:44:02 athos systemd[1]: Started Proxmox VE replication runner.

Then I understood that it was because the image disk was still present on the target (the full clone), deleted it using

Code:

zfs destroy rpool/data/vm-102-disk-1

Then, tried again and I'm back to the beginning with the 500 error code and the crash on all nodes of the pvesr service

fireon · Oct 28, 2018

saphirblanc said:
I can confirm that the process pvesr is using 100% of the CPU on the first node (sender) "athos" for hours but no on the two other nodes (including receiver node - porthos).

Yes, same here.

dendi · Oct 29, 2018

I tried to restore the original /etc/pve/replication.cfg but I got errors:

Code:

command 'set -o pipefail && pvesm export local-zfs:vm-101-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_101-1_1540805341__ | /usr/bin/cstream -t 10000000 | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve3' root@192.168.1.3 -- pvesm import local-zfs:vm-101-disk-1 zfs - -with-snapshots 1' failed: exit code 255

Hope this will help the staff...

fireon · Oct 29, 2018

dendi said:
Hope this will help the staff...

hope too

acidrop · Oct 29, 2018

There's an open bug in bugzilla, please post your info in there to help narrow down this issue ...

Thanks

wolfgang · Oct 30, 2018

Hi at all,

can you send the replication.cfg to see the replication schedules.

saphirblanc · Oct 30, 2018

wolfgang said:
Hi at all,

can you send the replication.cfg to see the replication schedules.

When it crashed, here was the replications :

Code:

local: 101-1
    target porthos
    schedule mon..fri
    source aramis

local: 102-0
    target porthos
    schedule mon..fri
    source athos

local: 106-0
    target porthos
    schedule mon..fri
    source aramis

local: 108-0
    target porthos
    schedule mon..fri
    source athos

local: 109-0
    target porthos
    schedule mon..fri
    source athos

local: 116-0
    target porthos
    schedule mon..fri
    source aramis

local: 117-0
    target porthos
    schedule mon..fri
    source athos

local: 113-0
    target porthos
    schedule mon..fri
    source athos

local: 104-0
    target porthos
    schedule mon..fri
    source aramis

local: 103-0
    target porthos
    schedule mon..fri
    source aramis

local: 105-0
    target porthos
    schedule mon..fri
    source aramis

local: 112-0
    target porthos
    schedule mon..fri
    source athos

local: 143-0
    target porthos
    schedule mon..fri
    source aramis

local: 145-0
    target porthos
    schedule mon..fri
    source athos

local: 114-0
    target porthos
    schedule mon..fri
    source aramis

local: 115-0
    target porthos
    schedule mon..fri
    source aramis

local: 126-0
    target porthos
    schedule mon..fri
    source aramis

local: 146-0
    target porthos
    schedule mon..fri
    source aramis

local: 144-0
    target porthos
    schedule mon..fri
    source aramis

local: 118-0
    target porthos
    schedule mon..fri
    source aramis

local: 107-0
    target porthos
    schedule mon..fri
    source aramis

local: 147-0
    target porthos
    schedule mon..fri
    source athos

local: 122-0
    target porthos
    schedule mon..fri
    source athos

local: 127-0
    target porthos
    schedule mon..fri
    source athos

local: 200-0
    target porthos
    schedule mon..fri
    source athos

rholighaus · Oct 30, 2018

I have the same issue also since 3am Oct 28 so I agree that I think it's a time change related issue.
I have opened ticket FHL-759-38090 for this issue but have not heard of any suggestions as how to solve this.

My setup heavily relies on function replication and I see the "error with cfs lock 'file-replication_cfg': got lock request timeout (500)" message even on a node that is neither a replication target nor source, but a member of the same cluster:

# pvesr status
trying to acquire cfs lock 'file-replication_cfg' ...
trying to acquire cfs lock 'file-replication_cfg' ...
trying to acquire cfs lock 'file-replication_cfg' ...
trying to acquire cfs lock 'file-replication_cfg' ...
trying to acquire cfs lock 'file-replication_cfg' ...
trying to acquire cfs lock 'file-replication_cfg' ...
trying to acquire cfs lock 'file-replication_cfg' ...
trying to acquire cfs lock 'file-replication_cfg' ...
trying to acquire cfs lock 'file-replication_cfg' ...
error with cfs lock 'file-replication_cfg': got lock request timeout

[SOLVED] Replication runner on all hosts and vm's broken since update

Distinguished Member

Renowned Member

Renowned Member

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

Distinguished Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Proxmox Retired Staff

Renowned Member

Renowned Member

We value your privacy