[SOLVED] Replication runner on all hosts and vm's broken since update

rholighaus · Oct 30, 2018

As requested by Proxmox Support:

root@carrier:/etc/pve/priv/lock# rmdir file-replication_cfg
root@carrier:/etc/pve/priv/lock# pvesr run --verbose
trying to acquire lock...
OK
error with cfs lock 'file-replication_cfg': got lock timeout - aborting command
root@carrier:/etc/pve/priv/lock#

I think there is a serious bug somewhere. No replication for three days now.
I am getting worried.

rholighaus · Oct 30, 2018

This thread may be a duplicate of https://forum.proxmox.com/threads/trying-to-acquire-cfs-lock-file-replication_cfg.47806/#post-226607

tom · Oct 30, 2018

See also https://bugzilla.proxmox.com/show_bug.cgi?id=1963

rholighaus · Oct 31, 2018

So it looks the problem is timezone / DST changing related.

Thank you Wolfgang Bumiller for that hint in the Bugzilla ticket - it works for me. Setting the timezone to UTC on all hosts in the cluster and restarting the pvesr.timer / pvesr.service seems to be a work-around for my cluster - before a fix (hopefully!) appears.

I am disappointed with Proxmox Support, however. I had opened a ticket, provided all the required information and even pointed to this thread and the Bugzilla ticket, but have not been offered any useful help. That's a shame, as we pay for the support to get quick help if something like this happens.

fireon · Oct 31, 2018

News. It works now. But i had to delete all my replicated data's and renew the cfgfile. Strange.

tom · Nov 1, 2018

rholighaus said:
So it looks the problem is timezone / DST changing related.

Thank you Wolfgang Bumiller for that hint in the Bugzilla ticket - it works for me. Setting the timezone to UTC on all hosts in the cluster and restarting the pvesr.timer / pvesr.service seems to be a work-around for my cluster - before a fix (hopefully!) appears.

I am disappointed with Proxmox Support, however. I had opened a ticket, provided all the required information and even pointed to this thread and the Bugzilla ticket, but have not been offered any useful help. That's a shame, as we pay for the support to get quick help if something like this happens.

Based on the reports on all channels, we found the issue and we fixed it already - within two days. Please note, the whole team works together on these topics, most important are issues from subscribers like you which got priority here.

See also: https://git.proxmox.com/?p=pve-common.git;a=summary

Dubard · Nov 2, 2018

Hi everybody,

I just updated my servers and I have the same problem as mentioned in this thread:

Code:

root@monserveur:~# systemctl status pvesr.service
● pvesr.service - Proxmox VE replication runner
   Loaded: loaded (/lib/systemd/system/pvesr.service; static; vendor preset: enabled)
   Active: failed (Result: exit-code) since Fri 2018-11-02 14:55:10 CET; 11s ago
  Process: 734356 ExecStart=/usr/bin/pvesr run --mail 1 (code=exited, status=17)
 Main PID: 734356 (code=exited, status=17)
      CPU: 870ms

Nov 02 14:55:05 atlas pvesr[734356]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 02 14:55:06 atlas pvesr[734356]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 02 14:55:07 atlas pvesr[734356]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 02 14:55:08 atlas pvesr[734356]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 02 14:55:09 atlas pvesr[734356]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 02 14:55:10 atlas pvesr[734356]: error with cfs lock 'file-replication_cfg': got lock request timeout
Nov 02 14:55:10 atlas systemd[1]: pvesr.service: Main process exited, code=exited, status=17/n/a
Nov 02 14:55:10 atlas systemd[1]: Failed to start Proxmox VE replication runner.
Nov 02 14:55:10 atlas systemd[1]: pvesr.service: Unit entered failed state.
Nov 02 14:55:10 atlas systemd[1]: pvesr.service: Failed with result 'exit-code'.
root@atlas:~# systemctl stop pvesr.service
Warning: Stopping pvesr.service, but it can still be activated by:
  pvesr.timer
root@monserveur:~#

See my configuration:

Code:

root@monserveur:/etc/pve/priv/lock# pveversion -v
proxmox-ve: 5.2-2 (running kernel: 4.15.18-7-pve)
pve-manager: 5.2-9 (running version: 5.2-9/4b30e8f9)
pve-kernel-4.15: 5.2-10
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.18-5-pve: 4.15.18-24
pve-kernel-4.15.18-4-pve: 4.15.18-23
pve-kernel-4.15.18-2-pve: 4.15.18-21
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.15.17-2-pve: 4.15.17-10
pve-kernel-4.15.17-1-pve: 4.15.17-9
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.13.16-3-pve: 4.13.16-50
pve-kernel-4.13.16-2-pve: 4.13.16-48
pve-kernel-4.13.16-1-pve: 4.13.16-46
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.4-1-pve: 4.13.4-26
ceph: 12.2.8-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-40
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-30
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-28
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-36
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve1~bpo1
root@monserveur:/etc/pve/priv/lock#

I tried to change the timezone for UTC but I can't...UTC is not a choice in my Timezone list

Also, i tried to delete all replica snapshot on all nodes:

Code:

root@monserveur:~# zfs list -t snapshot
NAME                                                         USED  AVAIL  REFER  MOUNTPOINT
cadzfs/CT/subvol-101-disk-1@__replicate_101-0_1540681681__   264K      -  2.96G  -
cadzfs/VM/vm-100-disk-1@__replicate_100-0_1541166743__         0B      -  5.22G  -
cadzfs/VM/vm-103-disk-1@__replicate_103-0_1541167809__         0B      -  6.58G  -
cadzfs/VM/vm-104-disk-1@__replicate_104-0_1540603740__       889M      -  34.9G  -
cadzfs/VM/vm-106-disk-1@__replicate_106-0_1540603800__      44.6M      -  3.73G  -
cadzfs/VM/vm-107-disk-1@__replicate_107-0_1540675441__         0B      -  47.5G  -
cadzfs/VM/vm-108-disk-2@__replicate_108-0_1540674661__         0B      -  8.59G  -
cadzfs/VM/vm-112-disk-1@__replicate_112-0_1540600200__       405M      -  5.71G  -
root@zeus:~# zfs destroy cadzfs/CT/subvol-101-disk-1@__replicate_101-0_1540681681__
root@zeus:~# zfs destroy cadzfs/VM/vm-100-disk-1@__replicate_100-0_1541166743__
root@zeus:~# zfs destroy cadzfs/VM/vm-103-disk-1@__replicate_103-0_1541167809__
root@zeus:~# zfs destroy cadzfs/VM/vm-104-disk-1@__replicate_104-0_1540603740__
root@zeus:~# zfs destroy cadzfs/VM/vm-106-disk-1@__replicate_106-0_1540603800__
root@zeus:~# zfs destroy cadzfs/VM/vm-107-disk-1@__replicate_107-0_1540675441__
root@zeus:~# zfs destroy cadzfs/VM/vm-108-disk-2@__replicate_108-0_1540674661__
root@zeus:~# zfs destroy cadzfs/VM/vm-112-disk-1@__replicate_112-0_1540600200__
root@zeus:~# zfs list -t snapshot
no datasets available
root@monserveur:~#

and delete on all nodes:

Code:

root@monserveur:/etc/pve/priv/lock# rm -r file-replication_cfg/

Try to delete on GUI all replication configuration...but same error "
trying to acquire cfs lock 'file-replication_cfg' "

How can I solve my problem, please ?
Many thanks !

rholighaus · Nov 2, 2018

Dubard said:
Hi everybody,

I tried to change the timezone for UTC but I can't...UTC is not a choice in my Timezone list

Try

Code:

# dpkg-reconfigure tzdata

then chose "None of the above" and you'll get UTC as an option.

Dubard · Nov 2, 2018

Re,

So, try to :

Edit "/etc/pve/replication.cfg"
Ensure delete "/etc/pve/priv/lock/file-replication_cfg" on all node
Restart "pvesr.service"
Code:
```
systemctl status pvesr.service
```
and now...service it's OK:

Code:

root@monserveur:/etc/pve/priv/lock# systemctl status pvesr.service
● pvesr.service - Proxmox VE replication runner
   Loaded: loaded (/lib/systemd/system/pvesr.service; static; vendor preset: enabled)
   Active: inactive (dead) since Fri 2018-11-02 16:02:01 CET; 7s ago
  Process: 985908 ExecStart=/usr/bin/pvesr run --mail 1 (code=exited, status=0/SUCCESS)
 Main PID: 985908 (code=exited, status=0/SUCCESS)
      CPU: 829ms

Nov 02 16:02:00 monserveur systemd[1]: Starting Proxmox VE replication runner...
Nov 02 16:02:01 monserveur systemd[1]: Started Proxmox VE replication runner.
root@monserveur:/etc/pve/priv/lock#

But a lot of errors into GUI for all VM's and CT's replications !

Code:

...
...
2018-11-02 16:04:02 104-0: warning: cannot send 'cadzfs/VM/vm-104-disk-1@__replicate_104-0_1541171040__': Broken pipe
2018-11-02 16:04:02 104-0: cannot send 'cadzfs/VM/vm-104-disk-1': I/O error
2018-11-02 16:04:02 104-0: command 'zfs send -Rpv -- cadzfs/VM/vm-104-disk-1@__replicate_104-0_1541171040__' failed: exit code 1
....
....

Any ideas please ?

Many thanks

Dubard · Nov 2, 2018

rholighaus said:
Try

Code:

# dpkg-reconfigure tzdata

then chose "None of the above" and you'll get UTC as an option.

Thanks @rholighaus ...I'm going to try it.

rholighaus · Nov 2, 2018

Dubard said:
Re,

So, try to :

Edit "/etc/pve/replication.cfg"

Ensure delete "/etc/pve/priv/lock/file-replication_cfg" on all node

Restart "pvesr.service"

Code:

systemctl status pvesr.service

and now...service it's OK:

Code:

But a lot of errors into GUI for all VM's and CT's replications ! [code] ... ... 2018-11-02 16:04:02 104-0: warning: cannot send 'cadzfs/VM/vm-104-disk-1@__replicate_104-0_1541171040__': Broken pipe 2018-11-02 16:04:02 104-0: cannot send 'cadzfs/VM/vm-104-disk-1': I/O error 2018-11-02 16:04:02 104-0: command 'zfs send -Rpv -- cadzfs/VM/vm-104-disk-1@__replicate_104-0_1541171040__' failed: exit code 1 .... ....

Any ideas please ?

Many thanks

Sounds like the state of your replication configuration and the state of the actual replications may no longer be in sync due to messing around with the configuration files, in particular with /etc/pve/replication.cfg. If you have a backup of this file, you may want to restore it and see what happens.

If restoring the config doesn't resolve the is, in most cases like this, you may have to delete the broken replication in the GUI and then set it up again. If pve fails to delete the replicated zfs filesystem(s) on the replication target, you may have to the delete them there (not on the replication source!) by hand using

Code:

zfs destroy -r /rpool/data/subvol-xxx-disk-x

(please use the right file system name, the one I suggested is an example for a container file system on the rpool filesystem. Remove all related filesystems on the target, then re-create the replication on the source. It will recreate the filesystems on the target system and copy all the data. It may take a long time for big filesystems.

pmickael · Nov 2, 2018

That's not a good solution, to change timezone !!
If you have service that depend on it, try to follow the step I explain on the bug ticket

EDIT (here what I did):

systemctl stop pvesr
systemctl stop pvesr.timer
cp -a /etc/pve/replication.cfg /tmp/replication.cfg
echo > /etc/pve/replication.cfg
systemctl start pvesr
IT will stuck wait
systemctl start pvesr.timer
Then
systemctl stop pvesr
systemctl stop pvesr.timer
cp -a /tmp/replication.cfg /etc/pve/replication.cfg
systemctl start pvesr
systemctl start pvesr.timer

And everything should work now.

rholighaus · Nov 2, 2018

pmickael said:
That's not a good solution, to change timezone !!
If you have service that depend on it, try to follow the step I explain on the bug ticket

Well, Wolfgang Bumiller from Proxmox recommended to change the timezone and for me it fixed the problem without messing with configuration files and without losing my existing replications.

pmickael · Nov 2, 2018

rholighaus said:
Well, Wolfgang Bumiller from Proxmox recommended to change the timezone and for me it fixed the problem without messing with configuration files and without losing my existing replications.

Yes, but you need to be sure that all others services which depends on timezone are correctly configured too.
Like If you have a MySQL server for example on a node for any reason.

Dubard · Nov 2, 2018

Re,

OK, i solved my problem. On each node, it is not only necessary to delete the snapshot but also the disk of the machines that have been replicated !

Code:

root@monserveur:~# zfs destroy -r cadzfs/VM/vm-1
cadzfs/VM/vm-103-disk-1  cadzfs/VM/vm-104-disk-1  cadzfs/VM/vm-105-disk-1  cadzfs/VM/vm-106-disk-1  cadzfs/VM/vm-107-disk-1  cadzfs/VM/vm-108-disk-2  cadzfs/VM/vm-109-disk-1  cadzfs/VM/vm-112-disk-1
root@monserveur:~# zfs destroy -r cadzfs/VM/vm-104-disk-1
root@monserveur:~# zfs destroy -r cadzfs/VM/vm-106-disk-1
root@monserveur:~# zfs destroy -r cadzfs/VM/vm-109-disk-1
root@monserveur:~#

BE CAREFUL not to delete the disks of the VMs or CTs that are active or off on the node !

After cleaning up all the nodes by removing these formerly synchronized disks, I was able to restart replication on each node.

Thanks

TwiX · Nov 5, 2018

Hi,

Seems that there are new updates on pve-enterprise repository. Do they solve the issue ?

Thanks

mira · Nov 5, 2018

As far as I see there was no update to libpve-common-perl-5.0, so no, not yet.

sadai · Nov 7, 2018

pmickael said:
That's not a good solution, to change timezone !!
If you have service that depend on it, try to follow the step I explain on the bug ticket

EDIT (here what I did):

systemctl stop pvesr
systemctl stop pvesr.timer
cp -a /etc/pve/replication.cfg /tmp/replication.cfg
echo > /etc/pve/replication.cfg
systemctl start pvesr
IT will stuck wait
systemctl start pvesr.timer
Then
systemctl stop pvesr
systemctl stop pvesr.timer
cp -a /tmp/replication.cfg /etc/pve/replication.cfg
systemctl start pvesr
systemctl start pvesr.timer

And everything should work now.

worked for me, thanks!

gianni.bet · Nov 10, 2018

pmickael said:
That's not a good solution, to change timezone !!
If you have service that depend on it, try to follow the step I explain on the bug ticket

EDIT (here what I did):

systemctl stop pvesr
systemctl stop pvesr.timer
cp -a /etc/pve/replication.cfg /tmp/replication.cfg
echo > /etc/pve/replication.cfg
systemctl start pvesr
IT will stuck wait
systemctl start pvesr.timer
Then
systemctl stop pvesr
systemctl stop pvesr.timer
cp -a /tmp/replication.cfg /etc/pve/replication.cfg
systemctl start pvesr
systemctl start pvesr.timer

And everything should work now.

It works !!!!!

[SOLVED] Replication runner on all hosts and vm's broken since update

Renowned Member

Renowned Member

Proxmox Staff Member

Renowned Member

Distinguished Member

Proxmox Staff Member

Active Member

Renowned Member

Active Member

Active Member

Renowned Member

Active Member

Renowned Member

Active Member

Active Member

Renowned Member

Proxmox Staff Member

Member

Member

We value your privacy