[SOLVED] Replication runner on all hosts and vm's broken since update

As requested by Proxmox Support:

root@carrier:/etc/pve/priv/lock# rmdir file-replication_cfg
root@carrier:/etc/pve/priv/lock# pvesr run --verbose
trying to acquire lock...
OK
error with cfs lock 'file-replication_cfg': got lock timeout - aborting command
root@carrier:/etc/pve/priv/lock#


I think there is a serious bug somewhere. No replication for three days now.
I am getting worried.
 
So it looks the problem is timezone / DST changing related.

Thank you Wolfgang Bumiller for that hint in the Bugzilla ticket - it works for me. Setting the timezone to UTC on all hosts in the cluster and restarting the pvesr.timer / pvesr.service seems to be a work-around for my cluster - before a fix (hopefully!) appears.

I am disappointed with Proxmox Support, however. I had opened a ticket, provided all the required information and even pointed to this thread and the Bugzilla ticket, but have not been offered any useful help. That's a shame, as we pay for the support to get quick help if something like this happens.
 
News. It works now. But i had to delete all my replicated data's and renew the cfgfile. Strange.
 
So it looks the problem is timezone / DST changing related.

Thank you Wolfgang Bumiller for that hint in the Bugzilla ticket - it works for me. Setting the timezone to UTC on all hosts in the cluster and restarting the pvesr.timer / pvesr.service seems to be a work-around for my cluster - before a fix (hopefully!) appears.

I am disappointed with Proxmox Support, however. I had opened a ticket, provided all the required information and even pointed to this thread and the Bugzilla ticket, but have not been offered any useful help. That's a shame, as we pay for the support to get quick help if something like this happens.

Based on the reports on all channels, we found the issue and we fixed it already - within two days. Please note, the whole team works together on these topics, most important are issues from subscribers like you which got priority here.

See also: https://git.proxmox.com/?p=pve-common.git;a=summary
 
  • Like
Reactions: fireon
Hi everybody,

I just updated my servers and I have the same problem as mentioned in this thread:

Prob-replication_proxmox.png

Code:
root@monserveur:~# systemctl status pvesr.service
● pvesr.service - Proxmox VE replication runner
   Loaded: loaded (/lib/systemd/system/pvesr.service; static; vendor preset: enabled)
   Active: failed (Result: exit-code) since Fri 2018-11-02 14:55:10 CET; 11s ago
  Process: 734356 ExecStart=/usr/bin/pvesr run --mail 1 (code=exited, status=17)
 Main PID: 734356 (code=exited, status=17)
      CPU: 870ms

Nov 02 14:55:05 atlas pvesr[734356]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 02 14:55:06 atlas pvesr[734356]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 02 14:55:07 atlas pvesr[734356]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 02 14:55:08 atlas pvesr[734356]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 02 14:55:09 atlas pvesr[734356]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 02 14:55:10 atlas pvesr[734356]: error with cfs lock 'file-replication_cfg': got lock request timeout
Nov 02 14:55:10 atlas systemd[1]: pvesr.service: Main process exited, code=exited, status=17/n/a
Nov 02 14:55:10 atlas systemd[1]: Failed to start Proxmox VE replication runner.
Nov 02 14:55:10 atlas systemd[1]: pvesr.service: Unit entered failed state.
Nov 02 14:55:10 atlas systemd[1]: pvesr.service: Failed with result 'exit-code'.
root@atlas:~# systemctl stop pvesr.service
Warning: Stopping pvesr.service, but it can still be activated by:
  pvesr.timer
root@monserveur:~#

See my configuration:

Code:
root@monserveur:/etc/pve/priv/lock# pveversion -v
proxmox-ve: 5.2-2 (running kernel: 4.15.18-7-pve)
pve-manager: 5.2-9 (running version: 5.2-9/4b30e8f9)
pve-kernel-4.15: 5.2-10
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.18-5-pve: 4.15.18-24
pve-kernel-4.15.18-4-pve: 4.15.18-23
pve-kernel-4.15.18-2-pve: 4.15.18-21
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.15.17-2-pve: 4.15.17-10
pve-kernel-4.15.17-1-pve: 4.15.17-9
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.13.16-3-pve: 4.13.16-50
pve-kernel-4.13.16-2-pve: 4.13.16-48
pve-kernel-4.13.16-1-pve: 4.13.16-46
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.4-1-pve: 4.13.4-26
ceph: 12.2.8-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-40
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-30
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-28
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-36
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve1~bpo1
root@monserveur:/etc/pve/priv/lock#

I tried to change the timezone for UTC but I can't...UTC is not a choice in my Timezone list :(

Also, i tried to delete all replica snapshot on all nodes:

Code:
root@monserveur:~# zfs list -t snapshot
NAME                                                         USED  AVAIL  REFER  MOUNTPOINT
cadzfs/CT/subvol-101-disk-1@__replicate_101-0_1540681681__   264K      -  2.96G  -
cadzfs/VM/vm-100-disk-1@__replicate_100-0_1541166743__         0B      -  5.22G  -
cadzfs/VM/vm-103-disk-1@__replicate_103-0_1541167809__         0B      -  6.58G  -
cadzfs/VM/vm-104-disk-1@__replicate_104-0_1540603740__       889M      -  34.9G  -
cadzfs/VM/vm-106-disk-1@__replicate_106-0_1540603800__      44.6M      -  3.73G  -
cadzfs/VM/vm-107-disk-1@__replicate_107-0_1540675441__         0B      -  47.5G  -
cadzfs/VM/vm-108-disk-2@__replicate_108-0_1540674661__         0B      -  8.59G  -
cadzfs/VM/vm-112-disk-1@__replicate_112-0_1540600200__       405M      -  5.71G  -
root@zeus:~# zfs destroy cadzfs/CT/subvol-101-disk-1@__replicate_101-0_1540681681__
root@zeus:~# zfs destroy cadzfs/VM/vm-100-disk-1@__replicate_100-0_1541166743__
root@zeus:~# zfs destroy cadzfs/VM/vm-103-disk-1@__replicate_103-0_1541167809__
root@zeus:~# zfs destroy cadzfs/VM/vm-104-disk-1@__replicate_104-0_1540603740__
root@zeus:~# zfs destroy cadzfs/VM/vm-106-disk-1@__replicate_106-0_1540603800__
root@zeus:~# zfs destroy cadzfs/VM/vm-107-disk-1@__replicate_107-0_1540675441__
root@zeus:~# zfs destroy cadzfs/VM/vm-108-disk-2@__replicate_108-0_1540674661__
root@zeus:~# zfs destroy cadzfs/VM/vm-112-disk-1@__replicate_112-0_1540600200__
root@zeus:~# zfs list -t snapshot
no datasets available
root@monserveur:~#

and delete on all nodes:

Code:
root@monserveur:/etc/pve/priv/lock# rm -r file-replication_cfg/

Try to delete on GUI all replication configuration...but same error "
trying to acquire cfs lock 'file-replication_cfg' "

How can I solve my problem, please ?
Many thanks !
 
Last edited:
Re,

So, try to :
  • Edit "/etc/pve/replication.cfg"
  • Ensure delete "/etc/pve/priv/lock/file-replication_cfg" on all node
  • Restart "pvesr.service"
    Code:
    systemctl status pvesr.service

    and now...service it's OK:
Code:
root@monserveur:/etc/pve/priv/lock# systemctl status pvesr.service
● pvesr.service - Proxmox VE replication runner
   Loaded: loaded (/lib/systemd/system/pvesr.service; static; vendor preset: enabled)
   Active: inactive (dead) since Fri 2018-11-02 16:02:01 CET; 7s ago
  Process: 985908 ExecStart=/usr/bin/pvesr run --mail 1 (code=exited, status=0/SUCCESS)
 Main PID: 985908 (code=exited, status=0/SUCCESS)
      CPU: 829ms

Nov 02 16:02:00 monserveur systemd[1]: Starting Proxmox VE replication runner...
Nov 02 16:02:01 monserveur systemd[1]: Started Proxmox VE replication runner.
root@monserveur:/etc/pve/priv/lock#

But a lot of errors into GUI for all VM's and CT's replications !

Code:
...
...
2018-11-02 16:04:02 104-0: warning: cannot send 'cadzfs/VM/vm-104-disk-1@__replicate_104-0_1541171040__': Broken pipe
2018-11-02 16:04:02 104-0: cannot send 'cadzfs/VM/vm-104-disk-1': I/O error
2018-11-02 16:04:02 104-0: command 'zfs send -Rpv -- cadzfs/VM/vm-104-disk-1@__replicate_104-0_1541171040__' failed: exit code 1
....
....

Any ideas please ?

Many thanks
 
Re,

So, try to :
  • Edit "/etc/pve/replication.cfg"
  • Ensure delete "/etc/pve/priv/lock/file-replication_cfg" on all node
  • Restart "pvesr.service"
    Code:
    systemctl status pvesr.service

    and now...service it's OK:
Code:
But a lot of errors into GUI for all VM's and CT's replications !

[code]
...
...
2018-11-02 16:04:02 104-0: warning: cannot send 'cadzfs/VM/vm-104-disk-1@__replicate_104-0_1541171040__': Broken pipe
2018-11-02 16:04:02 104-0: cannot send 'cadzfs/VM/vm-104-disk-1': I/O error
2018-11-02 16:04:02 104-0: command 'zfs send -Rpv -- cadzfs/VM/vm-104-disk-1@__replicate_104-0_1541171040__' failed: exit code 1
....
....

Any ideas please ?

Many thanks

Sounds like the state of your replication configuration and the state of the actual replications may no longer be in sync due to messing around with the configuration files, in particular with /etc/pve/replication.cfg. If you have a backup of this file, you may want to restore it and see what happens.

If restoring the config doesn't resolve the is, in most cases like this, you may have to delete the broken replication in the GUI and then set it up again. If pve fails to delete the replicated zfs filesystem(s) on the replication target, you may have to the delete them there (not on the replication source!) by hand using
Code:
zfs destroy -r /rpool/data/subvol-xxx-disk-x
(please use the right file system name, the one I suggested is an example for a container file system on the rpool filesystem. Remove all related filesystems on the target, then re-create the replication on the source. It will recreate the filesystems on the target system and copy all the data. It may take a long time for big filesystems.
 
That's not a good solution, to change timezone !!
If you have service that depend on it, try to follow the step I explain on the bug ticket

EDIT (here what I did):

systemctl stop pvesr
systemctl stop pvesr.timer
cp -a /etc/pve/replication.cfg /tmp/replication.cfg
echo > /etc/pve/replication.cfg
systemctl start pvesr
IT will stuck wait
systemctl start pvesr.timer
Then
systemctl stop pvesr
systemctl stop pvesr.timer
cp -a /tmp/replication.cfg /etc/pve/replication.cfg
systemctl start pvesr
systemctl start pvesr.timer​

And everything should work now.
 
  • Like
Reactions: sadai
That's not a good solution, to change timezone !!
If you have service that depend on it, try to follow the step I explain on the bug ticket
Well, Wolfgang Bumiller from Proxmox recommended to change the timezone and for me it fixed the problem without messing with configuration files and without losing my existing replications.
 
Well, Wolfgang Bumiller from Proxmox recommended to change the timezone and for me it fixed the problem without messing with configuration files and without losing my existing replications.

Yes, but you need to be sure that all others services which depends on timezone are correctly configured too.
Like If you have a MySQL server for example on a node for any reason.
 
Re,

OK, i solved my problem. On each node, it is not only necessary to delete the snapshot but also the disk of the machines that have been replicated !
Code:
root@monserveur:~# zfs destroy -r cadzfs/VM/vm-1
cadzfs/VM/vm-103-disk-1  cadzfs/VM/vm-104-disk-1  cadzfs/VM/vm-105-disk-1  cadzfs/VM/vm-106-disk-1  cadzfs/VM/vm-107-disk-1  cadzfs/VM/vm-108-disk-2  cadzfs/VM/vm-109-disk-1  cadzfs/VM/vm-112-disk-1
root@monserveur:~# zfs destroy -r cadzfs/VM/vm-104-disk-1
root@monserveur:~# zfs destroy -r cadzfs/VM/vm-106-disk-1
root@monserveur:~# zfs destroy -r cadzfs/VM/vm-109-disk-1
root@monserveur:~#
BE CAREFUL not to delete the disks of the VMs or CTs that are active or off on the node !

After cleaning up all the nodes by removing these formerly synchronized disks, I was able to restart replication on each node.

Thanks
 
  • Like
Reactions: fireon
Hi,

Seems that there are new updates on pve-enterprise repository. Do they solve the issue ?

Thanks
 
As far as I see there was no update to libpve-common-perl-5.0, so no, not yet.
 
That's not a good solution, to change timezone !!
If you have service that depend on it, try to follow the step I explain on the bug ticket

EDIT (here what I did):

systemctl stop pvesr
systemctl stop pvesr.timer
cp -a /etc/pve/replication.cfg /tmp/replication.cfg
echo > /etc/pve/replication.cfg
systemctl start pvesr
IT will stuck wait
systemctl start pvesr.timer
Then
systemctl stop pvesr
systemctl stop pvesr.timer
cp -a /tmp/replication.cfg /etc/pve/replication.cfg
systemctl start pvesr
systemctl start pvesr.timer​

And everything should work now.

worked for me, thanks!
 
That's not a good solution, to change timezone !!
If you have service that depend on it, try to follow the step I explain on the bug ticket

EDIT (here what I did):

systemctl stop pvesr
systemctl stop pvesr.timer
cp -a /etc/pve/replication.cfg /tmp/replication.cfg
echo > /etc/pve/replication.cfg
systemctl start pvesr
IT will stuck wait
systemctl start pvesr.timer
Then
systemctl stop pvesr
systemctl stop pvesr.timer
cp -a /tmp/replication.cfg /etc/pve/replication.cfg
systemctl start pvesr
systemctl start pvesr.timer​

And everything should work now.

It works !!!!! ;);)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!