pvesr status hanging after upgrade from 5.0 to 5.1

Oct 28, 2017
4
1
3
When issuing 'pvesr status' on the console, the command seems to be hanging without output.
Replication did not happen since the upgrade according to the timestamps of the replication logfiles in /var/log/pve/replicate/
pvesr list on the other hand returns a correct list of replication jobs.

The running pvesr process also seems to consume 100% cpu without functioning:
27792 root 20 0 305756 75644 13668 R 100.0 0.0 8:03.85 pvesr

Now when i access the replication status of the node or of a container/vm by webfrontend, there is no reply and pvedaemon gets also stuck consuming 100% cpu. In consequence various features on the frontend begin to fail, including statistic graphs, storage overview and storage content pages and finally authentication when logging into the frontend failed.
Restarting pvedaemon partially recovers the frontend until the replication page is accessed again.

pveproxy logs an error 596 which might be related:
10.2.42.26 - root@pam [28/10/2017:11:43:59 +0200] "GET /api2/json/nodes/speef/replication?guest=104 HTTP/1.1" 596 -
pvedaemon does not seem to log an error in messages.
zfs list and zpool list / status show all healthy and correct pools and dataset.

I'm out of ideas and would appreciate any help. Thanks in advance
 

cbx

Member
Mar 2, 2012
45
1
8
I have the same problem but without anu update, in the proxmox 5.0 version...
I have 4 nodes that sincronize in other all 30 mn, all was working fine, but this night, replication stop on all, pvesr and pvedaemon are on 100% cpu...
I have check the servier replication and all seem fine, there is sufficient space, no error in logs...
I have had to disable replication task on all of 4 node, and restart pvesr and pvedaemon...

I have no ideas about what is the problem...
 
Oct 28, 2017
4
1
3
Confirmed, when i stop pvesr, pvesr.timer, pvedaemon and clear out /etc/pve/replication.cfg and restart the services, 'pvesr status' works. I could single out one replication job for a specific lxc container (on zfs) which seems to cause this. The systems are now working again with the old replication.cfg minus that one container's job. I will check if there is something odd about that.
P.S. i had to destroy all zfs replication target datasets in order to have the full sync working again. There is another thread about that i believe.
P.P.S After replicating successfully the first container, it got stuck again.
 
Last edited:

cbx

Member
Mar 2, 2012
45
1
8
Confirmed, when i stop pvesr, pvesr.timer, pvedaemon and clear out /etc/pve/replication.cfg and restart the services, 'pvesr status' works. I could single out one replication job for a specific lxc container (on zfs) which seems to cause this. The systems are now working again with the old replication.cfg minus that one container's job. I will check if there is something odd about that.
P.S. i had to destroy all zfs replication target datasets in order to have the full sync working again. There is another thread about that i believe.
P.P.S After replicating successfully the first container, it got stuck again.
So the replication function is not an real production solution in this state? Someone have some solution to thie problem/bug?
 

dendi

Member
Nov 17, 2011
105
7
18
Same for me, I have a 4 node cluster, PVE 5.0 running fine for over a month, yesterday at 21:50 replication stopped.
Last logs on zfs receiver:

Code:
2017-10-28.20:06:23 zfs recv -F -- rpool/data/vm-102-disk-1
2017-10-28.20:06:28 zfs destroy rpool/data/vm-102-disk-1@__replicate_102-0_1509188701__
2017-10-28.20:26:09 zfs recv -F -- rpool/data/vm-106-disk-1
2017-10-28.20:26:14 zfs destroy rpool/data/vm-106-disk-1@__replicate_106-0_1509190200__
2017-10-28.21:40:35 zfs recv -F -- rpool/data/vm-100-disk-1
2017-10-28.21:49:57 zfs recv -F -- rpool/data/vm-100-disk-2
2017-10-28.21:49:57 zfs destroy rpool/data/vm-100-disk-1@__replicate_100-0_1509133501__
2017-10-28.21:49:58 zfs destroy rpool/data/vm-100-disk-2@__replicate_100-0_1509133501__
I'm using only kvm :)

So I need to remove all replication jobs and replicated data to solve?
Thank you for any help
 

cbx

Member
Mar 2, 2012
45
1
8
Same for me, I have a 4 node cluster, PVE 5.0 running fine for over a month, yesterday at 21:50 replication stopped.
Last logs on zfs receiver:

Code:
2017-10-28.20:06:23 zfs recv -F -- rpool/data/vm-102-disk-1
2017-10-28.20:06:28 zfs destroy rpool/data/vm-102-disk-1@__replicate_102-0_1509188701__
2017-10-28.20:26:09 zfs recv -F -- rpool/data/vm-106-disk-1
2017-10-28.20:26:14 zfs destroy rpool/data/vm-106-disk-1@__replicate_106-0_1509190200__
2017-10-28.21:40:35 zfs recv -F -- rpool/data/vm-100-disk-1
2017-10-28.21:49:57 zfs recv -F -- rpool/data/vm-100-disk-2
2017-10-28.21:49:57 zfs destroy rpool/data/vm-100-disk-1@__replicate_100-0_1509133501__
2017-10-28.21:49:58 zfs destroy rpool/data/vm-100-disk-2@__replicate_100-0_1509133501__
I'm using only kvm :)

So I need to remove all replication jobs and replicated data to solve?
Thank you for any help
I use also only KVM. I have not do the resync, in my case it's not possible to do it again each time... I really need what is the problem....
 

wolfgang

Proxmox Staff Member
Staff member
Oct 1, 2014
4,763
316
83
Hi

Please ensure that the zfs on PVE5.0 is minimum on zfs 0.6.5.11 and kernel is 4.10.17-4.
Or update both nodes to PVE5.1 and reboot so the new kernel 4.13 is loaded.
 
  • Like
Reactions: gsupp

dendi

Member
Nov 17, 2011
105
7
18
Hi wolfgang.

zfs 0.6.5.11-pve17~bpo90

uname -a
Linux pvez30 4.10.17-2-pve #1 SMP PVE 4.10.17-20 (Mon, 14 Aug 2017 11:23:37 +0200) x86_64 GNU/Linux

I did not touch anythink, it just stopped

Now I did disable pvesr, waiting to do more tests.

Thank you
 

wolfgang

Proxmox Staff Member
Staff member
Oct 1, 2014
4,763
316
83
Dendi,

can you send me please the output of
Code:
ps aux |grep zfs
from the source side
 

dendi

Member
Nov 17, 2011
105
7
18
Of course wolfgang,
There are no zfs processes on all nodes, sender and receiver.
The replication tab still does not work and hangs the pve-daemon process.
I have to restart it now.
 

Benoit

Member
Jan 17, 2017
55
0
6
37
Same here ... when i try to access to replication tab from GUI i have connection timeout and after i don't have access anymore to the GUI. Timeout on logon.
I have to reboot my nodes to have GUI access again.

ps aux |grep zfs

first node

root 27577 0.0 0.0 12788 944 pts/2 S+ 13:32 0:00 grep zfs

Second node :
root 5315 0.0 0.0 12788 1000 pts/1 S+ 13:28 0:00 grep zfs


Both nodes :

proxmox-ve: 5.1-26 (running kernel: 4.13.4-1-pve)
pve-manager: 5.1-36 (running version: 5.1-36/131401db)
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.10.17-2-pve: 4.10.17-20
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-15
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-20
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-16
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-2
pve-container: 2.0-17
pve-firewall: 3.0-3
pve-ha-manager: 2.0-3
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
 

Martin Maisey

New Member
Jun 14, 2017
15
4
3
43
Hi Benoit,

Thanks - details you requested below. This is on the source, but the target is identical (bar an extra pve-kernel line; but the running kernel is the same on both).

Cheers,

Martin

Code:
root@chenbro:~# pveversion -v
proxmox-ve: 5.1-30 (running kernel: 4.13.8-3-pve)
pve-manager: 5.1-38 (running version: 5.1-38/1e9bc777)
pve-kernel-4.10.15-1-pve: 4.10.15-15
pve-kernel-4.13.8-3-pve: 4.13.8-30
pve-kernel-4.10.11-1-pve: 4.10.11-9
pve-kernel-4.10.17-1-pve: 4.10.17-18
libpve-http-server-perl: 2.0-7
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-22
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-3
pve-container: 2.0-17
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
Code:
root@chenbro:~# apt-get install proxmox-ve
Reading package lists... Done
Building dependency tree     
Reading state information... Done
proxmox-ve is already the newest version (5.1-30).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
 
Last edited:

wolfgang

Proxmox Staff Member
Staff member
Oct 1, 2014
4,763
316
83
Pleas ensure the you have on both (sender and receiver) Nodes the newest kernel running.
Current kernel is 4.13.8-3
If you have different kernel version and different zfs versions the send process can hang.
 

Martin Maisey

New Member
Jun 14, 2017
15
4
3
43
Pleas ensure the you have on both (sender and receiver) Nodes the newest kernel running.
Current kernel is 4.13.8-3
If you have different kernel version and different zfs versions the send process can hang.
Hi Wolfgang - that kernel is running on both sender and receiver.

'ps aux |grep zfs' shows no hung ZFS processes on either sender or receiver.

'pvesr status' runs and completes on the receiver. However on the sender it hangs having created two 'pvesr' processes, each of which is at 100% CPU. These processes can be killed normally.
 

Martin Maisey

New Member
Jun 14, 2017
15
4
3
43
What you mean with normally?
At the console, 'kill <pid>'. Doesn't require -9.

can you send the output of
Code:
ps aux | grep -v grep | grep pvesr
When no hung pvesr process is present it looks like this:

Code:
root@chenbro:~# ps aux | grep -v grep | grep pvesr
root     19037 99.9  0.0 305628 71912 ?        Rs   10:21 145:53 /usr/bin/perl -T /usr/bin/pvesr run
If I run 'pvesr status' on another console it looks like this:

Code:
root@chenbro:~# ps aux | grep -v grep | grep pvesr
root     18270 99.1  0.0 305680 72180 pts/5    R+   12:49   0:17 /usr/bin/perl -T /usr/bin/pvesr status
root     19037 99.9  0.0 305628 71912 ?        Rs   10:21 147:32 /usr/bin/perl -T /usr/bin/pvesr run
Even after I have ^c'd the 'pvesr status' on the other console, top shows a pvesr process consuming 100% CPU. If I kill this process, another one pops up after a few seconds, again consuming 100% CPU.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!