pvesr status hanging after upgrade from 5.0 to 5.1

Patrick Atamaniuk · Oct 28, 2017

When issuing 'pvesr status' on the console, the command seems to be hanging without output.
Replication did not happen since the upgrade according to the timestamps of the replication logfiles in /var/log/pve/replicate/
pvesr list on the other hand returns a correct list of replication jobs.

The running pvesr process also seems to consume 100% cpu without functioning:
27792 root 20 0 305756 75644 13668 R 100.0 0.0 8:03.85 pvesr

Now when i access the replication status of the node or of a container/vm by webfrontend, there is no reply and pvedaemon gets also stuck consuming 100% cpu. In consequence various features on the frontend begin to fail, including statistic graphs, storage overview and storage content pages and finally authentication when logging into the frontend failed.
Restarting pvedaemon partially recovers the frontend until the replication page is accessed again.

pveproxy logs an error 596 which might be related:
10.2.42.26 - root@pam [28/10/2017:11:43:59 +0200] "GET /api2/json/nodes/speef/replication?guest=104 HTTP/1.1" 596 -
pvedaemon does not seem to log an error in messages.
zfs list and zpool list / status show all healthy and correct pools and dataset.

I'm out of ideas and would appreciate any help. Thanks in advance

cbx · Oct 29, 2017

I have the same problem but without anu update, in the proxmox 5.0 version...
I have 4 nodes that sincronize in other all 30 mn, all was working fine, but this night, replication stop on all, pvesr and pvedaemon are on 100% cpu...
I have check the servier replication and all seem fine, there is sufficient space, no error in logs...
I have had to disable replication task on all of 4 node, and restart pvesr and pvedaemon...

I have no ideas about what is the problem...

Patrick Atamaniuk · Oct 29, 2017

Confirmed, when i stop pvesr, pvesr.timer, pvedaemon and clear out /etc/pve/replication.cfg and restart the services, 'pvesr status' works. I could single out one replication job for a specific lxc container (on zfs) which seems to cause this. The systems are now working again with the old replication.cfg minus that one container's job. I will check if there is something odd about that.
P.S. i had to destroy all zfs replication target datasets in order to have the full sync working again. There is another thread about that i believe.
P.P.S After replicating successfully the first container, it got stuck again.

cbx · Oct 29, 2017

Patrick Atamaniuk said:
Confirmed, when i stop pvesr, pvesr.timer, pvedaemon and clear out /etc/pve/replication.cfg and restart the services, 'pvesr status' works. I could single out one replication job for a specific lxc container (on zfs) which seems to cause this. The systems are now working again with the old replication.cfg minus that one container's job. I will check if there is something odd about that.
P.S. i had to destroy all zfs replication target datasets in order to have the full sync working again. There is another thread about that i believe.
P.P.S After replicating successfully the first container, it got stuck again.

So the replication function is not an real production solution in this state? Someone have some solution to thie problem/bug?

dendi · Oct 29, 2017

Same for me, I have a 4 node cluster, PVE 5.0 running fine for over a month, yesterday at 21:50 replication stopped.
Last logs on zfs receiver:

Code:

2017-10-28.20:06:23 zfs recv -F -- rpool/data/vm-102-disk-1
2017-10-28.20:06:28 zfs destroy rpool/data/vm-102-disk-1@__replicate_102-0_1509188701__
2017-10-28.20:26:09 zfs recv -F -- rpool/data/vm-106-disk-1
2017-10-28.20:26:14 zfs destroy rpool/data/vm-106-disk-1@__replicate_106-0_1509190200__
2017-10-28.21:40:35 zfs recv -F -- rpool/data/vm-100-disk-1
2017-10-28.21:49:57 zfs recv -F -- rpool/data/vm-100-disk-2
2017-10-28.21:49:57 zfs destroy rpool/data/vm-100-disk-1@__replicate_100-0_1509133501__
2017-10-28.21:49:58 zfs destroy rpool/data/vm-100-disk-2@__replicate_100-0_1509133501__

I'm using only kvm

So I need to remove all replication jobs and replicated data to solve?
Thank you for any help

cbx · Oct 29, 2017

dendi said:
Same for me, I have a 4 node cluster, PVE 5.0 running fine for over a month, yesterday at 21:50 replication stopped.
Last logs on zfs receiver:

Code:

2017-10-28.20:06:23 zfs recv -F -- rpool/data/vm-102-disk-1 2017-10-28.20:06:28 zfs destroy rpool/data/vm-102-disk-1@__replicate_102-0_1509188701__ 2017-10-28.20:26:09 zfs recv -F -- rpool/data/vm-106-disk-1 2017-10-28.20:26:14 zfs destroy rpool/data/vm-106-disk-1@__replicate_106-0_1509190200__ 2017-10-28.21:40:35 zfs recv -F -- rpool/data/vm-100-disk-1 2017-10-28.21:49:57 zfs recv -F -- rpool/data/vm-100-disk-2 2017-10-28.21:49:57 zfs destroy rpool/data/vm-100-disk-1@__replicate_100-0_1509133501__ 2017-10-28.21:49:58 zfs destroy rpool/data/vm-100-disk-2@__replicate_100-0_1509133501__

I'm using only kvm

So I need to remove all replication jobs and replicated data to solve?
Thank you for any help

I use also only KVM. I have not do the resync, in my case it's not possible to do it again each time... I really need what is the problem....

dendi · Oct 30, 2017

Any suggestion from the staff?

wolfgang · Nov 2, 2017

Hi

Please ensure that the zfs on PVE5.0 is minimum on zfs 0.6.5.11 and kernel is 4.10.17-4.
Or update both nodes to PVE5.1 and reboot so the new kernel 4.13 is loaded.

dendi · Nov 2, 2017

Hi wolfgang.

zfs 0.6.5.11-pve17~bpo90

uname -a
Linux pvez30 4.10.17-2-pve #1 SMP PVE 4.10.17-20 (Mon, 14 Aug 2017 11:23:37 +0200) x86_64 GNU/Linux

I did not touch anythink, it just stopped

Now I did disable pvesr, waiting to do more tests.

Thank you

wolfgang · Nov 2, 2017

Dendi,

can you send me please the output of

Code:

ps aux |grep zfs

from the source side

dendi · Nov 2, 2017

Of course wolfgang,
There are no zfs processes on all nodes, sender and receiver.
The replication tab still does not work and hangs the pve-daemon process.
I have to restart it now.

Benoit · Nov 28, 2017

Same here ... when i try to access to replication tab from GUI i have connection timeout and after i don't have access anymore to the GUI. Timeout on logon.
I have to reboot my nodes to have GUI access again.

ps aux |grep zfs

first node

root 27577 0.0 0.0 12788 944 pts/2 S+ 13:32 0:00 grep zfs

Second node :
root 5315 0.0 0.0 12788 1000 pts/1 S+ 13:28 0:00 grep zfs

Both nodes :

proxmox-ve: 5.1-26 (running kernel: 4.13.4-1-pve)
pve-manager: 5.1-36 (running version: 5.1-36/131401db)
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.10.17-2-pve: 4.10.17-20
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-15
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-20
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-16
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-2
pve-container: 2.0-17
pve-firewall: 3.0-3
pve-ha-manager: 2.0-3
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9

Martin Maisey · Dec 11, 2017

I have exactly this problem too - has anyone been able to make any progress?

Benoit · Dec 12, 2017

Can you put the output of

pveversion -v ?

Benoit · Dec 12, 2017

Hello,

Try to do this :

apt-get install promox-ve

Martin Maisey · Dec 12, 2017

Hi Benoit,

Thanks - details you requested below. This is on the source, but the target is identical (bar an extra pve-kernel line; but the running kernel is the same on both).

Cheers,

Martin

Code:

root@chenbro:~# pveversion -v
proxmox-ve: 5.1-30 (running kernel: 4.13.8-3-pve)
pve-manager: 5.1-38 (running version: 5.1-38/1e9bc777)
pve-kernel-4.10.15-1-pve: 4.10.15-15
pve-kernel-4.13.8-3-pve: 4.13.8-30
pve-kernel-4.10.11-1-pve: 4.10.11-9
pve-kernel-4.10.17-1-pve: 4.10.17-18
libpve-http-server-perl: 2.0-7
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-22
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-3
pve-container: 2.0-17
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9

Code:

root@chenbro:~# apt-get install proxmox-ve
Reading package lists... Done
Building dependency tree     
Reading state information... Done
proxmox-ve is already the newest version (5.1-30).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

wolfgang · Dec 12, 2017

Pleas ensure the you have on both (sender and receiver) Nodes the newest kernel running.
Current kernel is 4.13.8-3
If you have different kernel version and different zfs versions the send process can hang.

Martin Maisey · Dec 12, 2017

wolfgang said:
Pleas ensure the you have on both (sender and receiver) Nodes the newest kernel running.
Current kernel is 4.13.8-3
If you have different kernel version and different zfs versions the send process can hang.

Hi Wolfgang - that kernel is running on both sender and receiver.

'ps aux |grep zfs' shows no hung ZFS processes on either sender or receiver.

'pvesr status' runs and completes on the receiver. However on the sender it hangs having created two 'pvesr' processes, each of which is at 100% CPU. These processes can be killed normally.

wolfgang · Dec 12, 2017

Martin Maisey said:
These processes can be killed normally.

What you mean with normally?

can you send the output of

Code:

ps aux | grep -v grep | grep pvesr

Martin Maisey · Dec 12, 2017

wolfgang said:
What you mean with normally?

At the console, 'kill <pid>'. Doesn't require -9.

Code:
can you send the output of

Code:

ps aux | grep -v grep | grep pvesr

When no hung pvesr process is present it looks like this:

Code:

root@chenbro:~# ps aux | grep -v grep | grep pvesr
root     19037 99.9  0.0 305628 71912 ?        Rs   10:21 145:53 /usr/bin/perl -T /usr/bin/pvesr run

If I run 'pvesr status' on another console it looks like this:

Code:

root@chenbro:~# ps aux | grep -v grep | grep pvesr
root     18270 99.1  0.0 305680 72180 pts/5    R+   12:49   0:17 /usr/bin/perl -T /usr/bin/pvesr status
root     19037 99.9  0.0 305628 71912 ?        Rs   10:21 147:32 /usr/bin/perl -T /usr/bin/pvesr run

Even after I have ^c'd the 'pvesr status' on the other console, top shows a pvesr process consuming 100% CPU. If I kill this process, another one pops up after a few seconds, again consuming 100% CPU.

pvesr status hanging after upgrade from 5.0 to 5.1

New Member

Active Member

New Member

Active Member

Renowned Member

Active Member

Renowned Member

Proxmox Retired Staff

Renowned Member

Proxmox Retired Staff

Renowned Member

Renowned Member

Active Member

Renowned Member

Renowned Member

Active Member

Proxmox Retired Staff

Active Member

Proxmox Retired Staff

Active Member

We value your privacy