pvesr status hanging after upgrade from 5.0 to 5.1

Oct 28, 2017
4
1
3
When issuing 'pvesr status' on the console, the command seems to be hanging without output.
Replication did not happen since the upgrade according to the timestamps of the replication logfiles in /var/log/pve/replicate/
pvesr list on the other hand returns a correct list of replication jobs.

The running pvesr process also seems to consume 100% cpu without functioning:
27792 root 20 0 305756 75644 13668 R 100.0 0.0 8:03.85 pvesr

Now when i access the replication status of the node or of a container/vm by webfrontend, there is no reply and pvedaemon gets also stuck consuming 100% cpu. In consequence various features on the frontend begin to fail, including statistic graphs, storage overview and storage content pages and finally authentication when logging into the frontend failed.
Restarting pvedaemon partially recovers the frontend until the replication page is accessed again.

pveproxy logs an error 596 which might be related:
10.2.42.26 - root@pam [28/10/2017:11:43:59 +0200] "GET /api2/json/nodes/speef/replication?guest=104 HTTP/1.1" 596 -
pvedaemon does not seem to log an error in messages.
zfs list and zpool list / status show all healthy and correct pools and dataset.

I'm out of ideas and would appreciate any help. Thanks in advance
 
I have the same problem but without anu update, in the proxmox 5.0 version...
I have 4 nodes that sincronize in other all 30 mn, all was working fine, but this night, replication stop on all, pvesr and pvedaemon are on 100% cpu...
I have check the servier replication and all seem fine, there is sufficient space, no error in logs...
I have had to disable replication task on all of 4 node, and restart pvesr and pvedaemon...

I have no ideas about what is the problem...
 
Confirmed, when i stop pvesr, pvesr.timer, pvedaemon and clear out /etc/pve/replication.cfg and restart the services, 'pvesr status' works. I could single out one replication job for a specific lxc container (on zfs) which seems to cause this. The systems are now working again with the old replication.cfg minus that one container's job. I will check if there is something odd about that.
P.S. i had to destroy all zfs replication target datasets in order to have the full sync working again. There is another thread about that i believe.
P.P.S After replicating successfully the first container, it got stuck again.
 
Last edited:
Confirmed, when i stop pvesr, pvesr.timer, pvedaemon and clear out /etc/pve/replication.cfg and restart the services, 'pvesr status' works. I could single out one replication job for a specific lxc container (on zfs) which seems to cause this. The systems are now working again with the old replication.cfg minus that one container's job. I will check if there is something odd about that.
P.S. i had to destroy all zfs replication target datasets in order to have the full sync working again. There is another thread about that i believe.
P.P.S After replicating successfully the first container, it got stuck again.

So the replication function is not an real production solution in this state? Someone have some solution to thie problem/bug?
 
Same for me, I have a 4 node cluster, PVE 5.0 running fine for over a month, yesterday at 21:50 replication stopped.
Last logs on zfs receiver:

Code:
2017-10-28.20:06:23 zfs recv -F -- rpool/data/vm-102-disk-1
2017-10-28.20:06:28 zfs destroy rpool/data/vm-102-disk-1@__replicate_102-0_1509188701__
2017-10-28.20:26:09 zfs recv -F -- rpool/data/vm-106-disk-1
2017-10-28.20:26:14 zfs destroy rpool/data/vm-106-disk-1@__replicate_106-0_1509190200__
2017-10-28.21:40:35 zfs recv -F -- rpool/data/vm-100-disk-1
2017-10-28.21:49:57 zfs recv -F -- rpool/data/vm-100-disk-2
2017-10-28.21:49:57 zfs destroy rpool/data/vm-100-disk-1@__replicate_100-0_1509133501__
2017-10-28.21:49:58 zfs destroy rpool/data/vm-100-disk-2@__replicate_100-0_1509133501__

I'm using only kvm :-)

So I need to remove all replication jobs and replicated data to solve?
Thank you for any help
 
Same for me, I have a 4 node cluster, PVE 5.0 running fine for over a month, yesterday at 21:50 replication stopped.
Last logs on zfs receiver:

Code:
2017-10-28.20:06:23 zfs recv -F -- rpool/data/vm-102-disk-1
2017-10-28.20:06:28 zfs destroy rpool/data/vm-102-disk-1@__replicate_102-0_1509188701__
2017-10-28.20:26:09 zfs recv -F -- rpool/data/vm-106-disk-1
2017-10-28.20:26:14 zfs destroy rpool/data/vm-106-disk-1@__replicate_106-0_1509190200__
2017-10-28.21:40:35 zfs recv -F -- rpool/data/vm-100-disk-1
2017-10-28.21:49:57 zfs recv -F -- rpool/data/vm-100-disk-2
2017-10-28.21:49:57 zfs destroy rpool/data/vm-100-disk-1@__replicate_100-0_1509133501__
2017-10-28.21:49:58 zfs destroy rpool/data/vm-100-disk-2@__replicate_100-0_1509133501__

I'm using only kvm :)

So I need to remove all replication jobs and replicated data to solve?
Thank you for any help

I use also only KVM. I have not do the resync, in my case it's not possible to do it again each time... I really need what is the problem....
 
Hi

Please ensure that the zfs on PVE5.0 is minimum on zfs 0.6.5.11 and kernel is 4.10.17-4.
Or update both nodes to PVE5.1 and reboot so the new kernel 4.13 is loaded.
 
  • Like
Reactions: gsupp
Hi wolfgang.

zfs 0.6.5.11-pve17~bpo90

uname -a
Linux pvez30 4.10.17-2-pve #1 SMP PVE 4.10.17-20 (Mon, 14 Aug 2017 11:23:37 +0200) x86_64 GNU/Linux

I did not touch anythink, it just stopped

Now I did disable pvesr, waiting to do more tests.

Thank you
 
Dendi,

can you send me please the output of
Code:
ps aux |grep zfs
from the source side
 
Of course wolfgang,
There are no zfs processes on all nodes, sender and receiver.
The replication tab still does not work and hangs the pve-daemon process.
I have to restart it now.
 
Same here ... when i try to access to replication tab from GUI i have connection timeout and after i don't have access anymore to the GUI. Timeout on logon.
I have to reboot my nodes to have GUI access again.

ps aux |grep zfs

first node

root 27577 0.0 0.0 12788 944 pts/2 S+ 13:32 0:00 grep zfs

Second node :
root 5315 0.0 0.0 12788 1000 pts/1 S+ 13:28 0:00 grep zfs


Both nodes :

proxmox-ve: 5.1-26 (running kernel: 4.13.4-1-pve)
pve-manager: 5.1-36 (running version: 5.1-36/131401db)
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.10.17-2-pve: 4.10.17-20
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-15
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-20
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-16
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-2
pve-container: 2.0-17
pve-firewall: 3.0-3
pve-ha-manager: 2.0-3
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
 
Hi Benoit,

Thanks - details you requested below. This is on the source, but the target is identical (bar an extra pve-kernel line; but the running kernel is the same on both).

Cheers,

Martin

Code:
root@chenbro:~# pveversion -v
proxmox-ve: 5.1-30 (running kernel: 4.13.8-3-pve)
pve-manager: 5.1-38 (running version: 5.1-38/1e9bc777)
pve-kernel-4.10.15-1-pve: 4.10.15-15
pve-kernel-4.13.8-3-pve: 4.13.8-30
pve-kernel-4.10.11-1-pve: 4.10.11-9
pve-kernel-4.10.17-1-pve: 4.10.17-18
libpve-http-server-perl: 2.0-7
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-22
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-3
pve-container: 2.0-17
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9

Code:
root@chenbro:~# apt-get install proxmox-ve
Reading package lists... Done
Building dependency tree     
Reading state information... Done
proxmox-ve is already the newest version (5.1-30).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
 
Last edited:
Pleas ensure the you have on both (sender and receiver) Nodes the newest kernel running.
Current kernel is 4.13.8-3
If you have different kernel version and different zfs versions the send process can hang.
 
Pleas ensure the you have on both (sender and receiver) Nodes the newest kernel running.
Current kernel is 4.13.8-3
If you have different kernel version and different zfs versions the send process can hang.

Hi Wolfgang - that kernel is running on both sender and receiver.

'ps aux |grep zfs' shows no hung ZFS processes on either sender or receiver.

'pvesr status' runs and completes on the receiver. However on the sender it hangs having created two 'pvesr' processes, each of which is at 100% CPU. These processes can be killed normally.
 
What you mean with normally?

At the console, 'kill <pid>'. Doesn't require -9.

can you send the output of
Code:
ps aux | grep -v grep | grep pvesr

When no hung pvesr process is present it looks like this:

Code:
root@chenbro:~# ps aux | grep -v grep | grep pvesr
root     19037 99.9  0.0 305628 71912 ?        Rs   10:21 145:53 /usr/bin/perl -T /usr/bin/pvesr run

If I run 'pvesr status' on another console it looks like this:

Code:
root@chenbro:~# ps aux | grep -v grep | grep pvesr
root     18270 99.1  0.0 305680 72180 pts/5    R+   12:49   0:17 /usr/bin/perl -T /usr/bin/pvesr status
root     19037 99.9  0.0 305628 71912 ?        Rs   10:21 147:32 /usr/bin/perl -T /usr/bin/pvesr run

Even after I have ^c'd the 'pvesr status' on the other console, top shows a pvesr process consuming 100% CPU. If I kill this process, another one pops up after a few seconds, again consuming 100% CPU.