pvestatd problem

boulate · Dec 1, 2017

Hello,

I have a problem with my proxmox cluster :

Everything was working fine, but I made a "dist-upgrade" yesterday (from up-to-date 4.4 to 5.1-28 with 4.13.8-2-pve kernel).

As soon as I upgraded the first node, I started to have problem : I can't see anymore my nodes in "green light" in the web administration.
I thought it was because all my nodes was not on the same version so I updated all of them.

Sadly the problem is still here, even after some nodes restart (one by one, or all together early this morning).

When I "systemctl restart pvestatd.service", it works well for 3 to 5 minutes, my /etc/pve/.rdd file is correct, but after some times (3-5 min), everything disapear again (so all my nodes are "red" on the web administration and I only see numbers on it, no more VM/CT names or status).

On all my nodes "systemctl status pvestatd.service" return a "active (running)" status.

Thank you for your help

PS : Infos :
- My CEPH cluster seems to works well.
- pveversion -v :

proxmox-ve: 5.1-28 (running kernel: 4.13.8-2-pve)
pve-manager: 5.1-36 (running version: 5.1-36/131401db)
pve-kernel-4.13.8-2-pve: 4.13.8-28
pve-kernel-4.4.49-1-pve: 4.4.49-86
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.95-1-pve: 4.4.95-99
pve-kernel-4.4.44-1-pve: 4.4.44-84
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-15
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-20
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-16
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-2
pve-container: 2.0-17
pve-firewall: 3.0-3
pve-ha-manager: 2.0-3
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
ceph: 12.2.1-pve3

dietmar · Dec 1, 2017

Are all your storages accessible? Please check with

# pvesm status

Dzy · Dec 1, 2017

Your problem is related to upgraded 'pve-container' version.
The solution is in this thread:
https://forum.proxmox.com/threads/lxc-console-cleanup-error.38293/

TL;DR:
insert pve-test repo in your sources list
apt update
apt install pve-container
comment out pve-test repo

boulate · Dec 1, 2017

Thank you very much for your answer @dietmar !

Sadly pvesm status return ... nothing ! it stuck, no answer on any node.

boulate · Dec 1, 2017

@Dzy : Thank you, I'll check that ! I hope this is the same issue !

edit : I have no container on this nodes. Should I try this anyway ?

fabian · Dec 1, 2017

no, those are completely unrelated problems. if pvesm does not return, you have a storage that is hanging and blocking pvestatd..

brudy · Dec 2, 2017

I have the same problem.

When I boot the kernel "pve-kernel-4.13.8-2-pve", pvesm status hangs.

I could track it down to the external ceph storage.

With "pve-kernel-4.13.4-1-pve" everything works as expected. ceph is accessible.

With the new kernel, the command

/usr/bin/rados -p rbd <many parameters> df

hangs. Looks like I cannot access ceph. I'm going to investigate more later today.

On my testcluster without external ceph everything works as expected.

brudy · Dec 3, 2017

Funny enough!

I added a machine to the cluster, updated everything to the latest version. Kernel is pve-kernel-4.13.8.2-pve. Everything works. Repository is no-subscription.

I have removed the one machine which does not work from the cluster. Reinstalled it exactly the same as the other machine.

As soon as I use pve-kernel-4.13.8.2-pve kernel, rados stops working. with 4.13.4 it works.

On 4.13.8.2-pve, when I execute rados ...

/usr/bin/rados -p rbd -m 10.0.2.131,10.0.2.132,10.0.2.133 --auth_supported cephx -n client.admin --keyring /etc/pve/priv/ceph/ceph.keyring --debug-rados=20 df

2017-12-03 11:14:52.934783 7f01dca6bb40 1 librados: starting msgr at :/0
2017-12-03 11:14:52.934787 7f01dca6bb40 1 librados: starting objecter
2017-12-03 11:14:52.934906 7f01dca6bb40 1 librados: setting wanted keys
2017-12-03 11:14:52.934908 7f01dca6bb40 1 librados: calling monclient init
2017-12-03 11:14:52.940187 7f01dca6bb40 1 librados: init done
2017-12-03 11:14:52.940192 7f01dca6bb40 10 librados: wait_for_osdmap waiting

The process hangs at wait_for_osdmap.

On 4.13.4 everything works without changing anything in the configuration. Just the other kernel.

brudy · Dec 4, 2017

I have some more information:

with kernel 4.13.8.2-pve, it does work on all my single CPU iCore 3/5/7 systems.

with kernel 4.13.8.2-pve, it does NOT work on dual cpu Xeon systems

exactly the same installation. In both cases I use openvswitch.

boulate · Dec 4, 2017

Ok, so ... my problem is solved but ... I don't know why and how !

It's a little bit scary, but it works now.

Yesterday morning all the lights turns green, but I didn't do anything special (just run a "ceph-mgr", but systemctl status ceph-mgr.target was OK before so ... I don't know if it was the key).

Should I turn this post to "solved" ? Or wait a little bit ?

brudy · Dec 5, 2017

It is the network card, which all this systems also got in common. The ones not working got intel, use the igb driver and use jumbo frames.

The others that work all use realtek. It has nothing to do with the CPU.

fabian · Dec 6, 2017

the current pve-kernel-4.13.8-3-pve in version -30 from pve-no-subscription fixes jumbo frames with igb.

Search

Search

pvestatd problem

boulate

New Member

dietmar

Proxmox Staff Member

Dzy

Renowned Member

boulate

New Member

boulate

New Member

fabian

Proxmox Staff Member

brudy

Renowned Member

brudy

Renowned Member

brudy

Renowned Member

boulate

New Member

brudy

Renowned Member

fabian

Proxmox Staff Member

We value your privacy