pvestatd problem

boulate

New Member
May 13, 2016
5
0
1
43
Hello,

I have a problem with my proxmox cluster :

Everything was working fine, but I made a "dist-upgrade" yesterday (from up-to-date 4.4 to 5.1-28 with 4.13.8-2-pve kernel).

As soon as I upgraded the first node, I started to have problem : I can't see anymore my nodes in "green light" in the web administration.
I thought it was because all my nodes was not on the same version so I updated all of them.

Sadly the problem is still here, even after some nodes restart (one by one, or all together early this morning).

When I "systemctl restart pvestatd.service", it works well for 3 to 5 minutes, my /etc/pve/.rdd file is correct, but after some times (3-5 min), everything disapear again (so all my nodes are "red" on the web administration and I only see numbers on it, no more VM/CT names or status).

On all my nodes "systemctl status pvestatd.service" return a "active (running)" status.

Thank you for your help :)

PS : Infos :
- My CEPH cluster seems to works well.
- pveversion -v :
proxmox-ve: 5.1-28 (running kernel: 4.13.8-2-pve)
pve-manager: 5.1-36 (running version: 5.1-36/131401db)
pve-kernel-4.13.8-2-pve: 4.13.8-28
pve-kernel-4.4.49-1-pve: 4.4.49-86
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.95-1-pve: 4.4.95-99
pve-kernel-4.4.44-1-pve: 4.4.44-84
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-15
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-20
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-16
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-2
pve-container: 2.0-17
pve-firewall: 3.0-3
pve-ha-manager: 2.0-3
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
ceph: 12.2.1-pve3
 
Last edited:
Thank you very much for your answer @dietmar !

Sadly pvesm status return ... nothing ! it stuck, no answer on any node.
 
@Dzy : Thank you, I'll check that ! I hope this is the same issue !

edit : I have no container on this nodes. Should I try this anyway ?
 
no, those are completely unrelated problems. if pvesm does not return, you have a storage that is hanging and blocking pvestatd..
 
I have the same problem.

When I boot the kernel "pve-kernel-4.13.8-2-pve", pvesm status hangs.

I could track it down to the external ceph storage.

With "pve-kernel-4.13.4-1-pve" everything works as expected. ceph is accessible.

With the new kernel, the command

/usr/bin/rados -p rbd <many parameters> df

hangs. Looks like I cannot access ceph. I'm going to investigate more later today.

On my testcluster without external ceph everything works as expected.
 
Funny enough!

I added a machine to the cluster, updated everything to the latest version. Kernel is pve-kernel-4.13.8.2-pve. Everything works. Repository is no-subscription.


I have removed the one machine which does not work from the cluster. Reinstalled it exactly the same as the other machine.

As soon as I use pve-kernel-4.13.8.2-pve kernel, rados stops working. with 4.13.4 it works.

On 4.13.8.2-pve, when I execute rados ...

/usr/bin/rados -p rbd -m 10.0.2.131,10.0.2.132,10.0.2.133 --auth_supported cephx -n client.admin --keyring /etc/pve/priv/ceph/ceph.keyring --debug-rados=20 df

2017-12-03 11:14:52.934783 7f01dca6bb40 1 librados: starting msgr at :/0
2017-12-03 11:14:52.934787 7f01dca6bb40 1 librados: starting objecter
2017-12-03 11:14:52.934906 7f01dca6bb40 1 librados: setting wanted keys
2017-12-03 11:14:52.934908 7f01dca6bb40 1 librados: calling monclient init
2017-12-03 11:14:52.940187 7f01dca6bb40 1 librados: init done
2017-12-03 11:14:52.940192 7f01dca6bb40 10 librados: wait_for_osdmap waiting

The process hangs at wait_for_osdmap.

On 4.13.4 everything works without changing anything in the configuration. Just the other kernel.
 
I have some more information:

with kernel 4.13.8.2-pve, it does work on all my single CPU iCore 3/5/7 systems.

with kernel 4.13.8.2-pve, it does NOT work on dual cpu Xeon systems

exactly the same installation. In both cases I use openvswitch.
 
Ok, so ... my problem is solved but ... I don't know why and how !

It's a little bit scary, but it works now.

Yesterday morning all the lights turns green, but I didn't do anything special (just run a "ceph-mgr", but systemctl status ceph-mgr.target was OK before so ... I don't know if it was the key).

Should I turn this post to "solved" ? Or wait a little bit ?
 
It is the network card, which all this systems also got in common. The ones not working got intel, use the igb driver and use jumbo frames.

The others that work all use realtek. It has nothing to do with the CPU.
 
Last edited:
the current pve-kernel-4.13.8-3-pve in version -30 from pve-no-subscription fixes jumbo frames with igb.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!