After update to 8.2.4 all nodes in cluster going grey

Soryu

New Member
Jun 18, 2024
4
2
3
Hello folks,

after update to 8.2.4 all nodes in the cluster are going grey (marked with a grey question mark) a couple of minutes after startup.

There were no issues during the update process. PVE version: pve-manager/8.2.4/faa83925c9641325

As all the VMs are working properly it seems that only the PVE Status Daemon is affected.

systemctl restart pvestatd executed on each host brings the respective host back to "life". After the manual restart of the daemon there is (up to now) no grey question mark. This behaviour can be reproduced. After a reboot the host will be grey after a couple of minutes. A manual restart of pvestatsd seems to solve the issue.

Is there anything running in a timeout after reboot?

Best regards,

Joerg
 
  • Like
Reactions: UweV and Supaman
I've ran into the same problem, have to restart pvestatd after every reboot or else everything is grayed out.
 
Hi, can you please post
  • the full output of pveversion -v
  • the file generated by the following command?
    Code:
    journalctl -b -u pvestatd.service > pvestatd-log.txt
 
FYI - I started no VM on this host pve2 after reboot.

root@pve2:/tmp# journalctl -b -u pvestatd.service
Jun 20 13:23:35 pve2 systemd[1]: Starting pvestatd.service - PVE Status Daemon...
Jun 20 13:23:51 pve2 pvestatd[1588]: starting server
Jun 20 13:23:51 pve2 systemd[1]: Started pvestatd.service - PVE Status Daemon.
root@pve2:/tmp#

1718883157999.png
 

Attachments

I'm having the same issue.......

pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.8-1-pve)
pve-manager: 8.2.4 (running version: 8.2.4/faa83925c9641325)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.8-1
proxmox-kernel-6.8.8-1-pve-signed: 6.8.8-1
proxmox-kernel-6.5.13-5-pve-signed: 6.5.13-5
proxmox-kernel-6.5: 6.5.13-5
proxmox-kernel-6.5.13-3-pve-signed: 6.5.13-3
proxmox-kernel-6.5.11-4-pve-signed: 6.5.11-4
ceph: 17.2.7-pve3
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.7
libpve-cluster-perl: 8.0.7
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.3
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.9
libpve-storage-perl: 8.2.2
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.4-1
proxmox-backup-file-restore: 3.2.4-1
proxmox-firewall: 0.4.2
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.7
pve-container: 5.1.12
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.1
pve-firewall: 5.0.7
pve-firmware: 3.12-1
pve-ha-manager: 4.0.5
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.4-pve1

pvestatd-log.txt
Jun 20 10:44:22 hostname systemd[1]: Starting pvestatd.service - PVE Status Daemon...
Jun 20 10:44:24 hostname pvestatd[1442]: starting server
Jun 20 10:44:25 hostname systemd[1]: Started pvestatd.service - PVE Status Daemon.
Jun 20 11:19:42 hostname systemd[1]: Stopping pvestatd.service - PVE Status Daemon...
Jun 20 11:19:43 hostname pvestatd[1442]: received signal TERM
Jun 20 11:19:43 hostname pvestatd[1442]: server closing
Jun 20 11:19:43 hostname pvestatd[1442]: server stopped
Jun 20 11:19:44 hostname systemd[1]: pvestatd.service: Deactivated successfully.
Jun 20 11:19:44 hostname systemd[1]: Stopped pvestatd.service - PVE Status Daemon.
Jun 20 11:19:44 hostname systemd[1]: pvestatd.service: Consumed 5.572s CPU time.
Jun 20 11:19:44 hostname systemd[1]: Starting pvestatd.service - PVE Status Daemon...
Jun 20 11:19:46 hostname pvestatd[6180]: starting server
Jun 20 11:19:46 hostname systemd[1]: Started pvestatd.service - PVE Status Daemon.

Thanks!
 
Thanks. So it looks like the journal restricted to pvestatd.service is not helpful, we might have to look at the full journal. Could you run journalctl -b and extract the excerpt up until you manually restarted pvestatd?
 
Shorter log file attached.
FYI - When I initiate a reboot on the GUI then the status is fine (question mark gone) for some seconds until the node reboots.
 

Attachments

Last edited:
Same here - When I initiate a reboot on the GUI the status is green untill reboot is starting up - something I could mention. I have a 3 nodes cluster. I updated the first one yesterday - and it went in gray mode - Summary page was empty - all Storage is grey as well. I updated node 2 and node 3 today and they are grey as well but the first node is now green again und looks good again
 
Thanks for the journals. Do I see correctly that all affected users have an ESXi storage defined? If I define an ESXi storage and boot into kernel 6.8.8-1 (EDIT: kernel is unrelated, see [1]), I also see gray nodes ~5 minutes after boot, and restarting pvestatd seems to fix it. We'll look into the issue tomorrow and keep you posted here.

Restarting pvestatd seems like the easiest workaround for now, but if that does not work for you, you could also try booting into an older kernel (e.g. 6.8.4-3) (EDIT: kernel is unrelated, see [1]).

[1] https://forum.proxmox.com/threads/a...odes-in-cluster-going-grey.149172/post-676672
 
Last edited:
  • Like
Reactions: Soryu and UweV
Thanks for the journals. Do I see correctly that all affected users have an ESXi storage defined? If I define an ESXi storage and boot into kernel 6.8.8-1, I also see gray nodes ~5 minutes after boot, and restarting pvestatd seems to fix it. We'll look into the issue tomorrow and keep you posted here.

Restarting pvestatd seems like the easiest workaround for now, but if that does not work for you, you could also try booting into an older kernel (e.g. 6.8.4-3).
I do not have ESXi storage defined and I get this bug after about 24hrs of uptime.
 
FYI - The "5min gray node" bug does not occur if ESXi Storage is disabled on my test cluster.
Thanks fweber for the second workaround.
 
small PVE Singlehost
1 x m.2 SSD
mounted ESX Storage: yes

Behaviour:
Storage "local" + "local-lvm" greyed out right after boot, datacenter and running VMs after 5 mins.
manual "systemctl restart pvestatd" after fixes the problem until next reboot.

next try:
rebooted the system - storages are greyed out
deleted the mounted ESX storage ---> greyed out storage switched instant to green.
And stayed in this state - also after reboot everything fine.

Hope this helps.
 
Thanks for the journals. Do I see correctly that all affected users have an ESXi storage defined? If I define an ESXi storage and boot into kernel 6.8.8-1, I also see gray nodes ~5 minutes after boot, and restarting pvestatd seems to fix it. We'll look into the issue tomorrow and keep you posted here.

Restarting pvestatd seems like the easiest workaround for now, but if that does not work for you, you could also try booting into an older kernel (e.g. 6.8.4-3).
You are right!

I finished my migration and deleted the ESXi-Storage I had no grey question mark after a reboot (yet!).
 
Thanks for confirming the presence of an ESXi storage! Turns out the kernel was unrelated to this issue. The issue was most likely caused by a regression in libpve-storage-perl=8.2.2 that inadvertently made pvestatd block after activating an ESXi storage. We have reverted [1] the commit introducing the regression. I'll update you here once the updated package is available in pvetest/no-subscription. Until then, I'd suggest to restart pvestatd as a workaround, as already pointed out by @Soryu.

Details: The first pvestatd process started after boot activates the ESXi storage, blocks afterwards and cannot send status information to the other nodes, so the node goes gray in the GUI after 5 minutes without status updates. Restarting pvestatd is a workaround because the second pvestatd process realizes the ESXi storage is already active and does not try to activate it again.

Thanks for reporting this!

I do not have ESXi storage defined and I get this bug after about 24hrs of uptime.
If you don't have an ESXi storage, this sounds like a different issue. Can you check the journal after 24h of uptime? If you want, please open a new thread (and mention me via @fweber) and attach the relevant portions of the journal there.

[1] https://git.proxmox.com/?p=pve-storage.git;a=commit;h=9a594f594dd40295db1b55c442f3d55c0be3bbfc
 
Last edited:
  • Like
Reactions: Wanninger and UweV

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!