After update to 8.2.4 all nodes in cluster going grey

Soryu · Jun 18, 2024

Hello folks,

after update to 8.2.4 all nodes in the cluster are going grey (marked with a grey question mark) a couple of minutes after startup.

There were no issues during the update process. PVE version: pve-manager/8.2.4/faa83925c9641325

As all the VMs are working properly it seems that only the PVE Status Daemon is affected.

systemctl restart pvestatd executed on each host brings the respective host back to "life". After the manual restart of the daemon there is (up to now) no grey question mark. This behaviour can be reproduced. After a reboot the host will be grey after a couple of minutes. A manual restart of pvestatsd seems to solve the issue.

Is there anything running in a timeout after reboot?

Best regards,

Joerg

dualm · Jun 18, 2024

I have the same behavior. The systemctl restart pvestatd worked for me too.

hadarr · Jun 19, 2024

I've ran into the same problem, have to restart pvestatd after every reboot or else everything is grayed out.

Supaman · Jun 20, 2024

Same Problem here, systemctl restart works, but only as temporary solution.

fweber · Jun 20, 2024

Hi, can you please post

the full output of pveversion -v

the file generated by the following command?

Code:

journalctl -b -u pvestatd.service > pvestatd-log.txt

UweV · Jun 20, 2024

FYI - I started no VM on this host pve2 after reboot.

root@pve2:/tmp# journalctl -b -u pvestatd.service
Jun 20 13:23:35 pve2 systemd[1]: Starting pvestatd.service - PVE Status Daemon...
Jun 20 13:23:51 pve2 pvestatd[1588]: starting server
Jun 20 13:23:51 pve2 systemd[1]: Started pvestatd.service - PVE Status Daemon.
root@pve2:/tmp#

BerndLambertz · Jun 20, 2024

I'm having the same issue.......

pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.8-1-pve)
pve-manager: 8.2.4 (running version: 8.2.4/faa83925c9641325)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.8-1
proxmox-kernel-6.8.8-1-pve-signed: 6.8.8-1
proxmox-kernel-6.5.13-5-pve-signed: 6.5.13-5
proxmox-kernel-6.5: 6.5.13-5
proxmox-kernel-6.5.13-3-pve-signed: 6.5.13-3
proxmox-kernel-6.5.11-4-pve-signed: 6.5.11-4
ceph: 17.2.7-pve3
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.7
libpve-cluster-perl: 8.0.7
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.3
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.9
libpve-storage-perl: 8.2.2
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.4-1
proxmox-backup-file-restore: 3.2.4-1
proxmox-firewall: 0.4.2
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.7
pve-container: 5.1.12
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.1
pve-firewall: 5.0.7
pve-firmware: 3.12-1
pve-ha-manager: 4.0.5
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.4-pve1

pvestatd-log.txt
Jun 20 10:44:22 hostname systemd[1]: Starting pvestatd.service - PVE Status Daemon...
Jun 20 10:44:24 hostname pvestatd[1442]: starting server
Jun 20 10:44:25 hostname systemd[1]: Started pvestatd.service - PVE Status Daemon.
Jun 20 11:19:42 hostname systemd[1]: Stopping pvestatd.service - PVE Status Daemon...
Jun 20 11:19:43 hostname pvestatd[1442]: received signal TERM
Jun 20 11:19:43 hostname pvestatd[1442]: server closing
Jun 20 11:19:43 hostname pvestatd[1442]: server stopped
Jun 20 11:19:44 hostname systemd[1]: pvestatd.service: Deactivated successfully.
Jun 20 11:19:44 hostname systemd[1]: Stopped pvestatd.service - PVE Status Daemon.
Jun 20 11:19:44 hostname systemd[1]: pvestatd.service: Consumed 5.572s CPU time.
Jun 20 11:19:44 hostname systemd[1]: Starting pvestatd.service - PVE Status Daemon...
Jun 20 11:19:46 hostname pvestatd[6180]: starting server
Jun 20 11:19:46 hostname systemd[1]: Started pvestatd.service - PVE Status Daemon.

Thanks!

UweV · Jun 20, 2024

@fweber which additional log files do you need?

fweber · Jun 20, 2024

Thanks. So it looks like the journal restricted to pvestatd.service is not helpful, we might have to look at the full journal. Could you run journalctl -b and extract the excerpt up until you manually restarted pvestatd?

UweV · Jun 20, 2024

attached

BerndLambertz · Jun 20, 2024

add my journalctl log as well.... thanks

UweV · Jun 20, 2024

Shorter log file attached.
FYI - When I initiate a reboot on the GUI then the status is fine (question mark gone) for some seconds until the node reboots.

spirit · Jun 20, 2024

try to disable storage ,1 by 1. Maybe 1 speficic storage is blocking the pvestatd daemon.

BerndLambertz · Jun 20, 2024

Same here - When I initiate a reboot on the GUI the status is green untill reboot is starting up - something I could mention. I have a 3 nodes cluster. I updated the first one yesterday - and it went in gray mode - Summary page was empty - all Storage is grey as well. I updated node 2 and node 3 today and they are grey as well but the first node is now green again und looks good again

fweber · Jun 20, 2024

Thanks for the journals. Do I see correctly that all affected users have an ESXi storage defined? If I define an ESXi storage ~~and boot into kernel 6.8.8-1~~ (EDIT: kernel is unrelated, see [1]), I also see gray nodes ~5 minutes after boot, and restarting pvestatd seems to fix it. We'll look into the issue tomorrow and keep you posted here.

Restarting pvestatd seems like the easiest workaround for now, ~~but if that does not work for you, you could also try booting into an older kernel (e.g. 6.8.4-3)~~ (EDIT: kernel is unrelated, see [1]).

[1] https://forum.proxmox.com/threads/a...odes-in-cluster-going-grey.149172/post-676672

bobloadmire · Jun 20, 2024

fweber said:
Thanks for the journals. Do I see correctly that all affected users have an ESXi storage defined? If I define an ESXi storage and boot into kernel 6.8.8-1, I also see gray nodes ~5 minutes after boot, and restarting pvestatd seems to fix it. We'll look into the issue tomorrow and keep you posted here.

Restarting pvestatd seems like the easiest workaround for now, but if that does not work for you, you could also try booting into an older kernel (e.g. 6.8.4-3).

I do not have ESXi storage defined and I get this bug after about 24hrs of uptime.

UweV · Jun 20, 2024

FYI - The "5min gray node" bug does not occur if ESXi Storage is disabled on my test cluster.
Thanks fweber for the second workaround.

Supaman · Jun 20, 2024

small PVE Singlehost
1 x m.2 SSD
mounted ESX Storage: yes

Behaviour:
Storage "local" + "local-lvm" greyed out right after boot, datacenter and running VMs after 5 mins.
manual "systemctl restart pvestatd" after fixes the problem until next reboot.

next try:
rebooted the system - storages are greyed out
deleted the mounted ESX storage ---> greyed out storage switched instant to green.
And stayed in this state - also after reboot everything fine.

Hope this helps.

Soryu · Jun 20, 2024

fweber said:
Thanks for the journals. Do I see correctly that all affected users have an ESXi storage defined? If I define an ESXi storage and boot into kernel 6.8.8-1, I also see gray nodes ~5 minutes after boot, and restarting pvestatd seems to fix it. We'll look into the issue tomorrow and keep you posted here.

Restarting pvestatd seems like the easiest workaround for now, but if that does not work for you, you could also try booting into an older kernel (e.g. 6.8.4-3).

You are right!

I finished my migration and deleted the ESXi-Storage I had no grey question mark after a reboot (yet!).

fweber · Jun 21, 2024

Thanks for confirming the presence of an ESXi storage! Turns out the kernel was unrelated to this issue. The issue was most likely caused by a regression in libpve-storage-perl=8.2.2 that inadvertently made pvestatd block after activating an ESXi storage. We have reverted [1] the commit introducing the regression. I'll update you here once the updated package is available in pvetest/no-subscription. Until then, I'd suggest to restart pvestatd as a workaround, as already pointed out by @Soryu.

Details: The first pvestatd process started after boot activates the ESXi storage, blocks afterwards and cannot send status information to the other nodes, so the node goes gray in the GUI after 5 minutes without status updates. Restarting pvestatd is a workaround because the second pvestatd process realizes the ESXi storage is already active and does not try to activate it again.

Thanks for reporting this!

bobloadmire said:
I do not have ESXi storage defined and I get this bug after about 24hrs of uptime.

If you don't have an ESXi storage, this sounds like a different issue. Can you check the journal after 24h of uptime? If you want, please open a new thread (and mention me via @fweber) and attach the relevant portions of the journal there.

[1] https://git.proxmox.com/?p=pve-storage.git;a=commit;h=9a594f594dd40295db1b55c442f3d55c0be3bbfc

After update to 8.2.4 all nodes in cluster going grey

New Member

New Member

Member

Member

Proxmox Staff Member

New Member

Attachments

New Member

New Member

Proxmox Staff Member

New Member

Attachments

New Member

Attachments

New Member

Attachments

Distinguished Member

New Member

Proxmox Staff Member

New Member

New Member

Member

New Member

Proxmox Staff Member

We value your privacy