[SOLVED] Grey question marks at containers, VMs, hosts and storage

Womumba

Active Member
Jul 24, 2019
9
0
41
53
Hello all,

I'm running a PVE cluster consisting of three HP DL380 Gen 10 hosts. All updates from PVE 6 through 9 always ran flawless, until I updated from 9.0.5 to 9.1.6 via apt update && apt upgrade which made the names of LXC containers and VMs in the web GUI vanish. They as well as hosts and storages show a grey question mark instead.

Numerous attempts of restarting pveproxy, corosync and pve-cluster on all the nodes didn't help. Restarting pvestatd makes the hosts, LXC and VM question marks go away, but only to return after 3 minutes or so.

I found no way to get rid of the storage question marks.
Trying to show the status of any container or VM runs into a connection timed out (596) error.
Trying to show the status of storage results in a communication failure (0) or connection timed out (596) even when it's about local storage.

Measures I tried to solve the problem include:
a) removing those IP addresses from /etc/corosync/corosync.conf which do not listen at UDP port 5405 (there were some configured as ring1_addr)
b) pinning the kernel to 6.14.11-5-pve

None of these steps did help.

Storage (e.g. that holding our VM disk images) is bound mostly via NFS and seems to work fine. Even migration of a VM's computing resources works.

I should also mention that running pvesm status on any node never returns a result but must be interrupted to regain the prompt.

Please find the output of pveversion -v below. Please let me know which additional information is needed. Any help will be appreciated.
Code:
proxmox-ve: 9.1.0 (running kernel: 6.14.11-5-pve)
pve-manager: 9.1.6 (running version: 9.1.6/71482d1833ded40a)
proxmox-kernel-helper: 9.0.4
proxmox-kernel-6.17: 6.17.13-1
proxmox-kernel-6.17.13-1-pve-signed: 6.17.13-1
proxmox-kernel-6.17.9-1-pve-signed: 6.17.9-1
proxmox-kernel-6.14.11-5-pve-signed: 6.14.11-5
proxmox-kernel-6.14: 6.14.11-5
ceph-fuse: 19.2.3-pve1
corosync: 3.1.10-pve1
criu: 4.1.1-1
frr-pythontools: 10.4.1-1+pve1
ifupdown2: 3.3.0-1+pmx12
intel-microcode: 3.20251111.1~deb13u1
libjs-extjs: 7.0.0-5
libproxmox-acme-perl: 1.7.0
libproxmox-backup-qemu0: 2.0.2
libproxmox-rs-perl: 0.4.1
libpve-access-control: 9.0.5
libpve-apiclient-perl: 3.4.2
libpve-cluster-api-perl: 9.0.7
libpve-cluster-perl: 9.0.7
libpve-common-perl: 9.1.7
libpve-guest-common-perl: 6.0.2
libpve-http-server-perl: 6.0.5
libpve-network-perl: 1.2.5
libpve-rs-perl: 0.11.4
libpve-storage-perl: 9.1.0
libspice-server1: 0.15.2-1+b1
lvm2: 2.03.31-2+pmx1
lxc-pve: 6.0.5-4
lxcfs: 6.0.4-pve1
novnc-pve: 1.6.0-3
proxmox-backup-client: 4.1.4-1
proxmox-backup-file-restore: 4.1.4-1
proxmox-backup-restore-image: 1.0.0
proxmox-firewall: 1.2.1
proxmox-kernel-helper: 9.0.4
proxmox-mail-forward: 1.0.2
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.3
proxmox-widget-toolkit: 5.1.8
pve-cluster: 9.0.7
pve-container: 6.1.2
pve-docs: 9.1.2
pve-edk2-firmware: not correctly installed
pve-esxi-import-tools: 1.0.1
pve-firewall: 6.0.4
pve-firmware: 3.18-1
pve-ha-manager: 5.1.1
pve-i18n: 3.6.6
pve-qemu-kvm: 10.1.2-7
pve-xtermjs: 5.5.0-3
qemu-server: 9.1.4
smartmontools: 7.4-pve1
spiceterm: 3.4.1
swtpm: 0.8.0+pve3
vncterm: 1.9.1
zfsutils-linux: 2.4.0-pve1
 
until I updated from 9.0.5 to 9.1.6 via apt update && apt upgrade which made the names of LXC containers
Are you certain, that you used "apt upgrade"? This could have broken some.
Always use apt dist-upgrade, otherwise some dependencies could get messed up!

See:
https://lists.proxmox.com/pipermail/pve-devel/2025-March/068874.html
and https://pve.proxmox.com/pve-docs/pve-admin-guide.html#system_software_updates

This was also discussed here:

pvestatd is responsible for reporting. As you have restarted and it stalled after 3 minutes again, is there anything in the logs?

journalctl -u pvestatd
 
Last edited:
  • Like
Reactions: Impact
Given that "pvesm" talks somewhat directly to storage accessible to the node where you are running it from, chances are that you have an issue with storage access that trips up "pvestatd" and other components.
I would remove all storage pools from storage.cfg, except "local" and "local-zfs/lvm", reboot/restart and monitor the system. If it stays stable - add one of your pools back, continue to monitor. If the issue comes back, troubleshoot that storage access.

Cheers



Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Are you certain, that you used "apt upgrade"? This could have broken some.
Always use apt dist-upgrade, otherwise some dependencies could get messed up!

See:
https://lists.proxmox.com/pipermail/pve-devel/2025-March/068874.html
and https://pve.proxmox.com/pve-docs/pve-admin-guide.html#system_software_updates
You are right, the docs explicitly recommend apt dist-upgrade in any case.
So, to answer your question: Yes, I'm certain I used apt upgrade to install the packages last week, following a habit to only use apt dist-upgrade when it is about a major release upgrade. Probably I have to rethink my habits...
pvestatd is responsible for reporting. As you have restarted and it stalled after 3 minutes again, is there anything in the logs?

journalctl -u pvestatd
I see only messages from restarting pvestatd, but none at all when it stalls:

Code:
Mar 02 15:43:56 node1 systemd[1]: Stopping pvestatd.service - PVE Status Daemon...
Mar 02 15:43:59 node1 pvestatd[873039]: received signal TERM
Mar 02 15:43:59 node1 pvestatd[873039]: server closing
Mar 02 15:43:59 node1 pvestatd[873039]: server stopped
Mar 02 15:44:00 node1 systemd[1]: pvestatd.service: Deactivated successfully.
Mar 02 15:44:00 node1 systemd[1]: Stopped pvestatd.service - PVE Status Daemon.
Mar 02 15:44:00 node1 systemd[1]: pvestatd.service: Consumed 6.246s CPU time, 205.9M memory peak.
Mar 02 15:44:00 node1 systemd[1]: Starting pvestatd.service - PVE Status Daemon...
Mar 02 15:44:02 node1 pvestatd[885085]: starting server
Mar 02 15:44:02 node1 systemd[1]: Started pvestatd.service - PVE Status Daemon.
 
Given that "pvesm" talks somewhat directly to storage accessible to the node where you are running it from, chances are that you have an issue with storage access that trips up "pvestatd" and other components.
I would remove all storage pools from storage.cfg, except "local" and "local-zfs/lvm", reboot/restart and monitor the system. If it stays stable - add one of your pools back, continue to monitor. If the issue comes back, troubleshoot that storage access.

Cheers



Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
Thank you for your proposal.
Since all my VMs' storage resources are on remote storage and moving them to local doesn't work, this would mean I have to shut down all VMs. Please understand that I do not yet feel desperate enough to take this approach, but circumstances might urge me to do so in a few days when no better way emerges.

Cheers.
 
Thank you for the update @Womumba .
Perhaps if you share the context of your /etc/pve/storage.cfg someone may point out what you should be looking at or testing.
Clearly there is an issue with one of your remote storage being probed based on pvesm hanging. You can try to use "strace" or "tcpdump" to get closer to the culprit.

Best of luck.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Thank you for the update @Womumba .
Perhaps if you share the context of your /etc/pve/storage.cfg someone may point out what you should be looking at or testing.
Clearly there is an issue with one of your remote storage being probed based on pvesm hanging. You can try to use "strace" or "tcpdump" to get closer to the culprit.

Best of luck.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
Now I can say you were definitely pointing in the right direction:
pvesm list stalls when I apply it to shares on a certain NAS, while shares on another NAS (luckily my productive one) can be questioned successfully. On the aforementioned NAS, I even can't list the contents of the share so it seems to be an issue of this certain NAS itself!
So I think I could remove the shares on the culprit from my storage.cfg without losing much...

Cheers!
 
Great to hear @Womumba . If you think your question is solved - you can update the thread with SOLVED prefix by editing the original post and selecting the appropriate subject prefix.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
Hm, removing the affected shares via GUI doesn't work - Connection error - Timeout.
Would it be sufficient to remove them directly from the storage.cfg with a text editor? And then restart a certain service, I guess?
 
Hm, removing the affected shares via GUI doesn't work -
Of course, it has to interact with pvestatd and it is hung trying to talk to the NAS
Just remove it from the storage.cfg
You can then follow that with:
systemctl try-reload-or-restart pvedaemon pveproxy pvestatd pvescheduler pve-ha-lrm



Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Good news:
After commenting out the shares on the affected NAS in /etc/pve/storage.cfg and restarting pvestatd, my PVE cluster works fine again! pvesm statusdoes no longer stall and shows the expected result. All question marks in the GUI have vanished, status is available for containers, VMs and storage. And this state has survived the three minute timeout threshold.

Many thanks to all who have made suggestions.

Subtle hint:
systemctl try-reload-or-restart pvedaemon pveproxy pvestatd pvescheduler pve-ha-lrm didn't do the trick in this case.
I had to systemctl restart pvestatd.

Thanks again and goodbye!