Proxmox GUI shows some guests as status unkown?

berbe01

Member
Aug 9, 2021
9
1
6
31
I updated in June. Ever since I updated, now randomly the web page will show random guests as status unknown, and it shows them this way for an undetermined period of time. Sometimes its 5 seconds sometimes its 2 minutes. It doesn't happen to specific guests it happens to random guests at random intervals. But it fixes itself and they appear normally again.

Updated in June. Have rebooted twice since. Once after the update and once yesterday.

This happens on multiple computers/browsers so not a cache issue.


jAGO54M.png





Code:
proxmox-ve: 7.2-1 (running kernel: 5.15.39-3-pve)
pve-manager: 7.2-7 (running version: 7.2-7/d0dd0e85)
pve-kernel-5.15: 7.2-9
pve-kernel-helper: 7.2-9
pve-kernel-5.4: 6.4-17
pve-kernel-5.15.39-4-pve: 5.15.39-4
pve-kernel-5.15.39-3-pve: 5.15.39-3
pve-kernel-5.15.35-1-pve: 5.15.35-3
pve-kernel-5.4.189-1-pve: 5.4.189-1
pve-kernel-5.4.174-2-pve: 5.4.174-2
pve-kernel-4.10.17-2-pve: 4.10.17-20
ceph-fuse: 14.2.21-1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-3
libpve-storage-perl: 7.2-8
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
openvswitch-switch: 2.15.0+ds1-2+deb11u1
proxmox-backup-client: 2.2.5-1
proxmox-backup-file-restore: 2.2.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-5
pve-firmware: 3.5-1
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-2
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.5-pve1
 
Last edited:
You should tell us more about your storages and VM configs (for example output of pvesm status, cat /etc/pve/storage.cfg, qm config 137, qm config 139).
I would guess you are storing on your VMs on a NFS/SMB share and loose the connection sometimes.
 
VMs are stored locally. I only store back ups on NFS.
BigRAID and OtherRAID are a raid 10 and raid 5 array on a dell r720 with LSI 9200-8e raid card.
MediaRAID is a MD1200 connected to the r720 via sas cables and a LSI SAS 9211-8i.
None of the drives show health errors on idrac.



pvesm status:

Code:
Name             Type     Status           Total            Used       Available        %
BigRAID           dir     active      2112646060      1875957252       129298296   88.80%
FreeNAS           nfs     active      1825425664      1027596672       797828992   56.29%
MediaRAID         dir     active     38908703184     12308789912     24646635936   31.64%
OtherRAID         dir     active       959258692       182667132       727790232   19.04%
images            dir   disabled               0               0               0      N/A
local             dir     active        98497780        48475708        44972524   49.22%
local-lvm     lvmthin     active       361758720               0       361758720    0.00%



storage.cfg:

Code:
dir: local
        path /var/lib/vz
        content vztmpl,iso,backup

lvmthin: local-lvm
        thinpool data
        vgname pve
        content rootdir,images

dir: images
        disable
        path /template/iso/
        content images
        prune-backups keep-last=1
        shared 0

dir: BigRAID
        path /mnt/RAID10
        content vztmpl,rootdir,images,snippets,iso,backup
        prune-backups keep-last=1
        shared 0

dir: OtherRAID
        path /mnt/RAID5
        content images,vztmpl,rootdir,iso,backup
        prune-backups keep-last=1
        shared 0

dir: MediaRAID
        path /mnt/MediaRAID
        content iso,vztmpl,rootdir,snippets,images
        shared 0

nfs: FreeNAS
        export /mnt/Proxmox
        path /mnt/pve/FreeNAS
        server 10.12.1.4
        content snippets,images,backup,rootdir,iso,vztmpl
        prune-backups keep-last=3




qm config 137:


Code:
balloon: 0
boot: order=virtio0;net0
cores: 2
memory: 4096
name: NextCloud
net0: virtio=5E:76:70:A0:F3:8C,bridge=vmbr2,tag=74
numa: 0
onboot: 1
ostype: l26
scsihw: virtio-scsi-pci
smbios1: uuid=4f4f1889-0f0f-46e8-8898-01936604838f
sockets: 2
virtio0: BigRAID:137/vm-137-disk-0.qcow2,size=64G
vmgenid: 53ba2b16-7742-4a25-ba88-062ec06eae23




qm config 139:

Code:
boot: order=virtio0;net0
cores: 2
memory: 4096
name: Mint
net0: virtio=06:DE:D9:B7:5B:4B,bridge=vmbr2
numa: 0
ostype: l26
scsihw: virtio-scsi-pci
smbios1: uuid=c24e9968-898e-4082-bee0-292944cbe2d3
sockets: 1
virtio0: BigRAID:139/vm-139-disk-0.qcow2,size=32G
vmgenid: 2e132d66-95aa-418c-8341-1436146942b0
 
Last edited:
Hi,
maybe journalctl -b -u pvestatd.service or /var/log/syslog contain some related messages/errors.
 
I have the same issue when one of my NFS share from another server stops responding. This share is used as a directory (but not used to store VMs) in my proxmox machine.
This shows the 'unknown' status but I don't think you should have other issue with this.
 
Last edited:
I havent had time to monitor it again but on monday or tuesday im going to watch for it again and check the logs fiona mentioned.

I looked at syslog and journalctl yesterday but didnt see any reference to qm 139
 
Hi,
I have the same issue when one of my NFS share from another server stops responding. This share is used as a directory (but not used to store VMs) in my proxmox machine.
This shows the 'unknown' status but I don't think you should have other issue with this.
in this case it's expected unfortunately, because pvestatd (which is responsible for collecting status information) will get stuck when trying to query the hanging NFS mount. We are planning to rewrite pvestatd in the future to make it more robust in such scenarios. But hanging NFS mounts are not particularly nicely handled in Linux. You can unmount it with umount -l -f /path/to/mount, (but it's still better to reboot EDIT: only if you had submounts). From the man page:
Code:
-l, --lazy
              Lazy unmount.  Detach the filesystem from the file hierarchy now, and clean up all references to this filesystem as soon as it is not busy anymore.

A system reboot would be expected in near future if you're going to use this option for network filesystem or local filesystem with submounts. The recommended use-case for umount -l is to
prevent hangs on shutdown due to an unreachable network share where a normal umount will hang due to a downed server or a network partition. Remounts of the share will not be possible.

I havent had time to monitor it again but on monday or tuesday im going to watch for it again and check the logs fiona mentioned.

I looked at syslog and journalctl yesterday but didnt see any reference to qm 139
Even if 139 wasn't explicitly mentioned, were there any other errors/warnings? You can post the logs here if you want.
 
Last edited:
Hi,

in this case it's expected unfortunately, because pvestatd (which is responsible for collecting status information) will get stuck when trying to query the hanging NFS mount. We are planning to rewrite pvestatd in the future to make it more robust in such scenarios. But hanging NFS mounts are not particularly nicely handled in Linux. You can unmount it with umount -l -f /path/to/mount, but it's still better to reboot. From the man page:
Code:
-l, --lazy
              Lazy unmount.  Detach the filesystem from the file hierarchy now, and clean up all references to this filesystem as soon as it is not busy anymore.

A system reboot would be expected in near future if you're going to use this option for network filesystem or local filesystem with submounts. The recommended use-case for umount -l is to
prevent hangs on shutdown due to an unreachable network share where a normal umount will hang due to a downed server or a network partition. Remounts of the share will not be possible.


Even if 139 wasn't explicitly mentioned, were there any other errors/warnings? You can post the logs here if you want.

I observed it for about an hour today and didnt see it happen so maybe self resolved.
Logs from last week:




Code:
Aug 22 03:03:33 proxmox systemd[1]: Reloading PVE Status Daemon.
Aug 22 03:03:34 proxmox pvestatd[1597922]: send HUP to 3260
Aug 22 03:03:34 proxmox pvestatd[3260]: received signal HUP
Aug 22 03:03:34 proxmox pvestatd[3260]: server shutdown (restart)
Aug 22 03:03:34 proxmox systemd[1]: Reloaded PVE Status Daemon.
Aug 22 03:03:35 proxmox pvestatd[3260]: restarting server
Aug 22 04:15:26 proxmox pvestatd[3260]: unable to get PID for CT 102 (not running?)
Aug 22 04:15:27 proxmox pvestatd[3260]: modified cpu set for lxc/103: 6,11
Aug 22 04:15:27 proxmox pvestatd[3260]: modified cpu set for lxc/107: 13,30
Aug 22 04:15:27 proxmox pvestatd[3260]: modified cpu set for lxc/132: 3-4,18,31
Aug 22 05:05:47 proxmox pvestatd[3260]: VM 109 qmp command failed - VM 109 not running
Aug 22 08:54:43 proxmox pvestatd[3260]: status update time (6.278 seconds)
Aug 22 09:44:32 proxmox pvestatd[3260]: VM 114 qmp command failed - VM 114 qmp command 'query-proxmox-support' failed - unable to connect to VM 114 qmp socket - timeout after 31 retries
Aug 22 09:44:33 proxmox pvestatd[3260]: status update time (6.960 seconds)
Aug 22 09:44:42 proxmox pvestatd[3260]: VM 114 qmp command failed - VM 114 qmp command 'query-proxmox-support' failed - unable to connect to VM 114 qmp socket - timeout after 31 retries
Aug 22 09:44:43 proxmox pvestatd[3260]: status update time (6.936 seconds)
Aug 22 10:39:53 proxmox pvestatd[3260]: VM 114 qmp command failed - VM 114 qmp command 'query-proxmox-support' failed - unable to connect to VM 114 qmp socket - timeout after 31 retries
Aug 22 10:39:54 proxmox pvestatd[3260]: status update time (6.912 seconds)
Aug 22 10:40:03 proxmox pvestatd[3260]: VM 114 qmp command failed - VM 114 qmp command 'query-proxmox-support' failed - unable to connect to VM 114 qmp socket - timeout after 31 retries
Aug 22 10:40:04 proxmox pvestatd[3260]: status update time (6.859 seconds)
Aug 22 15:58:28 proxmox pvestatd[3260]: unable to get PID for CT 105 (not running?)
Aug 22 15:58:29 proxmox pvestatd[3260]: modified cpu set for lxc/102: 9
Aug 22 15:58:29 proxmox pvestatd[3260]: modified cpu set for lxc/132: 3-4,10,18
Aug 22 15:58:29 proxmox pvestatd[3260]: unable to get PID for CT 105 (not running?)
Aug 22 20:05:10 proxmox pvestatd[3260]: auth key pair too old, rotating..
Aug 23 20:05:14 proxmox pvestatd[3260]: auth key pair too old, rotating..
Aug 24 20:05:23 proxmox pvestatd[3260]: auth key pair too old, rotating..
Aug 25 20:05:33 proxmox pvestatd[3260]: auth key pair too old, rotating..
Aug 27 22:08:58 proxmox pvestatd[3260]: VM 114 qmp command failed - VM 114 qmp command 'query-proxmox-support' failed - unable to connect to VM 114 qmp socket - timeout after 31 retries
Aug 27 22:08:59 proxmox pvestatd[3260]: status update time (6.922 seconds)
Aug 27 22:09:08 proxmox pvestatd[3260]: VM 114 qmp command failed - VM 114 qmp command 'query-proxmox-support' failed - unable to connect to VM 114 qmp socket - timeout after 31 retries
Aug 27 22:09:09 proxmox pvestatd[3260]: status update time (6.970 seconds)
Aug 27 22:16:38 proxmox pvestatd[3260]: VM 114 qmp command failed - VM 114 qmp command 'query-proxmox-support' failed - unable to connect to VM 114 qmp socket - timeout after 31 retries
Aug 27 22:16:39 proxmox pvestatd[3260]: status update time (6.915 seconds)
Aug 27 23:59:43 proxmox pvestatd[3260]: VM 114 qmp command failed - VM 114 not running
Aug 28 00:04:38 proxmox pvestatd[3260]: status update time (6.560 seconds)
Aug 28 20:05:35 proxmox pvestatd[3260]: auth key pair too old, rotating..
Aug 29 20:05:36 proxmox pvestatd[3260]: auth key pair too old, rotating..
Aug 30 20:05:36 proxmox pvestatd[3260]: auth key pair too old, rotating..


Not sure what that qmp error is.

qm config 114

Code:
balloon: 4096
boot: c
bootdisk: virtio0
cores: 4
memory: 6192
name: WindowsPMAT
net0: virtio=12:12:6A:33:F1:77,bridge=vmbr2,firewall=1,link_down=1,tag=66
net1: virtio=D6:51:50:05:34:2F,bridge=vmbr5,firewall=1
numa: 1
onboot: 1
ostype: win10
parent: CleanV2_Spice
scsihw: virtio-scsi-pci
smbios1: uuid=c990d835-27a7-4e96-9df8-63aaa3e65062
sockets: 2
spice_enhancements: videostreaming=all
vga: qxl
virtio0: BigRAID:114/vm-114-disk-0.qcow2,size=106G
vmgenid: a1e8cde3-1865-47b7-98fd-a47d2c6047ed
 
Last edited:
Code:
Aug 22 03:03:33 proxmox systemd[1]: Reloading PVE Status Daemon.
Aug 22 03:03:34 proxmox pvestatd[1597922]: send HUP to 3260
Aug 22 03:03:34 proxmox pvestatd[3260]: received signal HUP
Aug 22 03:03:34 proxmox pvestatd[3260]: server shutdown (restart)
Aug 22 03:03:34 proxmox systemd[1]: Reloaded PVE Status Daemon.
Aug 22 03:03:35 proxmox pvestatd[3260]: restarting server
Aug 22 04:15:26 proxmox pvestatd[3260]: unable to get PID for CT 102 (not running?)
Aug 22 04:15:27 proxmox pvestatd[3260]: modified cpu set for lxc/103: 6,11
Aug 22 04:15:27 proxmox pvestatd[3260]: modified cpu set for lxc/107: 13,30
Aug 22 04:15:27 proxmox pvestatd[3260]: modified cpu set for lxc/132: 3-4,18,31
Aug 22 05:05:47 proxmox pvestatd[3260]: VM 109 qmp command failed - VM 109 not running
Aug 22 08:54:43 proxmox pvestatd[3260]: status update time (6.278 seconds)
Aug 22 09:44:32 proxmox pvestatd[3260]: VM 114 qmp command failed - VM 114 qmp command 'query-proxmox-support' failed - unable to connect to VM 114 qmp socket - timeout after 31 retries
Aug 22 09:44:33 proxmox pvestatd[3260]: status update time (6.960 seconds)
Aug 22 09:44:42 proxmox pvestatd[3260]: VM 114 qmp command failed - VM 114 qmp command 'query-proxmox-support' failed - unable to connect to VM 114 qmp socket - timeout after 31 retries
Aug 22 09:44:43 proxmox pvestatd[3260]: status update time (6.936 seconds)
Aug 22 10:39:53 proxmox pvestatd[3260]: VM 114 qmp command failed - VM 114 qmp command 'query-proxmox-support' failed - unable to connect to VM 114 qmp socket - timeout after 31 retries
Aug 22 10:39:54 proxmox pvestatd[3260]: status update time (6.912 seconds)
Aug 22 10:40:03 proxmox pvestatd[3260]: VM 114 qmp command failed - VM 114 qmp command 'query-proxmox-support' failed - unable to connect to VM 114 qmp socket - timeout after 31 retries
Aug 22 10:40:04 proxmox pvestatd[3260]: status update time (6.859 seconds)
Aug 22 15:58:28 proxmox pvestatd[3260]: unable to get PID for CT 105 (not running?)
Aug 22 15:58:29 proxmox pvestatd[3260]: modified cpu set for lxc/102: 9
Aug 22 15:58:29 proxmox pvestatd[3260]: modified cpu set for lxc/132: 3-4,10,18
Aug 22 15:58:29 proxmox pvestatd[3260]: unable to get PID for CT 105 (not running?)
Aug 22 20:05:10 proxmox pvestatd[3260]: auth key pair too old, rotating..
Aug 23 20:05:14 proxmox pvestatd[3260]: auth key pair too old, rotating..
Aug 24 20:05:23 proxmox pvestatd[3260]: auth key pair too old, rotating..
Aug 25 20:05:33 proxmox pvestatd[3260]: auth key pair too old, rotating..
Aug 27 22:08:58 proxmox pvestatd[3260]: VM 114 qmp command failed - VM 114 qmp command 'query-proxmox-support' failed - unable to connect to VM 114 qmp socket - timeout after 31 retries
Aug 27 22:08:59 proxmox pvestatd[3260]: status update time (6.922 seconds)
Aug 27 22:09:08 proxmox pvestatd[3260]: VM 114 qmp command failed - VM 114 qmp command 'query-proxmox-support' failed - unable to connect to VM 114 qmp socket - timeout after 31 retries
Aug 27 22:09:09 proxmox pvestatd[3260]: status update time (6.970 seconds)
Aug 27 22:16:38 proxmox pvestatd[3260]: VM 114 qmp command failed - VM 114 qmp command 'query-proxmox-support' failed - unable to connect to VM 114 qmp socket - timeout after 31 retries
Aug 27 22:16:39 proxmox pvestatd[3260]: status update time (6.915 seconds)
Aug 27 23:59:43 proxmox pvestatd[3260]: VM 114 qmp command failed - VM 114 not running
Aug 28 00:04:38 proxmox pvestatd[3260]: status update time (6.560 seconds)
Aug 28 20:05:35 proxmox pvestatd[3260]: auth key pair too old, rotating..
Aug 29 20:05:36 proxmox pvestatd[3260]: auth key pair too old, rotating..
Aug 30 20:05:36 proxmox pvestatd[3260]: auth key pair too old, rotating..
Well, the status update sometimes takes a few seconds, but not that long. Is this a stand-alone node or part of a cluster? How does the IO/CPU/etc. load on the node look when the issue happens?

Not sure what that qmp error is.
It can happen when there's some load-heavy operation with the VM going on, but might also have other reasons. You could check in the syslog if there was some other task around the time the error happened.
 
Well, the status update sometimes takes a few seconds, but not that long. Is this a stand-alone node or part of a cluster? How does the IO/CPU/etc. load on the node look when the issue happens?


It can happen when there's some load-heavy operation with the VM going on, but might also have other reasons. You could check in the syslog if there was some other task around the time the error happened.


Stand alone.




Didnt look at IO at the time but heres maximums.
The spike happened on 8/20 but looks normal around 8/22 and 8/27 where status update took longer than normal.
2022-09-02_3-16-18.png




Heres the disk IO of VM 137 which I originally reported.

2022-09-02_4-38-08.png
 
Last edited: