Cluster node semi-offline?

danb35

Renowned Member
Oct 31, 2015
84
6
73
Since rebooting one node of my three-node cluster this morning, I'm seeing some strange behavior, in that one node (often, but not always, the one I rebooted) appears offline, like this:
1630932397430.png
The node is up and running, I can ssh to it, and in this case, I'm even logged into that node's web GUI. The datacenter summary looks good:
1630932480773.png

I can ping at least one of the VMs that should be (and apparently is) running on pve1. I can even click on a running VM and it will show its status, though CPU/memory reporting stops about a half hour ago:
1630933548280.png

Where should I be looking to correct this?

Again, it's a three-node cluster, running on three nearly-identical nodes of a Dell Poweredge C6220 (only difference is the amount of RAM, 80+ GB), each with 2x Xeon E5-2680v2. pveversion-v on pve1 shows:
Code:
root@pve1:~# pveversion -v
proxmox-ve: 7.0-2 (running kernel: 5.11.22-3-pve)
pve-manager: 7.0-11 (running version: 7.0-11/63d82f4e)
pve-kernel-5.11: 7.0-7
pve-kernel-helper: 7.0-7
pve-kernel-5.4: 6.4-4
pve-kernel-5.11.22-4-pve: 5.11.22-8
pve-kernel-5.11.22-3-pve: 5.11.22-7
pve-kernel-5.4.124-1-pve: 5.4.124-1
pve-kernel-4.15: 5.3-3
pve-kernel-4.15.18-12-pve: 4.15.18-35
ceph: 16.2.5-pve1
ceph-fuse: 16.2.5-pve1
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.3.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-6
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-10
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.9-2
proxmox-backup-file-restore: 2.0.9-2
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.0-9
pve-docs: 7.0-5
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-2
pve-firmware: 3.3-1
pve-ha-manager: 3.3-1
pve-i18n: 2.4-1
pve-qemu-kvm: 6.0.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-13
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1

While I was typing, the same thing happened to pve3, which is the node I rebooted this morning (and have since rebooted a couple of times), but reversed itself within a minute or so. But pve1 continues to be as shown.
 
Last edited:
This would indicate a failure in 'pvestatd'. This can have many reasons, often it is (network) storage related. Check the logs ('journalctl -e') on the affected node.
 
  • Like
Reactions: danb35
This would indicate a failure in 'pvestatd'. This can have many reasons, often it is (network) storage related. Check the logs ('journalctl -e') on the affected node.
Both were correct:
Code:
root@pve1:~# systemctl status pvestatd
● pvestatd.service - PVE Status Daemon
     Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2021-08-25 16:33:32 EDT; 1 weeks 4 days ago
    Process: 2426 ExecStart=/usr/bin/pvestatd start (code=exited, status=0/SUCCESS)
    Process: 1476418 ExecReload=/usr/bin/pvestatd restart (code=exited, status=0/SUCCESS)
   Main PID: 2532 (pvestatd)
      Tasks: 1 (limit: 96465)
     Memory: 96.0M
        CPU: 7h 8min 19.427s
     CGroup: /system.slice/pvestatd.service
             └─2532 pvestatd

Sep 06 08:33:11 pve1 pvestatd[2532]: got timeout
Sep 06 08:33:11 pve1 pvestatd[2532]: unable to activate storage 'NFS' - directory '/mnt/pve/NFS' does not exist or is unreachable
Sep 06 08:33:13 pve1 pvestatd[2532]: got timeout
Sep 06 08:33:13 pve1 pvestatd[2532]: unable to activate storage 'FreeNAS-mirror-NFS' - directory '/mnt/pve/FreeNAS-mirror-NFS' does>
Sep 06 08:33:20 pve1 pvestatd[2532]: got timeout
Sep 06 08:33:20 pve1 pvestatd[2532]: unable to activate storage 'FreeNAS-mirror-NFS' - directory '/mnt/pve/FreeNAS-mirror-NFS' does>
Sep 06 08:33:22 pve1 pvestatd[2532]: got timeout
Sep 06 08:33:22 pve1 pvestatd[2532]: unable to activate storage 'NFS' - directory '/mnt/pve/NFS' does not exist or is unreachable
Sep 06 08:33:30 pve1 pvestatd[2532]: got timeout
Sep 06 08:33:30 pve1 pvestatd[2532]: unable to activate storage 'NFS' - directory '/mnt/pve/NFS' does not exist or is unreachable

Restarting that (systemctl restart pvestatd) cleared the immediate problem, and those two NFS mounts did eventually come back online.

I expect that issue was caused by the TrueNAS server rebooting, though that had happened several hours earlier. Is there a way to "refresh" the NFS mounts in such cases?
 
It should automatically come back after a while, the kernel NFS client will retry by default if I'm not mistaken. If it doesn't, maybe the server is using an outdated NFS variant or something over there requires a full reconnect of sorts?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!