One node in PVE cluster greyed out hangs

Aug 8, 2019
3
0
21
54
I have a strange problem with one of the hypervisors in my PVE cluster. No changes other than adding a VM on that node today.

One of the nodes in the cluster shows greyed out with no names to all of the hosted vm's. Just grey question marks.

I can ssh to the host but most commands hangs the ssh session. I can not even do a "pvecm status" or a "df -h"

VM's running on the impacted hosts does not seem to be affected.

Any ideas? Seems that the /etc/pve mount is gone or undeachable.

I need to recover if possible without restarting all the VM guests.

Here's the view from another node in the cluster, installed from the same iso:

root@pve-02:~# pvecm status
Quorum information
------------------
Date: Tue Oct 20 19:06:25 2020
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000004
Ring ID: 1/364
Quorate: Yes

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.44.172
0x00000002 1 192.168.44.174
0x00000003 1 192.168.44.190
0x00000004 1 192.168.44.171 (local)

root@pve-02:/etc# pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-4 (running version: 6.0-4/2a719255)
pve-kernel-5.0: 6.0-5
pve-kernel-helper: 6.0-5
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-2
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-5
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-61
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-4
pve-container: 3.0-3
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-5
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-5
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1
 
>>Any ideas? Seems that the /etc/pve mount is gone or undeachable.

restart: "systemctl restart pve-cluster"

I'll launch a "pmxcfs" process, which is the process which mount the /etc/pve

(and verify than you don't have any file in /etc/pve/.. before restart)
 
Last edited:
Will doing a "systemctl restart pve-cluster" kill the running vms?

As to looking at the contents in /etc/pve:
If I run that command from an ssh session, the session hangs.
I did a ps -ef from the iLO console and it hanged indefinitely. I can now only manage the server via ssh.
 
Will doing a "systemctl restart pve-cluster" kill the running vms?
no. tou can restart differents promox services without any impact on the vms

As to looking at the contents in /etc/pve:
If I run that command from an ssh session, the session hangs.
I did a ps -ef from the iLO console and it hanged indefinitely. I can now only manage the server via ssh.

is the /etc/pve still mounted ? ("Seems that the /etc/pve mount is gone or undeachable.")
what is the output of "df" command ?

normally, if you have a "pmxcfs" process running (started with pve-cluster.service), the /etc/pve should be mounted.

if you still have /etc/pve but pmxcfs process is not running, you can try to do a lazy umount (umount -lf /etc/pve), then start pve-cluster service again.

I did a ps -ef from the iLO console and it hanged indefinitely.
thats really strange. I hope that you don't have a physical problem on your server...