One node in PVE cluster greyed out hangs

Riaan Timmerman · Oct 20, 2020

I have a strange problem with one of the hypervisors in my PVE cluster. No changes other than adding a VM on that node today.

One of the nodes in the cluster shows greyed out with no names to all of the hosted vm's. Just grey question marks.

I can ssh to the host but most commands hangs the ssh session. I can not even do a "pvecm status" or a "df -h"

VM's running on the impacted hosts does not seem to be affected.

Any ideas? Seems that the /etc/pve mount is gone or undeachable.

I need to recover if possible without restarting all the VM guests.

Here's the view from another node in the cluster, installed from the same iso:

root@pve-02:~# pvecm status
Quorum information
------------------
Date: Tue Oct 20 19:06:25 2020
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000004
Ring ID: 1/364
Quorate: Yes

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.44.172
0x00000002 1 192.168.44.174
0x00000003 1 192.168.44.190
0x00000004 1 192.168.44.171 (local)

root@pve-02:/etc# pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-4 (running version: 6.0-4/2a719255)
pve-kernel-5.0: 6.0-5
pve-kernel-helper: 6.0-5
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-2
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-5
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-61
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-4
pve-container: 3.0-3
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-5
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-5
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1

spirit · Oct 20, 2020

>>Any ideas? Seems that the /etc/pve mount is gone or undeachable.

restart: "systemctl restart pve-cluster"

I'll launch a "pmxcfs" process, which is the process which mount the /etc/pve

(and verify than you don't have any file in /etc/pve/.. before restart)

Riaan Timmerman · Oct 20, 2020

Will doing a "systemctl restart pve-cluster" kill the running vms?

As to looking at the contents in /etc/pve:
If I run that command from an ssh session, the session hangs.
I did a ps -ef from the iLO console and it hanged indefinitely. I can now only manage the server via ssh.

spirit · Oct 20, 2020

Riaan Timmerman said:
Will doing a "systemctl restart pve-cluster" kill the running vms?

no. tou can restart differents promox services without any impact on the vms

As to looking at the contents in /etc/pve:
If I run that command from an ssh session, the session hangs.
I did a ps -ef from the iLO console and it hanged indefinitely. I can now only manage the server via ssh.

is the /etc/pve still mounted ? ("Seems that the /etc/pve mount is gone or undeachable.")
what is the output of "df" command ?

normally, if you have a "pmxcfs" process running (started with pve-cluster.service), the /etc/pve should be mounted.

if you still have /etc/pve but pmxcfs process is not running, you can try to do a lazy umount (umount -lf /etc/pve), then start pve-cluster service again.

I did a ps -ef from the iLO console and it hanged indefinitely.

thats really strange. I hope that you don't have a physical problem on your server...

Search

Search

One node in PVE cluster greyed out hangs

Riaan Timmerman

Member

spirit

Distinguished Member

Riaan Timmerman

Member

spirit

Distinguished Member

We value your privacy