Node showing "?" after NFS VM hangs, qm command stuck

Oct 14, 2025
70
19
8
Hi everyone, I've encountered a very tricky issue with my PVE cluster and I'm looking for a way to recover without rebooting the physical host.

Environment Setup:
  • 3-node PVE Cluster.
  • Each node uses LACP (Bonding) with two NICs to handle PVE Management, VM traffic, and PBS backup traffic.
  • A VM on pve3003 acts as an NFS Server, using PCIe passthrough for an SSD.
  • The entire cluster (including the node itself) mounts this NFS share for ISO storage.
What Happened:Everything was fine until I tried uploading a large ISO from pve3001's WebUI. While waiting for the transfer to pve3003, the node pve3003 suddenly displayed a question mark (disconnected state) in the cluster view.

Current Status:
  1. I’ve forced unmounted the NFS paths on the nodes (umount -f -l). Now df -h works.
  2. However, pve3003 still shows a "?" in the WebUI and cannot be managed via other nodes.
  3. I've tried restarting pve-cluster, pvedaemon, and pveproxy, but it didn't help.
  4. Critically, any qm commands (like qm list) executed on pve3003 hang indefinitely.
  5. I checked dmesg and the system console, and it's flooded with "nfs: server not responding, still trying" messages along with kernel call traces related to I/O wait.
2026-03-17_11-28 m.png

  1. Other VMs on pve3003 are still running, but the NFS VM is completely unresponsive. It seems the passthrough I/O or the VM process is stuck.
Since this host runs critical services, a reboot would cause significant downtime. Is there any way to force reset these hung management processes (especially qm) and restore the node's cluster status without a full hardware reboot?