Subject: PVE 8.4.1 Single Node: Persistent Ghost Node & VMs After Recovery - Cannot Reuse IDs (Detailed Steps Tried)
Hello Proxmox Community,
I'm seeking help with a stubborn UI state inconsistency on a single Proxmox VE 8.4.1 node (using pve-no-subscription repo). After recovering from a significant /etc/pve corruption, I'm left with ghost entries that prevent me from reusing my original VM IDs, despite extensive troubleshooting.
Initial Situation & Recovery:
Thank you for your time and expertise!
Hello Proxmox Community,
I'm seeking help with a stubborn UI state inconsistency on a single Proxmox VE 8.4.1 node (using pve-no-subscription repo). After recovering from a significant /etc/pve corruption, I'm left with ghost entries that prevent me from reusing my original VM IDs, despite extensive troubleshooting.
Initial Situation & Recovery:
- The host (Cloud9.Proxmox.10) experienced a critical failure where /etc/pve (managed by pmxcfs) became corrupted and unmountable.
- Recovery involved stopping PVE services, moving the corrupted /etc/pve aside, creating a new empty /etc/pve, and using a clean /var/lib/pve-cluster/config.db (as the original and backups were also unusable).
- This successfully brought pmxcfs and core services back online.
- Internal 596 Cert Errors: Persistent certificate verify failed (596) errors were resolved by running sudo pvecm updatecerts --force. (Note: /usr/sbin/pvekey binary seems absent in 8.4.1 packages, but the command worked).
- External HTTPS: Secured access via https://prox.ihearvoices.ai:8006 by copying Let's Encrypt certs to /etc/pve/nodes/Cloud9/pveproxy-ssl.* and using a client /etc/hosts entry.
- PBS Connection: Fixed initial "Network Unreachable" errors by disabling Cloudflare proxying for the PBS hostname.
- Storage Visibility: The data ZFS storage pool (rpool/data) was initially not selectable in the UI. This was fixed by removing (sudo pvesm remove data) and re-adding (sudo pvesm add zfspool data ...) the storage definition via CLI.
- Snapshot Failures: "Out of space" errors during snapshots were resolved by removing unnecessary ZFS refreservations from the VM disk volumes (sudo zfs set refreservation=none ...).
- VM Restoration: VMs 111 & 112 were successfully restored from PBS backups using new IDs (9111, 9112) to local storage initially, then moved to the data storage pool once it became visible. These VMs are running correctly.
- The UI tree displays two node entries:
- Cloud9 (Uppercase C): The correct, active node containing running VMs 9111 & 9112 and correctly configured storage.
- cloud9 (Lowercase c): A ghost node entry, likely from the pre-corruption state.
- Under the ghost node cloud9, the UI lists the original VMs: 111, 112, 178.
- These ghost VMs do not exist:
- Their .conf files are confirmed deleted from /etc/pve/nodes/Cloud9/qemu-server/.
- sudo qm list only shows the active VMs (9111, 9112).
- The file /etc/pve/.vmlist does not exist on the system.
- Functional Impact: The system incorrectly believes IDs 111 and 112 are still in use due to these ghost entries. I cannot rename my active VMs (9111, 9112) back to their original, desired IDs (111, 112). Attempting to restore using these IDs also fails.
- Multiple restarts of pve-cluster, pvedaemon, pveproxy.
- Full host reboot.
- Extensive browser cache/storage clearing (hard refresh, private windows, clear site data via dev tools).
- Verified deletion of ghost VM .conf files.
- Verified qm list output is correct.
- Attempted cluster state reset by removing /etc/pve/.clusterlog and /etc/pve/.version and restarting services.
- Checked for /etc/pve/.vmlist (confirmed non-existent, so cannot be edited).
- Forced storage rescans (pvesm scan zfs) and re-added storage definition.
- Checked ZFS permissions (zfs allow -u www-data ...).
- Searched for any residual files containing ghost VM IDs using find /etc/pve/nodes/Cloud9/ -type f \( -name "*111*" ... \). None were found beyond the (now deleted) active VM configs during a brief period.
- Checked recent PVE service logs (journalctl) - no obvious errors related to state inconsistency after the 596 fix.
- Is there another cache, index file, or database entry within the Proxmox system (besides .vmlist) that might be holding this stale node/VM information?
- Are there any pvesh commands or other low-level tools that can be used to forcefully query and potentially purge these specific ghost entries from the cluster's internal state?
- Could this be a known bug in PVE 8.4.1 related to node renaming or recovery from corruption that requires a specific workaround or patch?
Thank you for your time and expertise!