Hello everyone, I would like to share some issues I’m facing with a distributed 7-node cluster on Hetzner, hoping to gather suggestions or ideas on how to resolve them.
Cluster Configuration:
• Nodes: 7, distributed across Germany and Finland.
• Purpose: Simplify VM migration and replicate two virtual machines. Other than that, nodes operate mostly independently.
• Storage: Each node is configured with: ZFS RAID1 (2 disks) for the operating system and ZFS RAID1 (2 disks) for the VMs.
• Networking: Dedicated VLANs, including one for internal cluster communication with failover to public IPs.
Context:
I have not actively managed this cluster for several months. Recently, I was contacted to address a critical issue that arose after upgrading some nodes to Proxmox VE 8.3. The main symptoms observed were:
• Significant slowdowns in accessing the GUI (both the main node and individual nodes).
• Root password not being recognized in certain cases (despite being correct).
• The pve-cluster service stopped working on one node.
Troubleshooting Attempts:
1. Initial Checks:
• corosync was operational, and all nodes were visible to each other. However, pve-cluster was not synchronizing correctly between nodes, even though no errors were shown during startup.
2. Forcing SSH Key Updates:
• During a cluster synchronization attempt, we noticed SSH connectivity issues between some nodes. We regenerated the SSH keys to address this.
3. Critical Issue:
• After regenerating the SSH keys, all directories in /etc/pve synchronized, wiping out the VM and LXC configurations on all nodes. Fortunately, we had an earlier backup that allowed us to manually restore the configurations, stopping the pve-clusterservice first to work on the directory.
4. Current Situation:
After restoring the configurations, pve-cluster no longer starts. Directory permissions appear correct, but:
• The GUI does not work.
• CLI commands related to VM and LXC management return errors:
• We attempted to restart a non-critical node, but this didn’t resolve the issue. VM and LXC remain operational on the other 6 nodes, which cannot be rebooted without risking further failures.
5. Resetting /etc/pve:
• Removing the data in /etc/pve allows pve-cluster to start, but some directories are missing:
• Attempts to restore directories from the backup fail with permission errors:
Current Objective:
We are not aiming to restore the cluster as a unified system. Instead, the primary goals are:
1. Regain control of the VM and LXC configurations to manage them if needed.
2. Perform updated backups and migrate VMs to a newly configured cluster, if necessary.
3. Preserve the operational VMs with the most recent data, as the latest PBS backups are outdated.
If anyone has ideas, suggestions, or procedures to resolve these issues, I would greatly appreciate your input.
Thank you in advance for your support!
Mattia
Cluster Configuration:
• Nodes: 7, distributed across Germany and Finland.
• Purpose: Simplify VM migration and replicate two virtual machines. Other than that, nodes operate mostly independently.
• Storage: Each node is configured with: ZFS RAID1 (2 disks) for the operating system and ZFS RAID1 (2 disks) for the VMs.
• Networking: Dedicated VLANs, including one for internal cluster communication with failover to public IPs.
Context:
I have not actively managed this cluster for several months. Recently, I was contacted to address a critical issue that arose after upgrading some nodes to Proxmox VE 8.3. The main symptoms observed were:
• Significant slowdowns in accessing the GUI (both the main node and individual nodes).
• Root password not being recognized in certain cases (despite being correct).
• The pve-cluster service stopped working on one node.
Troubleshooting Attempts:
1. Initial Checks:
• corosync was operational, and all nodes were visible to each other. However, pve-cluster was not synchronizing correctly between nodes, even though no errors were shown during startup.
2. Forcing SSH Key Updates:
• During a cluster synchronization attempt, we noticed SSH connectivity issues between some nodes. We regenerated the SSH keys to address this.
3. Critical Issue:
• After regenerating the SSH keys, all directories in /etc/pve synchronized, wiping out the VM and LXC configurations on all nodes. Fortunately, we had an earlier backup that allowed us to manually restore the configurations, stopping the pve-clusterservice first to work on the directory.
4. Current Situation:
After restoring the configurations, pve-cluster no longer starts. Directory permissions appear correct, but:
• The GUI does not work.
• CLI commands related to VM and LXC management return errors:
qm list
ipcc_send_rec[1] failed: Connection refused
ipcc_send_rec[2] failed: Connection refused
Unable to load access control list: Connection refused
• We attempted to restart a non-critical node, but this didn’t resolve the issue. VM and LXC remain operational on the other 6 nodes, which cannot be rebooted without risking further failures.
5. Resetting /etc/pve:
• Removing the data in /etc/pve allows pve-cluster to start, but some directories are missing:
/etc/pve# ls -la
total 9
drwxr-xr-x 2 root www-data 0 Jan 1 1970 .
drwxr-xr-x 97 root root 198 Nov 27 10:39 ..
-r--r----- 1 root www-data 155 Jan 1 1970 .clusterlog
-r--r----- 1 root www-data 1173 Nov 27 00:52 corosync.conf
-rw-r----- 1 root www-data 2 Jan 1 1970 .debug
lr-xr-xr-x 1 root www-data 0 Jan 1 1970 local -> nodes/node03
lr-xr-xr-x 1 root www-data 0 Jan 1 1970 lxc -> nodes/node03/lxc
-r--r----- 1 root www-data 409 Jan 1 1970 .members
lr-xr-xr-x 1 root www-data 0 Jan 1 1970 openvz -> nodes/node03/openvz
lr-xr-xr-x 1 root www-data 0 Jan 1 1970 qemu-server -> nodes/node03/qemu-server
-r--r----- 1 root www-data 0 Jan 1 1970 .rrd
-r--r----- 1 root www-data 942 Jan 1 1970 .version
-r--r----- 1 root www-data 18 Jan 1 1970 .vmlist
• Attempts to restore directories from the backup fail with permission errors:
cp -rf /root/etc-pve-backup/nodes ./
cp: cannot create directory './nodes': Permission denied
Current Objective:
We are not aiming to restore the cluster as a unified system. Instead, the primary goals are:
1. Regain control of the VM and LXC configurations to manage them if needed.
2. Perform updated backups and migrate VMs to a newly configured cluster, if necessary.
3. Preserve the operational VMs with the most recent data, as the latest PBS backups are outdated.
If anyone has ideas, suggestions, or procedures to resolve these issues, I would greatly appreciate your input.
Thank you in advance for your support!
Mattia