Issues with a 7-node cluster

mds · Nov 27, 2024

Hello everyone, I would like to share some issues I’m facing with a distributed 7-node cluster on Hetzner, hoping to gather suggestions or ideas on how to resolve them.

Cluster Configuration:
• Nodes: 7, distributed across Germany and Finland.
• Purpose: Simplify VM migration and replicate two virtual machines. Other than that, nodes operate mostly independently.
• Storage: Each node is configured with: ZFS RAID1 (2 disks) for the operating system and ZFS RAID1 (2 disks) for the VMs.
• Networking: Dedicated VLANs, including one for internal cluster communication with failover to public IPs.

Context:
I have not actively managed this cluster for several months. Recently, I was contacted to address a critical issue that arose after upgrading some nodes to Proxmox VE 8.3. The main symptoms observed were:
• Significant slowdowns in accessing the GUI (both the main node and individual nodes).
• Root password not being recognized in certain cases (despite being correct).
• The pve-cluster service stopped working on one node.

Troubleshooting Attempts:
1. Initial Checks:
• corosync was operational, and all nodes were visible to each other. However, pve-cluster was not synchronizing correctly between nodes, even though no errors were shown during startup.

2. Forcing SSH Key Updates:
• During a cluster synchronization attempt, we noticed SSH connectivity issues between some nodes. We regenerated the SSH keys to address this.

3. Critical Issue:
• After regenerating the SSH keys, all directories in /etc/pve synchronized, wiping out the VM and LXC configurations on all nodes. Fortunately, we had an earlier backup that allowed us to manually restore the configurations, stopping the pve-clusterservice first to work on the directory.

4. Current Situation:
After restoring the configurations, pve-cluster no longer starts. Directory permissions appear correct, but:
• The GUI does not work.
• CLI commands related to VM and LXC management return errors:

qm list
ipcc_send_rec[1] failed: Connection refused
ipcc_send_rec[2] failed: Connection refused
Unable to load access control list: Connection refused

• We attempted to restart a non-critical node, but this didn’t resolve the issue. VM and LXC remain operational on the other 6 nodes, which cannot be rebooted without risking further failures.

5. Resetting /etc/pve:

• Removing the data in /etc/pve allows pve-cluster to start, but some directories are missing:

/etc/pve# ls -la
total 9
drwxr-xr-x 2 root www-data 0 Jan 1 1970 .
drwxr-xr-x 97 root root 198 Nov 27 10:39 ..
-r--r----- 1 root www-data 155 Jan 1 1970 .clusterlog
-r--r----- 1 root www-data 1173 Nov 27 00:52 corosync.conf
-rw-r----- 1 root www-data 2 Jan 1 1970 .debug
lr-xr-xr-x 1 root www-data 0 Jan 1 1970 local -> nodes/node03
lr-xr-xr-x 1 root www-data 0 Jan 1 1970 lxc -> nodes/node03/lxc
-r--r----- 1 root www-data 409 Jan 1 1970 .members
lr-xr-xr-x 1 root www-data 0 Jan 1 1970 openvz -> nodes/node03/openvz
lr-xr-xr-x 1 root www-data 0 Jan 1 1970 qemu-server -> nodes/node03/qemu-server
-r--r----- 1 root www-data 0 Jan 1 1970 .rrd
-r--r----- 1 root www-data 942 Jan 1 1970 .version
-r--r----- 1 root www-data 18 Jan 1 1970 .vmlist

• Attempts to restore directories from the backup fail with permission errors:

cp -rf /root/etc-pve-backup/nodes ./
cp: cannot create directory './nodes': Permission denied

Current Objective:
We are not aiming to restore the cluster as a unified system. Instead, the primary goals are:
1. Regain control of the VM and LXC configurations to manage them if needed.
2. Perform updated backups and migrate VMs to a newly configured cluster, if necessary.
3. Preserve the operational VMs with the most recent data, as the latest PBS backups are outdated.

If anyone has ideas, suggestions, or procedures to resolve these issues, I would greatly appreciate your input.

Thank you in advance for your support!
Mattia

Maximiliano · Nov 27, 2024

Hello,

What is the latency between the hosts in two different locations? Clustering in Proxmox VE uses Corosync which is extremely sensitive to latency. We recommend stable connections with fewer than 5ms of latency [1]. When a node loses quorum then the /etc/pve director will be in read-only mode which would explain what you are seeing.

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_cluster_network_requirements

mds · Nov 27, 2024

Maximiliano said:
Hello,

What is the latency between the hosts in two different locations? Clustering in Proxmox VE uses Corosync which is extremely sensitive to latency. We recommend stable connections with fewer than 5ms of latency [1]. When a node loses quorum then the /etc/pve director will be in read-only mode which would explain what you are seeing.

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_cluster_network_requirements

Hi Maximiliano, don’t worry, latency has never been an issue for the cluster, as it has been running without any problems since 2021.

Currently, on the “sacrificial” machine, I managed to copy the missing files to the /etc/pve directory by disabling Corosync. Using the CLI, I regained control over the VMs. The web interface is now fully functional as well, although only as a single host. While this is acceptable for now, everything appears greyed out—I’m likely missing something related to state management.

Soon, I will test this on a production machine (which I cannot reboot) to see if I can regain control without interrupting active services.

Any suggestions or advice would be greatly appreciated.

Thank you so much,
Mattia

Maximiliano · Nov 27, 2024

Hi Maximiliano, don’t worry, latency has never been an issue for the cluster, as it has been running without any problems since 2021.\

What would be the latency? Note that from Finland to Germany there are over 1000km in between most pairs of points, at the speed of light that takes around 3ms. If you take into account roundtrips, that optic fiber cables can only do about 2/3 of the speed of light, and that the fiber cable (if there is one) is probably not in a perfect straight line from server to server it is quite unlikely it matches the 5ms requirement.

mds · Nov 27, 2024

Maximiliano said:
What would be the latency? Note that from Finland to Germany there are over 1000km in between most pair of points, at the speed of light that takes around 3ms. If you take into account roundtrips, that optic fiber cables can only do about 2/3 of the speed of light, and that the fiber cable (if there is one) is probably not in a perfect straight line from server to server it is quite unlikely it matches the 5ms requirement.

The latency between servers is around 2ms for those in the same location, while it ranges from 10ms to 35ms for servers in different locations, depending on whether the link is 1Gbps or 10Gbps.

I understand and appreciate the technical specifications you suggested reviewing; I do not question their validity, and, as stated, it is acknowledged that it might work but without guarantees. It’s possible that the issue arose following a network maintenance by the provider, although I consider it unlikely. That said, I will certainly keep this specification in mind for future projects, even though similar setups have been running successfully for over 6 years.

However, as I mentioned in my post, my primary goal is not to restore the corrupted cluster, which I believe would be a challenging task, but to regain access to the VMs and data, as this is the most critical concern for the end client.

Thank you once again for your input and clarification, which, as I’ve said, I will definitely take into account. If you have any further suggestions for recovery, I would greatly appreciate them.

Best regards,

Mattia

Maximiliano · Nov 27, 2024

while it ranges from 10ms to 35ms for servers in different locations

Probably something changed lately, perhaps more bandwidth is used by the network (which in turn can increase the latency) or something changed on layer 1. In any case, we do have customers who manage to run such clusters but there is no guarantee that such a setup will continue to work.

• Removing the data in /etc/pve allows pve-cluster to start, but some directories are missing:

Note that /etc/pve contains all the data used by Proxmox VE and it is cluster filesystem (pmxcfs). Deleting files there in one node can and will delete them in others under the right circumstances.

The VM configs are at /etc/pve/qemu-server and their data's location depends on which storage plugin was used, /var/lib/vz/images/ for the `local` storage at least. With those two pieces restoring them should be easy. If you have a node with the VM you can use the backup feature.

Getting the cluster back up would require to have the exact contents for /etc/pve/ for all nodes, it might not be that hard depending on its current state. If you have one node where you know the contents are ok one could sync them over to all nodes by copying the sqlite database that backs the cluster filesystem, see [1] for more details.

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_recovery

Search

Search

Issues with a 7-node cluster

mds

Active Member

Maximiliano

Proxmox Staff Member

mds

Active Member

Maximiliano

Proxmox Staff Member

mds

Active Member

Maximiliano

Proxmox Staff Member