[SOLVED] Solved 2.5 Months of Blood, Sweat and Tears

SCrisler · Mar 16, 2021

I'm posting this in the hopes that it saves someones sanity. The starting point for this yarn is a 2 node cluster running on identical Xeon hardware using dedicated 1 gig motherboard ethernet ports. Cluster ran flawlessly from it's inception. In December, new hardware was added in the form of matching Ryzen 3950X processors running on Asus PRO WS X570-ACE motherboard with a PCIe4 4 port 10 gig ethernet card. Naturally the desire was to add the 2 new nodes into the existing cluster. As part of the upgrade, a Unifi 10 gig backbone switch was added and 4 ports were carved out as a VLAN to carry the COROSYNC traffic. In my first attempt, I used the web GUI to add one of the Ryzen machines to the cluster. PVECM STATUS would show 3 machines in the cluster but the web gui immediately went wonky with gray question marks and the inability to manage the cluster. All nodes were setup in /etc/hosts and could ssh with each other. COROSYNC was running on its own link0 network. As soon as I turned off the Ryzen machine or removed it from the cluster, everything worked perfectly again. I wiped the Ryzen clean and did a fresh install this time doing it from the command line. Same grayed out web gui. PVECM UPDATECERTS --force would yield a timeout error. Finally needing to get the Ryzen machines online, I gave up trying to join them to the existing cluster and instead joined them to each other which worked perfectly. So at this point I have 2 clusters. Each cluster consists of 2 machines. In case you are wondering, the clusters were coexisting just fine on the same network over the 4 port VLAN. I'm still not happy because I want one cluster with 4 machines in it not two clusters. The Xeon machines started as 5.4 cluster before being upgraded to 6.0. I thought maybe there was some residual thing about 5.4 that was making the cluster fail to function correctly. Accordingly I migrated all machines to a single Xeon and removed the other Xeon from it's cluster. I then reinstalled the latest version of Proxmox from scratch. Using the web gui, I joined the Xeon to the Ryzen cluster. This worked better than when I joined the Ryzen machine to the Xeon cluster but it still resulted in the nodes going gray and losing the ability of the console to connect to running machines. Once again I joined and unjoined several times and tried PVECM UPDATECERTS with the same timeout error. The one thing I noticed is that all of the certs were pretty much missing from the Xeon node. PVECM STATUS was happy to report that I had a 3 node cluster with 3 nodes reporting and everything was quorate. If I attempted to do anything with the XEON node via the web gui, I had no access. Through this whole process, I was scouring this forum and the internet for any hints about what was going wrong. All of that of course was a dead end. Once again when I removed the 3rd (Xeon) node from the cluster, then everything worked perfectly. Finally in desperation, I said maybe there is something funky about the fact that the Xeon nodes are on 1 gigabit and the Ryzen nodes are on 10 gigabit. Keep in mind that I can ping from any node to any node and the recommended IP tool that I cannot recall the exact name of all worked perfectly and showed no problem. In the Xeon machines, there is an add on 2 port 10 gigabit ethernet card. So I swapped things around so COROSYNC would run on a 10 gig port. Well what do you know, PVECM UPDATECERTS --force worked almost instantly and I now have a fully functioning 3 node cluster. Now I still don't know the ultimate problem. Maybe COROSYNC doesn't like a mix of 10 gig and 1 gig, maybe the offbrand SFP+ switch adapters are not quite right, maybe Proxmox didn't find exactly the right driver for the ethernet adapters. Bottom line, if you are having trouble adding a node to your cluster, don't overlook the fact that your problem could be at the hardware level. COROSYNC worked just well enough that I spent 2.5 months chasing what I thought was a configuration issue.

Search

Search

[SOLVED] Solved 2.5 Months of Blood, Sweat and Tears

SCrisler

Member