[SOLVED] Cluster join to 6.4-4 master fails from new 7.1-2 node

built a new 7.1-2 PMG server on a linode hosted vm Debian 11 to replace 6.4-4 server on another hosting platform.
moved subscription key from old server to new server, OK.
deleted old 6.4.4 server from cluster, OK, leaving two locally hosted servers, sf-01 (master) and sf-02 (node).

established vpn tunnel from new server to locally hosted servers, just like for decommissioned server.

tried by command line pmgcm join X.X.X.X with local IP of master server sf-01, without success.
tried by UI to join with local IP Address and fingerprint of master server sf-01, without success.
tried by UI to join with local IP Address, password, and fingerprint of master server sf-01, without success.

All three servers have keys installed and can ssh without passwords to each other as root.
pmgmirror and pmgtunnel are dead on new server node
pmgmirror and pmgtunnel are running and synced on nodes sf-01 and sf-02.

The error returned by the command line join attempt is:

cluster join failed: 500 Can't connect to xx.xx.xx.xxx:8006 (hostname verification failed)

I've attached logs, sf-01 is master, sf-03 is new node I'm trying to add. Event takes place Mar 02 10:44:25

Thanks in advance
Bruce
 

Attachments

  • sf-03.txt
    538 bytes · Views: 1
  • sf-01.txt
    1 KB · Views: 1
please post the output of:
* `pmgversion -v` from all involved systems
* `pmgcm status` ffrom all involved systems
* `cat /etc/pmg/cluster.conf` from all involved systems
* a larger portion of the journal while trying to join a node
 
We came to the conclusion that we needed to upgrade our other nodes to 7.1-2, and followed the directions here:
https://pmg.proxmox.com/wiki/index.php/Upgrade_from_6.x_to_7.0
which is an excellent guide, and worked as it should. Once the upgrades were complete, I restored the pmg backup to the master node, checking all three boxes, which may have been a mistake, since we learned that it leads to the error message:
>>DBD::pg::db do failed: ERROR: duplicate key value violates unique constraint "localstat_pkey" DETAIL: Key ("time", cid)=(1643014800, 1) already exists. at /usr/share/perl5/PMG/DBTools.pm line 937.<<
when trying to create the cluster on the master. Researching, we found the offending data row in the LocalStat table and removed it. We were then able to create the cluster successfully and add the two other nodes. The local node synced up and changed status to active, but the master and the remote node never switched to Active status. I removed the nodes from the cluster last night, and removed the cluster.conf from the nodes. I tried to add them back this morning, and am waiting for the first one to sync. The node numbers are not consecutive anymore - the master node is one, and the newly added node is now 4. We will post our findings. Thanks.

Edit: We removed all nodes from the cluster and restarted. Then we removed all cids > 1 from the clusterinfo, cgreylist, cmailstore, cstatistic and localstat tables. We removed the cluster.conf from each node leaving cluster.conf on the master. We examined cluster.conf on the master node and removed the entries which were not joined to the cluster any more. We then restarted each node again and were able to add the nodes back. They all synced up OK in just a few minutes. The only situation is that the node ID's are not consecutive. The master is 1, the nodes are 5 and 6. I can live with that.

Cheers!

Bruce
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!