[SOLVED] Existing Node Orphaned After Adding New

arjones5

Member
Feb 18, 2020
9
1
21
31
I'm having a cluster issue after adding a new node to the cluster. One of the existing nodes (that was powered off at the time of join) is now causing issues when it's powered on. pve-cluster and corosync both seem happy on all nodes but the behavior goes completely erratic as soon as prod-prod01 is powered on when it behaved completely normal yesterday. We are testing ZFS vs hardware RAID so this particular node was completely reimaged.

When I started digging into it, the corosync files between the healthy nodes and the problem child are mis-matched and the log files indicate that prox-prod01 thinks it's on an island. Dare I try and edit the prod-prod01 corosync manually?

When I try to connect via GUI, I get an unfamiliar error:
Code:
Error hostname lookup 'prox-prod01' failed - failed to get address info for: prox-prod01: Name or service not known (500)

Corosync.conf from a healthy node:
Code:
cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}


nodelist {
  node {
    name: prox-lab01
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.2.0.20
  }
  node {
    name: prox-prod01
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.2.0.10
  }
  node {
    name: prox-prod02
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 10.2.0.11
  }
  node {
    name: prox-prod03
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.2.0.12
  }
  node {
    name: prox-raspi01
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.2.0.30
  }
  node {
    name: prox-raspi02
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.2.0.31
  }
}


quorum {
  provider: corosync_votequorum
}


totem {
  cluster_name: Castle
  config_version: 7
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Unhealthy node (note two missing nodes):
Code:
cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}


nodelist {
  node {
    name: prox-lab01
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.2.0.20
  }
  node {
    name: prox-prod01
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.2.0.10
  }
  node {
    name: prox-prod03
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.2.0.12
  }
  node {
    name: prox-raspi01
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.2.0.30
  }
}


quorum {
  provider: corosync_votequorum
}


totem {
  cluster_name: Castle
  config_version: 5
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}
 
Because the server wasn't online on the time of join, but more then half of the nodes were online, the joining worked, but the offline node didn't get its configs as well as the things like SSH-keys and the like.

I personally think it might be a safer route to just remove and re-install this new node (with a DIFFERENT name) and this time not join the cluster when not all the nodes are reachable ;)
https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node

You could of course also wait for someone else to chime in if there are routes to "manually" fix this, while running the risk of breaking things more, so maybe now is at least a good time to check if your backups are up to date and working
 
I personally think it might be a safer route to just remove and re-install this new node (with a DIFFERENT name) and this time not join the cluster when not all the nodes are reachable ;)
https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node
I've gotten pretty good at reimaging nodes so long as they don't jump back on the network with the same IP/hostname/cluster join info.
You could of course also wait for someone else to chime in if there are routes to "manually" fix this, while running the risk of breaking things more, so maybe now is at least a good time to check if your backups are up to date and working
I've learned my lesson the hard way before so several months of backups. Very small workload on this cluster so reimaging the entire cluster wouldn't be the end of the world and would give me the excuse to go ZFS across the board.
 
When I started digging into it, the corosync files between the healthy nodes and the problem child are mis-matched and the log files indicate that prox-prod01 thinks it's on an island. Dare I try and edit the prod-prod01 corosync manually?
You could copy over the /etc/corosync/corosync.conf file of a good node to the node that was offline. Then restart the Corosync service on it. Since it will now have a matching Corosync version, it should join back into the cluster and the pmxcfs should sync up on any changes in the /etc/pve directory.

You could, though I don't think it is necessary, run pvecm updatecerts on the nodes as well so that SSH keys and such are updated/synced.
 
You could copy over the /etc/corosync/corosync.conf file of a good node to the node that was offline. Then restart the Corosync service on it. Since it will now have a matching Corosync version, it should join back into the cluster and the pmxcfs should sync up on any changes in the /etc/pve directory.

You could, though I don't think it is necessary, run pvecm updatecerts on the nodes as well so that SSH keys and such are updated/synced.
This worked beautifully without the key updates. Thanks!!
 
  • Like
Reactions: aaron

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!