[SOLVED] Existing Node Orphaned After Adding New

arjones5 · Sep 5, 2024

I'm having a cluster issue after adding a new node to the cluster. One of the existing nodes (that was powered off at the time of join) is now causing issues when it's powered on. pve-cluster and corosync both seem happy on all nodes but the behavior goes completely erratic as soon as prod-prod01 is powered on when it behaved completely normal yesterday. We are testing ZFS vs hardware RAID so this particular node was completely reimaged.

When I started digging into it, the corosync files between the healthy nodes and the problem child are mis-matched and the log files indicate that prox-prod01 thinks it's on an island. Dare I try and edit the prod-prod01 corosync manually?

When I try to connect via GUI, I get an unfamiliar error:

Code:

Error hostname lookup 'prox-prod01' failed - failed to get address info for: prox-prod01: Name or service not known (500)

Corosync.conf from a healthy node:

Code:

cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}


nodelist {
  node {
    name: prox-lab01
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.2.0.20
  }
  node {
    name: prox-prod01
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.2.0.10
  }
  node {
    name: prox-prod02
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 10.2.0.11
  }
  node {
    name: prox-prod03
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.2.0.12
  }
  node {
    name: prox-raspi01
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.2.0.30
  }
  node {
    name: prox-raspi02
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.2.0.31
  }
}


quorum {
  provider: corosync_votequorum
}


totem {
  cluster_name: Castle
  config_version: 7
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Unhealthy node (note two missing nodes):

Code:

cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}


nodelist {
  node {
    name: prox-lab01
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.2.0.20
  }
  node {
    name: prox-prod01
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.2.0.10
  }
  node {
    name: prox-prod03
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.2.0.12
  }
  node {
    name: prox-raspi01
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.2.0.30
  }
}


quorum {
  provider: corosync_votequorum
}


totem {
  cluster_name: Castle
  config_version: 5
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

sw-omit · Sep 5, 2024

Because the server wasn't online on the time of join, but more then half of the nodes were online, the joining worked, but the offline node didn't get its configs as well as the things like SSH-keys and the like.

I personally think it might be a safer route to just remove and re-install this new node (with a DIFFERENT name) and this time not join the cluster when not all the nodes are reachable

https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node

You could of course also wait for someone else to chime in if there are routes to "manually" fix this, while running the risk of breaking things more, so maybe now is at least a good time to check if your backups are up to date and working

arjones5 · Sep 5, 2024

sw-omit said:
I personally think it might be a safer route to just remove and re-install this new node (with a DIFFERENT name) and this time not join the cluster when not all the nodes are reachable
https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node

I've gotten pretty good at reimaging nodes so long as they don't jump back on the network with the same IP/hostname/cluster join info.

sw-omit said:
You could of course also wait for someone else to chime in if there are routes to "manually" fix this, while running the risk of breaking things more, so maybe now is at least a good time to check if your backups are up to date and working

I've learned my lesson the hard way before so several months of backups. Very small workload on this cluster so reimaging the entire cluster wouldn't be the end of the world and would give me the excuse to go ZFS across the board.

aaron · Sep 5, 2024

arjones5 said:
When I started digging into it, the corosync files between the healthy nodes and the problem child are mis-matched and the log files indicate that prox-prod01 thinks it's on an island. Dare I try and edit the prod-prod01 corosync manually?

You could copy over the /etc/corosync/corosync.conf file of a good node to the node that was offline. Then restart the Corosync service on it. Since it will now have a matching Corosync version, it should join back into the cluster and the pmxcfs should sync up on any changes in the /etc/pve directory.

You could, though I don't think it is necessary, run pvecm updatecerts on the nodes as well so that SSH keys and such are updated/synced.

arjones5 · Sep 6, 2024

aaron said:
You could copy over the /etc/corosync/corosync.conf file of a good node to the node that was offline. Then restart the Corosync service on it. Since it will now have a matching Corosync version, it should join back into the cluster and the pmxcfs should sync up on any changes in the /etc/pve directory.

You could, though I don't think it is necessary, run pvecm updatecerts on the nodes as well so that SSH keys and such are updated/synced.

This worked beautifully without the key updates. Thanks!!

Search

Search

[SOLVED] Existing Node Orphaned After Adding New

arjones5

Member

sw-omit

Active Member

arjones5

Member

aaron

Proxmox Staff Member

arjones5

Member