Cluster hangs after adding a new node

Kabel1IT · May 25, 2023

Hi, we have a 3-node cluster, one day during a bulk migration the last live migrate got stuck, causing the GUI to stop responding (page loads, but login will fail; VMs run unaffected), until I killed the node and did

Code:

pvecm expected 2

.
I moved the one VM conf to a healthy node (runs just fine there), manually removed the node from the cluster following the official guide, reinstalled the node ,set up the network, buuuut - the moment I try to join said node to the cluster, pretty much all cluster services stop responding after a while (even in SSH) until I kill the node. Join operation never finishes, just stays there until I restart corosync. GUI shows the node in red (cause it's off...), configs show the node where they should, but the moment it boots, cluster becomes unresponsive.
What am I doing wrong??

Moayad · May 26, 2023

Hi,

Did you check the Syslog looking for an error message during the node join?

Kabel1IT · May 29, 2023

Code:

May 29 12:27:37 proxmox06 pvedaemon[844439]: <root@pam> adding node proxmox04 to cluster
May 29 12:27:37 proxmox06 pmxcfs[3283024]: [dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 14)
May 29 12:27:38 proxmox06 corosync[3280084]:   [CFG   ] Config reload requested by node 3
May 29 12:27:38 proxmox06 corosync[3280084]:   [TOTEM ] Configuring link 0
May 29 12:27:38 proxmox06 corosync[3280084]:   [TOTEM ] Configured link number 0: local addr: 10.233.1.46, port=5405
May 29 12:27:38 proxmox06 corosync[3280084]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
May 29 12:27:38 proxmox06 corosync[3280084]:   [KNET  ] host: host: 1 has no active links
May 29 12:27:38 proxmox06 corosync[3280084]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
May 29 12:27:38 proxmox06 corosync[3280084]:   [KNET  ] host: host: 1 has no active links
May 29 12:27:38 proxmox06 corosync[3280084]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
May 29 12:27:38 proxmox06 corosync[3280084]:   [KNET  ] host: host: 1 has no active links
May 29 12:27:38 proxmox06 pmxcfs[3283024]: [status] notice: update cluster info (cluster name  Proxmoxcluster2, version = 14)
May 29 12:27:46 proxmox06 corosync[3280084]:   [KNET  ] rx: host: 1 link: 0 is up
May 29 12:27:46 proxmox06 corosync[3280084]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
May 29 12:27:46 proxmox06 corosync[3280084]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
May 29 12:27:47 proxmox06 corosync[3280084]:   [QUORUM] Sync members[3]: 1 2 3
May 29 12:27:47 proxmox06 corosync[3280084]:   [QUORUM] Sync joined[1]: 1
May 29 12:27:47 proxmox06 corosync[3280084]:   [TOTEM ] A new membership (1.2184) was formed. Members joined: 1
May 29 12:27:47 proxmox06 pmxcfs[3283024]: [dcdb] notice: members: 1/12974, 2/204349, 3/3283024
May 29 12:27:47 proxmox06 pmxcfs[3283024]: [dcdb] notice: starting data syncronisation
May 29 12:27:47 proxmox06 pmxcfs[3283024]: [status] notice: members: 1/12974, 2/204349, 3/3283024
May 29 12:27:47 proxmox06 pmxcfs[3283024]: [status] notice: starting data syncronisation
May 29 12:27:47 proxmox06 corosync[3280084]:   [QUORUM] Members[3]: 1 2 3
May 29 12:27:47 proxmox06 corosync[3280084]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 29 12:27:47 proxmox06 pmxcfs[3283024]: [dcdb] notice: received sync request (epoch 1/12974/00000002)
May 29 12:27:47 proxmox06 pmxcfs[3283024]: [status] notice: received sync request (epoch 1/12974/00000002)
May 29 12:27:55 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 11 12 13
May 29 12:27:58 proxmox06 pvedaemon[1712750]: starting termproxy UPID:proxmox06:001A226E:9C9DCDD1:64747E2E:vncshell::root@pam:
May 29 12:27:58 proxmox06 pvedaemon[832679]: <root@pam> starting task UPID:proxmox06:001A226E:9C9DCDD1:64747E2E:vncshell::root@pam:
May 29 12:27:58 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 1a 1b 11 12 13
May 29 12:27:58 proxmox06 pvedaemon[831865]: <root@pam> successful auth for user 'root@pam'
May 29 12:27:58 proxmox06 systemd[1]: Created slice User Slice of UID 0.
May 29 12:27:58 proxmox06 systemd[1]: Starting User Runtime Directory /run/user/0...
May 29 12:27:58 proxmox06 systemd[1]: Finished User Runtime Directory /run/user/0.
May 29 12:27:58 proxmox06 systemd[1]: Starting User Manager for UID 0...
May 29 12:27:58 proxmox06 systemd[1712760]: Queued start job for default target Main User Target.
May 29 12:27:58 proxmox06 systemd[1712760]: Created slice User Application Slice.
May 29 12:27:58 proxmox06 systemd[1712760]: Reached target Paths.
May 29 12:27:58 proxmox06 systemd[1712760]: Reached target Timers.
May 29 12:27:58 proxmox06 systemd[1712760]: Listening on GnuPG network certificate management daemon.
May 29 12:27:58 proxmox06 systemd[1712760]: Listening on GnuPG cryptographic agent and passphrase cache (access for web browsers).
May 29 12:27:58 proxmox06 systemd[1712760]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
May 29 12:27:58 proxmox06 systemd[1712760]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
May 29 12:27:58 proxmox06 systemd[1712760]: Listening on GnuPG cryptographic agent and passphrase cache.
May 29 12:27:58 proxmox06 systemd[1712760]: Reached target Sockets.
May 29 12:27:58 proxmox06 systemd[1712760]: Reached target Basic System.
May 29 12:27:58 proxmox06 systemd[1712760]: Reached target Main User Target.
May 29 12:27:58 proxmox06 systemd[1712760]: Startup finished in 130ms.
May 29 12:27:58 proxmox06 systemd[1]: Started User Manager for UID 0.
May 29 12:27:58 proxmox06 systemd[1]: Started Session 8264 of user root.
May 29 12:28:00 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 1a 11 12 13
May 29 12:28:05 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 1a 11 12 13
May 29 12:28:10 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 1a 11 12
May 29 12:28:13 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 11 12 1a
May 29 12:28:18 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 11 12 1a 20
May 29 12:28:23 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 22 11 12 1a 20
May 29 12:28:25 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 22 11 12 1a 20
May 29 12:28:31 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 22 11 12 1a 20
May 29 12:28:33 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 22 11 12 1a 20
May 29 12:28:36 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 22 26 11 12 1a
May 29 12:28:41 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 22 26 11 12 1a
May 29 12:28:46 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 22 26 28 11 12 1a
May 29 12:28:46 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 22 26 28 11 12 1a
May 29 12:28:48 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 28 12 1a
May 29 12:28:51 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 28 2a 12 1a
May 29 12:28:53 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 28 2a 1a
May 29 12:28:58 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 2a 2c 1a

And it just keeps trying...

Kabel1IT · May 29, 2023

the joining node

Moayad · May 30, 2023

Thank you for the syslog!

Kabel1IT said:
May 29 12:28:00 proxmox06 corosync[3280084]: [TOTEM ] Retransmit List: 18 19 1a 11 12 13

- Can you please check from the network connectivity between the new node and the existing nodes cluster? (SSH between nodes - ping - nc etc..)
- Do you have firewall enabled on both sides?
- Can you also share with us the corosync.conf from the cluster side

Kabel1IT · May 30, 2023

corosync.conf

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: Proxmox05
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.233.1.45
  }
  node {
    name: proxmox04
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.233.1.44
  }
  node {
    name: proxmox06
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.233.1.46
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Proxmoxcluster2
  config_version: 14
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  token: 10000
  version: 2
}

Firewall has no rules but seems to be enabled...note that the cluster worked fine for a year with this...

ping from joining node (add names to hosts???)

Code:

root@proxmox04:~# ping proxmox06
ping: proxmox06: Temporary failure in name resolution
root@proxmox04:~# ping 10.233.1.46
PING 10.233.1.46 (10.233.1.46) 56(84) bytes of data.
64 bytes from 10.233.1.46: icmp_seq=1 ttl=64 time=0.184 ms
64 bytes from 10.233.1.46: icmp_seq=2 ttl=64 time=3.05 ms
64 bytes from 10.233.1.46: icmp_seq=3 ttl=64 time=0.218 ms
64 bytes from 10.233.1.46: icmp_seq=4 ttl=64 time=0.186 ms
^C
--- 10.233.1.46 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3046ms
rtt min/avg/max/mdev = 0.184/0.910/3.052/1.236 ms

ping to joining node

Code:

root@proxmox06:~# ping proxmox04
PING proxmox04.kabel1it.local (10.233.1.44) 56(84) bytes of data.
64 bytes from proxmox04.kabel1it.local (10.233.1.44): icmp_seq=1 ttl=64 time=0.247 ms
64 bytes from proxmox04.kabel1it.local (10.233.1.44): icmp_seq=2 ttl=64 time=0.298 ms
64 bytes from proxmox04.kabel1it.local (10.233.1.44): icmp_seq=3 ttl=64 time=0.281 ms
64 bytes from proxmox04.kabel1it.local (10.233.1.44): icmp_seq=4 ttl=64 time=0.265 ms
^C
--- proxmox04.kabel1it.local ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3039ms
rtt min/avg/max/mdev = 0.247/0.272/0.298/0.018 ms

Kabel1IT · May 30, 2023

all nodes have network configured like so (difference only in name and last digit of IP address)

I saw posts where the LACP bond was a problem, but like I wrote - worked fine for a year.

Moayad · May 30, 2023

Thank you for the output!

Looks like everything is ok to me. Can you also provide us with the output of `corosync-cfgtool -s`

Kabel1IT · May 30, 2023

with node off:

Code:

root@proxmox06:~# corosync-cfgtool -s
Local node ID 3, transport knet
LINK ID 0 udp
        addr    = 10.233.1.46
        status:
                nodeid:          1:     disconnected
                nodeid:          2:     connected
                nodeid:          3:     localhost

with node on:

Code:

root@proxmox06:~# corosync-cfgtool -s
Local node ID 3, transport knet
LINK ID 0 udp
        addr    = 10.233.1.46
        status:
                nodeid:          1:     connected
                nodeid:          2:     connected
                nodeid:          3:     localhost

Moayad · May 31, 2023

Hi again,

All nodes are the same PVE version?

Kabel1IT · May 31, 2023

nope, cluster is 7.2, node is 7.4 (I can't find the 7.2 ISO anywhere)...any chance you could make it available again?

Kabel1IT · Jun 7, 2023

uhh, anyone?

Moayad · Jun 12, 2023

I would ensure all nodes have the same version packages, especially the Corosync.

Kabel1IT · Jun 12, 2023

can I update the cluster by leaving all VMs on 1 node, update the other, migrate VMs and update the remaining node or can I expect things to break? 1 node should have the resources to run everything, but downtime on the VMs is a problem...will the cluster retain quorum if 1 of 2 nodes is down?

Moayad · Jun 12, 2023

Kabel1IT said:
can I update the cluster by leaving all VMs on 1 node, update the other, migrate VMs and update the remaining node or can I expect things to break?

Yes, this is the common practice in the cluster upgrade.

Kabel1IT said:
will the cluster retain quorum if 1 of 2 nodes is down?

no, both nodes are required to maintain a quorum if you have only 2 nodes in the cluster. However, you can bypass that by expected votes using `pvecm expected 1` command.

Kabel1IT · Jun 13, 2023

well, with the other nodes on version 7.4, it joined with no problems, consider this fixed

one note though - during the upgrade, even with pvecm expected 1 the VMs froze until the other node rebooted...

Moayad · Jun 13, 2023

Kabel1IT said:
well, with the other nodes on version 7.4, it joined with no problems, consider this fixed

Glad to read that!

Kabel1IT said:
one note though - during the upgrade, even with pvecm expected 1 the VMs froze until the other node rebooted...

Without more information or Syslog, I can't say what happened exactly, you can provide us with the syslog to more investigation and see the root cause of the froze VMs at the update time. But I assume that the VMs froze because the node did not have enough resources to handle all the VMs.

Kabel1IT · Jun 13, 2023

each node is dual Xeon 6148 (80 cores total) with 512GB RAM, with all the VMs migrated to just one, it still shows single-digit % CPU load and about 30% RAM usage. They have more then too much resources

I probably phrased it wrong - the VMs froze when the empty node was rebooting after the upgrade to 7.4, the moment pvecm status read 2 nodes, everything just worked happily ever after. Happened with both nodes (all VMs on 1, the other was rebooting after upgrade). During the node reboot pvcem status kept reporting in quorum.

Search

Search

Cluster hangs after adding a new node

Kabel1IT

New Member

Moayad

Proxmox Staff Member

Kabel1IT

New Member

Kabel1IT

New Member

Moayad

Proxmox Staff Member

Kabel1IT

New Member

Kabel1IT

New Member

Moayad

Proxmox Staff Member

Kabel1IT

New Member

Moayad

Proxmox Staff Member

Kabel1IT

New Member

Kabel1IT

New Member

Moayad

Proxmox Staff Member

Kabel1IT

New Member

Moayad

Proxmox Staff Member

Kabel1IT

New Member

Moayad

Proxmox Staff Member

Kabel1IT

New Member

We value your privacy