Cluster hangs after adding a new node

Kabel1IT

New Member
May 25, 2023
13
0
1
Hi, we have a 3-node cluster, one day during a bulk migration the last live migrate got stuck, causing the GUI to stop responding (page loads, but login will fail; VMs run unaffected), until I killed the node and did
Code:
pvecm expected 2
.
I moved the one VM conf to a healthy node (runs just fine there), manually removed the node from the cluster following the official guide, reinstalled the node ,set up the network, buuuut - the moment I try to join said node to the cluster, pretty much all cluster services stop responding after a while (even in SSH) until I kill the node. Join operation never finishes, just stays there until I restart corosync. GUI shows the node in red (cause it's off...), configs show the node where they should, but the moment it boots, cluster becomes unresponsive.
What am I doing wrong??
 
Code:
May 29 12:27:37 proxmox06 pvedaemon[844439]: <root@pam> adding node proxmox04 to cluster
May 29 12:27:37 proxmox06 pmxcfs[3283024]: [dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 14)
May 29 12:27:38 proxmox06 corosync[3280084]:   [CFG   ] Config reload requested by node 3
May 29 12:27:38 proxmox06 corosync[3280084]:   [TOTEM ] Configuring link 0
May 29 12:27:38 proxmox06 corosync[3280084]:   [TOTEM ] Configured link number 0: local addr: 10.233.1.46, port=5405
May 29 12:27:38 proxmox06 corosync[3280084]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
May 29 12:27:38 proxmox06 corosync[3280084]:   [KNET  ] host: host: 1 has no active links
May 29 12:27:38 proxmox06 corosync[3280084]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
May 29 12:27:38 proxmox06 corosync[3280084]:   [KNET  ] host: host: 1 has no active links
May 29 12:27:38 proxmox06 corosync[3280084]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
May 29 12:27:38 proxmox06 corosync[3280084]:   [KNET  ] host: host: 1 has no active links
May 29 12:27:38 proxmox06 pmxcfs[3283024]: [status] notice: update cluster info (cluster name  Proxmoxcluster2, version = 14)
May 29 12:27:46 proxmox06 corosync[3280084]:   [KNET  ] rx: host: 1 link: 0 is up
May 29 12:27:46 proxmox06 corosync[3280084]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
May 29 12:27:46 proxmox06 corosync[3280084]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
May 29 12:27:47 proxmox06 corosync[3280084]:   [QUORUM] Sync members[3]: 1 2 3
May 29 12:27:47 proxmox06 corosync[3280084]:   [QUORUM] Sync joined[1]: 1
May 29 12:27:47 proxmox06 corosync[3280084]:   [TOTEM ] A new membership (1.2184) was formed. Members joined: 1
May 29 12:27:47 proxmox06 pmxcfs[3283024]: [dcdb] notice: members: 1/12974, 2/204349, 3/3283024
May 29 12:27:47 proxmox06 pmxcfs[3283024]: [dcdb] notice: starting data syncronisation
May 29 12:27:47 proxmox06 pmxcfs[3283024]: [status] notice: members: 1/12974, 2/204349, 3/3283024
May 29 12:27:47 proxmox06 pmxcfs[3283024]: [status] notice: starting data syncronisation
May 29 12:27:47 proxmox06 corosync[3280084]:   [QUORUM] Members[3]: 1 2 3
May 29 12:27:47 proxmox06 corosync[3280084]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 29 12:27:47 proxmox06 pmxcfs[3283024]: [dcdb] notice: received sync request (epoch 1/12974/00000002)
May 29 12:27:47 proxmox06 pmxcfs[3283024]: [status] notice: received sync request (epoch 1/12974/00000002)
May 29 12:27:55 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 11 12 13
May 29 12:27:58 proxmox06 pvedaemon[1712750]: starting termproxy UPID:proxmox06:001A226E:9C9DCDD1:64747E2E:vncshell::root@pam:
May 29 12:27:58 proxmox06 pvedaemon[832679]: <root@pam> starting task UPID:proxmox06:001A226E:9C9DCDD1:64747E2E:vncshell::root@pam:
May 29 12:27:58 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 1a 1b 11 12 13
May 29 12:27:58 proxmox06 pvedaemon[831865]: <root@pam> successful auth for user 'root@pam'
May 29 12:27:58 proxmox06 systemd[1]: Created slice User Slice of UID 0.
May 29 12:27:58 proxmox06 systemd[1]: Starting User Runtime Directory /run/user/0...
May 29 12:27:58 proxmox06 systemd[1]: Finished User Runtime Directory /run/user/0.
May 29 12:27:58 proxmox06 systemd[1]: Starting User Manager for UID 0...
May 29 12:27:58 proxmox06 systemd[1712760]: Queued start job for default target Main User Target.
May 29 12:27:58 proxmox06 systemd[1712760]: Created slice User Application Slice.
May 29 12:27:58 proxmox06 systemd[1712760]: Reached target Paths.
May 29 12:27:58 proxmox06 systemd[1712760]: Reached target Timers.
May 29 12:27:58 proxmox06 systemd[1712760]: Listening on GnuPG network certificate management daemon.
May 29 12:27:58 proxmox06 systemd[1712760]: Listening on GnuPG cryptographic agent and passphrase cache (access for web browsers).
May 29 12:27:58 proxmox06 systemd[1712760]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
May 29 12:27:58 proxmox06 systemd[1712760]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
May 29 12:27:58 proxmox06 systemd[1712760]: Listening on GnuPG cryptographic agent and passphrase cache.
May 29 12:27:58 proxmox06 systemd[1712760]: Reached target Sockets.
May 29 12:27:58 proxmox06 systemd[1712760]: Reached target Basic System.
May 29 12:27:58 proxmox06 systemd[1712760]: Reached target Main User Target.
May 29 12:27:58 proxmox06 systemd[1712760]: Startup finished in 130ms.
May 29 12:27:58 proxmox06 systemd[1]: Started User Manager for UID 0.
May 29 12:27:58 proxmox06 systemd[1]: Started Session 8264 of user root.
May 29 12:28:00 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 1a 11 12 13
May 29 12:28:05 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 1a 11 12 13
May 29 12:28:10 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 1a 11 12
May 29 12:28:13 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 11 12 1a
May 29 12:28:18 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 11 12 1a 20
May 29 12:28:23 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 22 11 12 1a 20
May 29 12:28:25 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 22 11 12 1a 20
May 29 12:28:31 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 22 11 12 1a 20
May 29 12:28:33 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 22 11 12 1a 20
May 29 12:28:36 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 22 26 11 12 1a
May 29 12:28:41 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 22 26 11 12 1a
May 29 12:28:46 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 22 26 28 11 12 1a
May 29 12:28:46 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 22 26 28 11 12 1a
May 29 12:28:48 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 28 12 1a
May 29 12:28:51 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 28 2a 12 1a
May 29 12:28:53 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 28 2a 1a
May 29 12:28:58 proxmox06 corosync[3280084]:   [TOTEM ] Retransmit List: 18 19 2a 2c 1a

And it just keeps trying...
 
Thank you for the syslog!

May 29 12:28:00 proxmox06 corosync[3280084]: [TOTEM ] Retransmit List: 18 19 1a 11 12 13
- Can you please check from the network connectivity between the new node and the existing nodes cluster? (SSH between nodes - ping - nc etc..)
- Do you have firewall enabled on both sides?
- Can you also share with us the corosync.conf from the cluster side
 
corosync.conf
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: Proxmox05
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.233.1.45
  }
  node {
    name: proxmox04
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.233.1.44
  }
  node {
    name: proxmox06
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.233.1.46
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Proxmoxcluster2
  config_version: 14
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  token: 10000
  version: 2
}

Firewall has no rules but seems to be enabled...note that the cluster worked fine for a year with this...
obrázek_2023-05-30_113320734.png

ping from joining node (add names to hosts???)
Code:
root@proxmox04:~# ping proxmox06
ping: proxmox06: Temporary failure in name resolution
root@proxmox04:~# ping 10.233.1.46
PING 10.233.1.46 (10.233.1.46) 56(84) bytes of data.
64 bytes from 10.233.1.46: icmp_seq=1 ttl=64 time=0.184 ms
64 bytes from 10.233.1.46: icmp_seq=2 ttl=64 time=3.05 ms
64 bytes from 10.233.1.46: icmp_seq=3 ttl=64 time=0.218 ms
64 bytes from 10.233.1.46: icmp_seq=4 ttl=64 time=0.186 ms
^C
--- 10.233.1.46 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3046ms
rtt min/avg/max/mdev = 0.184/0.910/3.052/1.236 ms

ping to joining node
Code:
root@proxmox06:~# ping proxmox04
PING proxmox04.kabel1it.local (10.233.1.44) 56(84) bytes of data.
64 bytes from proxmox04.kabel1it.local (10.233.1.44): icmp_seq=1 ttl=64 time=0.247 ms
64 bytes from proxmox04.kabel1it.local (10.233.1.44): icmp_seq=2 ttl=64 time=0.298 ms
64 bytes from proxmox04.kabel1it.local (10.233.1.44): icmp_seq=3 ttl=64 time=0.281 ms
64 bytes from proxmox04.kabel1it.local (10.233.1.44): icmp_seq=4 ttl=64 time=0.265 ms
^C
--- proxmox04.kabel1it.local ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3039ms
rtt min/avg/max/mdev = 0.247/0.272/0.298/0.018 ms
 
all nodes have network configured like so (difference only in name and last digit of IP address)obrázek_2023-05-30_114713802.png

I saw posts where the LACP bond was a problem, but like I wrote - worked fine for a year.
 
Thank you for the output!

Looks like everything is ok to me. Can you also provide us with the output of `corosync-cfgtool -s`
 
with node off:
Code:
root@proxmox06:~# corosync-cfgtool -s
Local node ID 3, transport knet
LINK ID 0 udp
        addr    = 10.233.1.46
        status:
                nodeid:          1:     disconnected
                nodeid:          2:     connected
                nodeid:          3:     localhost

with node on:
Code:
root@proxmox06:~# corosync-cfgtool -s
Local node ID 3, transport knet
LINK ID 0 udp
        addr    = 10.233.1.46
        status:
                nodeid:          1:     connected
                nodeid:          2:     connected
                nodeid:          3:     localhost
 
nope, cluster is 7.2, node is 7.4 (I can't find the 7.2 ISO anywhere)...any chance you could make it available again?
 
can I update the cluster by leaving all VMs on 1 node, update the other, migrate VMs and update the remaining node or can I expect things to break? 1 node should have the resources to run everything, but downtime on the VMs is a problem...will the cluster retain quorum if 1 of 2 nodes is down?
 
can I update the cluster by leaving all VMs on 1 node, update the other, migrate VMs and update the remaining node or can I expect things to break?
Yes, this is the common practice in the cluster upgrade.

will the cluster retain quorum if 1 of 2 nodes is down?
no, both nodes are required to maintain a quorum if you have only 2 nodes in the cluster. However, you can bypass that by expected votes using `pvecm expected 1` command.
 
well, with the other nodes on version 7.4, it joined with no problems, consider this fixed

one note though - during the upgrade, even with pvecm expected 1 the VMs froze until the other node rebooted...
 
well, with the other nodes on version 7.4, it joined with no problems, consider this fixed
Glad to read that!

one note though - during the upgrade, even with pvecm expected 1 the VMs froze until the other node rebooted...
Without more information or Syslog, I can't say what happened exactly, you can provide us with the syslog to more investigation and see the root cause of the froze VMs at the update time. But I assume that the VMs froze because the node did not have enough resources to handle all the VMs.
 
each node is dual Xeon 6148 (80 cores total) with 512GB RAM, with all the VMs migrated to just one, it still shows single-digit % CPU load and about 30% RAM usage. They have more then too much resources :P
I probably phrased it wrong - the VMs froze when the empty node was rebooting after the upgrade to 7.4, the moment pvecm status read 2 nodes, everything just worked happily ever after. Happened with both nodes (all VMs on 1, the other was rebooting after upgrade). During the node reboot pvcem status kept reporting in quorum.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!