Cluster reboot after adding new node

PaNggNaP · Jun 27, 2024

Hi Folks,

Today I have added 5 new nodes to the existing cluster and found out the disk raid configuration on one of my servers have a problem and then I reconfigure the raid on that server.
After 1 hour after adding a new node, my whole cluster got rebooted and all the VMs inside rebooted as well.

Proxmox version: 6.4-15

Can you guys have any idea why the whole cluster got rebooted after adding new node?

Is it the common issue for Proxmox?

jsterr · Jun 27, 2024

Hi @PaNggNaP without any logs is hard to help. Usually when nodes reboot its because you have problems with corosync, because of having high latency on the link -> which can lead to whole cluster reboots. Please send:

root@PMX8:~# journalctl -u corosync.service because you are using pve6 (please update, its EOL!) you might need to check it with a different command or by looking at cat /var/log/syslog | grep corosync

Please also post:

/etc/pve/corosync.conf
/etc/network/interfaces

PaNggNaP · Jun 28, 2024

jsterr said:
Hi @PaNggNaP without any logs is hard to help. Usually when nodes reboot its because you have problems with corosync, because of having high latency on the link -> which can lead to whole cluster reboots. Please send:

root@PMX8:~# journalctl -u corosync.service because you are using pve6 (please update, its EOL!) you might need to check it with a different command or by looking at cat /var/log/syslog | grep corosync

Please also post:

/etc/pve/corosync.conf

/etc/network/interfaces

Hi @jsterr thank you for your suggestion.

The log output files are attached.

Could you please check and advise what could be the possible cause triggering cluster reboot?

Appreciate your help!!

jsterr · Jul 1, 2024

You are using bond0 for to much stuff. bond0 which is ens1f0 ens1f1 is used for the following things:

#25G-BOND-FOR DATA NETWORK
#VLAN BRIDGE FOR VMS DATA PLANE
# PROXMOX - CONTROL NETWORK
# PROXMOX INTERNET ACCESS - TEMPORARY
# STORAGE INTERNAL
# STORAGE EXTERNAL
# PROXMOX - OOB MANAGEMENT NETWORK

The recommendation for COROSYNC is to NOT use it with other services. Best Practice is a separate link (and physical port) that is only for corosync and nothing else. Why? Because corosync needs low latency in the network and the more the ports are used by other service the more likely it is that you get a node-fencing because of too high latency. This can lead to complete cluster reboots, because corosync timestamps (which are influenced by high latency) differ to much from each other -> can lead to reboot of all nodes.

Regarding corosync I can see:

Code:

nodelist {
  node {
    name: CTF01H
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.7.64.11
    ring1_addr: 10.2.4.11
  }

But I can only find 10.7.64.x in your network config, where is ring1? Seems like 10.2.4 is not available on all nodes ...

Long but now short: Your network config seems way to complex for what it might could be. Without all the details I cant 100% tell but it looks way to much config and to much complexity. By looking at the available ports you have I would recommend:

2x 25GB for Storage (Storage + Link1 Coro Fallback)
2x 10G for VMs
2x 10G for Backups
1x 1G for Web-UI
1x 1G for Corosync (Link0)

Edit: by looking at the syslog and searching for corosync you can see all the errors regarding corosync knet (has no active links) for example.

PaNggNaP · Jul 2, 2024

Hi @jsterr Thank you so much for your help.
Your suggestion is well-taken.

Search

Search

Cluster reboot after adding new node

PaNggNaP

New Member

jsterr

Renowned Member

PaNggNaP

New Member

Attachments

jsterr

Renowned Member

PaNggNaP

New Member

We value your privacy