Cluster reboot after adding new node

PaNggNaP

New Member
Jun 27, 2024
3
0
1
Hi Folks,

Today I have added 5 new nodes to the existing cluster and found out the disk raid configuration on one of my servers have a problem and then I reconfigure the raid on that server.
After 1 hour after adding a new node, my whole cluster got rebooted and all the VMs inside rebooted as well.

Proxmox version: 6.4-15


Can you guys have any idea why the whole cluster got rebooted after adding new node?

Is it the common issue for Proxmox?
 
Hi @PaNggNaP without any logs is hard to help. Usually when nodes reboot its because you have problems with corosync, because of having high latency on the link -> which can lead to whole cluster reboots. Please send:

root@PMX8:~# journalctl -u corosync.service because you are using pve6 (please update, its EOL!) you might need to check it with a different command or by looking at cat /var/log/syslog | grep corosync

Please also post:
  • /etc/pve/corosync.conf
  • /etc/network/interfaces
 
Hi @PaNggNaP without any logs is hard to help. Usually when nodes reboot its because you have problems with corosync, because of having high latency on the link -> which can lead to whole cluster reboots. Please send:

root@PMX8:~# journalctl -u corosync.service because you are using pve6 (please update, its EOL!) you might need to check it with a different command or by looking at cat /var/log/syslog | grep corosync

Please also post:
  • /etc/pve/corosync.conf
  • /etc/network/interfaces

Hi @jsterr thank you for your suggestion.

The log output files are attached.

Could you please check and advise what could be the possible cause triggering cluster reboot?

Appreciate your help!!
 

Attachments

You are using bond0 for to much stuff. bond0 which is ens1f0 ens1f1 is used for the following things:

  • #25G-BOND-FOR DATA NETWORK
  • #VLAN BRIDGE FOR VMS DATA PLANE
  • # PROXMOX - CONTROL NETWORK
  • # PROXMOX INTERNET ACCESS - TEMPORARY
  • # STORAGE INTERNAL
  • # STORAGE EXTERNAL
  • # PROXMOX - OOB MANAGEMENT NETWORK
The recommendation for COROSYNC is to NOT use it with other services. Best Practice is a separate link (and physical port) that is only for corosync and nothing else. Why? Because corosync needs low latency in the network and the more the ports are used by other service the more likely it is that you get a node-fencing because of too high latency. This can lead to complete cluster reboots, because corosync timestamps (which are influenced by high latency) differ to much from each other -> can lead to reboot of all nodes.


Regarding corosync I can see:

Code:
nodelist {
  node {
    name: CTF01H
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.7.64.11
    ring1_addr: 10.2.4.11
  }

But I can only find 10.7.64.x in your network config, where is ring1? Seems like 10.2.4 is not available on all nodes ...

Long but now short: Your network config seems way to complex for what it might could be. Without all the details I cant 100% tell but it looks way to much config and to much complexity. By looking at the available ports you have I would recommend:

  • 2x 25GB for Storage (Storage + Link1 Coro Fallback)
  • 2x 10G for VMs
  • 2x 10G for Backups
  • 1x 1G for Web-UI
  • 1x 1G for Corosync (Link0)
Edit: by looking at the syslog and searching for corosync you can see all the errors regarding corosync knet (has no active links) for example.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!