Replacing switch responsible for cluster

Nehemiah

New Member
Jul 14, 2022
4
0
1
We need to replace a 10Gbe switch that is responsible for Link 0 in our cluster. The switch is failing lately with several unscheduled reboots. Every time the switch rebooted, our whole cluster went down and the individual nodes rebooted even though we have another two rings in the cluster. Another 10Gbe ring and a 1Gbe ring. Is there a way I can prevent the cluster from going down when I replace the switch?

The ring connection is dedicated to cluster only. Theoretically the nodes should be able to continue on ring 2 and allow us to safely replace the switch. Can I force the cluster to ring 2? I might not understand rings in corosync rightly.
 
Can you post your /etc/pve/corosync.conf file?
What is the output of pvecm status?
I assume you have guests configured as HA?

Can the nodes ping each other on the other corosync links?
 
All nodes are able to ping each other on IP addresses mentioned below. Here is my configuration:

Code:
Cluster information
-------------------
Name:             shepherd
Config Version:   6
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Jul 14 09:45:50 2022
Quorum provider:  corosync_votequorum
Nodes:            6
Node ID:          0x00000001
Ring ID:          1.17d
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   6
Highest expected: 6
Total votes:      6
Quorum:           4
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.10.0.1 (local)
0x00000002          1 10.10.0.2
0x00000003          1 10.10.0.3
0x00000004          1 10.10.0.5
0x00000005          1 10.10.0.6
0x00000006          1 10.10.0.4

Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: asher
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 10.10.0.4
    ring1_addr: 10.0.0.23
    ring2_addr: 192.168.54.23
  }
  node {
    name: gad
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.0.3
    ring1_addr: 10.0.0.22
    ring2_addr: 192.168.54.22
  }
  node {
    name: judah
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.0.1
    ring1_addr: 10.0.0.20
    ring2_addr: 192.168.54.20
  }
  node {
    name: manasseh
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.10.0.6
    ring1_addr: 10.0.0.25
    ring2_addr: 192.168.54.25
  }
  node {
    name: naftali
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.10.0.5
    ring1_addr: 10.0.0.24
    ring2_addr: 192.168.54.24
  }
  node {
    name: reuben
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.0.2
    ring1_addr: 10.0.0.21
    ring2_addr: 192.168.54.21
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: shepherd
  config_version: 6
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  interface {
    linknumber: 2
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}
 
We replaced the switch without any issues today. I guess this topic can be closed. Although I'm not sure why my nodes fenced themselves off because of the faulty switch. ProxMox VM is a great product.
 
The only thing that I can imagine is that the other networks were affected as well. If Corosync cannot establish a connection and you have HA guests on your node, it will fence itself after a minute or two to make sure, that the guests are definitely off before the (hopefully) remaining cluster will start them again.
You do have multiple network configured for Corosync to fall back to. Why that did not work is hard to say without further diagnostic. I assume that either the network was configured in such a way that the other Corosync networks also used that switch (VLANs?) or it had some other side effect, hindering the normal operation of the other networks. Hard to say without more infos / diagnostics.
 
It turned out that the nodes didn't actually go down because of the switch rebooting as I assumed. After we replaced the switch the issue emerged again. Looking in the IPMI logs I noticed that the servers logged 'Brown-out recovery' and the culprit is both UPS units that support these servers and also happen to support the mentioned switch needing new batteries. I suspect that all these issues will go away after we install the new battery packs. It doesn't help much that the server in question are over 4500miles away.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!