Replacing switch responsible for cluster

Nehemiah · Jul 14, 2022

We need to replace a 10Gbe switch that is responsible for Link 0 in our cluster. The switch is failing lately with several unscheduled reboots. Every time the switch rebooted, our whole cluster went down and the individual nodes rebooted even though we have another two rings in the cluster. Another 10Gbe ring and a 1Gbe ring. Is there a way I can prevent the cluster from going down when I replace the switch?

The ring connection is dedicated to cluster only. Theoretically the nodes should be able to continue on ring 2 and allow us to safely replace the switch. Can I force the cluster to ring 2? I might not understand rings in corosync rightly.

aaron · Jul 14, 2022

Can you post your /etc/pve/corosync.conf file?
What is the output of pvecm status?
I assume you have guests configured as HA?

Can the nodes ping each other on the other corosync links?

Nehemiah · Jul 14, 2022

All nodes are able to ping each other on IP addresses mentioned below. Here is my configuration:

Code:

Cluster information
-------------------
Name:             shepherd
Config Version:   6
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Jul 14 09:45:50 2022
Quorum provider:  corosync_votequorum
Nodes:            6
Node ID:          0x00000001
Ring ID:          1.17d
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   6
Highest expected: 6
Total votes:      6
Quorum:           4
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.10.0.1 (local)
0x00000002          1 10.10.0.2
0x00000003          1 10.10.0.3
0x00000004          1 10.10.0.5
0x00000005          1 10.10.0.6
0x00000006          1 10.10.0.4

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: asher
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 10.10.0.4
    ring1_addr: 10.0.0.23
    ring2_addr: 192.168.54.23
  }
  node {
    name: gad
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.0.3
    ring1_addr: 10.0.0.22
    ring2_addr: 192.168.54.22
  }
  node {
    name: judah
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.0.1
    ring1_addr: 10.0.0.20
    ring2_addr: 192.168.54.20
  }
  node {
    name: manasseh
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.10.0.6
    ring1_addr: 10.0.0.25
    ring2_addr: 192.168.54.25
  }
  node {
    name: naftali
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.10.0.5
    ring1_addr: 10.0.0.24
    ring2_addr: 192.168.54.24
  }
  node {
    name: reuben
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.0.2
    ring1_addr: 10.0.0.21
    ring2_addr: 192.168.54.21
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: shepherd
  config_version: 6
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  interface {
    linknumber: 2
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Nehemiah · Jul 14, 2022

We replaced the switch without any issues today. I guess this topic can be closed. Although I'm not sure why my nodes fenced themselves off because of the faulty switch. ProxMox VM is a great product.

aaron · Jul 15, 2022

The only thing that I can imagine is that the other networks were affected as well. If Corosync cannot establish a connection and you have HA guests on your node, it will fence itself after a minute or two to make sure, that the guests are definitely off before the (hopefully) remaining cluster will start them again.
You do have multiple network configured for Corosync to fall back to. Why that did not work is hard to say without further diagnostic. I assume that either the network was configured in such a way that the other Corosync networks also used that switch (VLANs?) or it had some other side effect, hindering the normal operation of the other networks. Hard to say without more infos / diagnostics.

Nehemiah · Jul 21, 2022

It turned out that the nodes didn't actually go down because of the switch rebooting as I assumed. After we replaced the switch the issue emerged again. Looking in the IPMI logs I noticed that the servers logged 'Brown-out recovery' and the culprit is both UPS units that support these servers and also happen to support the mentioned switch needing new batteries. I suspect that all these issues will go away after we install the new battery packs. It doesn't help much that the server in question are over 4500miles away.

aaron · Jul 22, 2022

Thanks for the update

Nehemiah said:
It doesn't help much that the server in question are over 4500miles away.

Ouch :-/

Search

Search

Replacing switch responsible for cluster

Nehemiah

New Member

aaron

Proxmox Staff Member

Nehemiah

New Member

Nehemiah

New Member

aaron

Proxmox Staff Member

Nehemiah

New Member

aaron

Proxmox Staff Member

We value your privacy