[SOLVED] Promox VE Cluster reboots after switch shutdown

comfreak

Active Member
May 20, 2020
46
13
28
34
Hello all,

I made a week ago on the weekend some maintanance work. Therefore I had to shutdown the 2 switches for the Proxmox/Ceph Cluster one after another. We are using the system as HCI for our VMs which works pretty good and we are happy with it.

Every server has a connection to both switches. The Ceph NICs (Client and Cluster traffic together) are configured via a Linux bond (bond110), the Corosync NICs via its configuration when joining the cluster and the public NICs also via Linux Bond (bond0). Ceph, Corosync and public traffic are separetad via VLANs.

ProxmoxCluster.png

1618842610022.png

1) I shut down switch 2, which was no problem - probably because the NICs of the servers to that switch were in slave mode.
2) I made the maintanance and booted switch 2 again and waited until it was finished. I could see in syslog that the NICs were connected again.
3) I shut down switch 1 and after a while the servers started to reboot

That surprised me, as I thought that configuration enabled me to shutdown 1 switch without interruption.

I tried to unterstand what was going on, but could not find any reason for it. Probably because Corosync did not get in sync again.

Code:
Apr 10 16:23:22 pve01 kernel: [262883.826693] i40e 0000:83:00.0 enp131s0f0: NIC Link is Down
Apr 10 16:23:22 pve01 kernel: [262883.828315] i40e 0000:83:00.2 enp131s0f2: NIC Link is Down
Apr 10 16:23:22 pve01 kernel: [262883.915091] bond110: (slave enp131s0f0): link status definitely down, disabling slave
Apr 10 16:23:22 pve01 kernel: [262883.915093] bond110: (slave enp131s0f1): making interface the new active one
Apr 10 16:23:22 pve01 kernel: [262884.327246] igb 0000:06:00.0 eno1: igb: eno1 NIC Link is Down
Apr 10 16:23:22 pve01 kernel: [262884.331177] bond0: (slave eno1): link status definitely down, disabling slave
Apr 10 16:23:22 pve01 kernel: [262884.331179] bond0: (slave eno2): making interface the new active one
Apr 10 16:23:22 pve01 kernel: [262884.331181] device eno1 left promiscuous mode
Apr 10 16:23:22 pve01 kernel: [262884.331263] device eno2 entered promiscuous mode
Apr 10 16:23:23 pve01 corosync[3324]:   [KNET  ] link: host: 5 link: 0 is down
Apr 10 16:23:23 pve01 corosync[3324]:   [KNET  ] link: host: 5 link: 1 is down
Apr 10 16:23:23 pve01 corosync[3324]:   [KNET  ] link: host: 4 link: 0 is down
Apr 10 16:23:23 pve01 corosync[3324]:   [KNET  ] link: host: 4 link: 1 is down
Apr 10 16:23:23 pve01 corosync[3324]:   [KNET  ] link: host: 2 link: 0 is down
Apr 10 16:23:23 pve01 corosync[3324]:   [KNET  ] link: host: 2 link: 1 is down
Apr 10 16:23:23 pve01 corosync[3324]:   [KNET  ] link: host: 3 link: 0 is down
Apr 10 16:23:23 pve01 corosync[3324]:   [KNET  ] link: host: 3 link: 1 is down
Apr 10 16:23:23 pve01 corosync[3324]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Apr 10 16:23:23 pve01 corosync[3324]:   [KNET  ] host: host: 5 has no active links
Apr 10 16:23:23 pve01 corosync[3324]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Apr 10 16:23:23 pve01 corosync[3324]:   [KNET  ] host: host: 5 has no active links
Apr 10 16:23:23 pve01 corosync[3324]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Apr 10 16:23:23 pve01 corosync[3324]:   [KNET  ] host: host: 4 has no active links
Apr 10 16:23:23 pve01 corosync[3324]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Apr 10 16:23:23 pve01 corosync[3324]:   [KNET  ] host: host: 4 has no active links
Apr 10 16:23:23 pve01 corosync[3324]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Apr 10 16:23:23 pve01 corosync[3324]:   [KNET  ] host: host: 2 has no active links
Apr 10 16:23:23 pve01 corosync[3324]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Apr 10 16:23:23 pve01 corosync[3324]:   [KNET  ] host: host: 2 has no active links
Apr 10 16:23:23 pve01 corosync[3324]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Apr 10 16:23:23 pve01 corosync[3324]:   [KNET  ] host: host: 3 has no active links
Apr 10 16:23:23 pve01 corosync[3324]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Apr 10 16:23:23 pve01 corosync[3324]:   [KNET  ] host: host: 3 has no active links
Apr 10 16:23:23 pve01 kernel: [262884.855437] vmbr1201: port 1(enp131s0f2) entered disabled state
Apr 10 16:23:24 pve01 corosync[3324]:   [TOTEM ] Token has not been received in 2212 ms
Apr 10 16:23:24 pve01 corosync[3324]:   [TOTEM ] A processor failed, forming new configuration.
Apr 10 16:23:28 pve01 corosync[3324]:   [TOTEM ] A new membership (1.166) was formed. Members left: 2 3 4 5
Apr 10 16:23:28 pve01 corosync[3324]:   [TOTEM ] Failed to receive the leave message. failed: 2 3 4 5
Apr 10 16:23:28 pve01 corosync[3324]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Apr 10 16:23:28 pve01 corosync[3324]:   [QUORUM] Members[1]: 1
Apr 10 16:23:28 pve01 pmxcfs[3109]: [dcdb] notice: members: 1/3109
Apr 10 16:23:28 pve01 corosync[3324]:   [MAIN  ] Completed service synchronization, ready to provide service.

Why did all servers reboot?

I attached the relevant syslog parts. If you need more information, please let me know.
 

Attachments

  • syslog.log
    13.1 KB · Views: 4
Last edited:
Apr 10 16:23:23 pve01 corosync[3324]: [KNET ] link: host: 5 link: 0 is down
Apr 10 16:23:23 pve01 corosync[3324]: [KNET ] link: host: 5 link: 1 is down
Apr 10 16:23:23 pve01 corosync[3324]: [KNET ] link: host: 4 link: 0 is down
Apr 10 16:23:23 pve01 corosync[3324]: [KNET ] link: host: 4 link: 1 is down
Apr 10 16:23:23 pve01 corosync[3324]: [KNET ] link: host: 2 link: 0 is down
Apr 10 16:23:23 pve01 corosync[3324]: [KNET ] link: host: 2 link: 1 is down
Apr 10 16:23:23 pve01 corosync[3324]: [KNET ] link: host: 3 link: 0 is down
Apr 10 16:23:23 pve01 corosync[3324]: [KNET ] link: host: 3 link: 1 is down
Somehow both corosync links (and probably other interfaces as well) went down at the same time.

I would investigate why the connections to the second switch went down, once you shutdown the first and if you can reproduce that.

One thing that could be the reason for corosync is that both NICs have addresses in the same subnet. Meaning, only one of the NICs will be in the routing table. Sepearating the corosync networks into different subnets should result in each NIC being the route for that subnet.

Check the output of ip route. There is likely only one NIC set to route traffic to the 10.120.1.0/16 network.
 
Thanks for your feedback. I also saw that, that all NICs went down and I wondered why.

Yes, I also already had the idea to test it again without running any VMs, to not destroy anything by accident. :)

Code:
root@pve01:~# ip route
default via 192.168.213.91 dev vmbr0 proto kernel onlink
10.110.0.0/16 dev bond110 proto kernel scope link src 10.110.1.1
10.120.0.0/16 dev vmbr1201 proto kernel scope link src 10.120.1.1
10.120.0.0/16 dev vmbr1202 proto kernel scope link src 10.120.1.2
192.168.212.0/23 dev vmbr0 proto kernel scope link src 192.168.213.222

Seems both NICs (the 10.120.1.1 and 10.120.1.2) are in the routing table?

Is it best practice to seperate the Corosync NICs into different subnets? Would that also be best practice for Ceph NICs?

Maybe the Corosync config helps?

Code:
root@pve01:~# cat /etc/pve/corosync.conf 
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve01
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.120.1.1
    ring1_addr: 10.120.1.2
  }
  node {
    name: pve02
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.120.2.1
    ring1_addr: 10.120.2.2
  }
  node {
    name: pve03
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.120.3.1
    ring1_addr: 10.120.3.2
  }
  node {
    name: pve04
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.120.4.1
    ring1_addr: 10.120.4.2
  }
  node {
    name: pve05
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.120.5.1
    ring1_addr: 10.120.5.2
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Meta
  config_version: 5
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Edit: Attached Corosync config.
 
Last edited:
Thats little crazy setup, having every node on different subnet and both links in the same subnet...Maybe thats the core of the problem for switch restarts.

You can test switches with this setup
Code:
node1 ring0 x.x.x.1
node1 ring1 x.x.y.1
node2 ring0 x.x.x.2
node2 ring1 x.x.y.2
...
 
Thats little crazy setup, having every node on different subnet and both links in the same subnet
The nodes are all in the same subnet, see the /16 in the CIDR ;)

But yes, I would avoid having multiple interface with addresses in the same subnet. There are all kinds of unexpected behavior that this can cause, and I think the problem you can into is one of them.

Try to place the two Corosync links into separate (much smaller) subnets and if you can test it, it should work better, that one link stays up. To not have such a large impact on the working cluster, you can disable the HA services for that time.

First stop all LRM services in the cluster and then the CRM. To start them, do it in the same order, first start the LRM services on all nodes, then the CRM
Code:
systemctl stop pve-ha-lrm

systemctl stop pve-ha-crm
 
Exactly - all nodes and their NICs are in the same subnet (/16).

Thanks, I will try your suggestion!
 
Last edited:
Short feedback: Yes - seperating the networks worked! Thank you!

/closed
 
Last edited:
  • Like
Reactions: aaron

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!