node kicked out of cluster after changing nic

volleasy · Mar 19, 2018

Hi,

running 4 node cluster with multiple networks cards.(pve 5.0)
ceph and corosync working via broadecom vdx 6740 switch in switchport mode, Internet Gateway and VM bridge is on normal 1gb procurve switch. no Vlan, no trunking, just switching.

I´m 1600km away from the datacenter, and one nic or port or cable of the 10gb network failed.
After reboot the node it hangs by the mellanox nic. I connect with the BMC remote console and changed the error control in the bios. the node comes up again.

I configured ceph to a bridge vmbr1 and corosync as vmbr1:1 with the IP from failed nic. I can ping to all member nodes with all IP and omping for broadcast is working.

CEPH ist working, but corosync service do not start. corosync.conf is the same on all nodes and in all other nodes there is the failed node is listed in /.members.

When I try to list members at the failed node, output is ( nodename: "xxx.xxx.srv03.xx" version: 0) and corosync.service status first line of output is "no resources configured" But nothing without IP binding was changed.

some VM and LXC container where up and running in CEPH storage when the nic was failing.
Why there is no quorum after binding same ip to other nic ? How I can re-add the node

thanks for support

volleasy · Mar 21, 2018

At the end it was a littel config error in /etc/network/interfaces. There was vor 2nd IP auto vmbr1 in /etc/network/interfaces. now I correct to vmbr1:1 and after a reboot cluster works. When I manualy bring up the bridge (ifup vmbr1:1) the bridge was up but corosync.server failed to start. Only when the interfaces starts at boot, corosync.service startet correct. anyway, network.service do not report an error when you have twice auto vmbr1 in config.

Search

Search

node kicked out of cluster after changing nic

volleasy

Renowned Member

volleasy

Renowned Member