Hi,
running 4 node cluster with multiple networks cards.(pve 5.0)
ceph and corosync working via broadecom vdx 6740 switch in switchport mode, Internet Gateway and VM bridge is on normal 1gb procurve switch. no Vlan, no trunking, just switching.
I´m 1600km away from the datacenter, and one nic or port or cable of the 10gb network failed.
After reboot the node it hangs by the mellanox nic. I connect with the BMC remote console and changed the error control in the bios. the node comes up again.
I configured ceph to a bridge vmbr1 and corosync as vmbr1:1 with the IP from failed nic. I can ping to all member nodes with all IP and omping for broadcast is working.
CEPH ist working, but corosync service do not start. corosync.conf is the same on all nodes and in all other nodes there is the failed node is listed in /.members.
When I try to list members at the failed node, output is ( nodename: "xxx.xxx.srv03.xx" version: 0) and corosync.service status first line of output is "no resources configured" But nothing without IP binding was changed.
some VM and LXC container where up and running in CEPH storage when the nic was failing.
Why there is no quorum after binding same ip to other nic ? How I can re-add the node
thanks for support
running 4 node cluster with multiple networks cards.(pve 5.0)
ceph and corosync working via broadecom vdx 6740 switch in switchport mode, Internet Gateway and VM bridge is on normal 1gb procurve switch. no Vlan, no trunking, just switching.
I´m 1600km away from the datacenter, and one nic or port or cable of the 10gb network failed.
After reboot the node it hangs by the mellanox nic. I connect with the BMC remote console and changed the error control in the bios. the node comes up again.
I configured ceph to a bridge vmbr1 and corosync as vmbr1:1 with the IP from failed nic. I can ping to all member nodes with all IP and omping for broadcast is working.
CEPH ist working, but corosync service do not start. corosync.conf is the same on all nodes and in all other nodes there is the failed node is listed in /.members.
When I try to list members at the failed node, output is ( nodename: "xxx.xxx.srv03.xx" version: 0) and corosync.service status first line of output is "no resources configured" But nothing without IP binding was changed.
some VM and LXC container where up and running in CEPH storage when the nic was failing.
Why there is no quorum after binding same ip to other nic ? How I can re-add the node
thanks for support