Cluster unavailable after adding about 15 Nodes

Luhahn · Sep 16, 2020

Hi,

we're running on Proxmox 6.2. When we add a certain amount of nodes to the cluster, proxmox starts to loose connection to all of the nodes and the cluster completly shuts down. All nodes are still available over ssh. This state remains, until we remove about 1-2 nodes from the cluster. After that it is stable again.
It does not matter which of the nodes we remove and there is no order required for the nodes to be added to reproduce this behavior.

dmesg reports, that some task is hanging

Bash:

[ 4714.510601] INFO: task pvesr:2692 blocked for more than 362 seconds.
[ 4714.510632]       Tainted: P          IO      5.4.60-1-pve #1
[ 4714.510649] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4714.510672] pvesr           D    0  2692      1 0x00000000
[ 4714.510674] Call Trace:
[ 4714.510683]  __schedule+0x2e6/0x6f0
[ 4714.510687]  ? filename_parentat.isra.57.part.58+0xf7/0x180
[ 4714.510689]  schedule+0x33/0xa0
[ 4714.510692]  rwsem_down_write_slowpath+0x2ed/0x4a0
[ 4714.510694]  down_write+0x3d/0x40
[ 4714.510696]  filename_create+0x8e/0x180
[ 4714.510697]  do_mkdirat+0x59/0x110
[ 4714.510699]  __x64_sys_mkdir+0x1b/0x20
[ 4714.510702]  do_syscall_64+0x57/0x190
[ 4714.510704]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 4714.510706] RIP: 0033:0x7f67b5d920d7
[ 4714.510710] Code: Bad RIP value.
[ 4714.510711] RSP: 002b:00007ffed8d437b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[ 4714.510712] RAX: ffffffffffffffda RBX: 0000560f367db260 RCX: 00007f67b5d920d7
[ 4714.510713] RDX: 0000560f3621b3d4 RSI: 00000000000001ff RDI: 0000560f3a877be0
[ 4714.510713] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007
[ 4714.510714] R10: 0000000000000000 R11: 0000000000000246 R12: 0000560f37c149e8
[ 4714.510715] R13: 0000560f3a877be0 R14: 0000560f3a4f0f70 R15: 00000000000001ff

At first we thought, that it was network related due to a bug in the intel firmware. But it should be fixed with a bios update applied yesterday.
We tried upgrading the kernel as well as installing the intel-microcode package.

We're currently a bit out of ideas. Does anyone know which task exactly is hanging?

Luhahn · Sep 16, 2020

Also corosync reports, that the token was lost/not received constantly. I attached a part of the syslog, when the error was showing up.

Rares · Sep 16, 2020

Do you have a separated/multiple network for corosync? Do you have a congestion on the network? Do you combine storage with corosync network? What connections/capacity do you have between nodes?

Luhahn · Sep 16, 2020

We have a seperate network for proxmox only, which includes corosync. The proxmox nodes also get internet connection over this link. There is no congestion, that we're aware of. For example, we ran a couple iperf3 tests over the link for about 30 minutes with full speed and there were no outages. We do not combine storage with corosync network. Nodes are running on 1Gbit network. The setup is about 5 Nodes -> switch A -> switch B -> switch C -> rest of the nodes.

David Herselman · Sep 16, 2020

Try changing your Corosync configuration to being a little less trigger happy. Herewith a working sample:

Code:

[admin@kvm5a ~]# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: kvm5a
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.254.1.2
  }
  node {
    name: kvm5b
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.254.1.3
  }
  node {
    name: kvm5c
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.254.1.4
  }
  node {
    name: kvm5d
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.254.1.5
  }
  node {
    name: kvm5e
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.254.1.6
  }
  node {
    name: kvm5f
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 10.254.1.7
  }
  node {
    name: kvm5g
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 10.254.1.8
  }
  node {
    name: kvm5h
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 10.254.1.9
  }
  node {
    name: kvm5i
    nodeid: 9
    quorum_votes: 1
    ring0_addr: 10.254.1.10
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: kvm5
  config_version: 17
  interface {
    knet_ping_interval: 200
    knet_ping_timeout: 5000
    knet_pong_count: 1
    linknumber: 0
  }
  ip_version: ipv4
  secauth: on
  token: 1000
  version: 2
}

We generally implement 5 uplinks per server. One for dedicated out of band management access (Intel RMM/Dell iDrac/HO iLo), 2 x 10G SFP+ LACP bond for Ceph and Corosync and 2 x 10G UTP LACP bond without recirculation for VM traffic.

Herewith the network init script for using OvS:

Code:

[admin@kvm5a ~]# cat /etc/network/interfaces                                                             auto lo
iface lo inet loopback

allow-vmbr0 bond0
iface bond0 inet manual
ovs_bridge vmbr0
ovs_type OVSBond
ovs_bonds eth0 eth1
pre-up ( ifconfig eth0 mtu 9216 && ifconfig eth1 mtu 9216 )
ovs_options bond_mode=balance-slb lacp=active other_config:lacp-time=fast other_config:bond-rebalance-interval=60000 tag=1 vlan_mode=native-untagged
mtu 9216

auto vmbr0
allow-ovs vmbr0
iface vmbr0 inet manual
ovs_type OVSBridge
ovs_ports bond0 vlan1
mtu 9216

allow-vmbr0 vlan1
iface vlan1 inet static
ovs_type OVSIntPort
ovs_bridge vmbr0
ovs_options tag=1
ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
address 198.19.17.66
netmask 255.255.255.224
gateway 198.19.17.65
mtu 1500

allow-vmbr1 bond1
iface bond1 inet manual
ovs_bridge vmbr1
ovs_type OVSBond
ovs_bonds eth2 eth3
pre-up ( ifconfig eth2 mtu 9216 && ifconfig eth3 mtu 9216 )
ovs_options bond_mode=balance-tcp lacp=active other_config:lacp-time=fast tag=1 vlan_mode=native-untagged
mtu 9216

auto vmbr1
allow-ovs vmbr1
iface vmbr1 inet manual
ovs_type OVSBridge
ovs_ports bond1 vlan33
mtu 9216

allow-vmbr1 vlan33
iface vlan33 inet static
ovs_type OVSIntPort
ovs_bridge vmbr1
ovs_options tag=33
ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
address 10.254.1.2
netmask  255.255.255.0
mtu 9212

Fencing a node is a massively disruptive event, you really only want this to happen when it's really offline...

David Herselman · Sep 16, 2020

vlan1 is untagged for VM traffic on 1st LACP which runs balance-slb, to avoid double processing packets to perfectly hash outgoing data on the slave interfaces. We run primarily virtual firewalls and routers on this cluster so packets per second with low latency is our priority.
Intel based network cards have integrated directors which get cores generating data to be connected to those relevant IRQ lines that inbound data, for those streams, is subsequently hashed back to.

vlan33 is tagged for Ceph *and* Corosync on 2nd LACP bond (vmbr1 on bond1)

Luhahn · Sep 17, 2020

Thank you, I will test the corosync config you've provided. In addition we currently trying to rearrange our network situation today. Maybe this helps

David Herselman · Sep 17, 2020

Should it help you, or anyone else:

Herewith an example of converting a host using the Linux bridge to OvS (Open vSwitch).

Original Linux bridge:

Code:

    auto lo
    iface lo inet loopback
   
    auto bond0
    iface bond0 inet manual
            slaves eth0,eth1
            bond_miimon 100
            bond_mode 802.3ad
            bond_lacp_rate 1
            mtu 9000
   
    auto eth0
    iface eth0 inet manual
            bond-master bond0
            bond-primary eth0
            mtu 9000
   
    auto eth1
    iface eth1 inet manual
            bond-master bond0
            mtu 9000
   
    auto vmbr0
    iface vmbr0 inet static
            address 192.168.15.2
            netmask 255.255.255.0
            gateway 192.168.15.1
            bridge_ports bond0
            bridge_stp off
            bridge_fd 0
            mtu 1500

New OvS:

Code:

apt-get install openvswitch-switch;

cat /etc/network/interfaces
    auto lo
    iface lo inet loopback
   
    allow-vmbr0 bond0
    iface bond0 inet manual
            ovs_bridge vmbr0
            ovs_type OVSBond
            ovs_bonds eth0 eth1
            pre-up ( ifconfig eth0 mtu 9000 && ifconfig eth1 mtu 9000 )
            ovs_options bond_mode=balance-tcp lacp=active other_config:lacp-time=fast tag=1 vlan_mode=native-untagged
            mtu 9000
   
    auto vmbr0
    allow-ovs vmbr0
    iface vmbr0 inet manual
            ovs_type OVSBridge
            ovs_ports bond0 vlan1
            mtu 9000
   
    allow-vmbr0 vlan1
    iface vlan1 inet static
            ovs_type OVSIntPort
            ovs_bridge vmbr0
            ovs_options tag=1
            ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
            address 192.168.15.2
            netmask 255.255.255.0
            gateway 192.168.15.1
            mtu 1500

In the example above the server is a Dell where the maximum MTU on the NICs is limited to 9000 bytes. You can test this ahead of editing the network configuration file (/etc/network/interfaces) by running 'ifconfig eth0 mtu 9216' and working backwards.

In both examples:

bond0 = LACP in 'fast' mode using eth0 and eth1
untagged traffic on bond is associated with 'vlan1' interface (controlled by 'tag=1 vlan_mode=native-untagged')
Double reference vmbr0 (with 'ovs_bridge vmbr0' in interfaces and 'ovs_ports bond0 vlan1' in bridge definition)
Set the MTU of the slave and bond interfaces to the lowest maximum your network card and switches support. This infers that you are enabling jumbo frames or setting the maximum frame size manually on your switches. Netgear M4300 supports frames of 10K+ byte, most Intel cards support 9216 and many Dell 'tg3' NICs are limited to 9000 or even 8996 bytes. Try to use an even multiple of the default 1500 byte frame size to reduce overhead.

Luhahn · Oct 5, 2020

Hi, after a few weeks of testing, this issue has been resolved.

We've restructured our clusters network infrastructure as well as we adjusted the corosync parameters.
We also found a few firmware issues regarding the network cards in our clusters, which also had an impact on the instability.

Thanks for the help!

Search

Search

Cluster unavailable after adding about 15 Nodes

Luhahn

New Member

Luhahn

New Member

Attachments

Rares

Renowned Member

Luhahn

New Member

David Herselman

Renowned Member

David Herselman

Renowned Member

Luhahn

New Member

David Herselman

Renowned Member

Luhahn

New Member

We value your privacy