[SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

Status
Not open for further replies.
For active-backup, should works fine too, I don't see any reason to not use it.
Sorry, I got that mixed up. I meant balance-tcp and was referring to https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#_linux_bond which says that
If you intend to run your cluster network on the bonding interfaces, then you have to use active-passive mode on the bonding interfaces, other modes are unsupported.

But during my limited testing it looks as if other bonding modes can work fine too.

(Perhaps getting a bit OT here.)
 
Sorry, I got that mixed up. I meant balance-tcp and was referring to https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#_linux_bond which says that

But during my limited testing it looks as if other bonding modes can work fine too.

(Perhaps getting a bit OT here.)

The only really stable bonding mode are active-backup, or lacp. (on openvswitch, you have lacp"balance-tcp or lacp balance-slb, this is the same than linux bond lacp with hash algorithm layer2 or layer3+4)

The problem with balancing 1 tcp connection across multiple links, is possible out of order packets when switch receive them, and you can have drop/retransmist. (That's why lacp loadbalance always same tcp connection on same link)
That's why other linux bond mode like balance-rr can give you problems
 
The problem with balancing 1 tcp connection across multiple links, is possible out of order packets when switch receive them, and you can have drop/retransmist. (That's why lacp loadbalance always same tcp connection on same link)
That's why other linux bond mode like balance-rr can give you problems
Yeah, according to the lacp specs, correct ordering is required, so since I'm on lacp+balance-tcp I'm good from that standpoint. The xor hashing both in ovs and the Cisco switch with src-dst-tcp should be 100% deterministic and make each tcp stream stay on a single interface.

This is my understanding anyway, and I was just thinking whether the recommendation to only use active-backup for corosync still holds up (and why, except for packet reordering) or if it perhaps stems from pre-lacp days?

I've found various bonding recommendations for corosync rrp traffic other places (including using balance-rr fwiw, which is obviously sketchy...)

Please tell me if this belongs in another thread.
 
This is my understanding anyway, and I was just thinking whether the recommendation to only use active-backup for corosync still holds up (and why, except for packet reordering) or if it perhaps stems from pre-lacp days?
This come from here:
http://lists.linux-ha.org/pipermail/linux-ha/2013-January/046295.html

Not sure about mutlticast with lacp. But now, no more multicast, so no reason to not use lacp.

I've found various bonding recommendations for corosync rrp traffic other places (including using balance-rr fwiw, which is obviously sketchy...)
rrp is another way to handle failover at corosync level.

(I'm personnaly use lacp with corosync since 2012 without any problem, but with udpu - no multicast)
 
  • Like
Reactions: ahovda
We have two clusters in which we host virtual routers and firewalls. Heavy network traffic causes jitter and sometimes even packet loss with the default LACP OvS configuration so we run a sort of hybrid. The root cause is that Intel X520 network cards support receive side steering where they compute a hash and then pass return traffic that is part of an identified stream back to the same queue that sent the packet.

With the default 'balance-tcp' mode the packets get transmitted and then recirculate OvS to be sent out the balanced interface which causes problems. In the environment with allot of virtual routers and firewalls we run things slightly differently, namely that we use balance-slb together with LACP to periodically rebalance the streams.

Normal OvS LACP:
Code:
ovs_options bond_mode=balance-tcp lacp=active other_config:lacp-time=fast tag=200 vlan_mode=native-untagged


No recirculation:
Code:
ovs_options bond_mode=balance-slb lacp=active other_config:lacp-time=fast other_config:bond-rebalance-interval=60000 tag=200 vlan_mode=native-untagged


NB: This is not the case in the cluster with the 2 x 1 GbE links where we regularly observe corosync having problems...


@spirit: The problem cluster with the 2 x 1 GbE links was running a ping between all three hosts and we observe no network connectivity issues with any other application, Ceph or diagnostic pings.
 
Sep 25 07:45:52 node-22 systemd[1]: corosync.service: Main process exited, code=killed, status=11/SEGV
Sep 25 07:45:52 node-22 systemd[1]: corosync.service: Failed with result 'signal'.

Code:
dpkg -l |grep knet
ii  libknet1:amd64                       1.11-pve2                       amd64        kronosnet core switching implementation

A simple restart of corosync and the cluster was quorate.
 
We have two clusters in which we host virtual routers and firewalls. Heavy network traffic causes jitter and sometimes even packet loss with the default LACP OvS configuration so we run a sort of hybrid. The root cause is that Intel X520 network cards support receive side steering where they compute a hash and then pass return traffic that is part of an identified stream back to the same queue that sent the packet.

With the default 'balance-tcp' mode the packets get transmitted and then recirculate OvS to be sent out the balanced interface which causes problems. In the environment with allot of virtual routers and firewalls we run things slightly differently, namely that we use balance-slb together with LACP to periodically rebalance the streams.

Normal OvS LACP:
Code:
ovs_options bond_mode=balance-tcp lacp=active other_config:lacp-time=fast tag=200 vlan_mode=native-untagged


No recirculation:
Code:
ovs_options bond_mode=balance-slb lacp=active other_config:lacp-time=fast other_config:bond-rebalance-interval=60000 tag=200 vlan_mode=native-untagged


NB: This is not the case in the cluster with the 2 x 1 GbE links where we regularly observe corosync having problems...


@spirit: The problem cluster with the 2 x 1 GbE links was running a ping between all three hosts and we observe no network connectivity issues with any other application, Ceph or diagnostic pings.

if I understand you correctly, your "problematic cluster" is configured using a single knet link (with a bond of two interfaces underneath)?

if so, it would be interesting to test the following:
- drop token timeout to default value again (effectively: 1000ms + (N-2) * 650ms, where N == number of nodes)
- set knet_ping_interval of the link to 200 (ms)
- set knet_ping_timeout to 5000 (ms)
- set knet_pong_count to 1

this should have the effect that the token will time out on the corosync level before knet decides a link is down, instead of the other way round. it will also slightly increase the load since it will now send ~5 pings/s, instead of ~1.

the default settings are to send a ping every token timeout / 10, with a timeout of token timeout / 5 for marking a link down. so if a single ping times out and marks the link as down, it's already hard to get 5 good pongs back before the token times out (sending 5 pings takes at least half of the token timeout!). this is probably the reason why we see such big problems in slightly unstable networks.

for single-link setups, we want corosync to detect token failure rather than knet to detect link failure, and for knet to mark links as up fast. for multi-link setups, we actually want knet to detect link failure fast to do a failover before the token times out, but marking links as up again fast is not as important since we want to switch back to a previously failed link only if it is really stable again. upstream is currently working on better auto-detection/configuration for this interaction.
 
Looks like the corosync crash should be fixed from this PR: https://github.com/kronosnet/kronosnet/pull/257

And is now part of pvetest ... anyone experiencing the corosync crash, please try it! Direct link here if not on pvetest:

http://download.proxmox.com/debian/dists/buster/pvetest/binary-amd64/libknet1_1.11-pve2_amd64.deb

I still have nodes at 5.4 in my cluster (I suspended my upgrades when I discovered this issue). Is there an updated libknet for this release? I'm using this repo: http://download.proxmox.com/debian/corosync-3/.

Thanks.
 
In my environment with libknet* 1.12-pve1 (from no-subscription repo) cluster has become much more stable (no "link down" and corosync seg fault so far >48hrs)
 
we have a test build already, should be available soon on the public repo as well.

available now both for the Stretch -> Buster / PVE 5.4 -> PVE 6.x upgrade repository ('corosync-3') as well as in PVE-6's pve-enterprise repo!
 
@spriit

Good thinking, I scanned through the documentation on corosync 3 and my understanding is that the token timeout is automatically adjusted by the coefficient when there are 3 or more nodes, so I made the following changes:

On all three nodes initially:
Code:
systemctl stop pve-ha-lrm;
systemctl stop pve-ha-crm;


Then edited PVE distributed Corosync configuration file (remember to increment config_version):
Code:
[admin@kvm1 ~]# cat /etc/pve/corosync.conf
logging {
  debug: on
  to_syslog: yes
}

nodelist {
  node {
    name: kvm1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 1.1.7.9
  }
  node {
    name: kvm2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 1.1.7.10
  }
  node {
    name: kvm3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 1.1.7.11
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: cluster1
  config_version: 6
  interface {
    linknumber: 0
    knet_ping_interval: 200
    knet_ping_timeout: 5000
    knet_pong_count: 1
  }
  ip_version: ipv4
  secauth: on
  version: 2
  token: 1000
}


Essentially set interface link0 to ping with an interval of 0.2s, timeout of 5s and count a single pong as being successful.

Then ran the following on all three nodes to activate the changes:
Code:
systemctl restart corosync;
systemctl restart pve-cluster.service;
corosync-cfgtool -s;
pvecm status;

# Run the corosync-cfgtool and pvecm commands to ensure everything is back up and in quorum

systemctl start pve-ha-lrm;
systemctl start pve-ha-crm;
 
@spriit

Good thinking, I scanned through the documentation on corosync 3 and my understanding is that the token timeout is automatically adjusted by the coefficient when there are 3 or more nodes, so I made the following changes:

On all three nodes initially:
Code:
systemctl stop pve-ha-lrm;
systemctl stop pve-ha-crm;


Then edited PVE distributed Corosync configuration file (remember to increment config_version):
Code:
[admin@kvm1 ~]# cat /etc/pve/corosync.conf
logging {
  debug: on
  to_syslog: yes
}

nodelist {
  node {
    name: kvm1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 1.1.7.9
  }
  node {
    name: kvm2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 1.1.7.10
  }
  node {
    name: kvm3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 1.1.7.11
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: cluster1
  config_version: 6
  interface {
    linknumber: 0
    knet_ping_interval: 200
    knet_ping_timeout: 5000
    knet_pong_count: 1
  }
  ip_version: ipv4
  secauth: on
  version: 2
  token: 1000
}


Essentially set interface link0 to ping with an interval of 0.2s, timeout of 5s and count a single pong as being successful.

Then ran the following on all three nodes to activate the changes:
Code:
systemctl restart corosync;
systemctl restart pve-cluster.service;
corosync-cfgtool -s;
pvecm status;

# Run the corosync-cfgtool and pvecm commands to ensure everything is back up and in quorum

systemctl start pve-ha-lrm;
systemctl start pve-ha-crm;

great! looking forward to your feedback!
 
I still have nodes at 5.4 in my cluster (I suspended my upgrades when I discovered this issue). Is there an updated libknet for this release? I'm using this repo: http://download.proxmox.com/debian/corosync-3/.

I also suspended the upgrade because of this issue ... Was just about today to give it a second try (because I would have some spare time the next days to do the full upgrade for my 7 nodes) ... but then saw https://github.com/kronosnet/kronosnet/issues/261 (via https://bugzilla.proxmox.com/show_bug.cgi?id=2326#c40 ) ... So I now will wait again and see what comes out there :-(
 
@spriit

Good thinking, I scanned through the documentation on corosync 3 and my understanding is that the token timeout is automatically adjusted by the coefficient when there are 3 or more nodes, so I made the following changes:

I did not really understand the network topics that were discussed above. I have a 7 node cluster on Intel Nucs ... Would this config also be an idea for me? (Sorry for that Dummy questions)
 
Hello,
in my case it seems that the problem with corosync has been solved with the last update:
Code:
# dpkg -l | grep knet
ii  libknet1:amd64                       1.12-pve1                       amd64        kronosnet core switching implementation
before this update corosync reported continuously "link down":
Code:
Sep 27 00:02:14 proxmox101 corosync[13516]:   [TOTEM ] Retransmit List: 570
Sep 27 00:02:20 proxmox101 corosync[13516]:   [KNET  ] link: host: 4 link: 0 is down
Sep 27 00:02:20 proxmox101 corosync[13516]:   [KNET  ] link: host: 2 link: 0 is down
Sep 27 00:02:20 proxmox101 corosync[13516]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Sep 27 00:02:20 proxmox101 corosync[13516]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 27 00:02:20 proxmox101 corosync[13516]:   [KNET  ] host: host: 2 has no active links
Sep 27 00:02:26 proxmox101 corosync[13516]:   [KNET  ] rx: host: 4 link: 0 is up
Sep 27 00:02:26 proxmox101 corosync[13516]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Sep 27 00:02:26 proxmox101 corosync[13516]:   [KNET  ] rx: host: 2 link: 0 is up
Sep 27 00:02:26 proxmox101 corosync[13516]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 27 00:02:35 proxmox101 corosync[13516]:   [TOTEM ] Retransmit List: 5c2
Sep 27 00:02:48 proxmox101 corosync[13516]:   [KNET  ] link: host: 5 link: 0 is down
Sep 27 00:02:48 proxmox101 corosync[13516]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Sep 27 00:02:48 proxmox101 corosync[13516]:   [KNET  ] host: host: 5 has no active links
Sep 27 00:02:55 proxmox101 corosync[13516]:   [KNET  ] rx: host: 5 link: 0 is up
Sep 27 00:02:55 proxmox101 corosync[13516]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
and I had to restart corosync every 5 hours to see that it crashed the server:
Code:
into /lib/systemd/system/corosync.service

add the line

WatchdogSec=18000
now I have removed the corosync reboot, I no longer see "link down" signals in the syslog and the cluster (6 nodes) has been stable since Friday.
 
@spirit
jsut a quick update, now a week in, i remove the second fallback ring/link and keep it on one vrack ring only without adjusting timeout or anything (basically default config otherwise)
patch works well. occasional fault on that line for a second without any consequences as far i can tell for now.
 
Status
Not open for further replies.

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!