do you see bnx2 error in kern.log or dmesg ?Our small HP system cluster, which has bnx2 NICs, is the only one still experiencing regular problems...
do you see bnx2 error in kern.log or dmesg ?Our small HP system cluster, which has bnx2 NICs, is the only one still experiencing regular problems...
Is this a generalized issue or only certain combination of hw/sw see this?
And what about a fresh install of V6, does it have issues with corosync v3?
yes, indeed, running since 6months with corosync3 beta (on proxmox 5 with kernel 4.15). no problem until now. (I'm using mellanox connect-x4 card, and 2x10gb lacp bonding). Cluster have 16nodes, without any special tuning of corosync configuration.AFAIK @spirit also has been running larger installations for quite a while without issues.
[root@kvm1 ~]# ethtool -i eth0
driver: bnx2
version: 2.2.6
firmware-version: bc 5.2.3 NCSI 2.0.6
expansion-rom-version:
bus-info: 0000:03:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
do you see bnx2 error in kern.log or dmesg ?
kernel 5.0.21-1-pve
eno2 is dedicated to corsync - runs over vrack at ovh (i know i know)
it was stable with 4 nodes and tinc for about a week
then it was stable for 2 weeks with 3 nodes and using vrack/second nic instead of tinc
since adding node 4 and second nic corosync crashes the hole thing and reports mtu changes
do you have corosync log ?
# pveversion -v ?
https://github.com/corosync/corosync/blob/master/exec/totemsrp.c
static int pause_flush (struct totemsrp_instance *instance)
{
...
if ((now_msec - timestamp_msec) > (instance->totem_config->token_timeout / 2)) {
log_printf (instance->totemsrp_log_level_notice,
"Process pause detected for %d ms, flushing membership messages.", (unsigned int)(now_msec - timestamp_msec));
options bnx2 disable_msi=1
to /etc/modprobe.d/bnx2.conf
and rebooted all hosts (checked lspci -v
for MSI-X: Enable-
afterwards). So far no more corosync segfaults, but I have yet to reenable HA, I don't really trust it yet.Could the problem be related to jumbo frames and/or dual ring configuration?
I'm facing the same issue - corosync randomly hangs on different nodes.
I've two rings 10Gbe + 1Gbe with mtu = 9000 on both nets
@bofh
So, I think you can try to configure higher token timeout in corosync.conf
maybe try with 10s:
token: 10000
have you already done last libknet update (on both proxmox5 (with corosync3 repo) && proxmox6 nodes ?)I`m really frustrate frrom updating to Corosync 3. Cluster has been unstable for 2 months.
I have 5 nodes cluster...2 nodes are in latest 5 version, 3 nodes are in latest 6 version.
I have a question...what corosync conf is the right? There is one in /etc/pve and another one in /etc/corosync.
I edited one in /etc/corosync...maybe this was "the miracle".
its /etc/pve/corosync it will rewrite the /etc/corosync and distribute it over the nodesI`m really frustrate frrom updating to Corosync 3. Cluster has been unstable for 2 months.
I have 5 nodes cluster...2 nodes are in latest 5 version, 3 nodes are in latest 6 version.
Nodes are disconnected randomly, cluster is online sometimes few hours, few minutes or few days. Yesterday pve1 didn`t want connect to cluster. I tried restart corosync or node but no result.
Paradoxically I had to make this steps:
I tried to make corosync at 2nd ethernet NIC in the pve1 and pve3 (another subnet)...no connection. Pve3 didn`t connect with pve1 but pve1 connect with pve 2, 4 and 5! It was impossible because they had set 1st subnet not the same as pve1.
Next step was edit corosync conf with subnet 1 on pve3 and all nodes in cluster are online again.
I have a question...what corosync conf is the right? There is one in /etc/pve and another one in /etc/corosync.
I edited one in /etc/corosync...maybe this was "the miracle".
I`m going to try add ring1_addr to config...maybe with 2 ips it wil be ok.
That`s not true...sometimes you have to restart disconnected nodes, sometimes connected.yea restarting the "faulty" node wont help you, you need to restart corosync on the other nodes.
its paradox but this is how it works for whatever reason.
I've addedoptions bnx2 disable_msi=1
to/etc/modprobe.d/bnx2.conf
and rebooted all hosts (checkedlspci -v
forMSI-X: Enable-
afterwards). So far no more corosync segfaults, but I have yet to reenable HA, I don't really trust it yet.
We use essential cookies to make this site work, and optional cookies to enhance your experience.