[SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

andrew sp · Aug 16, 2019

same behavior, even with secauth: off, but now corosync dumps with a different signal (not SEGV, but FPE):

Code:

Aug 16 13:29:56 dev-proxmox14 kernel: [1235183.092490] traps: corosync[30294] trap divide error ip:7f5ddd3c78c6 sp:7f5dd17bfa50 error:0 in libknet.so.1.2.0[7f5ddd3bc000+13000]
Aug 16 13:29:56 dev-proxmox14 systemd[1]: corosync.service: Main process exited, code=killed, status=8/FPE
Aug 16 13:29:56 dev-proxmox14 systemd[1]: corosync.service: Failed with result 'signal'.

David Herselman · Aug 16, 2019

Adding a node to an existing cluster results in all existing nodes being fenced and restarting.

Also, this was with secauth: off and token: 10000

/etc/corosync/corosync.conf

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: kvm7a
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.250.1.2
  }
  node {
    name: kvm7b
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.250.1.3
  }
  node {
    name: kvm7c
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.250.1.4
  }
  node {
    name: kvm7d
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.250.1.5
  }
  node {
    name: kvm7e
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.250.1.6
  }
  node {
    name: kvm7f
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 10.250.1.7
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: kvm7
  config_version: 8
  interface {
    linknumber: 0
  }
  ip_version: ipv4
  secauth: off
  token: 10000
  version: 2
}

andrew sp · Aug 16, 2019

what is my best option now? cluster roughly lives for a day, and since it's a kinda production one - i'm in a position to attempt harsh downgrade with a lot of overtime

my initial intentions with an update was to escape from this lxc bug (https://github.com/lxc/lxd/issues/4468)
and now i've got a much bigger problem with corosync issues

Jeff Billimek · Aug 18, 2019

Having similar issues with corosync3 consuming over 6GB of memory on two of my proxmox v6 nodes:

.. this is with trying the `secauth: off` thing

astnwt · Aug 18, 2019

andrew sp said:
what is my best option now?

Hard to say. We're on PVE6.0-5 with the same issue. We now introduced a really weird workaround where we use cron to:
1) start corosync every 5 minutes
2) restart pve-cluster - then let everything settle and let do proxmox its "job"
3) stop corosync again
4) start over in 4 minutes.

It's a pain, but it works "for now". Manual HA if you want to. Keeps our sysadmins on their toes when a monitoring message arrives or somebody restarts a container.

Jeff Billimek · Aug 18, 2019

I modified the corosync systemd service (located /lib/systemd/system/corosync.service) to auto-restart corosync every 12 hours by adding the following in the [Service] section:

Code:

Restart=always
WatchdogSec=43200

Steps followed:

Code:

vim /lib/systemd/system/corosync.service
<add the above to the [Service] section>
save, exit vim
systemctl daemon-reload
service corosync restart

Ivan Gersi · Aug 18, 2019

Jeff Billimek said:
I modified the corosync systemd service (located /lib/systemd/system/corosync.service) to auto-restart corosync every 12 hours by adding the following in the [Service] section:

Code:

Restart=always WatchdogSec=43200

Steps followed:

Code:

vim /lib/systemd/system/corosync.service <add the above to the [Service] section> save, exit vim systemctl daemon-reload service corosync restart

I think it is not enough because corosync synchro failed during few minutes.

Ivan Gersi · Aug 18, 2019

I`ve been fighting with this issue for several weeks and I hope it will be ok when I make upgrade all nodes to v6....but it seems it is not right way.
I tried several procedures (e.g. add ring1 to config), restarting cluster and more....but no way guys/
The better way is restart corosynv in all nodes and....now I`m waiting for offline mode, but all noides in cluster are still green.
I have 5.4.13 pve in all nodes (except for node 1 wih 6.0-5) and cluster working properly.
I`ll try to get the time when cluster will be off and probably 12h corosync restarting from Jeff is only one right way.

andrew sp · Aug 19, 2019

i've downgraded corosync to 2.4.4-pve1 and it's just works

our problem now is unstable igb module in 5.0.18, which is randomly hungs network on "Intel Corporation I350 Gigabit Network Connection" cards
but it's a tale for another support post

Jema · Aug 19, 2019

We have been experiencing the same issue and the corosync service kept on crashing on our Proxmox VE 6 on and off on different nodes in the cluster.
We solved it by simply disabling IPV6 on the cluster network interfaces

Code:

echo net.ipv6.conf.eno4.disable_ipv6 = 1 >> /etc/sysctl.conf
sysctl -p

** Change 'eno4' with the network interface of the cluster network

spirit · Aug 19, 2019

Jema said:
We have been experiencing the same issue and the corosync service kept on crashing on our Proxmox VE 6 on and off on different nodes in the cluster.
We solved it by simply disabling IPV6 on the cluster network interfaces

Code:

echo net.ipv6.conf.eno4.disable_ipv6 = 1 >> /etc/sysctl.conf sysctl -p ** Change 'eno4' with the network interface of the cluster network

Interesting...

from corosync3 changelog
https://github.com/corosync/corosync/wiki/Corosync-3.0.0-Release-Notes

Code:

ip_version config setting has new default ipv6-4 (resolve IPv6 address first, if it fails try IPv4). For more info please consult corosync.conf(5) man page. To achieve old behavior (IPv4 only) please set totem.ip_version setting to ipv4.

Don't known if you are in production, and if you could try in corosync.conf

Code:

totem {
ip_version: ipv4
}

Jema · Aug 19, 2019

spirit said:
Interesting...

Don't known if you are in production, and if you could try in corosync.conf

Code:

totem { ip_version: ipv4 }

Also tried that, but after restarting corosync for some reason (after a small period of time) it resets again to the ipv6-4 setting.

astnwt · Aug 19, 2019

spirit said:
Code:

totem { ip_version: ipv4 }

For us (running 6.0-5 with corosync 3.0.2-pve2) this already was the (default?) setting
and we have the stability issues nevertheless.

But I'll try and disable IPv6 on the interface - let's see if that helps.
Have to move the cluster interface to a dedicated NIC then, but that's probably
a good idea anyways.

Jema · Aug 19, 2019

astnwt said:
Have to move the cluster interface to a dedicated NIC then, but that's probably
a good idea anyways.

It's always best to separate it if you have the chance to do so.

robhost · Aug 20, 2019

Has anoyone here tried to use "udpu" instead of "knet" with Corosync3?

Jema · Aug 20, 2019

Just to update everyone, the change we made didn't do the trick, corosync is still crashing randomly (only it stayed stable for about 24h and then started again). We thought we solved it but still doesn't seem like a solution.

spirit · Aug 21, 2019

Jema said:
Just to update everyone, the change we made didn't do the trick, corosync is still crashing randomly (only it stayed stable for about 24h and then started again). We thought we solved it but still doesn't seem like a solution.

Do you mean that disabling ipv6 on interface with sysctl don't have fixed it ?

Jema · Aug 21, 2019

spirit said:
Do you mean that disabling ipv6 on interface with sysctl don't have fixed it ?

It is much more stable but still the corosync service drops on and off on nodes. Less than before though.

astnwt · Aug 21, 2019

Same here. Disabled IPv6 on the interfaces - still the same random crashes. Today I upgraded to libknet 1.10-pve2 from the enterprise repos, let's see if that changes anything.

spirit · Aug 21, 2019

Jema said:
It is much more stable but still the corosync service drops on and off on nodes. Less than before though.

is it a corosync process crash ? or network communication flapping ? (can you send corosync logs ?)

[SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

New Member

Renowned Member

New Member

Member

Renowned Member

Member

Renowned Member

Renowned Member

New Member

New Member

Distinguished Member

New Member

Renowned Member

New Member

Active Member

New Member

Distinguished Member

New Member

Renowned Member

Distinguished Member

We value your privacy