[SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

Status
Not open for further replies.

andrew sp

New Member
Jul 31, 2019
7
0
1
same behavior, even with secauth: off, but now corosync dumps with a different signal (not SEGV, but FPE):
Code:
Aug 16 13:29:56 dev-proxmox14 kernel: [1235183.092490] traps: corosync[30294] trap divide error ip:7f5ddd3c78c6 sp:7f5dd17bfa50 error:0 in libknet.so.1.2.0[7f5ddd3bc000+13000]
Aug 16 13:29:56 dev-proxmox14 systemd[1]: corosync.service: Main process exited, code=killed, status=8/FPE
Aug 16 13:29:56 dev-proxmox14 systemd[1]: corosync.service: Failed with result 'signal'.
 
Jun 8, 2016
341
65
48
46
Johannesburg, South Africa
Adding a node to an existing cluster results in all existing nodes being fenced and restarting.

Also, this was with secauth: off and token: 10000

/etc/corosync/corosync.conf
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: kvm7a
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.250.1.2
  }
  node {
    name: kvm7b
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.250.1.3
  }
  node {
    name: kvm7c
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.250.1.4
  }
  node {
    name: kvm7d
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.250.1.5
  }
  node {
    name: kvm7e
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.250.1.6
  }
  node {
    name: kvm7f
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 10.250.1.7
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: kvm7
  config_version: 8
  interface {
    linknumber: 0
  }
  ip_version: ipv4
  secauth: off
  token: 10000
  version: 2
}
 

andrew sp

New Member
Jul 31, 2019
7
0
1
what is my best option now? cluster roughly lives for a day, and since it's a kinda production one - i'm in a position to attempt harsh downgrade with a lot of overtime

my initial intentions with an update was to escape from this lxc bug (https://github.com/lxc/lxd/issues/4468)
and now i've got a much bigger problem with corosync issues
 

Jeff Billimek

Member
Feb 16, 2018
8
5
8
Having similar issues with corosync3 consuming over 6GB of memory on two of my proxmox v6 nodes:

d8nY4fm.png



niTP6bw.png


.. this is with trying the `secauth: off` thing
 
Jun 1, 2011
11
3
3
Austria
fhstp.ac.at
what is my best option now?

Hard to say. We're on PVE6.0-5 with the same issue. We now introduced a really weird workaround where we use cron to:
1) start corosync every 5 minutes
2) restart pve-cluster - then let everything settle and let do proxmox its "job"
3) stop corosync again
4) start over in 4 minutes.

It's a pain, but it works "for now". Manual HA if you want to. Keeps our sysadmins on their toes when a monitoring message arrives or somebody restarts a container.
 

Jeff Billimek

Member
Feb 16, 2018
8
5
8
I modified the corosync systemd service (located /lib/systemd/system/corosync.service) to auto-restart corosync every 12 hours by adding the following in the [Service] section:

Code:
Restart=always
WatchdogSec=43200

Steps followed:

Code:
vim /lib/systemd/system/corosync.service
<add the above to the [Service] section>
save, exit vim
systemctl daemon-reload
service corosync restart
 

Ivan Gersi

Active Member
May 29, 2016
56
5
28
52
I modified the corosync systemd service (located /lib/systemd/system/corosync.service) to auto-restart corosync every 12 hours by adding the following in the [Service] section:

Code:
Restart=always
WatchdogSec=43200

Steps followed:

Code:
vim /lib/systemd/system/corosync.service
<add the above to the [Service] section>
save, exit vim
systemctl daemon-reload
service corosync restart
I think it is not enough because corosync synchro failed during few minutes.
 
Last edited:

Ivan Gersi

Active Member
May 29, 2016
56
5
28
52
I`ve been fighting with this issue for several weeks and I hope it will be ok when I make upgrade all nodes to v6....but it seems it is not right way.
I tried several procedures (e.g. add ring1 to config), restarting cluster and more....but no way guys/
The better way is restart corosynv in all nodes and....now I`m waiting for offline mode, but all noides in cluster are still green.
I have 5.4.13 pve in all nodes (except for node 1 wih 6.0-5) and cluster working properly.
I`ll try to get the time when cluster will be off and probably 12h corosync restarting from Jeff is only one right way.
 

andrew sp

New Member
Jul 31, 2019
7
0
1
i've downgraded corosync to 2.4.4-pve1 and it's just works

our problem now is unstable igb module in 5.0.18, which is randomly hungs network on "Intel Corporation I350 Gigabit Network Connection" cards
but it's a tale for another support post
 

Jema

New Member
Jun 3, 2019
20
0
1
42
We have been experiencing the same issue and the corosync service kept on crashing on our Proxmox VE 6 on and off on different nodes in the cluster.
We solved it by simply disabling IPV6 on the cluster network interfaces

Code:
echo net.ipv6.conf.eno4.disable_ipv6 = 1 >> /etc/sysctl.conf
sysctl -p

** Change 'eno4' with the network interface of the cluster network
 

spirit

Famous Member
Apr 2, 2010
5,780
675
133
www.odiso.com
We have been experiencing the same issue and the corosync service kept on crashing on our Proxmox VE 6 on and off on different nodes in the cluster.
We solved it by simply disabling IPV6 on the cluster network interfaces

Code:
echo net.ipv6.conf.eno4.disable_ipv6 = 1 >> /etc/sysctl.conf
sysctl -p

** Change 'eno4' with the network interface of the cluster network

Interesting...

from corosync3 changelog
https://github.com/corosync/corosync/wiki/Corosync-3.0.0-Release-Notes

Code:
ip_version config setting has new default ipv6-4 (resolve IPv6 address first, if it fails try IPv4). For more info please consult corosync.conf(5) man page. To achieve old behavior (IPv4 only) please set totem.ip_version setting to ipv4.

Don't known if you are in production, and if you could try in corosync.conf

Code:
totem {
ip_version: ipv4
}
 

Jema

New Member
Jun 3, 2019
20
0
1
42
Interesting...

Don't known if you are in production, and if you could try in corosync.conf

Code:
totem {
ip_version: ipv4
}

Also tried that, but after restarting corosync for some reason (after a small period of time) it resets again to the ipv6-4 setting.
 
Jun 1, 2011
11
3
3
Austria
fhstp.ac.at
Code:
totem {
ip_version: ipv4
}

For us (running 6.0-5 with corosync 3.0.2-pve2) this already was the (default?) setting
and we have the stability issues nevertheless.

But I'll try and disable IPv6 on the interface - let's see if that helps.
Have to move the cluster interface to a dedicated NIC then, but that's probably
a good idea anyways.
 

Jema

New Member
Jun 3, 2019
20
0
1
42
Just to update everyone, the change we made didn't do the trick, corosync is still crashing randomly (only it stayed stable for about 24h and then started again). We thought we solved it but still doesn't seem like a solution.
 

spirit

Famous Member
Apr 2, 2010
5,780
675
133
www.odiso.com
Just to update everyone, the change we made didn't do the trick, corosync is still crashing randomly (only it stayed stable for about 24h and then started again). We thought we solved it but still doesn't seem like a solution.
Do you mean that disabling ipv6 on interface with sysctl don't have fixed it ?
 

Jema

New Member
Jun 3, 2019
20
0
1
42
Do you mean that disabling ipv6 on interface with sysctl don't have fixed it ?

It is much more stable but still the corosync service drops on and off on nodes. Less than before though.
 
Status
Not open for further replies.

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!