[SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

Status
Not open for further replies.
same behavior, even with secauth: off, but now corosync dumps with a different signal (not SEGV, but FPE):
Code:
Aug 16 13:29:56 dev-proxmox14 kernel: [1235183.092490] traps: corosync[30294] trap divide error ip:7f5ddd3c78c6 sp:7f5dd17bfa50 error:0 in libknet.so.1.2.0[7f5ddd3bc000+13000]
Aug 16 13:29:56 dev-proxmox14 systemd[1]: corosync.service: Main process exited, code=killed, status=8/FPE
Aug 16 13:29:56 dev-proxmox14 systemd[1]: corosync.service: Failed with result 'signal'.
 
Adding a node to an existing cluster results in all existing nodes being fenced and restarting.

Also, this was with secauth: off and token: 10000

/etc/corosync/corosync.conf
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: kvm7a
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.250.1.2
  }
  node {
    name: kvm7b
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.250.1.3
  }
  node {
    name: kvm7c
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.250.1.4
  }
  node {
    name: kvm7d
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.250.1.5
  }
  node {
    name: kvm7e
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.250.1.6
  }
  node {
    name: kvm7f
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 10.250.1.7
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: kvm7
  config_version: 8
  interface {
    linknumber: 0
  }
  ip_version: ipv4
  secauth: off
  token: 10000
  version: 2
}
 
what is my best option now? cluster roughly lives for a day, and since it's a kinda production one - i'm in a position to attempt harsh downgrade with a lot of overtime

my initial intentions with an update was to escape from this lxc bug (https://github.com/lxc/lxd/issues/4468)
and now i've got a much bigger problem with corosync issues
 
Having similar issues with corosync3 consuming over 6GB of memory on two of my proxmox v6 nodes:

d8nY4fm.png



niTP6bw.png


.. this is with trying the `secauth: off` thing
 
what is my best option now?

Hard to say. We're on PVE6.0-5 with the same issue. We now introduced a really weird workaround where we use cron to:
1) start corosync every 5 minutes
2) restart pve-cluster - then let everything settle and let do proxmox its "job"
3) stop corosync again
4) start over in 4 minutes.

It's a pain, but it works "for now". Manual HA if you want to. Keeps our sysadmins on their toes when a monitoring message arrives or somebody restarts a container.
 
I modified the corosync systemd service (located /lib/systemd/system/corosync.service) to auto-restart corosync every 12 hours by adding the following in the [Service] section:

Code:
Restart=always
WatchdogSec=43200

Steps followed:

Code:
vim /lib/systemd/system/corosync.service
<add the above to the [Service] section>
save, exit vim
systemctl daemon-reload
service corosync restart
 
I modified the corosync systemd service (located /lib/systemd/system/corosync.service) to auto-restart corosync every 12 hours by adding the following in the [Service] section:

Code:
Restart=always
WatchdogSec=43200

Steps followed:

Code:
vim /lib/systemd/system/corosync.service
<add the above to the [Service] section>
save, exit vim
systemctl daemon-reload
service corosync restart
I think it is not enough because corosync synchro failed during few minutes.
 
Last edited:
I`ve been fighting with this issue for several weeks and I hope it will be ok when I make upgrade all nodes to v6....but it seems it is not right way.
I tried several procedures (e.g. add ring1 to config), restarting cluster and more....but no way guys/
The better way is restart corosynv in all nodes and....now I`m waiting for offline mode, but all noides in cluster are still green.
I have 5.4.13 pve in all nodes (except for node 1 wih 6.0-5) and cluster working properly.
I`ll try to get the time when cluster will be off and probably 12h corosync restarting from Jeff is only one right way.
 
i've downgraded corosync to 2.4.4-pve1 and it's just works

our problem now is unstable igb module in 5.0.18, which is randomly hungs network on "Intel Corporation I350 Gigabit Network Connection" cards
but it's a tale for another support post
 
We have been experiencing the same issue and the corosync service kept on crashing on our Proxmox VE 6 on and off on different nodes in the cluster.
We solved it by simply disabling IPV6 on the cluster network interfaces

Code:
echo net.ipv6.conf.eno4.disable_ipv6 = 1 >> /etc/sysctl.conf
sysctl -p

** Change 'eno4' with the network interface of the cluster network
 
We have been experiencing the same issue and the corosync service kept on crashing on our Proxmox VE 6 on and off on different nodes in the cluster.
We solved it by simply disabling IPV6 on the cluster network interfaces

Code:
echo net.ipv6.conf.eno4.disable_ipv6 = 1 >> /etc/sysctl.conf
sysctl -p

** Change 'eno4' with the network interface of the cluster network

Interesting...

from corosync3 changelog
https://github.com/corosync/corosync/wiki/Corosync-3.0.0-Release-Notes

Code:
ip_version config setting has new default ipv6-4 (resolve IPv6 address first, if it fails try IPv4). For more info please consult corosync.conf(5) man page. To achieve old behavior (IPv4 only) please set totem.ip_version setting to ipv4.

Don't known if you are in production, and if you could try in corosync.conf

Code:
totem {
ip_version: ipv4
}
 
Interesting...

Don't known if you are in production, and if you could try in corosync.conf

Code:
totem {
ip_version: ipv4
}

Also tried that, but after restarting corosync for some reason (after a small period of time) it resets again to the ipv6-4 setting.
 
Code:
totem {
ip_version: ipv4
}

For us (running 6.0-5 with corosync 3.0.2-pve2) this already was the (default?) setting
and we have the stability issues nevertheless.

But I'll try and disable IPv6 on the interface - let's see if that helps.
Have to move the cluster interface to a dedicated NIC then, but that's probably
a good idea anyways.
 
Just to update everyone, the change we made didn't do the trick, corosync is still crashing randomly (only it stayed stable for about 24h and then started again). We thought we solved it but still doesn't seem like a solution.
 
Just to update everyone, the change we made didn't do the trick, corosync is still crashing randomly (only it stayed stable for about 24h and then started again). We thought we solved it but still doesn't seem like a solution.
Do you mean that disabling ipv6 on interface with sysctl don't have fixed it ?
 
Same here. Disabled IPv6 on the interfaces - still the same random crashes. Today I upgraded to libknet 1.10-pve2 from the enterprise repos, let's see if that changes anything.
 
Status
Not open for further replies.

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!