corosync crash when network unstable work

Alibek

Well-Known Member
Jan 13, 2017
102
15
58
44
Hi all!

I found next situation:
when link is unstable, (such as network card 10Gbps attached to switch by connectors RJ45 w/o gilding - contacts on network card and on connector can be oxidized), switch can up/down links and after which corosync goes "crazy" - starts loading one core of the processor at 30-100% and after a few hours corosync is crush. Corosync crush arbitrarily, for example, on 5/8 cluster servers.

I fix that by simple script:
Code:
#!/usr/bin/env bash
killall corosync -9
sleep 2
systemctl stop pve-ha-lrm.service
sleep 2
systemctl stop pve-ha-crm.service
sleep 2
systemctl restart pvedaemon.service
sleep 2
systemctl start pve-ha-lrm.service

But probably it is necessary to check more thoroughly the code of Corosync.

Code:
# pveversion --verbose
proxmox-ve: 5.2-2 (running kernel: 4.15.18-4-pve)
pve-manager: 5.2-8 (running version: 5.2-8/fdf39912)
pve-kernel-4.15: 5.2-7
pve-kernel-4.15.18-4-pve: 4.15.18-23
pve-kernel-4.15.18-2-pve: 4.15.18-21
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.15.17-3-pve: 4.15.17-14
ceph: 12.2.8-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: not correctly installed
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-38
libpve-guest-common-perl: 2.0-17
libpve-http-server-perl: 2.0-10
libpve-storage-perl: 5.0-27
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-2
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-19
pve-cluster: 5.0-30
pve-container: 2.0-26
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-33
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9
 
Hi,
But probably it is necessary to check more thoroughly the code of Corosync.
Corosync is not made for unreliable networks.
Corosync is a real-time message service and you can't make a real-time application latency tolerance because this is the opposite of a real-time app.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!