[SOLVED] Corosync not synchronized with another nodes

Petr.114

Active Member
Jun 25, 2019
35
2
28
32
Hello, iam facing an issue with cluster/corosync.

If one node in cluster is unavailable for like 30 minutes, because of lost internet connection, HW maintenance or something else, the node wont rejoin cluster after starting/connecting.
The solution i found is to restart corosync on all nodes, but it is not reliable solution, because somone need to do it manually.

There is also a thing with corosync located in /etc/pve/corosync.conf, iam not able to open this one, only the one located in /etc/corosync/corosync.conf
Iam getting this error when opening in nano [ Error reading lock file /etc/pve/.corosync.conf.swp: Not enough data read ]

I will be glad for any ideas.
Thank you

There are some informations about my pveversion, corosync configuration and logs.
root@havirov-prox1:~# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.4.22-1-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-5.4: 6.1-7
pve-kernel-helper: 6.1-7
pve-kernel-5.3: 6.1-5
pve-kernel-5.4.24-1-pve: 5.4.24-1
pve-kernel-5.4.22-1-pve: 5.4.22-1
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-22
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-6
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: backup
nodeid: 5
quorum_votes: 1
ring0_addr: backup
}
node {
name: havirov-prox1
nodeid: 8
quorum_votes: 1
ring0_addr: havirov-prox1
}
node {
name: prox1
nodeid: 2
quorum_votes: 1
ring0_addr: prox1
}
node {
name: prox1-brno
nodeid: 9
quorum_votes: 1
ring0_addr: prox1-brno
}
node {
name: prox2
nodeid: 3
quorum_votes: 1
ring0_addr: prox2
}
node {
name: prox2-brno
nodeid: 7
quorum_votes: 1
ring0_addr: prox2-brno
}
node {
name: prox3
nodeid: 4
quorum_votes: 1
ring0_addr: prox3
}
node {
name: prox4
nodeid: 6
quorum_votes: 1
ring0_addr: prox4
}
node {
name: pve
nodeid: 1
quorum_votes: 1
ring0_addr: pve
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: cutter-pv
config_version: 42
interface {
ringnumber: 0
knet_transport: sctp
}
ip_version: ipv4
secauth: on
version: 2
token: 10000
}
 

Attachments

  • corosync syslog.txt
    8.1 KB · Views: 6
Don't use sctp, use ip addresses instead of names for your ring0_addr, don't set a token value if you don't have a very good reason and its not ringnumber it is linknumber.
 
Hello, we had a long time issues with cluster stability, the cluster was falling apart few times a week.

We found this thread on forum, post #5
https://forum.proxmox.com/threads/a...-after-upgrade-5-4-to-6-0-4.56425/post-260570
So we set knet_transport to sctp and token to 10000.

We made this change just a few weeks ago and since then, the cluster is stable.

I will change the ring0_addr to IPs and correct ringnumber to linknumber. But i´am not sure if i want to delete knet_transport and token line.
Maybe i give a shot deleting knet_transport, but i think i should at least keep token, because some servers have high ping and i think the cluster falls were happening because token timeouts.

I let you know after some testing.
 
So, i changed names to IP addresses and ringnumber to linknumber. Everything now works perfect. Node connects to the cluster in a moment.

Opening /etc/pve/corosync.conf still shows the error message.
[ Error reading lock file /etc/pve/.corosync.conf.swp: Not enough data read ]
Is it a problem? I dont see any errors and everything seems to fork fine.

Thank you very much, i really appreciate your help.
 
I guess you didn't edit the corosync.conf file as suggested in our admin guide: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_corosync_configuration

Short explanation, there are 2 corosync.conf files:
/etc/corosync/corosync.conf
/etc/pve/corosync.conf

The /pve one is managed by our cluster filesystem and therefore synced between all nodes, the other one is the node specific actual configuration file. Only change the one in /pve unless you really know what you are doing. If the cluster isn't quorate it you may need to copy the file manually to all nodes.
 
Thank you, now i understand the system about two corosync.conf files and how to edit. Appreciate it.
 
Hello, unfortunatelly, we are facing problem again.
Today morning the node started making mess and eventually restarted, after that, he did not connected to the cluster, even after manually restarting corosync on all nodes.
The problem started at 07:28 and continued until the node restarted at 08:35.
Will be glad for any ideas.
Thank you

proxmox-ve: 6.1-2 (running kernel: 5.4.24-1-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-5.4: 6.1-8
pve-kernel-helper: 6.1-8
pve-kernel-5.3: 6.1-6 pve-kernel-5.0: 6.0-11
pve-kernel-5.4.27-1-pve: 5.4.27-1
pve-kernel-5.4.24-1-pve: 5.4.24-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 4.0.1-pve1
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-23
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-7
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
 

Attachments

  • syslog.txt
    720.1 KB · Views: 5
libknet1 1.16-pve1 update fixed problem with nodes not connecting into cluster after restart/network failure

Problems with restarting nodes still persists.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!