[SOLVED] Corosync not synchronized with another nodes

Petr.114 · Mar 25, 2020

Hello, iam facing an issue with cluster/corosync.

If one node in cluster is unavailable for like 30 minutes, because of lost internet connection, HW maintenance or something else, the node wont rejoin cluster after starting/connecting.
The solution i found is to restart corosync on all nodes, but it is not reliable solution, because somone need to do it manually.

There is also a thing with corosync located in /etc/pve/corosync.conf, iam not able to open this one, only the one located in /etc/corosync/corosync.conf
Iam getting this error when opening in nano [ Error reading lock file /etc/pve/.corosync.conf.swp: Not enough data read ]

I will be glad for any ideas.
Thank you

There are some informations about my pveversion, corosync configuration and logs.

root@havirov-prox1:~# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.4.22-1-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-5.4: 6.1-7
pve-kernel-helper: 6.1-7
pve-kernel-5.3: 6.1-5
pve-kernel-5.4.24-1-pve: 5.4.24-1
pve-kernel-5.4.22-1-pve: 5.4.22-1
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-22
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-6
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: backup
nodeid: 5
quorum_votes: 1
ring0_addr: backup
}
node {
name: havirov-prox1
nodeid: 8
quorum_votes: 1
ring0_addr: havirov-prox1
}
node {
name: prox1
nodeid: 2
quorum_votes: 1
ring0_addr: prox1
}
node {
name: prox1-brno
nodeid: 9
quorum_votes: 1
ring0_addr: prox1-brno
}
node {
name: prox2
nodeid: 3
quorum_votes: 1
ring0_addr: prox2
}
node {
name: prox2-brno
nodeid: 7
quorum_votes: 1
ring0_addr: prox2-brno
}
node {
name: prox3
nodeid: 4
quorum_votes: 1
ring0_addr: prox3
}
node {
name: prox4
nodeid: 6
quorum_votes: 1
ring0_addr: prox4
}
node {
name: pve
nodeid: 1
quorum_votes: 1
ring0_addr: pve
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: cutter-pv
config_version: 42
interface {
ringnumber: 0
knet_transport: sctp
}
ip_version: ipv4
secauth: on
version: 2
token: 10000
}

tim · Mar 25, 2020

Don't use sctp, use ip addresses instead of names for your ring0_addr, don't set a token value if you don't have a very good reason and its not ringnumber it is linknumber.

Petr.114 · Mar 26, 2020

Hello, we had a long time issues with cluster stability, the cluster was falling apart few times a week.

We found this thread on forum, post #5
https://forum.proxmox.com/threads/a...-after-upgrade-5-4-to-6-0-4.56425/post-260570
So we set knet_transport to sctp and token to 10000.

We made this change just a few weeks ago and since then, the cluster is stable.

I will change the ring0_addr to IPs and correct ringnumber to linknumber. But i´am not sure if i want to delete knet_transport and token line.
Maybe i give a shot deleting knet_transport, but i think i should at least keep token, because some servers have high ping and i think the cluster falls were happening because token timeouts.

I let you know after some testing.

Petr.114 · Mar 27, 2020

So, i changed names to IP addresses and ringnumber to linknumber. Everything now works perfect. Node connects to the cluster in a moment.

Opening /etc/pve/corosync.conf still shows the error message.
[ Error reading lock file /etc/pve/.corosync.conf.swp: Not enough data read ]
Is it a problem? I dont see any errors and everything seems to fork fine.

Thank you very much, i really appreciate your help.

tim · Mar 27, 2020

I guess you didn't edit the corosync.conf file as suggested in our admin guide: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_corosync_configuration

Short explanation, there are 2 corosync.conf files:
/etc/corosync/corosync.conf
/etc/pve/corosync.conf

The /pve one is managed by our cluster filesystem and therefore synced between all nodes, the other one is the node specific actual configuration file. Only change the one in /pve unless you really know what you are doing. If the cluster isn't quorate it you may need to copy the file manually to all nodes.

Petr.114 · Mar 30, 2020

Thank you, now i understand the system about two corosync.conf files and how to edit. Appreciate it.

Petr.114 · Apr 20, 2020

Hello, unfortunatelly, we are facing problem again.
Today morning the node started making mess and eventually restarted, after that, he did not connected to the cluster, even after manually restarting corosync on all nodes.
The problem started at 07:28 and continued until the node restarted at 08:35.
Will be glad for any ideas.
Thank you

proxmox-ve: 6.1-2 (running kernel: 5.4.24-1-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-5.4: 6.1-8
pve-kernel-helper: 6.1-8
pve-kernel-5.3: 6.1-6 pve-kernel-5.0: 6.0-11
pve-kernel-5.4.27-1-pve: 5.4.27-1
pve-kernel-5.4.24-1-pve: 5.4.24-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 4.0.1-pve1
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-23
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-7
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1

Petr.114 · Aug 6, 2020

libknet1 1.16-pve1 update fixed problem with nodes not connecting into cluster after restart/network failure

Problems with restarting nodes still persists.

Search

Search

[SOLVED] Corosync not synchronized with another nodes

Petr.114

Well-Known Member

Attachments

tim

Proxmox Staff Member

Petr.114

Well-Known Member

Petr.114

Well-Known Member

tim

Proxmox Staff Member

Petr.114

Well-Known Member

Petr.114

Well-Known Member

Attachments

Petr.114

Well-Known Member

We value your privacy