Corosync error messages since reboot

May 14, 2019
20
1
6
27
Hello,

today we had a problem after adding a new node to a 7 node proxmox cluster. After adding the node the cluster rebooted completely.
Since this reboot we are getting constantly (every second) the following message on two of our 8 nodes.

Code:
corosync[2710]: [TOTEM ] Retransmit List: 1c23c

We checked multicast with the help of omping but everything seems fine, no packetloss.
Also the nodes have enough headroom and aren't overloaded.

Do you have any idea about how to fix this problem and maybe why the cluster has rebooted after adding the new node?
I can provide the whole syslog via pn.

Regards,
Alex
 
Hi,

are the nodes all on the same Proxmox VE version?

If yes can you post the exact version do you use?
Code:
pveversion -v

maybe why the cluster has rebooted after adding the new node?
I guess it is a quorum problem that you lost with the join process.
The reboot only happens when you lose quorum if you have HA enabled on a guest.
 
I will check the version! Thank you for this hint.
(Checked it, the new node is on 6.2 and the old nodes are on 6.1)

We had 7 nodes in the cluster and now have 8 nodes, can you imagine why we loose quorum when we add a new node?

Regards,
Alex
 
(Checked it, the new node is on 6.2 and the old nodes are on 6.1)
This should be fine, but I would recommend you to upgrade all to 6.2.
There were some bugs in corosync and to ensure this is not the problem perform an upgrade.

Proxmox VE in version 6 does not use multicast anymore.
Now UDP unicast is used.
Are the Nodes all on the same switch?
can you send the corosync conf to see if there are any errors?

Code:
cat /etc/pve/corosync.conf
 
Hello, the nodes aren't on the same switch and are also splitted over two DCs with multiple fiber links in between.

We will do the upgrade as soon as possible. As there were no heavy changes between 6.1 and 6.2 regarding corosync (checked in changelogs) we installed the new node on the newest version. I can send you the log output of the cluster join via private message if that helps.

Corosync config:
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: xx-prox1
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 172.19.1.130
  }
  node {
    name: xx-prox15
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 172.19.1.144
  }
  node {
    name: xx-prox32
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 172.19.1.161
  }
  node {
    name: xx-prox9
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 172.19.1.138
  }
  node {
    name: xx-prox23
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 172.19.1.24
  }
  node {
    name: xx-prox31
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 172.19.1.32
  }
  node {
    name: xx-prox32
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 172.19.1.33
  }

  node {
    name: xx-prox-quorum
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 172.19.1.254
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: xx-prox
  config_version: 18
  interface {
    bindnetaddr: 172.19.1.161
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}
 
The config is looking good.

I guess it is a latency problem.
I would recommend you monitor your network and have a look if there are many latency spikes.
Corosync sync is very latency sensible.
 
In this screenshot you can see a behavior for which we could not find a reason so far. It looks like the IP addresses of the nodes change when a new member is added to the cluster.

image (1).jpg
 
You still missed to check/provide the output of:

> pveversion -v

Adding a node to the cluster does not change the IP configuration of a node, please check again what you see.
 
Hi Tom, I see the following outputs in the log:
1593094221397.png

Here are the outputs ov pveversion -v:

Proxmox 6.1 Nodes (all excepting the new one)
Code:
root@xx-xx1-prox31:~# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.18-2-pve)
pve-manager: 6.1-7 (running version: 6.1-7/13e58d5e)
pve-kernel-5.3: 6.1-5
pve-kernel-helper: 6.1-5
pve-kernel-4.15: 5.4-14
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-4.15.18-26-pve: 4.15.18-54
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-16-pve: 4.15.18-41
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph: 14.2.9-pve1
ceph-fuse: 14.2.9-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.14-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-12
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-4
libpve-storage-perl: 6.1-4
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-19
pve-docs: 6.1-6
pve-edk2-firmware: 2.20191127-1
pve-firewall: 4.0-10
pve-firmware: 3.0-5
pve-ha-manager: 3.0-8
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-3
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-6
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1

Proxmox 6.2 (the new node that was added to the cluster)
Code:
root@xx-xx2-prox1:~# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.41-1-pve)
pve-manager: 6.2-6 (running version: 6.2-6/ee1d7754)
pve-kernel-5.4: 6.2-2
pve-kernel-helper: 6.2-2
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-3
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-7
pve-cluster: 6.1-8
pve-container: 3.1-8
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1
 
config_version: 18

You have quite a high config version for 8 nodes, what manual changes where done here?

I see the following outputs in the log:

IIRC, this can be ignored it's more a logging issue than anything else - if you want to get rid of it you should be able to do so by ordering the node entries by their ID.

libknet1: 1.14-pve1

It'd be good to get that to match the newer node (1.15-pve1 at time of writing), it's responsible for the underlying cluster communication transport stack and there are a bunch of fixes included in the newer one.

after adding a new node to a 7 node proxmox cluster

How was this node added? Webinterface, CLI, manually?

What's the current status of the cluster, did you perform some actions (reboots, etc.)?
Is it functional (pvecm status) and you "just" get those log messages?
 
Last edited:
You have quite a high config version for 8 nodes, what manual changes where done here?

No manual changes, we added and removed nodes from the cluster.

It'd be good to get that to match the newer node (1.15-pve1 at time of writing), it's responsible for the underlying cluster communication transport stack and there are a bunch of fixes included in the newer one.

I will do an upgrade asap.

How was this node added? Webinterface, CLI, manually?

The node was added via CLI.

What's the current status of the cluster, did you perform some actions (reboots, etc.)?
While adding the node to the cluster we haven't performed any reboot manually but because of a unidentified problem the cluster was rebooting by itself.

Code:
root@xx-prox32:~# pvecm status
Cluster information
-------------------
Name:             xx-xxx-prox
Config Version:   18
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Jul  3 10:32:16 2020
Quorum provider:  corosync_votequorum
Nodes:            8
Node ID:          0x00000004
Ring ID:          1.2907
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   8
Highest expected: 8
Total votes:      8
Quorum:           5
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 172.19.1.161
0x00000002          1 172.19.1.24
0x00000003          1 172.19.1.254
0x00000004          1 172.19.1.33 (local)
0x00000005          1 172.19.1.32
0x00000006          1 172.19.1.138
0x00000007          1 172.19.1.144
0x00000008          1 172.19.1.130
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!