Corosync error messages since reboot

huizi · Jun 12, 2020

Hello,

today we had a problem after adding a new node to a 7 node proxmox cluster. After adding the node the cluster rebooted completely.
Since this reboot we are getting constantly (every second) the following message on two of our 8 nodes.

Code:

corosync[2710]: [TOTEM ] Retransmit List: 1c23c

We checked multicast with the help of omping but everything seems fine, no packetloss.
Also the nodes have enough headroom and aren't overloaded.

Do you have any idea about how to fix this problem and maybe why the cluster has rebooted after adding the new node?
I can provide the whole syslog via pn.

Regards,
Alex

wolfgang · Jun 18, 2020

Hi,

are the nodes all on the same Proxmox VE version?

If yes can you post the exact version do you use?

Code:

pveversion -v

huizi said:
maybe why the cluster has rebooted after adding the new node?

I guess it is a quorum problem that you lost with the join process.
The reboot only happens when you lose quorum if you have HA enabled on a guest.

huizi · Jun 18, 2020

I will check the version! Thank you for this hint.
(Checked it, the new node is on 6.2 and the old nodes are on 6.1)

We had 7 nodes in the cluster and now have 8 nodes, can you imagine why we loose quorum when we add a new node?

Regards,
Alex

wolfgang · Jun 18, 2020

huizi said:
(Checked it, the new node is on 6.2 and the old nodes are on 6.1)

This should be fine, but I would recommend you to upgrade all to 6.2.
There were some bugs in corosync and to ensure this is not the problem perform an upgrade.

Proxmox VE in version 6 does not use multicast anymore.
Now UDP unicast is used.
Are the Nodes all on the same switch?
can you send the corosync conf to see if there are any errors?

Code:

cat /etc/pve/corosync.conf

huizi · Jun 19, 2020

Hello, the nodes aren't on the same switch and are also splitted over two DCs with multiple fiber links in between.

We will do the upgrade as soon as possible. As there were no heavy changes between 6.1 and 6.2 regarding corosync (checked in changelogs) we installed the new node on the newest version. I can send you the log output of the cluster join via private message if that helps.

Corosync config:

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: xx-prox1
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 172.19.1.130
  }
  node {
    name: xx-prox15
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 172.19.1.144
  }
  node {
    name: xx-prox32
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 172.19.1.161
  }
  node {
    name: xx-prox9
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 172.19.1.138
  }
  node {
    name: xx-prox23
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 172.19.1.24
  }
  node {
    name: xx-prox31
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 172.19.1.32
  }
  node {
    name: xx-prox32
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 172.19.1.33
  }

  node {
    name: xx-prox-quorum
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 172.19.1.254
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: xx-prox
  config_version: 18
  interface {
    bindnetaddr: 172.19.1.161
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

wolfgang · Jun 19, 2020

The config is looking good.

I guess it is a latency problem.
I would recommend you monitor your network and have a look if there are many latency spikes.
Corosync sync is very latency sensible.

huizi · Jun 19, 2020

What do you think is the maximum latency for corosync?

huizi · Jun 19, 2020

In this screenshot you can see a behavior for which we could not find a reason so far. It looks like the IP addresses of the nodes change when a new member is added to the cluster.

huizi · Jun 25, 2020

Hello,

does nobody have an idea?

tom · Jun 25, 2020

huizi said:
What do you think is the maximum latency for corosync?

Siehe https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_cluster_network_requirements

huizi · Jun 25, 2020

Our latency is under 1ms every time and sometimes peaks to ~3ms. But this doesn't explain the problems mentioned here:
https://forum.proxmox.com/threads/corosync-error-messages-since-reboot.71307/#post-321001

tom · Jun 25, 2020

You still missed to check/provide the output of:

> pveversion -v

Adding a node to the cluster does not change the IP configuration of a node, please check again what you see.

huizi · Jun 25, 2020

Hi Tom, I see the following outputs in the log:

Here are the outputs ov pveversion -v:

Proxmox 6.1 Nodes (all excepting the new one)

Code:

root@xx-xx1-prox31:~# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.18-2-pve)
pve-manager: 6.1-7 (running version: 6.1-7/13e58d5e)
pve-kernel-5.3: 6.1-5
pve-kernel-helper: 6.1-5
pve-kernel-4.15: 5.4-14
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-4.15.18-26-pve: 4.15.18-54
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-16-pve: 4.15.18-41
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph: 14.2.9-pve1
ceph-fuse: 14.2.9-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.14-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-12
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-4
libpve-storage-perl: 6.1-4
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-19
pve-docs: 6.1-6
pve-edk2-firmware: 2.20191127-1
pve-firewall: 4.0-10
pve-firmware: 3.0-5
pve-ha-manager: 3.0-8
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-3
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-6
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1

Proxmox 6.2 (the new node that was added to the cluster)

Code:

root@xx-xx2-prox1:~# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.41-1-pve)
pve-manager: 6.2-6 (running version: 6.2-6/ee1d7754)
pve-kernel-5.4: 6.2-2
pve-kernel-helper: 6.2-2
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-3
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-7
pve-cluster: 6.1-8
pve-container: 3.1-8
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1

t.lamprecht · Jun 25, 2020

huizi said:
config_version: 18

You have quite a high config version for 8 nodes, what manual changes where done here?

huizi said:
I see the following outputs in the log:

IIRC, this can be ignored it's more a logging issue than anything else - if you want to get rid of it you should be able to do so by ordering the node entries by their ID.

huizi said:
libknet1: 1.14-pve1

It'd be good to get that to match the newer node (1.15-pve1 at time of writing), it's responsible for the underlying cluster communication transport stack and there are a bunch of fixes included in the newer one.

huizi said:
after adding a new node to a 7 node proxmox cluster

How was this node added? Webinterface, CLI, manually?

What's the current status of the cluster, did you perform some actions (reboots, etc.)?
Is it functional (pvecm status) and you "just" get those log messages?

huizi · Jul 3, 2020

t.lamprecht said:
You have quite a high config version for 8 nodes, what manual changes where done here?

No manual changes, we added and removed nodes from the cluster.

t.lamprecht said:
It'd be good to get that to match the newer node (1.15-pve1 at time of writing), it's responsible for the underlying cluster communication transport stack and there are a bunch of fixes included in the newer one.

I will do an upgrade asap.

t.lamprecht said:
How was this node added? Webinterface, CLI, manually?

The node was added via CLI.

t.lamprecht said:
What's the current status of the cluster, did you perform some actions (reboots, etc.)?

While adding the node to the cluster we haven't performed any reboot manually but because of a unidentified problem the cluster was rebooting by itself.

Code:

root@xx-prox32:~# pvecm status
Cluster information
-------------------
Name:             xx-xxx-prox
Config Version:   18
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Jul  3 10:32:16 2020
Quorum provider:  corosync_votequorum
Nodes:            8
Node ID:          0x00000004
Ring ID:          1.2907
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   8
Highest expected: 8
Total votes:      8
Quorum:           5
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 172.19.1.161
0x00000002          1 172.19.1.24
0x00000003          1 172.19.1.254
0x00000004          1 172.19.1.33 (local)
0x00000005          1 172.19.1.32
0x00000006          1 172.19.1.138
0x00000007          1 172.19.1.144
0x00000008          1 172.19.1.130

huizi · Jul 10, 2020

@t.lamprecht Did you already have time to review my post?

Search

Search

Corosync error messages since reboot

huizi

Member

wolfgang

Proxmox Retired Staff

huizi

Member

wolfgang

Proxmox Retired Staff

huizi

Member

wolfgang

Proxmox Retired Staff

huizi

Member

huizi

Member

huizi

Member

tom

Proxmox Staff Member

huizi

Member

tom

Proxmox Staff Member

huizi

Member

t.lamprecht

Proxmox Staff Member

huizi

Member

huizi

Member

We value your privacy