[SOLVED]proxmox 6: corosync 3 problem caused by unfinished bonding configuration.

Han Boetes

Active Member
Jun 21, 2018
20
3
43
54
On 3 of our 7 cluster members I had /etc/hosts entries like this:

10.10.10.100 host100 # normal VLAN, this host
10.10.60.100 host100 # corosync VLAN, this host
10.10.60.101 host101 # corosync VLAN, another host

etc, etc.

After removing the normal VLAN line from the hosts file and restarting corosync on all host, the weird problems we had with corosync disappeared. This was no (noticeable) problem with proxmox5 but is with proxmox6. Maybe somebody else is helped with this gotcha.
 
hmm - could you post your corosync.conf?

In any case - Thanks for sharing a solution to your problem!
 
Well, my joy is short-lived. It's running into problems again. Back to the drawing boards. Do you spot anything out of place in the corosync.conf?

Code:
logging {
  debug: off
  to_syslog: yes
}
nodelist {
  node {
    name: batman
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.60.9
  }
  node {
    name: gaston
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 10.10.60.45
  }
  node {
    name: habocp1
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.10.60.230
  }
  node {
    name: laureline
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 10.10.60.17
  }
  node {
    name: ravian
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.60.14
  }
  node {
    name: redbaron
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.10.60.47
  }
  node {
    name: rehicp1
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.60.70
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: haxis
  config_version: 20
  interface {
    bindnetaddr: 10.10.60.230
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}


 
looks ok from a quick glance (i guess the mismatch between host100 and the node-names in the corosync.conf is from an abstraction in the first post?)

What kind of troubles are you experiencing?
(`journalctl` output for a timeframe where they occur would be helpful)
What's the output of `pveversion -v` ?
 
Here is the pveversion -v output. The journalctl output I will post when it happens.
Code:
proxmox-ve: 6.0-2 (running kernel: 5.0.21-1-pve)
pve-manager: 6.0-7 (running version: 6.0-7/28984024)
pve-kernel-5.0: 6.0-7
pve-kernel-helper: 6.0-7
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-5.0.18-1-pve: 5.0.18-3
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.11-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-4
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-8
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-64
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-7
pve-container: 3.0-7
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
pve-zsync: 2.0-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve2
 
Here is some typical logs of when a single host is having a problem. After I restart all corosync processes It's usually quiet for say half an hour and then this begins. The missing host in this case crashed during the night and it's our main production server, so I had to get up to reboot it in the middle of the night.
Code:
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [QUORUM] Members[6]: 1 2 3 5 6 7
Sep 18 10:12:51 rehicp1 corosync[3214]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 18 10:12:51 rehicp1 corosync[3214]:   [TOTEM ] A new membership (1:523900) was formed. Members
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [QUORUM] Members[6]: 1 2 3 5 6 7
Sep 18 10:12:51 rehicp1 corosync[3214]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 18 10:12:51 rehicp1 corosync[3214]:   [TOTEM ] A new membership (1:523904) was formed. Members
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [QUORUM] Members[6]: 1 2 3 5 6 7
Sep 18 10:12:51 rehicp1 corosync[3214]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 18 10:12:51 rehicp1 corosync[3214]:   [TOTEM ] A new membership (1:523908) was formed. Members
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [QUORUM] Members[6]: 1 2 3 5 6 7
Sep 18 10:12:51 rehicp1 corosync[3214]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 18 10:12:51 rehicp1 corosync[3214]:   [TOTEM ] A new membership (1:523912) was formed. Members
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [QUORUM] Members[6]: 1 2 3 5 6 7
Sep 18 10:12:51 rehicp1 corosync[3214]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 18 10:12:51 rehicp1 corosync[3214]:   [TOTEM ] A new membership (1:523916) was formed. Members
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:51 rehicp1 corosync[3214]:   [QUORUM] Members[6]: 1 2 3 5 6 7
Sep 18 10:12:51 rehicp1 corosync[3214]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 18 10:12:52 rehicp1 corosync[3214]:   [TOTEM ] A new membership (1:523920) was formed. Members
Sep 18 10:12:52 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:52 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:52 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:52 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:52 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:52 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:52 rehicp1 corosync[3214]:   [QUORUM] Members[6]: 1 2 3 5 6 7
Sep 18 10:12:52 rehicp1 corosync[3214]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 18 10:12:52 rehicp1 corosync[3214]:   [TOTEM ] Process pause detected for 2171 ms, flushing membership messages.

about 30 similar messages snipped

Sep 18 10:12:52 rehicp1 corosync[3214]:   [TOTEM ] Process pause detected for 2871 ms, flushing membership messages.
Sep 18 10:12:52 rehicp1 corosync[3214]:   [TOTEM ] Process pause detected for 2872 ms, flushing membership messages.
Sep 18 10:12:53 rehicp1 corosync[3214]:   [TOTEM ] A new membership (1:523924) was formed. Members
Sep 18 10:12:53 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:53 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:53 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:53 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:53 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:53 rehicp1 corosync[3214]:   [CPG   ] downlist left_list: 0 received
Sep 18 10:12:53 rehicp1 corosync[3214]:   [QUORUM] Members[6]: 1 2 3 5 6 7
 
I just had to restart the main production server's corosync again.
Code:
Sep 18 10:15:20 redbaron corosync[27598]:   [KNET  ] host: host: 7 has no active links
Sep 18 10:15:20 redbaron corosync[27598]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Sep 18 10:15:20 redbaron corosync[27598]:   [KNET  ] host: host: 5 has no active links
Sep 18 10:15:20 redbaron corosync[27598]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Sep 18 10:15:20 redbaron corosync[27598]:   [KNET  ] host: host: 6 has no active links
Sep 18 10:15:21 redbaron corosync[27598]:   [KNET  ] link: host: 2 link: 0 is down
Sep 18 10:15:21 redbaron corosync[27598]:   [KNET  ] link: host: 1 link: 0 is down
Sep 18 10:15:21 redbaron corosync[27598]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 18 10:15:21 redbaron corosync[27598]:   [KNET  ] host: host: 2 has no active links
Sep 18 10:15:21 redbaron corosync[27598]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 18 10:15:21 redbaron corosync[27598]:   [KNET  ] host: host: 1 has no active links
Sep 18 10:15:24 redbaron corosync[27598]:   [KNET  ] rx: host: 3 link: 0 is up
Sep 18 10:15:24 redbaron corosync[27598]:   [KNET  ] rx: host: 1 link: 0 is up
Sep 18 10:15:24 redbaron corosync[27598]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Sep 18 10:15:24 redbaron corosync[27598]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 18 10:15:24 redbaron corosync[27598]:   [KNET  ] rx: host: 2 link: 0 is up
Sep 18 10:15:24 redbaron corosync[27598]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 18 10:15:26 redbaron systemd[1]: corosync.service: Main process exited, code=killed, status=8/FPE
Sep 18 10:15:26 redbaron systemd[1]: corosync.service: Failed with result 'signal'.
^C
root@redbaron ~ # systemctl status corosync
root@redbaron ~ # systemctl start corosync
root@redbaron ~ # journalctl -f -u corosync
-- Logs begin at Wed 2019-09-18 00:27:54 CEST. --
Sep 18 10:28:40 redbaron corosync[25878]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Sep 18 10:28:40 redbaron corosync[25878]:   [KNET  ] host: host: 3 has no active links
Sep 18 10:28:40 redbaron corosync[25878]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Sep 18 10:28:40 redbaron corosync[25878]:   [KNET  ] host: host: 3 has no active links
Sep 18 10:28:40 redbaron corosync[25878]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 18 10:28:40 redbaron corosync[25878]:   [KNET  ] host: host: 2 has no active links
Sep 18 10:28:40 redbaron corosync[25878]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 18 10:28:40 redbaron corosync[25878]:   [KNET  ] host: host: 2 has no active links
Sep 18 10:28:40 redbaron corosync[25878]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 18 10:28:40 redbaron corosync[25878]:   [KNET  ] host: host: 2 has no active links
Sep 18 10:28:42 redbaron corosync[25878]:   [KNET  ] rx: host: 3 link: 0 is up
Sep 18 10:28:42 redbaron corosync[25878]:   [KNET  ] rx: host: 7 link: 0 is up
Sep 18 10:28:42 redbaron corosync[25878]:   [KNET  ] rx: host: 6 link: 0 is up
Sep 18 10:28:42 redbaron corosync[25878]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Sep 18 10:28:42 redbaron corosync[25878]:   [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
Sep 18 10:28:42 redbaron corosync[25878]:   [KNET  ] rx: host: 1 link: 0 is up
Sep 18 10:28:42 redbaron corosync[25878]:   [KNET  ] rx: host: 5 link: 0 is up
Sep 18 10:28:42 redbaron corosync[25878]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Sep 18 10:28:42 redbaron corosync[25878]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 18 10:28:42 redbaron corosync[25878]:   [KNET  ] rx: host: 2 link: 0 is up
Sep 18 10:28:42 redbaron corosync[25878]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Sep 18 10:28:42 redbaron corosync[25878]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 18 10:28:42 redbaron corosync[25878]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Sep 18 10:28:42 redbaron corosync[25878]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
Sep 18 10:28:42 redbaron corosync[25878]:   [KNET  ] pmtud: PMTUD link change for host: 7 link: 0 from 469 to 1397
Sep 18 10:28:42 redbaron corosync[25878]:   [KNET  ] pmtud: PMTUD link change for host: 5 link: 0 from 469 to 1397
Sep 18 10:28:42 redbaron corosync[25878]:   [KNET  ] pmtud: PMTUD link change for host: 6 link:
 
The main server redbaron has 10G Nic cards, and there was some noticeable packet loss going on. Seems to be a networking issue.
 
Figured it out: I set up channel bonding, which always appeared to work but in reality caused a few percent of packet loss under load. Only corosync 3 really has problems with it. I had to enable the bonding on the
Cisco switch as well.
 
Last edited:
  • Like
Reactions: Stoiko Ivanov
Thanks for sharing the solution!
please mark the thread as 'SOLVED' it might helps others in a similar situation
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!