Corosync strange behaviour

rowa-hooge · Mar 8, 2021

Hello,

today i detected some strange behavoir.
I have a 3 Node cluster with two Rings configured, each on dedicated network ports.

On each node i can check it with: corosync-cfgtool -n

Code:

Local node ID 3, transport knet
nodeid: 1 reachable
   LINK: 0 (172.16.120.9->172.16.120.10) enabled connected mtu: 1397
   LINK: 1 (172.16.121.9->172.16.121.10) enabled connected mtu: 1397

nodeid: 2 reachable
   LINK: 0 (172.16.120.9->172.16.120.8) enabled connected mtu: 1397
   LINK: 1 (172.16.121.9->172.16.121.8) enabled connected mtu: 1397

Everything looks fine but corosync-cfgtool -s shows (Link ID 1 missing)

Code:

LINK ID 0
    addr    = 172.16.120.9
    status:
        nodeid:   1:    connected
        nodeid:   2:    localhost
        nodeid:   3:    connected

and on an other host

Code:

LINK ID 0
    addr    = 172.16.120.8
    status:
        nodeid:   1:    localhost
        nodeid:   2:    connected
        nodeid:   3:    connected
LINK ID 1
    addr    = 172.16.121.8
    status:
        nodeid:   1:    localhost
        nodeid:   2:    disconnected
        nodeid:   3:    connected

So there ist something strange with Ring 1.

Any Ideas?

mira · Mar 8, 2021

Please provide the output of pveversion -v, the corosync config in /etc/pve/corosync.conf and the one in /etc/corosync/corosync.conf on that node.

rowa-hooge · Mar 9, 2021

corosync.cfg is identical on all nodes inside /etc/pve and /etc/corosync. Tested it with diff.

One more thing: I changed network cards from dual to quadport so there are new MAC-Addresses.
Don't know if that matters.

Code:

root@pve-09:~# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.65-1-pve: 5.4.65-1
ceph: 14.2.16-pve1
ceph-fuse: 14.2.16-pve1
corosync: 3.1.0-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-3
libpve-guest-common-perl: 3.1-4
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.8-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-5
pve-cluster: 6.2-1
pve-container: 3.3-3
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-5
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

Code:

root@pve-09:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve-08
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 172.16.120.8
    ring1_addr: 172.16.121.8
  }
  node {
    name: pve-09
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 172.16.120.9
    ring1_addr: 172.16.121.9
  }
  node {
    name: pve-10
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 172.16.120.10
    ring1_addr: 172.16.121.10
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: PROD-PI
  config_version: 4
  rrp_mode: passive
  interface {
    bindnetaddr: 172.16.120.8
    ringnumber: 0
  }
  interface {
    bindnetaddr: 172.16.121.8
    ringnumber: 1
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

mira · Mar 9, 2021

The changed MAC address should not be the issue as long as you don't have anything in between the nodes that filters traffic based on it and still has the old one.
Have you restarted Corosync on that node already? systemctl restart corosync.service

rowa-hooge · Mar 10, 2021

Restarted corosync on all three nodes.
Dosn't change anything. Same problem as before.

t.lamprecht · Mar 10, 2021

The pveversion is outputting the same on all nodes? (just to be sure)

rowa-hooge said:
One more thing: I changed network cards from dual to quadport so there are new MAC-Addresses.

Did you also change the IP address when switching NIC?

It would be nice to get the output of

from one node cat /etc/pve/corosync.conf
from those two, or all, nodes: ip addr

rowa-hooge · Mar 10, 2021

No, the IP stays the same.
pveversion is identical on all nodes. Checked it again with diff.

Here the ip config:

Code:

PVE-08:
-------
7: eno4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 08:f1:ea:8c:e3:97 brd ff:ff:ff:ff:ff:ff
    inet 172.16.120.8/24 brd 172.16.120.255 scope global eno4
       valid_lft forever preferred_lft forever
    inet6 fe80::af1:eaff:fe8c:e397/64 scope link
       valid_lft forever preferred_lft forever
8: ens1f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 98:f2:b3:1b:03:c4 brd ff:ff:ff:ff:ff:ff
    inet 172.16.121.8/24 brd 172.16.121.255 scope global ens1f0
       valid_lft forever preferred_lft forever
    inet6 fe80::9af2:b3ff:fe1b:3c4/64 scope link
       valid_lft forever preferred_lft forever

PVE-09:
-------
7: eno4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 08:f1:ea:8c:dd:97 brd ff:ff:ff:ff:ff:ff
    inet 172.16.120.9/24 brd 172.16.120.255 scope global eno4
       valid_lft forever preferred_lft forever
    inet6 fe80::af1:eaff:fe8c:dd97/64 scope link
       valid_lft forever preferred_lft forever
8: ens1f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 98:f2:b3:1b:03:00 brd ff:ff:ff:ff:ff:ff
    inet 172.16.121.9/24 brd 172.16.121.255 scope global ens1f0
       valid_lft forever preferred_lft forever
    inet6 fe80::9af2:b3ff:fe1b:300/64 scope link
       valid_lft forever preferred_lft forever

PVE-10:
-------
7: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 3c:a8:2a:1c:6a:a7 brd ff:ff:ff:ff:ff:ff
    inet 172.16.120.10/24 brd 172.16.120.255 scope global eth3
       valid_lft forever preferred_lft forever
    inet6 fe80::3ea8:2aff:fe1c:6aa7/64 scope link
       valid_lft forever preferred_lft forever
8: ens1f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 98:f2:b3:1b:0d:d0 brd ff:ff:ff:ff:ff:ff
    inet 172.16.121.10/24 brd 172.16.121.255 scope global ens1f0
       valid_lft forever preferred_lft forever
    inet6 fe80::9af2:b3ff:fe1b:dd0/64 scope link
       valid_lft forever preferred_lft forever

sigmarb · Jul 13, 2021

I have the same 'problem'. Maybe it's just a display issue, as the Proxmox VE Administration Guide notes:
"Even if all links are working, only the one with the highest priority will see corosync traffic."

sigmarb · Jul 13, 2021

Code:

root@MCHPLPX01:~# corosync-cfgtool -s
Local node ID 4, transport knet
LINK ID 0
    addr    = 172.16.1.1
    status:
        nodeid:   1:    localhost
        nodeid:   2:    connected
        nodeid:   3:    connected
        nodeid:   4:    connected

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: MCHPLPX01
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 172.16.1.1
    ring1_addr: 172.16.40.1
  }
  node {
    name: MCHPLPX02
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 172.16.1.2
    ring1_addr: 172.16.40.2
  }
  node {
    name: MCHPLPX03
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 172.16.1.3
    ring1_addr: 172.16.40.3
  }
  node {
    name: MCHPLPX04
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 172.16.1.4
    ring1_addr: 172.16.40.4
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: prod
  config_version: 4
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Code:

root@MCHPLPX01:~# corosync-cfgtool -n
Local node ID 4, transport knet
nodeid: 1 reachable
   LINK: 0 (172.16.1.1->172.16.1.4) enabled connected mtu: 1397
   LINK: 1 (172.16.40.1->172.16.40.4) enabled connected mtu: 9093

nodeid: 2 reachable
   LINK: 0 (172.16.1.1->172.16.1.3) enabled connected mtu: 1397
   LINK: 1 (172.16.40.1->172.16.40.3) enabled connected mtu: 9093

nodeid: 3 reachable
   LINK: 0 (172.16.1.1->172.16.1.2) enabled connected mtu: 1397
   LINK: 1 (172.16.40.1->172.16.40.2) enabled connected mtu: 9093

Code:

Jul 13 12:51:42 MCHPLPX01 systemd[1]: Stopped Corosync Cluster Engine.
Jul 13 12:51:42 MCHPLPX01 systemd[1]: Starting Corosync Cluster Engine...
Jul 13 12:51:42 MCHPLPX01 pmxcfs[1618]: [quorum] crit: quorum_initialize failed: 2
Jul 13 12:51:42 MCHPLPX01 pmxcfs[1618]: [quorum] crit: can't initialize service
Jul 13 12:51:42 MCHPLPX01 pmxcfs[1618]: [confdb] crit: cmap_initialize failed: 2
Jul 13 12:51:42 MCHPLPX01 pmxcfs[1618]: [confdb] crit: can't initialize service
Jul 13 12:51:42 MCHPLPX01 pmxcfs[1618]: [dcdb] notice: start cluster connection
Jul 13 12:51:42 MCHPLPX01 pmxcfs[1618]: [dcdb] crit: cpg_initialize failed: 2
Jul 13 12:51:42 MCHPLPX01 pmxcfs[1618]: [dcdb] crit: can't initialize service
Jul 13 12:51:42 MCHPLPX01 pmxcfs[1618]: [status] notice: start cluster connection
Jul 13 12:51:42 MCHPLPX01 pmxcfs[1618]: [status] crit: cpg_initialize failed: 2
Jul 13 12:51:42 MCHPLPX01 pmxcfs[1618]: [status] crit: can't initialize service
Jul 13 12:51:42 MCHPLPX01 corosync[4257]:   [MAIN  ] Corosync Cluster Engine 3.1.2 starting up
Jul 13 12:51:42 MCHPLPX01 corosync[4257]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Jul 13 12:51:42 MCHPLPX01 corosync[4257]:   [TOTEM ] Initializing transport (Kronosnet).
Jul 13 12:51:42 MCHPLPX01 corosync[4257]:   [TOTEM ] totemknet initialized
Jul 13 12:51:42 MCHPLPX01 corosync[4257]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [QB    ] server name: cmap
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [QB    ] server name: cfg
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [QB    ] server name: cpg
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [WD    ] Watchdog not enabled by configuration
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [WD    ] resource load_15min missing a recovery key.
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [WD    ] resource memory_used missing a recovery key.
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [WD    ] no resources configured.
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [QUORUM] Using quorum provider corosync_votequorum
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [QB    ] server name: votequorum
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [QB    ] server name: quorum
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [TOTEM ] Configuring link 0
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [TOTEM ] Configured link number 0: local addr: 172.16.1.1, port=5405
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [TOTEM ] Configuring link 1
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [TOTEM ] Configured link number 1: local addr: 172.16.40.1, port=5406
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 0)
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 4 has no active links
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 0)
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 3 has no active links
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 3 has no active links
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 3 has no active links
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 2 has no active links
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [QUORUM] Sync members[1]: 4
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [QUORUM] Sync joined[1]: 4
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [TOTEM ] A new membership (4.34) was formed. Members joined: 4
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 2 has no active links
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 2 has no active links
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 1 has no active links
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 1 has no active links
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 1 has no active links
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [QUORUM] Members[1]: 4
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 13 12:51:43 MCHPLPX01 systemd[1]: Started Corosync Cluster Engine.
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 3 has no active links
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 3 has no active links
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 3 has no active links
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 2 has no active links
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 2 has no active links
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 2 has no active links
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 1 has no active links
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 1 has no active links
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 13 12:51:43 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 1 has no active links
Jul 13 12:51:45 MCHPLPX01 corosync[4257]:   [KNET  ] rx: host: 1 link: 0 is up
Jul 13 12:51:45 MCHPLPX01 corosync[4257]:   [KNET  ] rx: host: 2 link: 0 is up
Jul 13 12:51:45 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 13 12:51:45 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 13 12:51:45 MCHPLPX01 corosync[4257]:   [KNET  ] rx: host: 3 link: 0 is up
Jul 13 12:51:45 MCHPLPX01 corosync[4257]:   [KNET  ] rx: host: 1 link: 1 is up
Jul 13 12:51:45 MCHPLPX01 corosync[4257]:   [KNET  ] rx: host: 2 link: 1 is up
Jul 13 12:51:45 MCHPLPX01 corosync[4257]:   [KNET  ] rx: host: 3 link: 1 is up
Jul 13 12:51:45 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 13 12:51:45 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 13 12:51:45 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 13 12:51:45 MCHPLPX01 corosync[4257]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 13 12:51:45 MCHPLPX01 corosync[4257]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Jul 13 12:51:45 MCHPLPX01 corosync[4257]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 1 from 469 to 8885
Jul 13 12:51:45 MCHPLPX01 corosync[4257]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Jul 13 12:51:45 MCHPLPX01 corosync[4257]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 1 from 469 to 8885
Jul 13 12:51:45 MCHPLPX01 corosync[4257]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
Jul 13 12:51:45 MCHPLPX01 corosync[4257]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 1 from 469 to 8885
Jul 13 12:51:45 MCHPLPX01 corosync[4257]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Jul 13 12:51:47 MCHPLPX01 corosync[4257]:   [QUORUM] Sync members[4]: 1 2 3 4
Jul 13 12:51:47 MCHPLPX01 corosync[4257]:   [QUORUM] Sync joined[3]: 1 2 3
Jul 13 12:51:47 MCHPLPX01 corosync[4257]:   [TOTEM ] A new membership (1.38) was formed. Members joined: 1 2 3
Jul 13 12:51:47 MCHPLPX01 corosync[4257]:   [QUORUM] This node is within the primary component and will provide service.
Jul 13 12:51:47 MCHPLPX01 corosync[4257]:   [QUORUM] Members[4]: 1 2 3 4
Jul 13 12:51:47 MCHPLPX01 corosync[4257]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 13 12:51:48 MCHPLPX01 pmxcfs[1618]: [status] notice: update cluster info (cluster name  prod, version = 4)
Jul 13 12:51:48 MCHPLPX01 pmxcfs[1618]: [status] notice: node has quorum
Jul 13 12:51:48 MCHPLPX01 pmxcfs[1618]: [dcdb] notice: members: 1/1635, 2/1630, 3/1603, 4/1618
Jul 13 12:51:48 MCHPLPX01 pmxcfs[1618]: [dcdb] notice: starting data syncronisation
Jul 13 12:51:48 MCHPLPX01 pmxcfs[1618]: [dcdb] notice: received sync request (epoch 1/1635/00000003)
Jul 13 12:51:48 MCHPLPX01 pmxcfs[1618]: [status] notice: members: 1/1635, 2/1630, 3/1603, 4/1618
Jul 13 12:51:48 MCHPLPX01 pmxcfs[1618]: [status] notice: starting data syncronisation
Jul 13 12:51:48 MCHPLPX01 pmxcfs[1618]: [dcdb] notice: received all states
Jul 13 12:51:48 MCHPLPX01 pmxcfs[1618]: [dcdb] notice: leader is 1/1635
Jul 13 12:51:48 MCHPLPX01 pmxcfs[1618]: [dcdb] notice: synced members: 1/1635, 2/1630, 3/1603
Jul 13 12:51:48 MCHPLPX01 pmxcfs[1618]: [dcdb] notice: waiting for updates from leader
Jul 13 12:51:48 MCHPLPX01 pmxcfs[1618]: [status] notice: received sync request (epoch 1/1635/00000003)
Jul 13 12:51:48 MCHPLPX01 pmxcfs[1618]: [dcdb] notice: update complete - trying to commit (got 4 inode updates)
Jul 13 12:51:48 MCHPLPX01 pmxcfs[1618]: [dcdb] notice: all data is up to date
Jul 13 12:51:48 MCHPLPX01 pmxcfs[1618]: [status] notice: received all states
Jul 13 12:51:48 MCHPLPX01 pmxcfs[1618]: [status] notice: all data is up to date

bjsko · Nov 15, 2021

Hi, I know this is an old-ish thread, but did you ever find a solution to the weird corosync-cfgtool -s output on the one node?

I have a cluster with 14 nodes and three rings enabled. The cluster has been running for more than a year, and new nodes has been added as needed. Today I started upgrading the nodes from 6.4-6 to 6.4-13. The two first upgraded nodes seems OK and corosync-cfgtool -s shows all three links(e.g (changed the ip addresses)):

Code:

root@pve113:~# corosync-cfgtool -s
Local node ID 11, transport knet
LINK ID 0
        addr    = 1.2.3.4
        status:
                nodeid:   1:    connected
                nodeid:   2:    connected
                nodeid:   3:    connected
                nodeid:   4:    connected
                nodeid:   5:    localhost
                nodeid:   6:    connected
                nodeid:   7:    connected
                nodeid:   8:    connected
                nodeid:   9:    connected
                nodeid:  10:    connected
                nodeid:  11:    connected
                nodeid:  12:    connected
                nodeid:  13:    connected
                nodeid:  14:    connected
LINK ID 1
        addr    = 5.6.7.8
        status:
                nodeid:   1:    connected
                nodeid:   2:    connected
                nodeid:   3:    connected
                nodeid:   4:    connected
                nodeid:   5:    localhost
                nodeid:   6:    connected
                nodeid:   7:    connected
                nodeid:   8:    connected
                nodeid:   9:    connected
                nodeid:  10:    connected
                nodeid:  11:    disconnected
                nodeid:  12:    connected
                nodeid:  13:    connected
                nodeid:  14:    connected
LINK ID 2
        addr    = 9.10.11.12
        status:
                nodeid:   1:    connected
                nodeid:   2:    connected
                nodeid:   3:    connected
                nodeid:   4:    connected
                nodeid:   5:    localhost
                nodeid:   6:    connected
                nodeid:   7:    connected
                nodeid:   8:    connected
                nodeid:   9:    connected
                nodeid:  10:    connected
                nodeid:  11:    disconnected
                nodeid:  12:    connected
                nodeid:  13:    connected
                nodeid:  14:    connected

On the third node corosync-cfgtool -s shows:

Code:

root@pve114:~# corosync-cfgtool -s
Local node ID 14, transport knet
LINK ID 0
        addr    = 1.2.3.5
        status:
                nodeid:   1:    connected
                nodeid:   2:    connected
                nodeid:   3:    connected
                nodeid:   4:    connected
                nodeid:   5:    connected
                nodeid:   6:    localhost
                nodeid:   7:    connected
                nodeid:   8:    connected
                nodeid:   9:    connected
                nodeid:  10:    connected
                nodeid:  11:    connected
                nodeid:  12:    connected
                nodeid:  13:    connected
                nodeid:  14:    connected

I am not 100% sure the output was OK before the node was upgraded. I have done many upgrades/patching of the cluster before, and never seen this behaviour. The cluster seems OK except from the output.

If I run corosync-cfgtool -n on one of the other nodes it seems like all is OK:

Code:

root@pve113:~# corosync-cfgtool -n
Local node ID 11, transport knet

<....removed entries for all the other nodes....>

nodeid: 14 reachable
   LINK: 0 (1.2.3.4>1.2.3.5) enabled connected mtu: 1397
   LINK: 1 (5.6.7.8->5.6.7.9) enabled connected mtu: 1397
   LINK: 2 (9.10.11.12->9.10.11.13) enabled connected mtu: 1397

I can also ping between the nodes on all three rings.

Any pointers to what can cause this?

BR
Bjørn

bjsko · Nov 16, 2021

Following up on my previous comment.... I removed the node only showing one ring from the cluster and reinstalled it and joined the cluster again, but it's still showing the same output from corosync-cfgtool as before. Not sure if I should/if it is safe go ahead with upgrading the remaining 11 nodes before understanding why this happens. I have 3 other clusters running, but not seen this before.

/bjørn

fabian · Nov 17, 2021

as long as the quorum is established and the log says the links went up it should only be cosmetic - you can also check the output of corosync-cmaptool -m stats, it should show various info about the links between the local and all the other nodes.

some related upstream commits (part of corosync >= 3.1.5):

https://github.com/corosync/corosync/commit/2856a6d85ed5e9216ea93a82c3b605bb0d141011
https://github.com/corosync/corosync/commit/d2d159a8ac1b93c0ca4c19b66737760868aacb16

bjsko · Nov 17, 2021

It seems to be cosmetic, indeed.

But I don't understand why it only shows up on this one node so far. I did a restart of corosync to verify that all links come up, and it looks good.

corosync-cmaptool is the old version I guess? Running corosync-cmapctl on the node (nodeid 14):

Code:

root@pve114:~# corosync-cmapctl -m stats | grep node14
stats.knet.node14.link0.connected (u8) = 1
stats.knet.node14.link0.down_count (u32) = 0
stats.knet.node14.link0.enabled (u8) = 1
stats.knet.node14.link0.latency_ave (u32) = 0
stats.knet.node14.link0.latency_max (u32) = 0
stats.knet.node14.link0.latency_min (u32) = 4294967295
stats.knet.node14.link0.latency_samples (u32) = 0
stats.knet.node14.link0.mtu (u32) = 65535
stats.knet.node14.link0.rx_data_bytes (u64) = 0
stats.knet.node14.link0.rx_data_packets (u64) = 0
stats.knet.node14.link0.rx_ping_bytes (u64) = 0
stats.knet.node14.link0.rx_ping_packets (u64) = 0
stats.knet.node14.link0.rx_pmtu_bytes (u64) = 0
stats.knet.node14.link0.rx_pmtu_packets (u64) = 0
stats.knet.node14.link0.rx_pong_bytes (u64) = 0
stats.knet.node14.link0.rx_pong_packets (u64) = 0
stats.knet.node14.link0.rx_total_bytes (u64) = 0
stats.knet.node14.link0.rx_total_packets (u64) = 0
stats.knet.node14.link0.rx_total_retries (u64) = 0
stats.knet.node14.link0.tx_data_bytes (u64) = 2011454
stats.knet.node14.link0.tx_data_errors (u32) = 0
stats.knet.node14.link0.tx_data_packets (u64) = 916
stats.knet.node14.link0.tx_data_retries (u32) = 0
stats.knet.node14.link0.tx_ping_bytes (u64) = 0
stats.knet.node14.link0.tx_ping_errors (u32) = 0
stats.knet.node14.link0.tx_ping_packets (u64) = 0
stats.knet.node14.link0.tx_ping_retries (u32) = 0
stats.knet.node14.link0.tx_pmtu_bytes (u64) = 0
stats.knet.node14.link0.tx_pmtu_errors (u32) = 0
stats.knet.node14.link0.tx_pmtu_packets (u64) = 0
stats.knet.node14.link0.tx_pmtu_retries (u32) = 0

It only shows link0 on the local node. If I run the same command on another node in the cluster it shows stats for all three links on node14.

I have one cluster running pve 7 (corosync 3.1.5-pve1), and it seems like it is changed there as you suggest as corosync-cmapctl -m stats will show all links on all nodes.

/bjørn

Search

Search

Corosync strange behaviour

rowa-hooge

Member

mira

Proxmox Staff Member

rowa-hooge

Member

mira

Proxmox Staff Member

rowa-hooge

Member

t.lamprecht

Proxmox Staff Member

rowa-hooge

Member

sigmarb

Well-Known Member

sigmarb

Well-Known Member

Attachments

bjsko

Well-Known Member

bjsko

Well-Known Member

fabian

Proxmox Staff Member

bjsko

Well-Known Member