[SOLVED] howto resolve .pm04 corosync[1871]: [KNET ] host: host: 3 has no active links

Mar 18, 2024
55
4
8
east of muc
hello,

i have 4 nodes - .-pm01 .. .-pm04 all running Virtual Environment 8.2.2.

2 known system difference :

1. .pm01 .. .pm03 have valid subscriptions while .pm04 does not yet has a subscription (will have one in the future)

2. on .pm01 .. .pm03 uname -a: Linux .pm03 6.5.13-1-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.13-1 (2024-02-05T13:50Z) x86_64 GNU/Linux
on .pm04 uname -a: Linux .pm04 6.8.4-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-2 (2024-04-10T17:36Z) x86_64 GNU/Linux

--

findings:
on .pm01 .. .pm03:

pvecm status
Cluster information
-------------------
Name: testcluster
Config Version: 13
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Fri May 3 16:58:15 2024
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000002
Ring ID: 1.3a0
Quorate: Yes

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 3
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 *.34
0x00000002 1 *.33 (local)
0x00000003 1 *.35

-----

on ms-pm04:
pvecm status
Cluster information
-------------------
Name: testcluster
Config Version: 13
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Fri May 3 16:59:36 2024
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000004
Ring ID: 4.3c4
Quorate: No

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 1
Quorum: 3 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000004 1 *.36 (local)

systemctl status corosync says:

May 03 17:11:46 .pm04 corosync[1835]: [KNET ] host: host: 3 has no active links
May 03 17:11:46 .pm04 corosync[1835]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
May 03 17:11:46 .pm04 corosync[1835]: [KNET ] host: host: 3 has no active links
May 03 17:11:46 .pm04 corosync[1835]: [KNET ] link: Resetting MTU for link 0 because host 4 joined
May 03 17:11:46 .pm04 corosync[1835]: [QUORUM] Sync members[1]: 4
May 03 17:11:46 .pm04 corosync[1835]: [QUORUM] Sync joined[1]: 4
May 03 17:11:46 .pm04 corosync[1835]: [TOTEM ] A new membership (4.3c9) was formed. Members joined: 4
May 03 17:11:46 .pm04 corosync[1835]: [QUORUM] Members[1]: 4
May 03 17:11:46 .pm04 corosync[1835]: [MAIN ] Completed service synchronization, ready to provide service.
May 03 17:11:46 .pm04 systemd[1]: Started corosync.service - Corosync Cluster Engine.


initially it worked and i was able to migrate a vm from .pm02 to .pm04 and i could start it on .pm04.

/etc/corosync/corosync.conf and /etc/pve/corosync.conf have identical content and seem to be the same on all 4 nodes.

/var/log/syslog on .pm04 states:

2024-05-03T16:40:14.438441+02:00 .pm04 pmxcfs[1619]: [quorum] crit: quorum_initialize failed: 2
2024-05-03T16:40:14.438484+02:00 .pm04 pmxcfs[1619]: [quorum] crit: can't initialize service
2024-05-03T16:40:14.438501+02:00 .pm04 pmxcfs[1619]: [confdb] crit: cmap_initialize failed: 2
2024-05-03T16:40:14.438514+02:00 .pm04 pmxcfs[1619]: [confdb] crit: can't initialize service
2024-05-03T16:40:14.438528+02:00 .pm04 pmxcfs[1619]: [dcdb] crit: cpg_initialize failed: 2
2024-05-03T16:40:14.438549+02:00 .pm04 pmxcfs[1619]: [dcdb] crit: can't initialize service
2024-05-03T16:40:14.438564+02:00 .pm04 pmxcfs[1619]: [status] crit: cpg_initialize failed: 2
2024-05-03T16:40:14.438578+02:00 .pm04 pmxcfs[1619]: [status] crit: can't initialize service

i feel, the nodes are now unable to communicate with each other.

maybe someone can give me a hint to resolve this issue.

thanks in advance, gustav
 
Last edited:
first of, please use [code][/code] tags to make CLI output and file contents better readable :)

Code:
Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 3
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.30.34
0x00000002 1 192.168.30.33 (local)
0x00000003 1 192.168.30.35
versus
Code:
Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 1
Quorum: 3 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000004 1 192.168.30.36 (local)

indicates a network problem between node 4 with end IP 36 to the other 3.

If you check the corosync logs on the individual nodes with journalctl -u corosync, you should see even more how the connection falls apart.

Check if they can ping each other on this network. Did you configure a specific MTU?
Is the network for Corosync shared with other services that might take up a lot of bandwidth and congest the network? Typical types of service that might congest a network can be shared storage, backup, live migration, ...
 
journalctl -u corosync:

Code:
May 03 14:34:15 .pm04 corosync[2951]:   [MAIN  ] Corosync Cluster Engine  starting up
May 03 14:34:15 .pm04 corosync[2951]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow
May 03 14:34:15 .pm04 corosync[2951]:   [TOTEM ] Initializing transport (Kronosnet).
May 03 14:34:15 .pm04 corosync[2951]:   [TOTEM ] totemknet initialized
May 03 14:34:15 .pm04 corosync[2951]:   [KNET  ] pmtud: MTU manually set to: 0
May 03 14:34:15 .pm04 corosync[2951]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
May 03 14:34:15 .pm04 corosync[2951]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
May 03 14:34:15 .pm04 corosync[2951]:   [QB    ] server name: cmap
May 03 14:34:15 .pm04 corosync[2951]:   [SERV  ] Service engine loaded: corosync configuration service [1]
May 03 14:34:15 .pm04 corosync[2951]:   [QB    ] server name: cfg
May 03 14:34:15 .pm04 corosync[2951]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
May 03 14:34:15 .pm04 corosync[2951]:   [QB    ] server name: cpg
May 03 14:34:15 .pm04 corosync[2951]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
May 03 14:34:15 .pm04 corosync[2951]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
May 03 14:34:15 .pm04 corosync[2951]:   [WD    ] Watchdog not enabled by configuration
May 03 14:34:15 .pm04 corosync[2951]:   [WD    ] resource load_15min missing a recovery key.
May 03 14:34:15 .pm04 corosync[2951]:   [WD    ] resource memory_used missing a recovery key.
May 03 14:34:15 .pm04 corosync[2951]:   [WD    ] no resources configured.
May 03 14:34:15 .pm04 corosync[2951]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
May 03 14:34:15 .pm04 corosync[2951]:   [QUORUM] Using quorum provider corosync_votequorum
May 03 14:34:15 .pm04 corosync[2951]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
May 03 14:34:15 .pm04 corosync[2951]:   [QB    ] server name: votequorum
May 03 14:34:15 .pm04 corosync[2951]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
May 03 14:34:15 .pm04 corosync[2951]:   [QB    ] server name: quorum
May 03 14:34:15 .pm04 corosync[2951]:   [TOTEM ] Configuring link 0
May 03 14:34:15 .pm04 corosync[2951]:   [TOTEM ] Configured link number 0: local addr: 10.10.151.243, port=5405
May 03 14:34:15 .pm04 corosync[2951]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 0)
May 03 14:34:15 .pm04 corosync[2951]:   [KNET  ] host: host: 2 has no active links
May 03 14:34:15 .pm04 corosync[2951]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
May 03 14:34:15 .pm04 corosync[2951]:   [KNET  ] host: host: 2 has no active links
May 03 14:34:15 .pm04 corosync[2951]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
May 03 14:34:15 .pm04 corosync[2951]:   [KNET  ] host: host: 2 has no active links
May 03 14:34:15 .pm04 corosync[2951]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
May 03 14:34:15 .pm04 corosync[2951]:   [KNET  ] host: host: 1 has no active links
May 03 14:34:15 .pm04 corosync[2951]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
May 03 14:34:15 .pm04 corosync[2951]:   [KNET  ] host: host: 1 has no active links
May 03 14:34:15 .pm04 corosync[2951]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
May 03 14:34:15 .pm04 corosync[2951]:   [KNET  ] host: host: 1 has no active links
May 03 14:34:15 .pm04 corosync[2951]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
May 03 14:34:15 .pm04 corosync[2951]:   [KNET  ] host: host: 3 has no active links
May 03 14:34:15 .pm04 corosync[2951]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
May 03 14:34:15 .pm04 corosync[2951]:   [KNET  ] host: host: 3 has no active links
May 03 14:34:15 .pm04 corosync[2951]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
May 03 14:34:15 .pm04 corosync[2951]:   [KNET  ] host: host: 3 has no active links
May 03 14:34:15 .pm04 corosync[2951]:   [KNET  ] link: Resetting MTU for link 0 because host 4 joined
May 03 14:34:15 .pm04 corosync[2951]:   [QUORUM] Sync members[1]: 4
May 03 14:34:15 .pm04 corosync[2951]:   [QUORUM] Sync joined[1]: 4
May 03 14:34:15 .pm04 corosync[2951]:   [TOTEM ] A new membership (4.5) was formed. Members joined: 4
May 03 14:34:15 .pm04 corosync[2951]:   [QUORUM] Members[1]: 4
May 03 14:34:15 .pm04 corosync[2951]:   [MAIN  ] Completed service synchronization, ready to provide service.

the .36 nic is also used for shared storage but there seems to be not much traffic on it.

.pm01 .. .pm03 work fine.
 
Last edited:
the strange thing is:

neither "pvecm nodes" nor "pvecm status" seem to be aware of .pm04, at least this node is not listed.


but

/etc/corosync/corosynv.conf and /etc/pve/corosync.conf do contain .pm04 and the webinterface still displays .pm04 under datacenter -> cluster.

i guess, this is because /etc/pve/nodes/.pm04 still (i had installed and joined and removed .pm04 before) exists on .pm01 .. .pm03. should i remove this directory and it's contents?


i think something is wrong with .pm04 and i would like to remove .pm04 (and readd it later after a fresh reinstall) from the cluster but i don't know how to do this as long as pvecm does not know about it (except deleting it from corosync.conf, but i do not feel brave enough to to this).

questions:

is it safe to remove .pm04 from corosync.conf?
is it safe to delete /etc/pve/nodes/.pm04?

are there any advices?
 
Last edited: