How to add a 2nd Corosync Link

Hi,

currently that's only possible through editing the configuration. https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_adding_redundant_links_to_an_existing_cluster

There's work on the way do allow adding, editing and removing links through the gui.

I just did that, it seemed to work. The 2nd Link is beeing display in the gui now, too. Also the new Configuration ID/Version.
journalctl -b -u corosync looked good.

However, to test it i did a "ifconfig vmbr0 0.0.0.0" where ring0 is on.
After a few moments it lost quorum and got fenced?

Should it not still be alive via ring1/link1?
Should i not see the other node ips like: 10.10.51.1, 10.10.51.2, 10.10.51.3?


root@cluster5-node02:~# pvecm status
Cluster information
-------------------
Name: cluster5
Config Version: 4
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Mon Feb 17 16:06:26 2020
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000003
Ring ID: 1.38
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 172.16.30.212
0x00000002 1 172.16.30.214
0x00000003 1 172.16.30.213 (local)
 
However, to test it i did a "ifconfig vmbr0 0.0.0.0" where ring0 is on.
After a few moments it lost quorum and got fenced?

ifup/down are not really good test for corosync stuff, even if it should work with corosync 3 - but actually that specific one seems like a bug only showing up with kernels newer than 5.0 we also found a few weeks ago and which should be gone with the next corosync/kronosnet upgrade.

You can check the link state with corosync-cfgtool -sb
 
normally it still should though, check the link state if all shows up?


ssh cluster5-node01 "corosync-cfgtool -sb"
Printing link status.
Local node ID 1
LINK ID 0
addr = 172.16.30.212
status = 333


ssh cluster5-node02 "corosync-cfgtool -sb"
Printing link status.
Local node ID 3
LINK ID 0
addr = 172.16.30.213
status = 333
LINK ID 1
addr = 10.10.51.2
status = 13n


ssh cluster5-node03 "corosync-cfgtool -sb"
Printing link status.
Local node ID 2
LINK ID 0
addr = 172.16.30.214
status = 333
LINK ID 1
addr = 10.10.51.3
status = 1n3


How shall i test it then properly?
 
And check what the log on the fenced (or another node) says regarding corosync - it should talk about both links after they got configured..


Feb 17 15:56:28 cluster5-node01 corosync[3077]: [KNET ] link: host: 3 link: 0 is down
Feb 17 15:56:28 cluster5-node01 corosync[3077]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 17 15:56:28 cluster5-node01 corosync[3077]: [KNET ] host: host: 3 has no active links
Feb 17 15:56:28 cluster5-node01 corosync[3077]: [TOTEM ] Token has not been received in 1237 ms
Feb 17 15:56:29 cluster5-node01 corosync[3077]: [TOTEM ] A processor failed, forming new configuration.
Feb 17 15:56:31 cluster5-node01 corosync[3077]: [TOTEM ] A new membership (1.34) was formed. Members left: 3
Feb 17 15:56:31 cluster5-node01 corosync[3077]: [TOTEM ] Failed to receive the leave message. failed: 3
Feb 17 15:56:31 cluster5-node01 corosync[3077]: [CPG ] downlist left_list: 1 received
Feb 17 15:56:31 cluster5-node01 corosync[3077]: [CPG ] downlist left_list: 1 received
Feb 17 15:56:31 cluster5-node01 pmxcfs[3054]: [dcdb] notice: members: 1/3054, 2/3133
Feb 17 15:56:31 cluster5-node01 pmxcfs[3054]: [dcdb] notice: starting data syncronisation
Feb 17 15:56:31 cluster5-node01 corosync[3077]: [QUORUM] Members[2]: 1 2
Feb 17 15:56:31 cluster5-node01 corosync[3077]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 17 15:56:31 cluster5-node01 pmxcfs[3054]: [dcdb] notice: cpg_send_message retried 1 times
Feb 17 15:56:31 cluster5-node01 pmxcfs[3054]: [status] notice: members: 1/3054, 2/3133
Feb 17 15:56:31 cluster5-node01 pmxcfs[3054]: [status] notice: starting data syncronisation
Feb 17 15:56:31 cluster5-node01 pmxcfs[3054]: [dcdb] notice: received sync request (epoch 1/3054/00000004)
Feb 17 15:56:31 cluster5-node01 pmxcfs[3054]: [status] notice: received sync request (epoch 1/3054/00000004)
Feb 17 15:56:31 cluster5-node01 pmxcfs[3054]: [dcdb] notice: received all states
Feb 17 15:56:31 cluster5-node01 pmxcfs[3054]: [dcdb] notice: leader is 1/3054
Feb 17 15:56:31 cluster5-node01 pmxcfs[3054]: [dcdb] notice: synced members: 1/3054, 2/3133
Feb 17 15:56:31 cluster5-node01 pmxcfs[3054]: [dcdb] notice: start sending inode updates
Feb 17 15:56:31 cluster5-node01 pmxcfs[3054]: [dcdb] notice: sent all (0) updates
Feb 17 15:56:31 cluster5-node01 pmxcfs[3054]: [dcdb] notice: all data is up to date
Feb 17 15:56:31 cluster5-node01 pmxcfs[3054]: [dcdb] notice: dfsm_deliver_queue: queue length 5
Feb 17 15:56:31 cluster5-node01 pmxcfs[3054]: [status] notice: received all states
Feb 17 15:56:31 cluster5-node01 pmxcfs[3054]: [status] notice: all data is up to date
Feb 17 15:56:31 cluster5-node01 pmxcfs[3054]: [status] notice: dfsm_deliver_queue: queue length 87


=> its only talking about link: 0, right?
 
Feb 17 15:56:28 cluster5-node01 corosync[3077]: [KNET ] host: host: 3 has no active links
=> its only talking about link: 0, right?

Yes, it seems kronosnet (the transport tech corosync uses) did not "saw" the new link yet..
But your current corosync output from the other posts says it does now, you could re-check in the current logs.
 
I did restart corosync on all nodes now. Now i am "only" left with link enabled:0 on nodeid 1:


root@cluster5-node01:~# /usr/sbin/corosync-cfgtool -s
Printing link status.
Local node ID 1
LINK ID 0
addr = 172.16.30.212
status:
nodeid 1: link enabled:1 link connected:1
nodeid 2: link enabled:1 link connected:1
nodeid 3: link enabled:1 link connected:1
LINK ID 1
addr = 10.10.51.1
status:
nodeid 1: link enabled:0 link connected:1
nodeid 2: link enabled:1 link connected:1
nodeid 3: link enabled:1 link connected:1


On node01 i increased the config version to trigger a reload. It seems to be aware of link1, but wont enable it?

Feb 17 16:53:59 cluster5-node01 corosync[192434]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 17 16:53:59 cluster5-node01 corosync[192434]: [KNET ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Feb 17 16:53:59 cluster5-node01 corosync[192434]: [KNET ] pmtud: PMTUD link change for host: 2 link: 1 from 469 to 1397
Feb 17 16:53:59 cluster5-node01 corosync[192434]: [KNET ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
Feb 17 16:53:59 cluster5-node01 corosync[192434]: [KNET ] pmtud: PMTUD link change for host: 3 link: 1 from 469 to 1397
Feb 17 16:53:59 cluster5-node01 corosync[192434]: [KNET ] pmtud: Global data MTU changed to: 1397
Feb 17 16:59:51 cluster5-node01 corosync[192434]: [CFG ] Config reload requested by node 1
Feb 17 16:59:51 cluster5-node01 corosync[192434]: [TOTEM ] Configuring link 0
Feb 17 16:59:51 cluster5-node01 corosync[192434]: [TOTEM ] Configured link number 0: local addr: 172.16.30.212, port=5405
Feb 17 16:59:51 cluster5-node01 corosync[192434]: [TOTEM ] Configuring link 1
Feb 17 16:59:51 cluster5-node01 corosync[192434]: [TOTEM ] Configured link number 1: local addr: 10.10.51.1, port=5406

Any idea how to enable that link?
 
no need to, as this is just a display issue - a node will always only use one link as loopback to itself. the displayed output will improve with the next corosync version hopefully, to avoid this confusion.
 
Feb 17 16:59:51 cluster5-node01 corosync[192434]: [TOTEM ] Configuring link 0
Feb 17 16:59:51 cluster5-node01 corosync[192434]: [TOTEM ] Configured link number 0: local addr: 172.16.30.212, port=5405
Feb 17 16:59:51 cluster5-node01 corosync[192434]: [TOTEM ] Configuring link 1
Feb 17 16:59:51 cluster5-node01 corosync[192434]: [TOTEM ] Configured link number 1: local addr: 10.10.51.1, port=5406

they are already up and configured now :) I'd still avoid ifdown testing until you got the following versions running:
Code:
corosync: 3.0.3-pve1
libknet1: 1.14-pve1
 
okay, thank you.

i even added a 3rd ring now:

root@cluster5-node02:~# /usr/sbin/corosync-cfgtool -s
Printing link status.
Local node ID 3
LINK ID 0
addr = 172.16.30.213
status:
nodeid 1: link enabled:1 link connected:1
nodeid 2: link enabled:1 link connected:1
nodeid 3: link enabled:1 link connected:1
LINK ID 1
addr = 10.10.51.2
status:
nodeid 1: link enabled:1 link connected:1
nodeid 2: link enabled:1 link connected:1
nodeid 3: link enabled:0 link connected:1
LINK ID 2
addr = 10.10.50.2
status:
nodeid 1: link enabled:1 link connected:1
nodeid 2: link enabled:1 link connected:1
nodeid 3: link enabled:0 link connected:1
 
No, albeit I need to research what happened with that work I referenced to (from top of my head it was handled by a colleague that left us that year for some virtual reality pastures).

Anyhow, the UI side might not be _that_ hard, but the API backend is also still missing support for this.

Would be great to open an enhancement request over at https://bugzilla.proxmox.com/ to keep better track of this feature request.

W.r.t. implementation details: I'd probably add explicit "add link" and "remove link" endpoints that have some basic checks, like is there a (working) link left if one gets removed or if all network addresses are reachable/configured by nodes and so on. If we should have an edit existing link endpoint is IMO a bit questionable, most of the time it's probably safer to just add a new link with the new desired addresses and so on and then delete the old one if all worked out.
 
  • Like
Reactions: itNGO
No, albeit I need to research what happened with that work I referenced to (from top of my head it was handled by a colleague that left us that year for some virtual reality pastures).

Anyhow, the UI side might not be _that_ hard, but the API backend is also still missing support for this.

Would be great to open an enhancement request over at https://bugzilla.proxmox.com/ to keep better track of this feature request.

W.r.t. implementation details: I'd probably add explicit "add link" and "remove link" endpoints that have some basic checks, like is there a (working) link left if one gets removed or if all network addresses are reachable/configured by nodes and so on. If we should have an edit existing link endpoint is IMO a bit questionable, most of the time it's probably safer to just add a new link with the new desired addresses and so on and then delete the old one if all worked out.
Made that feature request...
https://bugzilla.proxmox.com/show_bug.cgi?id=6147
 
  • Like
Reactions: mohnewald