[SOLVED] Corosync redundancy - corosync.conf

Mihai

Renowned Member
Dec 22, 2015
104
8
83
39
I have an existing 3 node cluster that was originally created in Proxomx 5.x.

Code:
proxmox-ve: 7.2-1 (running kernel: 5.15.30-2-pve)
pve-manager: 7.2-3 (running version: 7.2-3/c743d6c1)
pve-kernel-helper: 7.2-2
pve-kernel-5.15: 7.2-1
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.30-2-pve: 5.15.30-3
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: not correctly installed
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-8
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-6
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.2-2
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.1.8-1
proxmox-backup-file-restore: 2.1.8-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-10
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-1
pve-ha-manager: 3.3-4
pve-i18n: 2.7-1
pve-qemu-kvm: 6.2.0-5
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1

I would like to add corosync redundancy and make the appropriate changes to corosync.conf.

The corosync.conf in this older install is slightly different than what is given in the newest documentation.

Specifically in the following lines:

Existing older corosync totem section:

Code:
  interface {
    bindnetaddr: 10.10.2.0
    ringnumber: 0
  }

New corosync totem section in the 7.x documentation:

Code:
  interface {
    linknumber: 0
  }

I have 3 rings with each host having their own ringX_addr set.

In the totem section, would it be safe to remove bindnetaddr and ringnumber and replace it with linknumber?

Thanks!
 
Last edited:
Here's the original corosync:

Code:
nodelist {
  node {
    name: VMHost2
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.2.16
    ring1_addr: 10.10.1.16
    ring2_addr: 10.10.0.16
  }
  node {
    name: VMHost4
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.2.14
    ring1_addr: 10.10.1.14
    ring2_addr: 10.10.0.14
  }
  node {
    name: vmhost3
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.0.13
    ring1_addr: 10.10.1.13
    ring2_addr: 10.10.0.13
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: SP-Cluster1
  config_version: 14
  interface {
    bindnetaddr: 10.10.2.0
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

Here's the new corosync I wanted to change it to:

Code:
nodelist {
  node {
    name: VMHost2
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.2.16
    ring1_addr: 10.10.1.16
    ring2_addr: 10.10.0.16
  }
  node {
    name: VMHost4
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.2.14
    ring1_addr: 10.10.1.14
    ring2_addr: 10.10.0.14
  }
  node {
    name: vmhost3
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.2.13
    ring1_addr: 10.10.1.13
    ring2_addr: 10.10.0.13
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: SP-Cluster1
  config_version: 14
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  interface {
    linknumber: 2
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

When I made the change I got the following errors:

Code:
May 04 09:30:27 VMHost2 systemd[1]: Started Corosync Cluster Engine.
May 04 09:30:27 VMHost2 corosync[3231]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
May 04 09:30:27 VMHost2 corosync[3231]:   [QB    ] server name: quorum
May 04 09:30:27 VMHost2 corosync[3231]:   [TOTEM ] Configuring link 0
May 04 09:30:27 VMHost2 corosync[3231]:   [TOTEM ] Configured link number 0: local addr: 10.10.2.16, port=5405
May 04 09:30:27 VMHost2 corosync[3231]:   [TOTEM ] Configuring link 1
May 04 09:30:27 VMHost2 corosync[3231]:   [TOTEM ] Configured link number 1: local addr: 10.10.1.16, port=5406
May 04 09:30:27 VMHost2 corosync[3231]:   [TOTEM ] Configuring link 2
May 04 09:30:27 VMHost2 corosync[3231]:   [TOTEM ] Configured link number 2: local addr: 10.10.0.16, port=5407
May 04 09:30:27 VMHost2 corosync[3231]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
May 04 09:30:27 VMHost2 corosync[3231]:   [KNET  ] host: host: 1 has no active links
May 04 09:30:27 VMHost2 corosync[3231]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
May 04 09:30:27 VMHost2 corosync[3231]:   [KNET  ] host: host: 1 has no active links
May 04 09:30:27 VMHost2 corosync[3231]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
May 04 09:30:27 VMHost2 corosync[3231]:   [KNET  ] host: host: 1 has no active links
May 04 09:30:27 VMHost2 corosync[3231]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
May 04 09:30:27 VMHost2 corosync[3231]:   [KNET  ] host: host: 2 has no active links
May 04 09:30:27 VMHost2 corosync[3231]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
May 04 09:30:27 VMHost2 corosync[3231]:   [KNET  ] host: host: 2 has no active links
May 04 09:30:27 VMHost2 corosync[3231]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
May 04 09:30:27 VMHost2 corosync[3231]:   [KNET  ] host: host: 2 has no active links
May 04 09:30:27 VMHost2 corosync[3231]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
May 04 09:30:27 VMHost2 corosync[3231]:   [KNET  ] host: host: 1 has no active links
May 04 09:30:27 VMHost2 corosync[3231]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
May 04 09:30:27 VMHost2 corosync[3231]:   [KNET  ] host: host: 1 has no active links
May 04 09:30:27 VMHost2 corosync[3231]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
May 04 09:30:27 VMHost2 corosync[3231]:   [KNET  ] host: host: 1 has no active links
May 04 09:30:27 VMHost2 corosync[3231]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
May 04 09:30:27 VMHost2 corosync[3231]:   [KNET  ] host: host: 2 has no active links
May 04 09:30:27 VMHost2 corosync[3231]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
May 04 09:30:27 VMHost2 corosync[3231]:   [KNET  ] host: host: 2 has no active links
May 04 09:30:27 VMHost2 corosync[3231]:   [QUORUM] Sync members[1]: 3
May 04 09:30:27 VMHost2 corosync[3231]:   [QUORUM] Sync joined[1]: 3
May 04 09:30:27 VMHost2 corosync[3231]:   [TOTEM ] A new membership (3.b66) was formed. Members joined: 3

Code:
May 11 15:22:53 VMHost2 pmxcfs[3204]: [dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 14)
May 11 15:22:53 VMHost2 corosync[3231]:   [CFG   ] Config reload requested by node 3
May 11 15:22:53 VMHost2 corosync[3231]:   [TOTEM ] new config has different address for link 0 (addr changed from 10.10.0.13 to 10.10.2.13). Internal value was NOT changed.
May 11 15:22:53 VMHost2 corosync[3231]:   [CFG   ] Cannot configure new interface definitions: To reconfigure an interface it must be deleted and recreated. A working interface needs to be available to corosync at all times
May 11 15:22:53 VMHost2 pmxcfs[3204]: [dcdb] crit: corosync-cfgtool -R failed with exit code 7#010

I didn't notice anything breaking but I re-applied the original corosync configuration.

Does anyone have any idea what the issue may be?
 
Last edited:
the message tells you what you need to do - you need to remove a link, reload config on all nodes, then add the newly configured link and reload again. obviously this only works with multiple links, you always need to have at least one link up and running.
 
Hi Fabian, thank you for the information. but I don't fully understand the procedure:

  • Which link(s) do I need to remove? Do I need to remove the same link on each node? I thought link 0 was the same address on both configurations: 10.10.2.16, so I am confused.
  • How do I remove the link?
  • I guess we currently only have 1 link operational? I wanted to add 2 more links, so I don't know how I can remove this existing link without breaking the cluster.
 
if I understand you correctly, you want to change ring0_addr of the last node? then you'd need to remove that address, reload the config (this marks that link down obviously on all nodes, so you need another ring/link to remain operational), then add the new address (again as ring0_addr), reload the config, and the link should be usable again.

also one thing to keep in mind - if you change anything in the config, you should always always also bump the config_version by 1 before writing it/copying it to /etc/pve/corosync.conf . the config_version is used to detect whether the config is new, so this is pretty important!
 
Actually now that you ask, I did not provide the complete story properly. Let me start again.

This was the original corosync.conf:

Code:
nodelist {
  node {
    name: VMHost2
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.2.16
  }
  node {
    name: VMHost4
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.2.14
  }
  node {
    name: vmhost3
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.2.13

  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: SP-Cluster1
  config_version: 14
  interface {
    bindnetaddr: 10.10.2.0
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

Then I tried to add two more links and this is the current corosync interface:

Code:
nodelist {
  node {
    name: VMHost2
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.2.16
    ring1_addr: 10.10.1.16
    ring2_addr: 10.10.0.16
  }
  node {
    name: VMHost4
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.2.14
    ring1_addr: 10.10.1.14
    ring2_addr: 10.10.0.14
  }
  node {
    name: vmhost3
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.0.13
    ring1_addr: 10.10.1.13
    ring2_addr: 10.10.0.13
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: SP-Cluster1
  config_version: 14
  interface {
    bindnetaddr: 10.10.2.0
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

There are three errors in this configuration
  1. Node vmhost3 where the ring0 addresss and ring2 address are identical.
  2. I am missing the other rings in totem section
  3. I am not using the new corosync.conf setting
Now when I try to change the corosync.conf to fix vmhost3 address, it gives the error mentioned above.

I don't know how to fix these issues.

How can I test if another link is operational before I bring another down?
 
Actually now that you ask, I did not provide the complete story properly. Let me start again.

This was the original corosync.conf:

Code:
nodelist {
  node {
    name: VMHost2
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.2.16
  }
  node {
    name: VMHost4
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.2.14
  }
  node {
    name: vmhost3
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.2.13

  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: SP-Cluster1
  config_version: 14
  interface {
    bindnetaddr: 10.10.2.0
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

Then I tried to add two more links and this is the current corosync interface:

Code:
nodelist {
  node {
    name: VMHost2
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.2.16
    ring1_addr: 10.10.1.16
    ring2_addr: 10.10.0.16
  }
  node {
    name: VMHost4
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.2.14
    ring1_addr: 10.10.1.14
    ring2_addr: 10.10.0.14
  }
  node {
    name: vmhost3
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.0.13
    ring1_addr: 10.10.1.13
    ring2_addr: 10.10.0.13
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: SP-Cluster1
  config_version: 14
  interface {
    bindnetaddr: 10.10.2.0
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

thanks, that makes the situation more clear :)

There are three errors in this configuration
  1. Node vmhost3 where the ring0 addresss and ring2 address are identical.
  2. I am missing the other rings in totem section
  3. I am not using the new corosync.conf setting
yeah, 1. should definitely be fixed
2. is not an issue, the interface section is entirely optional for knet, you can remove it altogether (all the relevant info is contained in the node list anyway)
3. I assume you mean you didn't bump the config_version?
Now when I try to change the corosync.conf to fix vmhost3 address, it gives the error mentioned above.

yeah, because corosync doesn't allow live-reloading addresses of configured rings/links. you need to live-reload removal of a link, and then live-reload adding the link again with the new address(es).

I don't know how to fix these issues.
verify with corosync-cfgtool -s that at least one link other than the one you need to reconfigure is connected on each node. then remove the wrong address from the third node, bump the config version (in one edit/copy!) and verify that all nodes reloaded the config (it should say so in the journal output of corosync. then add the right address back to the third node, bump the config version again (in one edit/copy) and verify that all nodes reloaded the config. then verify with corosync-cfgtool again that all three links are operational.
How can I test if another link is operational before I bring another down?
see above ;) if anything goes wrong, refer to https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_edit_corosync_conf (and the part below about how to reconfigure corosync if the config is messed up and no quorum can be established). if you have HA enabled and active, it might be prudent to disarm it before working on the corosync config, as losing quorum means nodes getting fenced.
 
3. I assume you mean you didn't bump the config_version?
Here I meant that this wording does not exist in the newest wiki:

Code:
bindnetaddr: 10.10.2.0

ringnumber: 0

But you did answer this question and you said it was not necessary, so I will remove this from the conf file.

Running corosync-cfgtool -s at this moment, here is the result:

Code:
Local node ID 3, transport knet
LINK ID 0 udp
        addr    = 10.10.2.16
        status:
                nodeid:          1:     connected
                nodeid:          2:     connected
                nodeid:          3:     localhost
LINK ID 1 udp
        addr    = 10.10.1.16
        status:
                nodeid:          1:     connected
                nodeid:          2:     connected
                nodeid:          3:     localhost
LINK ID 2 udp
        addr    = 10.10.0.16
        status:
                nodeid:          1:     connected
                nodeid:          2:     connected
                nodeid:          3:     localhost

Ok I think I understand what I need to do. P.S. I did read and know how to properly edit corosync.conf, but now I realize that I did forget to bump up the version number in the configuration.

Can you please confirm that the following steps are correct?

Step 1. Disarm HA

systemctl stop pve-ha-crm

Step 2. Prepare new corosync.conf.new by removing ring0 address.

Code:
nodelist {
  node {
    name: VMHost2
    nodeid: 3
    quorum_votes: 1
    ring1_addr: 10.10.1.16
    ring2_addr: 10.10.0.16
  }
  node {
    name: VMHost4
    nodeid: 1
    quorum_votes: 1
    ring1_addr: 10.10.1.14
    ring2_addr: 10.10.0.14
  }
  node {
    name: vmhost3
    nodeid: 2
    quorum_votes: 1
    ring1_addr: 10.10.1.13
    ring2_addr: 10.10.0.13
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: SP-Cluster1
  config_version: 15
  ip_version: ipv4
  secauth: on
  version: 2
}

Step 3. Bump up version as per instructions and copy it to corosync.conf.

Step 4. Prepare new corosync.conf.new by with correct ring0 address.


Code:
nodelist {
  node {
    name: VMHost2
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.2.16
    ring1_addr: 10.10.1.16
    ring2_addr: 10.10.0.16
  }
  node {
    name: VMHost4
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.2.14   
    ring1_addr: 10.10.1.14
    ring2_addr: 10.10.0.14
  }
  node {
    name: vmhost3
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.2.13
    ring1_addr: 10.10.1.13
    ring2_addr: 10.10.0.13
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: SP-Cluster1
  config_version: 16
  ip_version: ipv4
  secauth: on
  version: 2
}

Step 5. Bump up version as per instructions and copy it to corosync.conf.

Step 6. Prepare new corosync.conf.new by removing ring2 address.


Code:
nodelist {
  node {
    name: VMHost2
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.2.16
    ring1_addr: 10.10.1.16
  }
  node {
    name: VMHost4
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.2.14   
    ring1_addr: 10.10.1.14
  }
  node {
    name: vmhost3
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.2.13
    ring1_addr: 10.10.1.13
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: SP-Cluster1
  config_version: 17
  ip_version: ipv4
  secauth: on
  version: 2
}

Step 7. Bump up version as per instructions and copy it to corosync.conf.

Step 8. Prepare new corosync.conf.new by with correct ring2 address.


Code:
nodelist {
  node {
    name: VMHost2
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.2.16
    ring1_addr: 10.10.1.16
    ring2_addr: 10.10.0.16
  }
  node {
    name: VMHost4
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.2.14   
    ring1_addr: 10.10.1.14
    ring2_addr: 10.10.0.14
  }
  node {
    name: vmhost3
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.2.13
    ring1_addr: 10.10.1.13
    ring2_addr: 10.10.0.13
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: SP-Cluster1
  config_version: 18
  ip_version: ipv4
  secauth: on
  version: 2
}

Step 9. Bump up corosync version as per instructions.

Step 10. Re-enable HA

systemctl start pve-ha-crm

Success?

Thanks very much for your help!
 
Last edited:
disarming HA works like this:
- first, on all nodes, stop pve-ha-lrm service (and verify it's stopped)
- second, on all nodes, stop pve-ha-crm service (and verify it's stopped)

after you're done, do the inverse:
- first, on all nodes, start pve-ha-lrm
- second, on all nodes, start pve-ha-crm

I think steps 6-8 are not necessary (I'd check after step 5, if all nodes say all links are connected and the IPs are correct, you can skip them). in step 2, it's probably enough to remove the ring_0addrs from vmhost3 instead of all nodes, but it doesn't really make much of a difference I guess.
 
Fabian,

Thank you so much for your help, I was able to successfully re-add the correct corosync addresses and clean up the config file.

You were right, steps 6-8 were not necessary.

To summarize for anyone else needing help:

Step 1. Disarm HA
Code:
systemctl stop pve-ha-crm
Code:
systemctl stop pve-ha-lrm

Step 2. Remove incorrect corosync link and increase version numbers as per instructions

Make backup:
Code:
cp corosync.conf corosync.conf.bak

Prepare new corosync config file:
Code:
cp corosync.conf corosync.conf.new

Make appropriate changes, in my case to remove ring0 addresses from all nodes.

Increase config version number: config_version: 14 -> 15

Step 3. Copy new corosync config file over and check logs and status of connected links


Code:
cp corosync.conf.new corosync.conf
corosync-cfgtool -s


Step 4. Re-add corrected corosync link and increase version numbers as per instructions

Prepare new corosync config file (in my base, just edit the existing corosync.conf.new).

Add ring0 addresses on all nodes.

Increase config version number: config_version: 15 -> 16

Step 5. Copy new corosync config file over and check logs and status of connected links


Code:
cp corosync.conf.new corosync.conf
corosync-cfgtool -s

Step 6. Re-arm HA

Code:
systemctl start pve-ha-lrm
Code:
systemctl start pve-ha-crm

Success!
 
Last edited:
  • Like
Reactions: shrdlicka

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!