Corosync IPv6 requires node ID be specified

Dan Schaper

Member
May 22, 2018
6
0
21
49
I'm trying to set up a cluster of Proxmox 5.2-1 hosts over IPv6. My understanding is that IPv6 is multicast by design and should be a good configuration for quorum to commence. I'm setting up via CLI as the web interface assisted setup fails with a UPID error. I'm guessing that the UPID isn't quite ready for bare IPv6 addresses, the string just grows too long.

The "parent" Proxmox shows that there are two nodes in the cluster on the web interface. But there is a failure on the joining node.

Edit: the man page for corosync.conf states that the Totem needs a node id for IPv6 as well. I can try by editing that file, though it is set sticky read only and I'd like to avoid breaking anything.

corosync.service error on joining node:
Code:
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Sun 2018-05-27 02:30:25 CEST; 9s ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
  Process: 26051 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=8)
 Main PID: 26051 (code=exited, status=8)
      CPU: 29ms

May 27 02:30:25 benjamin systemd[1]: Starting Corosync Cluster Engine...
May 27 02:30:25 benjamin corosync[26051]:  [MAIN  ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to provide service.
May 27 02:30:25 benjamin corosync[26051]: notice  [MAIN  ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to provide servic
May 27 02:30:25 benjamin corosync[26051]: info    [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd ups
May 27 02:30:25 benjamin corosync[26051]: error   [MAIN  ] parse error in config: An IPV6 network requires that a node ID be specified.
May 27 02:30:25 benjamin corosync[26051]: error   [MAIN  ] Corosync Cluster Engine exiting with status 8 at main.c:1308.
May 27 02:30:25 benjamin systemd[1]: corosync.service: Main process exited, code=exited, status=8/n/a
May 27 02:30:25 benjamin systemd[1]: Failed to start Corosync Cluster Engine.
May 27 02:30:25 benjamin systemd[1]: corosync.service: Unit entered failed state.
May 27 02:30:25 benjamin systemd[1]: corosync.service: Failed with result 'exit-code'.

corosync config on joining node:
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: benjamin
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 2a01:xxxx:xxxx:xxxx::2
  }
  node {
    name: jean-luc
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 2a01:xxxx:xxxx:xxxx::2
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Pi-hole
  config_version: 2
  interface {
    bindnetaddr: 2a01:xxxx:xxxx:xxxx::2
    ringnumber: 0
  }
  ip_version: ipv6
  secauth: on
  version: 2
}
 
Last edited:
Verified mutlicast routing is configured semi-correctly, communication is happening, can set up a ULA segment and openvswitch via GRE to further verify. Nodes do join, but after being listed in each others membership, the nodes do not show as reachable. I'm not sure if this is partial success in the join and then failure when corosync fails due to the invalid configuration file:

Code:
2a01:xxxx:xxxx:329e::2 :   unicast, xmt/rcv/%loss = 12/12/0%, min/avg/max/std-dev = 0.383/0.412/0.425/0.015
2a01:xxxx:xxxx:329e::2 : multicast, xmt/rcv/%loss = 12/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000

2a01:yyyy:yyyy:13a2::2 :   unicast, xmt/rcv/%loss = 11/11/0%, min/avg/max/std-dev = 0.325/0.423/0.459/0.035
2a01:yyyy:yyyy:13a2::2 : multicast, xmt/rcv/%loss = 11/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000
 
Sorry for the frequent posts, I'd edit in new information but I'm not sure what the forum standards are. Not trying to bump my posts.

I've modified the corosync.conf file that is shared among members. Added the udpu transport to go unicast while testing, and added `nodeid: 1` to totem. This fixes the initial error on the child member not starting, but a new error has now occurred. I may be able to manually adjust these configurations, but I'd rather this be handled for the user during setup.

Code:
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Sun 2018-05-27 18:17:48 CEST; 1min 56s ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
  Process: 17210 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=20)
 Main PID: 17210 (code=exited, status=20)
      CPU: 44ms

May 27 18:17:48 benjamin corosync[17210]: info    [WD    ] no resources configured.
May 27 18:17:48 benjamin corosync[17210]: notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
May 27 18:17:48 benjamin corosync[17210]: notice  [QUORUM] Using quorum provider corosync_votequorum
May 27 18:17:48 benjamin corosync[17210]: crit    [QUORUM] Quorum provider: corosync_votequorum failed to initialize.
May 27 18:17:48 benjamin corosync[17210]: error   [SERV  ] Service engine 'corosync_quorum' failed to load for reason 'configuration error: nodelist or quorum.expected_votes must be configured!'
May 27 18:17:48 benjamin corosync[17210]: error   [MAIN  ] Corosync Cluster Engine exiting with status 20 at service.c:356.
May 27 18:17:48 benjamin systemd[1]: corosync.service: Main process exited, code=exited, status=20/n/a
May 27 18:17:48 benjamin systemd[1]: Failed to start Corosync Cluster Engine.
May 27 18:17:48 benjamin systemd[1]: corosync.service: Unit entered failed state.
May 27 18:17:48 benjamin systemd[1]: corosync.service: Failed with result 'exit-code'.
 
Code:
2a01:xxxx:xxxx:329e::2 : unicast, xmt/rcv/%loss = 12/12/0%, min/avg/max/std-dev = 0.383/0.412/0.425/0.015
2a01:xxxx:xxxx:329e::2 : multicast, xmt/rcv/%loss = 12/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000

2a01:yyyy:yyyy:13a2::2 : unicast, xmt/rcv/%loss = 11/11/0%, min/avg/max/std-dev = 0.325/0.423/0.459/0.035
2a01:yyyy:yyyy:13a2::2 : multicast, xmt/rcv/%loss = 11/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000
that output shows that multicast is 100% not working
 
that output shows that multicast is 100% not working

Yes, that's why I changed the transport to use unicast. These are two servers that are 0.5ms away from each other, but for redundancy they are on different network segments. I didn't want to have everything in the same domain to prevent the whole cluster from failing if a switch or a router needed to be serviced. The web setup does need to have the nodeid added to the totem configuration when IPv6 is used, and the last issue of the quorum consensus failure is related to the documentation showing that the server names were not being correctly configured, my error on that. I'll set up an authoritative DNS server for the internal and configure the cluster to use DNS instead of IPv6 raw addresses on the ring and just bind the net address to the IP address instead and see if that works.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!