[SOLVED] unable to create cluster in multicast-hostile environment

gomu · Dec 4, 2020

Hello,

I'm hitting more or less the same issue than Add Node 3 - Waiting for Quorum except that it is for adding the second node to the cluster.
First I followed basic instructions (pvecm add fisrtNodeIPAddress) and was stuck there :

Code:

pvecm add X.X.X.X
Please enter superuser (root) password for 'X.X.X.X':
                                                         Password for root@X.X.X.X: ********************
Establishing API connection with host 'X.X.X.X'
The authenticity of host 'X.X.X.X' can't be established.
X509 SHA256 key fingerprint is 3C:4...01:38.
Are you sure you want to continue connecting (yes/no)? yes
Login succeeded.
Request addition of this node
Join request OK, finishing setup locally
stopping pve-cluster service
backup old database to '/var/lib/pve-cluster/backup/config-1607066482.sql.gz'
waiting for quorum...

Then I realized that not only my two hosts are not on the same /C network, but also that my hosting provider is known not to permit multicast traffic (OVH).

So I started adding transport: udpu in my corosync config :

Code:

--- /etc/corosync/corosync.conf.orig    2020-11-30 21:19:18.291843119 +0100
+++ /etc/corosync/corosync.conf 2020-11-30 21:21:20.572697431 +0100
@@ -23,6 +23,7 @@
 }

 totem {
+  transport: udpu
   cluster_name: lcoovhclup002
   config_version: 2
   interface {

And It helped...only once. I managed to create the cluster but I rollbacked to redo it in a clean way.
And since then all my tries failed, even though preparing the corosync.conf accordingly BEFORE joining the cluster :

Code:

--- /etc/pve/corosync.conf      2020-12-03 21:53:29.000000000 +0100
+++ /etc/pve/corosync.conf.new  2020-12-03 21:55:58.000000000 +0100
@@ -18,13 +18,14 @@

 totem {
   cluster_name: lcoovhclup001
-  config_version: 1
+  config_version: 2
   interface {
-    bindnetaddr: X.X.X.X
+    bindnetaddr: 0.0.0.0
     ringnumber: 0
   }
   ip_version: ipv4
   secauth: on
+  transport: udpu
   version: 2
 }

And only then pvecm add X.X.X.X
I'm always stuck at waiting for quorum...

I stopped the firewall -> same thing (I took care to add both hosts in the management IPSet anyway). My pveversion -v hereunder.

Code:

proxmox-ve: 5.4-2 (running kernel: 4.15.18-30-pve)
pve-manager: 5.4-15 (running version: 5.4-15/d0ec33c6)
pve-kernel-4.15: 5.4-19
pve-kernel-4.15.18-30-pve: 4.15.18-58
pve-kernel-4.15.18-12-pve: 4.15.18-36
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-12
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-56
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-14
libpve-storage-perl: 5.0-44
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-7
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-28
pve-cluster: 5.0-38
pve-container: 2.0-42
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-22
pve-firmware: 2.0-7
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-4
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-56
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

My whole corosync.conf :

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: nsXXXX
    nodeid: 1
    quorum_votes: 1
    ring0_addr: X.X.X.X
  }
  node {
    name: nsYYYY1
    nodeid: 2
    quorum_votes: 1
    ring0_addr: Y.Y.Y.Y
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: lcoovhclup001
  config_version: 5
  interface {
    bindnetaddr: 0.0.0.0
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  transport: udpu
  version: 2
}

Moayad · Dec 4, 2020

Hello,

First of all, I suggest you upgrade to the latest version of Proxmox VE 6 [1] since Proxmox 5.x is already is EOL [2].

To upgrade Proxmox VE 5.x to Proxmox VE 6.3.x following our guide [3]

gomu said:
waiting for quorum...

I assume this because of a network issue [4], however, if you need to stay at PVE 5 may you try omping testing [5] also check the logs of corosync journalctl -u corosync maybe you will get a hint

[1] https://forum.proxmox.com/threads/proxmox-ve-6-3-available.79687/
[2] https://pve.proxmox.com/wiki/FAQ
[3] https://pve.proxmox.com/wiki/Upgrade_from_5.x_to_6.0
[4] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network
[5] https://pve.proxmox.com/wiki/Multicast_notes#Using_omping_to_test_multicast

gomu · Dec 4, 2020

Hello, and thanks for your answer !
Actually I plan to upgrade in-place just after the setup works well.
I double-checked [4] (again ;-)).
I also looked at [5] during my troubleshooting. There I had the confirmation that multicast should be avoided in my setup with the 100% loss metric about multicast paquets. I did it again for the record :

Code:

root@firstNode_X :~# omping -c 10000 -i 0.001 -F -q X.X.X.X Y.Y.Y.Y
Y.Y.Y.Y : waiting for response msg
Y.Y.Y.Y : waiting for response msg
Y.Y.Y.Y : waiting for response msg
Y.Y.Y.Y : joined (S,G) = (*, 232.43.211.234), pinging
Y.Y.Y.Y : waiting for response msg
Y.Y.Y.Y : server told us to stop

Y.Y.Y.Y :   unicast, xmt/rcv/%loss = 9099/9099/0%, min/avg/max/std-dev = 0.142/0.260/1.647/0.072
Y.Y.Y.Y : multicast, xmt/rcv/%loss = 9099/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000

root@secondNode_Y :~# omping -c 10000 -i 0.001 -F -q X.X.X.X Y.Y.Y.Y
X.X.X.X : waiting for response msg
X.X.X.X : joined (S,G) = (*, 232.43.211.234), pinging
X.X.X.X : given amount of query messages was sent

X.X.X.X :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.151/0.268/1.848/0.071
X.X.X.X : multicast, xmt/rcv/%loss = 10000/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000

I don't see anything weird there (although I wonder if 0.260 and 0.268 are ms or s)
Also I have no clue about what the 232.43.211.234 host is.
Straight omping X.X.X.X Y.Y.Y.Y shows dist=1
I also ran the same tests with the short and long names without differences.

gomu · Dec 4, 2020

Here are the corosync log after a fresh restart, I don't see any hint (but I'm a beginner)
On first node :

Code:

Dec 04 08:23:42 firstNodeX systemd[1]: Starting Corosync Cluster Engine...
Dec 04 08:23:42 firstNodeX corosync[4797]:  [MAIN  ] Corosync Cluster Engine ('2.4.4-dirty'): started and ready to provide service.
Dec 04 08:23:42 firstNodeX corosync[4797]:  [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog systemd xmlconf qdevices qnetd snmp pie relro bindnow
Dec 04 08:23:42 firstNodeX corosync[4797]: notice  [MAIN  ] Corosync Cluster Engine ('2.4.4-dirty'): started and ready to provide service.
Dec 04 08:23:42 firstNodeX corosync[4797]: info    [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog systemd xmlconf qdevices qnetd snmp pie relro bindnow
Dec 04 08:23:42 firstNodeX corosync[4797]: warning [MAIN  ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
Dec 04 08:23:42 firstNodeX corosync[4797]: warning [MAIN  ] Please migrate config file to nodelist.
Dec 04 08:23:42 firstNodeX corosync[4797]:  [MAIN  ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
Dec 04 08:23:42 firstNodeX corosync[4797]:  [MAIN  ] Please migrate config file to nodelist.
Dec 04 08:23:42 firstNodeX corosync[4797]: notice  [TOTEM ] Initializing transport (UDP/IP Unicast).
Dec 04 08:23:42 firstNodeX corosync[4797]: notice  [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Dec 04 08:23:42 firstNodeX corosync[4797]:  [TOTEM ] Initializing transport (UDP/IP Unicast).
Dec 04 08:23:42 firstNodeX corosync[4797]:  [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Dec 04 08:23:42 firstNodeX corosync[4797]: notice  [TOTEM ] The network interface [X.X.X.X] is now up.
Dec 04 08:23:42 firstNodeX corosync[4797]:  [TOTEM ] The network interface [X.X.X.X] is now up.
Dec 04 08:23:42 firstNodeX corosync[4797]: notice  [SERV  ] Service engine loaded: corosync configuration map access [0]
Dec 04 08:23:42 firstNodeX corosync[4797]: info    [QB    ] server name: cmap
Dec 04 08:23:42 firstNodeX corosync[4797]: notice  [SERV  ] Service engine loaded: corosync configuration service [1]
Dec 04 08:23:42 firstNodeX corosync[4797]: info    [QB    ] server name: cfg
Dec 04 08:23:42 firstNodeX corosync[4797]: notice  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Dec 04 08:23:42 firstNodeX corosync[4797]: info    [QB    ] server name: cpg
Dec 04 08:23:42 firstNodeX corosync[4797]: notice  [SERV  ] Service engine loaded: corosync profile loading service [4]
Dec 04 08:23:42 firstNodeX corosync[4797]: notice  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Dec 04 08:23:42 firstNodeX corosync[4797]: warning [WD    ] Watchdog not enabled by configuration
Dec 04 08:23:42 firstNodeX corosync[4797]: warning [WD    ] resource load_15min missing a recovery key.
Dec 04 08:23:42 firstNodeX corosync[4797]: warning [WD    ] resource memory_used missing a recovery key.
Dec 04 08:23:42 firstNodeX corosync[4797]: info    [WD    ] no resources configured.
Dec 04 08:23:42 firstNodeX corosync[4797]: notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
Dec 04 08:23:42 firstNodeX corosync[4797]: notice  [QUORUM] Using quorum provider corosync_votequorum
Dec 04 08:23:42 firstNodeX corosync[4797]:  [SERV  ] Service engine loaded: corosync configuration map access [0]
Dec 04 08:23:42 firstNodeX systemd[1]: Started Corosync Cluster Engine.
Dec 04 08:23:42 firstNodeX corosync[4797]: notice  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Dec 04 08:23:42 firstNodeX corosync[4797]: info    [QB    ] server name: votequorum
Dec 04 08:23:42 firstNodeX corosync[4797]: notice  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Dec 04 08:23:42 firstNodeX corosync[4797]: info    [QB    ] server name: quorum
Dec 04 08:23:42 firstNodeX corosync[4797]: notice  [TOTEM ] adding new UDPU member {X.X.X.X}
Dec 04 08:23:42 firstNodeX corosync[4797]: notice  [TOTEM ] adding new UDPU member {94.23.247.212}
Dec 04 08:23:42 firstNodeX corosync[4797]: notice  [TOTEM ] A new membership (X.X.X.X:16) was formed. Members joined: 1
Dec 04 08:23:42 firstNodeX corosync[4797]: warning [CPG   ] downlist left_list: 0 received
Dec 04 08:23:42 firstNodeX corosync[4797]: notice  [QUORUM] Members[1]: 1
Dec 04 08:23:42 firstNodeX corosync[4797]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Dec 04 08:23:42 firstNodeX corosync[4797]:  [QB    ] server name: cmap
Dec 04 08:23:42 firstNodeX corosync[4797]:  [SERV  ] Service engine loaded: corosync configuration service [1]
Dec 04 08:23:42 firstNodeX corosync[4797]:  [QB    ] server name: cfg
Dec 04 08:23:42 firstNodeX corosync[4797]:  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Dec 04 08:23:42 firstNodeX corosync[4797]:  [QB    ] server name: cpg
Dec 04 08:23:42 firstNodeX corosync[4797]:  [SERV  ] Service engine loaded: corosync profile loading service [4]
Dec 04 08:23:42 firstNodeX corosync[4797]:  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Dec 04 08:23:42 firstNodeX corosync[4797]:  [WD    ] Watchdog not enabled by configuration
Dec 04 08:23:42 firstNodeX corosync[4797]:  [WD    ] resource load_15min missing a recovery key.
Dec 04 08:23:42 firstNodeX corosync[4797]:  [WD    ] resource memory_used missing a recovery key.
Dec 04 08:23:42 firstNodeX corosync[4797]:  [WD    ] no resources configured.
Dec 04 08:23:42 firstNodeX corosync[4797]:  [SERV  ] Service engine loaded: corosync watchdog service [7]
Dec 04 08:23:42 firstNodeX corosync[4797]:  [QUORUM] Using quorum provider corosync_votequorum
Dec 04 08:23:42 firstNodeX corosync[4797]:  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Dec 04 08:23:42 firstNodeX corosync[4797]:  [QB    ] server name: votequorum
Dec 04 08:23:42 firstNodeX corosync[4797]:  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Dec 04 08:23:42 firstNodeX corosync[4797]:  [QB    ] server name: quorum
Dec 04 08:23:42 firstNodeX corosync[4797]:  [TOTEM ] adding new UDPU member {X.X.X.X}
Dec 04 08:23:42 firstNodeX corosync[4797]:  [TOTEM ] adding new UDPU member {Y.Y.Y.Y}
Dec 04 08:23:42 firstNodeX corosync[4797]:  [TOTEM ] A new membership (X.X.X.X:16) was formed. Members joined: 1
Dec 04 08:23:42 firstNodeX corosync[4797]:  [CPG   ] downlist left_list: 0 received
Dec 04 08:23:42 firstNodeX corosync[4797]:  [QUORUM] Members[1]: 1
Dec 04 08:23:42 firstNodeX corosync[4797]:  [MAIN  ] Completed service synchronization, ready to provide service.

On the second node :

Code:

Dec 04 08:28:48 ns324801 corosync[6032]:  [MAIN  ] Corosync Cluster Engine ('2.4.4-dirty'): started and ready to provide service.
Dec 04 08:28:48 ns324801 corosync[6032]:  [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog systemd xmlconf qdevices qnetd snmp pie relro bindnow
Dec 04 08:28:48 ns324801 corosync[6032]: notice  [MAIN  ] Corosync Cluster Engine ('2.4.4-dirty'): started and ready to provide service.
Dec 04 08:28:48 ns324801 corosync[6032]: info    [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog systemd xmlconf qdevices qnetd snmp pie relro bindnow
Dec 04 08:28:48 ns324801 corosync[6032]: warning [MAIN  ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
Dec 04 08:28:48 ns324801 corosync[6032]: warning [MAIN  ] Please migrate config file to nodelist.
Dec 04 08:28:48 ns324801 corosync[6032]:  [MAIN  ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
Dec 04 08:28:48 ns324801 corosync[6032]:  [MAIN  ] Please migrate config file to nodelist.
Dec 04 08:28:48 ns324801 corosync[6032]: notice  [TOTEM ] Initializing transport (UDP/IP Unicast).
Dec 04 08:28:48 secondNodeX corosync[6032]: notice  [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Dec 04 08:28:48 secondNodeX corosync[6032]:  [TOTEM ] Initializing transport (UDP/IP Unicast).
Dec 04 08:28:48 secondNodeX corosync[6032]:  [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Dec 04 08:28:48 secondNodeX corosync[6032]: notice  [TOTEM ] The network interface [94.23.247.212] is now up.
Dec 04 08:28:48 secondNodeX corosync[6032]: notice  [SERV  ] Service engine loaded: corosync configuration map access [0]
Dec 04 08:28:48 secondNodeX corosync[6032]:  [TOTEM ] The network interface [94.23.247.212] is now up.
Dec 04 08:28:48 secondNodeX corosync[6032]: info    [QB    ] server name: cmap
Dec 04 08:28:48 secondNodeX corosync[6032]: notice  [SERV  ] Service engine loaded: corosync configuration service [1]
Dec 04 08:28:48 secondNodeX corosync[6032]: info    [QB    ] server name: cfg
Dec 04 08:28:48 secondNodeX corosync[6032]: notice  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Dec 04 08:28:48 secondNodeX corosync[6032]: info    [QB    ] server name: cpg
Dec 04 08:28:48 secondNodeX corosync[6032]: notice  [SERV  ] Service engine loaded: corosync profile loading service [4]
Dec 04 08:28:48 secondNodeX corosync[6032]: notice  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Dec 04 08:28:48 secondNodeX corosync[6032]: warning [WD    ] Watchdog not enabled by configuration
Dec 04 08:28:48 secondNodeX corosync[6032]: warning [WD    ] resource load_15min missing a recovery key.
Dec 04 08:28:48 secondNodeX corosync[6032]: warning [WD    ] resource memory_used missing a recovery key.
Dec 04 08:28:48 secondNodeX corosync[6032]: info    [WD    ] no resources configured.
Dec 04 08:28:48 secondNodeX corosync[6032]: notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
Dec 04 08:28:48 secondNodeX corosync[6032]: notice  [QUORUM] Using quorum provider corosync_votequorum
Dec 04 08:28:48 secondNodeX corosync[6032]:  [SERV  ] Service engine loaded: corosync configuration map access [0]
Dec 04 08:28:48 secondNodeX systemd[1]: Started Corosync Cluster Engine.
Dec 04 08:28:48 secondNodeX corosync[6032]: notice  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Dec 04 08:28:48 secondNodeX corosync[6032]: info    [QB    ] server name: votequorum
Dec 04 08:28:48 secondNodeX corosync[6032]: notice  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Dec 04 08:28:48 secondNodeX corosync[6032]: info    [QB    ] server name: quorum
Dec 04 08:28:48 secondNodeX corosync[6032]: notice  [TOTEM ] adding new UDPU member {X.X.X.X}
Dec 04 08:28:48 secondNodeX corosync[6032]: notice  [TOTEM ] adding new UDPU member {94.23.247.212}
Dec 04 08:28:48 secondNodeX corosync[6032]: notice  [TOTEM ] A new membership (94.23.247.212:868) was formed. Members joined: 2
Dec 04 08:28:48 secondNodeX corosync[6032]: warning [CPG   ] downlist left_list: 0 received
Dec 04 08:28:48 secondNodeX corosync[6032]: notice  [QUORUM] Members[1]: 2
Dec 04 08:28:48 secondNodeX corosync[6032]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Dec 04 08:28:48 secondNodeX corosync[6032]:  [QB    ] server name: cmap
Dec 04 08:28:48 secondNodeX corosync[6032]:  [SERV  ] Service engine loaded: corosync configuration service [1]
Dec 04 08:28:48 secondNodeX corosync[6032]:  [QB    ] server name: cfg
Dec 04 08:28:48 secondNodeX corosync[6032]:  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Dec 04 08:28:48 secondNodeX corosync[6032]:  [QB    ] server name: cpg
Dec 04 08:28:48 secondNodeX corosync[6032]:  [SERV  ] Service engine loaded: corosync profile loading service [4]
Dec 04 08:28:48 secondNodeX corosync[6032]:  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Dec 04 08:28:48 secondNodeX corosync[6032]:  [WD    ] Watchdog not enabled by configuration
Dec 04 08:28:48 secondNodeX corosync[6032]:  [WD    ] resource load_15min missing a recovery key.
Dec 04 08:28:48 secondNodeX corosync[6032]:  [WD    ] resource memory_used missing a recovery key.
Dec 04 08:28:48 secondNodeX corosync[6032]:  [WD    ] no resources configured.
Dec 04 08:28:48 secondNodeX corosync[6032]:  [SERV  ] Service engine loaded: corosync watchdog service [7]
Dec 04 08:28:48 secondNodeY corosync[6032]:  [QUORUM] Using quorum provider corosync_votequorum
Dec 04 08:28:48 secondNodeY corosync[6032]:  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Dec 04 08:28:48 secondNodeY corosync[6032]:  [QB    ] server name: votequorum
Dec 04 08:28:48 secondNodeY corosync[6032]:  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Dec 04 08:28:48 secondNodeY corosync[6032]:  [QB    ] server name: quorum
Dec 04 08:28:48 secondNodeY corosync[6032]:  [TOTEM ] adding new UDPU member {X.X.X.X}
Dec 04 08:28:48 secondNodeY corosync[6032]:  [TOTEM ] adding new UDPU member {94.23.247.212}
Dec 04 08:28:48 secondNodeY corosync[6032]:  [TOTEM ] A new membership (Y.Y.Y.Y:868) was formed. Members joined: 2
Dec 04 08:28:48 secondNodeY corosync[6032]:  [CPG   ] downlist left_list: 0 received
Dec 04 08:28:48 secondNodeY corosync[6032]:  [QUORUM] Members[1]: 2
Dec 04 08:28:48 secondNodeY corosync[6032]:  [MAIN  ] Completed service synchronization, ready to provide service.

gomu · Dec 4, 2020

The second nodes then loops on this continously :

Code:

Dec 04 12:26:55 ns324801 corosync[6032]: notice  [TOTEM ] A new membership (Y.Y.Y.Y:40772) was formed. Members
Dec 04 12:26:55 ns324801 corosync[6032]:  [TOTEM ] A new membership (Y.Y.Y.Y:40772) was formed. Members
Dec 04 12:26:55 ns324801 corosync[6032]: warning [CPG   ] downlist left_list: 0 received
Dec 04 12:26:55 ns324801 corosync[6032]: notice  [QUORUM] Members[1]: 2
Dec 04 12:26:55 ns324801 corosync[6032]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Dec 04 12:26:55 ns324801 corosync[6032]:  [CPG   ] downlist left_list: 0 received
Dec 04 12:26:55 ns324801 corosync[6032]:  [QUORUM] Members[1]: 2
Dec 04 12:26:55 ns324801 corosync[6032]:  [MAIN  ] Completed service synchronization, ready to provide service.
Dec 04 12:26:57 ns324801 corosync[6032]: notice  [TOTEM ] A new membership (Y.Y.Y.Y:40776) was formed. Members
Dec 04 12:26:57 ns324801 corosync[6032]:  [TOTEM ] A new membership (Y.Y.Y.Y:40776) was formed. Members
Dec 04 12:26:57 ns324801 corosync[6032]: warning [CPG   ] downlist left_list: 0 received
Dec 04 12:26:57 ns324801 corosync[6032]: notice  [QUORUM] Members[1]: 2
Dec 04 12:26:57 ns324801 corosync[6032]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Dec 04 12:26:57 ns324801 corosync[6032]:  [CPG   ] downlist left_list: 0 received
Dec 04 12:26:57 ns324801 corosync[6032]:  [QUORUM] Members[1]: 2
Dec 04 12:26:57 ns324801 corosync[6032]:  [MAIN  ] Completed service synchronization, ready to provide service.
Dec 04 12:26:58 ns324801 corosync[6032]: notice  [TOTEM ] A new membership (Y.Y.Y.Y:40780) was formed. Members
Dec 04 12:26:58 ns324801 corosync[6032]:  [TOTEM ] A new membership (Y.Y.Y.Y:40780) was formed. Members
Dec 04 12:26:58 ns324801 corosync[6032]: warning [CPG   ] downlist left_list: 0 received
Dec 04 12:26:58 secondNodeY corosync[6032]: notice  [QUORUM] Members[1]: 2
Dec 04 12:26:58 secondNodeY corosync[6032]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Dec 04 12:26:58 secondNodeY corosync[6032]:  [CPG   ] downlist left_list: 0 received
Dec 04 12:26:58 secondNodeY corosync[6032]:  [QUORUM] Members[1]: 2
Dec 04 12:26:58 secondNodeY corosync[6032]:  [MAIN  ] Completed service synchronization, ready to provide service.
Dec 04 12:26:59 secondNodeY corosync[6032]: notice  [TOTEM ] A new membership (Y.Y.Y.Y:40784) was formed. Members
Dec 04 12:26:59 secondNodeY corosync[6032]:  [TOTEM ] A new membership (Y.Y.Y.Y:40784) was formed. Members
Dec 04 12:26:59 secondNodeY corosync[6032]: warning [CPG   ] downlist left_list: 0 received
Dec 04 12:26:59 secondNodeY corosync[6032]: notice  [QUORUM] Members[1]: 2
Dec 04 12:26:59 secondNodeY corosync[6032]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Dec 04 12:26:59 secondNodeY corosync[6032]:  [CPG   ] downlist left_list: 0 received
Dec 04 12:26:59 secondNodeY corosync[6032]:  [QUORUM] Members[1]: 2
Dec 04 12:26:59 secondNodeY corosync[6032]:  [MAIN  ] Completed service synchronization, ready to provide service.
Dec 04 12:27:01 secondNodeY corosync[6032]: notice  [TOTEM ] A new membership (Y.Y.Y.Y:40788) was formed. Members
Dec 04 12:27:01 secondNodeY corosync[6032]:  [TOTEM ] A new membership (Y.Y.Y.Y:40788) was formed. Members
Dec 04 12:27:01 secondNodeY corosync[6032]: warning [CPG   ] downlist left_list: 0 received
Dec 04 12:27:01 secondNodeY corosync[6032]: notice  [QUORUM] Members[1]: 2
Dec 04 12:27:01 secondNodeY corosync[6032]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Dec 04 12:27:01 secondNodeY corosync[6032]:  [CPG   ] downlist left_list: 0 received
Dec 04 12:27:01 secondNodeY corosync[6032]:  [QUORUM] Members[1]: 2
Dec 04 12:27:01 secondNodeY corosync[6032]:  [MAIN  ] Completed service synchronization, ready to provide service.
Dec 04 12:27:02 secondNodeY corosync[6032]: notice  [TOTEM ] A new membership (Y.Y.Y.Y:40792) was formed. Members
Dec 04 12:27:02 secondNodeY corosync[6032]: warning [CPG   ] downlist left_list: 0 received
Dec 04 12:27:02 secondNodeY corosync[6032]: notice  [QUORUM] Members[1]: 2
Dec 04 12:27:02 secondNodeY corosync[6032]:  [TOTEM ] A new membership (Y.Y.Y.Y:40792) was formed. Members
Dec 04 12:27:02 secondNodeY corosync[6032]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Dec 04 12:27:02 secondNodeY corosync[6032]:  [CPG   ] downlist left_list: 0 received
Dec 04 12:27:02 secondNodeY corosync[6032]:  [QUORUM] Members[1]: 2
Dec 04 12:27:02 secondNodeY corosync[6032]:  [MAIN  ] Completed service synchronization, ready to provide service.

Moayad · Dec 4, 2020

Hi, thanks for the outputs

-- Have you checked your network configuration and `/etc/hosts` files?
-- Is your nodes all same PVE versions?
-- Can you check if you can connect via UDP ports 5404 and 5405?
-- Check for time for all nodes by using timedatectl command

gomu · Dec 4, 2020

Moayad said:
Hi, thanks for the outputs

No, thank YOU for your support

Moayad said:
-- Have you checked your network configuration and `/etc/hosts` files?

Yep, seem fine.

Moayad said:
-- Is your nodes all same PVE versions?

yes, pveversion -v is exactly the same. I actually applied all available updates from pve-no-subscription channel on both servers.

Moayad said:
-- Can you check if you can connect via UDP ports 5404 and 5405?

How do I do that ? I will try with nc -u

Moayad said:
-- Check for time for all nodes by using timedatectl command

Code:

root@firstNodeX:~# timedatectl
      Local time: Fri 2020-12-04 14:28:48 CET
  Universal time: Fri 2020-12-04 13:28:48 UTC
        RTC time: Fri 2020-12-04 13:28:48
       Time zone: Europe/Paris (CET, +0100)
 Network time on: yes
NTP synchronized: yes
 RTC in local TZ: no

Code:

root@secondNodeY:~# timedatectl
      Local time: Fri 2020-12-04 14:28:48 CET
  Universal time: Fri 2020-12-04 13:28:48 UTC
        RTC time: Fri 2020-12-04 13:28:48
       Time zone: Europe/Paris (CET, +0100)
 Network time on: yes
NTP synchronized: yes
 RTC in local TZ: no

Moayad · Dec 4, 2020

gomu said:
How do I do that ? I will try with nc -u

Yes you can use nc or nmap listing on node A e.g nc -l -v -p 5405 -v -u and try to connect with nc don't forget -u flag

gomu · Dec 4, 2020

Hello, nc -l -v -p 5405 -v -u wasn't working in the first place :

Code:

root@ns324801:~# nc -l -v -p 5405 -v -u
retrying local 0.0.0.0:5405 : Address already in use

So I had to stop corosync before and it worked :

Code:

root@secondNodeY:~# systemctl stop corosync
root@secondNodeY:~# nc -l -v -p 5405 -v -u
listening on [any] 5405 ...
connect to [Y.Y.Y.Y] from firstNodeX.fqdn [X.X.X.X] 37402
typed from the other end
^C sent 0, rcvd 25
root@secondNodeY:~#

Whereas I was doing this on the other node :

Code:

root@firstNodeX:~# echo typed from the other end | nc -u secondNodeY 5405
^C
root@firstNodeX:~#

However it was working on both sides with both ports 5404 and 5405... if the firewall was stopped ! I must have made something wrong with it.
With firewall disabled the nodes do connect and cluster is healthy

... as long as I let the firewall option disabled.

I thought that adding the IP addresses of my nodes to the management IPSet would be enough (remember both nodes are not in the same IP network so they can't speak together "natively"...)

Code:

root@secondNodeY:~# cat /etc/pve/firewall/cluster.fw
[OPTIONS]

enable: 0

[IPSET management]

A.B.C.D # my home IP address for management
E.F.G.H # a public host in case my home IP changes...
!Y.Y.Y.0/24 # ovh local network for secondNodeY
Y.Y.Y.Y # secondNodeY
!X.X.X.X.0/24 # ovh local network for firstNodeX
X.X.X.X # firstNodeX

And it actually enables inter node communication for ssh and 8006... any idea ?
BTW I added the plain /C network of both nodes as "nomatch" so no other client of OVH can reach my nodes

gomu · Dec 4, 2020

Okay, I was successfull in joining a cluster with such a firewall config :

Code:

[OPTIONS]

enable: 1

[IPSET management]

A.B.C.D # my home IP address for management
E.F.G.H # a public host in case my home IP changes...
!Y.Y.Y.0/24 # ovh local network for secondNodeY
Y.Y.Y.Y # secondNodeY
!X.X.X.X.0/24 # ovh local network for firstNodeX
X.X.X.X # firstNodeX

[RULES]

IN ACCEPT -i vmbr0 -source +management -p udp -dport 5404:5405 -log nolog # Corosync channel
IN ACCEPT -i vmbr0 -p tcp -dport 80 -log nolog # challenges Let's Encrypt

First rule "Corosync channel" was the missing one.
Second rule allows Let's Encrypt certificates to be renewed flawlessly.

Thanks for your help ! As soon as I reproduce in production I will mark the thread as solved.

gomu · Dec 4, 2020

Success !
Just had one more issue : must restart corosync after patching /etc/pve/corosync.conf with transport/version and bindnetaddr like shown in my first post. BTW the bottom of https://pve.proxmox.com/wiki/Multicast_notes#Troubleshooting needs an update I think :

Carefully read the entire corosync.conf(5) and votequorum(5) manpages.
create the cluster as usual
~~if needed, bring the initial node into quorate state with "pvecm e 1"~~
~~if needed,~~ edit /etc/pve/corosync.conf (remember increasing the version number!); it will later be auto-copied to /etc/corosync/corosync.conf on each node by one of the PVE services, where in turn it will be copied into the local /etc/corosync/corosync.conf [[1]].
in the totem{} stanza, add "transport: udpu"
~~pre-add the nodes to the nodelist{} stanza.~~
~~on each node :~~ systemctl restart corosync (if this command does not work, use killall -9 corosync )
~~then, on each node : /etc/init.d/pve-cluster restart~~
add the nodes

I mean, no need to pre-add the nodes nor to restart the pve-cluster if you add transport: udpu before adding the nodes, they will inherit the unicast option nicely !

Moayad · Dec 9, 2020

Glad! thanks for sharing your solution.

Have a nice day

Search

Search

[SOLVED] unable to create cluster in multicast-hostile environment

gomu

Member

Moayad

Proxmox Staff Member

gomu

Member

gomu

Member

gomu

Member

Moayad

Proxmox Staff Member

gomu

Member

Moayad

Proxmox Staff Member

gomu

Member

gomu

Member

gomu

Member

Moayad

Proxmox Staff Member

We value your privacy