[SOLVED] Node leaves the cluster

lc63

New Member
Jun 8, 2018
15
1
3
53
Hello,

I have a cluster with 5.3.1 version nodes.
I try to add a new node, version 5.4.1.

The new node is added for a few minutes only, after it leaves the cluster for an unknown reason.
On the other nodes, 'pvecm nodes' does not show the new node but it appears in /etc/pve/corosync.conf.

proxmox-ve: 5.4-1 (running kernel: 4.15.18-14-pve)
pve-manager: 5.4-5 (running version: 5.4-5/c6fdb264)
pve-kernel-4.15: 5.4-2
pve-kernel-4.15.18-14-pve: 4.15.18-38
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-9
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-51
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-42
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-27
pve-cluster: 5.0-37
pve-container: 2.0-38
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-21
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-2
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-51
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3

proxmox-ve: 5.3-1 (running kernel: 4.15.18-10-pve)
root@ns3047:~# pveversion -v
proxmox-ve: 5.3-1 (running kernel: 4.15.18-10-pve)
pve-manager: 5.3-8 (running version: 5.3-8/2929af8e)
pve-kernel-4.15: 5.3-1
pve-kernel-4.15.18-10-pve: 4.15.18-32
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-19
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-36
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-33
pve-container: 2.0-33
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-17
pve-firmware: 2.0-6
pve-ha-manager: 2.0-6
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 3.10.1-1
qemu-server: 5.0-45
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3

Any ideas ?
 
Most commonly this is due to a multicast issue in the network - please run both omping commands (they need to be run on all nodes in parallel) described in:
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network
and provide the output

if omping shows that multicast works - check the journal for entries from corosync and pmxcfs (pve-cluster)

Hope this helps!
 
(I preferred to use ssmping, because I don't want to install git, gcc, make... on the node.)
ssmping shows correct multicast response between nodes.

In fact, as soon as I started ssmping, the node came back into the cluster. I don't know why, but it seems stable.

Thanks for the clue !
 
hm? omping can easily be installed via apt get - no need to build it from source?

if it stays in the cluster this is indeed odd - but as said in the documentation - check your igmp snooping and multicast querier settings
 
You're right, omping has a deb package, I didn't see this.

IGMP snooping is on switch level of my host provider (OVH), I do not have control over that. Maybe a bug at OVH ?
 
Maybe a bug at OVH
Sadly I don't have explicit experience with OVH - but from what I've read (here and elsewhere) you need a VRACK in order to use multicast with them...

You can also try to use unicast transport - but it's probably best to ask OVH's support!
 
Right, the cluster is under a Vrack.
It seems that it was the initialization of multicast which posed a problem. I'll ask to OVH if multicast is not stable.

Thank you for your responses !
 
You're welcome!

Please report back what the solution was (since we have quite a few users at OVH, who are running into issues like that)!
Thanks!
 
As I mentionned, node came back into the cluster as soon as I started ssmping, as if a first exchange of multicast packets had been necessary.
I don't know the reason, but it is now stable.
 
  • Like
Reactions: Stoiko Ivanov
As I mentionned, node came back into the cluster as soon as I started ssmping, as if a first exchange of multicast packets had been necessary.
I don't know the reason, but it is now stable.

Did you resolve? I got a single node but with a IPv4 Nat + IPv6 routed configuration and this latter leads to the same problem. I tried to add
Bash:
    post-up echo 1 > /sys/devices/virtual/net/vmbr0/bridge/multicast_querier
    post-up echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping

in order to disable igmp snooping on my side and this is the behavior: I need every time to ping6 #IpV6:my:proxy:neighbor from outside, then the Container connection starts to work and can reach internet for a 30/60 minutes. Then it goes down again.
 
Last edited: