[SOLVED] Node leaves the cluster

lc63 · May 15, 2019

Hello,

I have a cluster with 5.3.1 version nodes.
I try to add a new node, version 5.4.1.

The new node is added for a few minutes only, after it leaves the cluster for an unknown reason.
On the other nodes, 'pvecm nodes' does not show the new node but it appears in /etc/pve/corosync.conf.

proxmox-ve: 5.4-1 (running kernel: 4.15.18-14-pve)
pve-manager: 5.4-5 (running version: 5.4-5/c6fdb264)
pve-kernel-4.15: 5.4-2
pve-kernel-4.15.18-14-pve: 4.15.18-38
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-9
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-51
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-42
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-27
pve-cluster: 5.0-37
pve-container: 2.0-38
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-21
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-2
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-51
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3

proxmox-ve: 5.3-1 (running kernel: 4.15.18-10-pve)
root@ns3047:~# pveversion -v
proxmox-ve: 5.3-1 (running kernel: 4.15.18-10-pve)
pve-manager: 5.3-8 (running version: 5.3-8/2929af8e)
pve-kernel-4.15: 5.3-1
pve-kernel-4.15.18-10-pve: 4.15.18-32
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-19
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-36
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-33
pve-container: 2.0-33
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-17
pve-firmware: 2.0-6
pve-ha-manager: 2.0-6
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 3.10.1-1
qemu-server: 5.0-45
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3

Any ideas ?

Stoiko Ivanov · May 15, 2019

Most commonly this is due to a multicast issue in the network - please run both omping commands (they need to be run on all nodes in parallel) described in:
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network
and provide the output

if omping shows that multicast works - check the journal for entries from corosync and pmxcfs (pve-cluster)

Hope this helps!

lc63 · May 15, 2019

(I preferred to use ssmping, because I don't want to install git, gcc, make... on the node.)
ssmping shows correct multicast response between nodes.

In fact, as soon as I started ssmping, the node came back into the cluster. I don't know why, but it seems stable.

Thanks for the clue !

Stoiko Ivanov · May 15, 2019

hm? omping can easily be installed via apt get - no need to build it from source?

if it stays in the cluster this is indeed odd - but as said in the documentation - check your igmp snooping and multicast querier settings

lc63 · May 15, 2019

You're right, omping has a deb package, I didn't see this.

IGMP snooping is on switch level of my host provider (OVH), I do not have control over that. Maybe a bug at OVH ?

Stoiko Ivanov · May 15, 2019

lc63 said:
Maybe a bug at OVH

Sadly I don't have explicit experience with OVH - but from what I've read (here and elsewhere) you need a VRACK in order to use multicast with them...

You can also try to use unicast transport - but it's probably best to ask OVH's support!

lc63 · May 15, 2019

Right, the cluster is under a Vrack.
It seems that it was the initialization of multicast which posed a problem. I'll ask to OVH if multicast is not stable.

Thank you for your responses !

Stoiko Ivanov · May 15, 2019

You're welcome!

Please report back what the solution was (since we have quite a few users at OVH, who are running into issues like that)!
Thanks!

lc63 · May 15, 2019

As I mentionned, node came back into the cluster as soon as I started ssmping, as if a first exchange of multicast packets had been necessary.
I don't know the reason, but it is now stable.

Stoiko Ivanov · May 16, 2019

Thanks for the update!

kamzata · May 28, 2020

lc63 said:
As I mentionned, node came back into the cluster as soon as I started ssmping, as if a first exchange of multicast packets had been necessary.
I don't know the reason, but it is now stable.

Did you resolve? I got a single node but with a IPv4 Nat + IPv6 routed configuration and this latter leads to the same problem. I tried to add

Bash:

    post-up echo 1 > /sys/devices/virtual/net/vmbr0/bridge/multicast_querier
    post-up echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping

in order to disable igmp snooping on my side and this is the behavior: I need every time to ping6 #IpV6:my:proxy:neighbor from outside, then the Container connection starts to work and can reach internet for a 30/60 minutes. Then it goes down again.

Search

Search

[SOLVED] Node leaves the cluster

lc63

New Member

Stoiko Ivanov

Proxmox Staff Member

lc63

New Member

Stoiko Ivanov

Proxmox Staff Member

lc63

New Member

Stoiko Ivanov

Proxmox Staff Member

lc63

New Member

Stoiko Ivanov

Proxmox Staff Member

lc63

New Member

Stoiko Ivanov

Proxmox Staff Member

kamzata

Renowned Member

We value your privacy