[SOLVED] Node leaves the cluster

lc63

New Member
Jun 8, 2018
15
1
3
52
Hello,

I have a cluster with 5.3.1 version nodes.
I try to add a new node, version 5.4.1.

The new node is added for a few minutes only, after it leaves the cluster for an unknown reason.
On the other nodes, 'pvecm nodes' does not show the new node but it appears in /etc/pve/corosync.conf.

proxmox-ve: 5.4-1 (running kernel: 4.15.18-14-pve)
pve-manager: 5.4-5 (running version: 5.4-5/c6fdb264)
pve-kernel-4.15: 5.4-2
pve-kernel-4.15.18-14-pve: 4.15.18-38
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-9
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-51
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-42
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-27
pve-cluster: 5.0-37
pve-container: 2.0-38
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-21
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-2
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-51
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3

proxmox-ve: 5.3-1 (running kernel: 4.15.18-10-pve)
root@ns3047:~# pveversion -v
proxmox-ve: 5.3-1 (running kernel: 4.15.18-10-pve)
pve-manager: 5.3-8 (running version: 5.3-8/2929af8e)
pve-kernel-4.15: 5.3-1
pve-kernel-4.15.18-10-pve: 4.15.18-32
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-19
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-36
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-33
pve-container: 2.0-33
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-17
pve-firmware: 2.0-6
pve-ha-manager: 2.0-6
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 3.10.1-1
qemu-server: 5.0-45
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3

Any ideas ?
 
Most commonly this is due to a multicast issue in the network - please run both omping commands (they need to be run on all nodes in parallel) described in:
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network
and provide the output

if omping shows that multicast works - check the journal for entries from corosync and pmxcfs (pve-cluster)

Hope this helps!
 
(I preferred to use ssmping, because I don't want to install git, gcc, make... on the node.)
ssmping shows correct multicast response between nodes.

In fact, as soon as I started ssmping, the node came back into the cluster. I don't know why, but it seems stable.

Thanks for the clue !
 
hm? omping can easily be installed via apt get - no need to build it from source?

if it stays in the cluster this is indeed odd - but as said in the documentation - check your igmp snooping and multicast querier settings
 
You're right, omping has a deb package, I didn't see this.

IGMP snooping is on switch level of my host provider (OVH), I do not have control over that. Maybe a bug at OVH ?
 
Maybe a bug at OVH
Sadly I don't have explicit experience with OVH - but from what I've read (here and elsewhere) you need a VRACK in order to use multicast with them...

You can also try to use unicast transport - but it's probably best to ask OVH's support!
 
Right, the cluster is under a Vrack.
It seems that it was the initialization of multicast which posed a problem. I'll ask to OVH if multicast is not stable.

Thank you for your responses !
 
You're welcome!

Please report back what the solution was (since we have quite a few users at OVH, who are running into issues like that)!
Thanks!
 
As I mentionned, node came back into the cluster as soon as I started ssmping, as if a first exchange of multicast packets had been necessary.
I don't know the reason, but it is now stable.
 
  • Like
Reactions: Stoiko Ivanov
As I mentionned, node came back into the cluster as soon as I started ssmping, as if a first exchange of multicast packets had been necessary.
I don't know the reason, but it is now stable.

Did you resolve? I got a single node but with a IPv4 Nat + IPv6 routed configuration and this latter leads to the same problem. I tried to add
Bash:
    post-up echo 1 > /sys/devices/virtual/net/vmbr0/bridge/multicast_querier
    post-up echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping

in order to disable igmp snooping on my side and this is the behavior: I need every time to ping6 #IpV6:my:proxy:neighbor from outside, then the Container connection starts to work and can reach internet for a 30/60 minutes. Then it goes down again.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!