PVE Cluster fails every time

ThinkPrivacy · Sep 22, 2016

I have been pulling my hair out today and have reinstalled Proxmox on 3 different servers about 4 times each so far because not only does it not work but I can't seem to revert back either.

I have 3 servers (nodes) running Proxmox 4.2-23

All 3 servers are on an RPN network as well as having public IP addresses.

On server 1 I type:

Code:

# pvecm create tpc1
Corosync Cluster Engine Authentication key generator.
Gathering 1024 bits for key from /dev/urandom.
Writing corosync key to /etc/corosync/authkey.
#

On server 2 & 3 I type:

Code:

# pvecm add 10.91.150.134
The authenticity of host '10.91.150.134 (10.91.150.134)' can't be established.
ECDSA key fingerprint is ########################################.
Are you sure you want to continue connecting (yes/no)? yes
root@10.91.150.134's password:
copy corosync auth key
stopping pve-cluster service
backup old database
Job for corosync.service failed. See 'systemctl status corosync.service' and 'journalctl -xn' for details.
waiting for quorum...

10.91.150.134 is the RPN IP of the first server where I created the cluster.

And there the system hangs. I can no longer access the web interface for servers 2 and 3 and have to reinstall Proxmox.

Multicast is enabled on the RPN:

Code:

# ifconfig eth1 | grep MTU
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
#

Output from systemctl status corosync.service:

Code:

# systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
   Active: failed (Result: exit-code) since Thu 2016-09-22 19:38:07 CEST; 19min ago
  Process: 11271 ExecStart=/usr/share/corosync/corosync start (code=exited, status=1/FAILURE)

Sep 22 19:37:06 pmn2 corosync[11280]: [QB    ] server name: cmap
Sep 22 19:37:06 pmn2 corosync[11280]: [SERV  ] Service engine loaded: corosync configuration service [1]
Sep 22 19:37:06 pmn2 corosync[11280]: [QB    ] server name: cfg
Sep 22 19:37:06 pmn2 corosync[11280]: [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Sep 22 19:37:06 pmn2 corosync[11280]: [QB    ] server name: cpg
Sep 22 19:37:06 pmn2 corosync[11280]: [SERV  ] Service engine loaded: corosync profile loading service [4]
Sep 22 19:37:06 pmn2 corosync[11280]: [QUORUM] Using quorum provider corosync_votequorum
Sep 22 19:37:06 pmn2 corosync[11280]: [QUORUM] Quorum provider: corosync_votequorum failed to initialize.
Sep 22 19:37:06 pmn2 corosync[11280]: [SERV  ] Service engine 'corosync_quorum' failed to load for reason 'configuration error: nodelist or quorum.expected_votes must be configured!'
Sep 22 19:38:07 pmn2 corosync[11271]: Starting Corosync Cluster Engine (corosync): [FAILED]
Sep 22 19:38:07 pmn2 systemd[1]: corosync.service: control process exited, code=exited status=1
Sep 22 19:38:07 pmn2 systemd[1]: Failed to start Corosync Cluster Engine.
Sep 22 19:38:07 pmn2 systemd[1]: Unit corosync.service entered failed state.
#

Output from journalctl -xn:

Code:

# journalctl -xn
-- Logs begin at Thu 2016-09-22 15:57:08 CEST, end at Thu 2016-09-22 19:59:29 CEST. --
Sep 22 19:59:17 pmn2 pmxcfs[11262]: [dcdb] crit: cpg_initialize failed: 2
Sep 22 19:59:17 pmn2 pmxcfs[11262]: [status] crit: cpg_initialize failed: 2
Sep 22 19:59:23 pmn2 pmxcfs[11262]: [quorum] crit: quorum_initialize failed: 2
Sep 22 19:59:23 pmn2 pmxcfs[11262]: [confdb] crit: cmap_initialize failed: 2
Sep 22 19:59:23 pmn2 pmxcfs[11262]: [dcdb] crit: cpg_initialize failed: 2
Sep 22 19:59:23 pmn2 pmxcfs[11262]: [status] crit: cpg_initialize failed: 2
Sep 22 19:59:29 pmn2 pmxcfs[11262]: [quorum] crit: quorum_initialize failed: 2
Sep 22 19:59:29 pmn2 pmxcfs[11262]: [confdb] crit: cmap_initialize failed: 2
Sep 22 19:59:29 pmn2 pmxcfs[11262]: [dcdb] crit: cpg_initialize failed: 2
Sep 22 19:59:29 pmn2 pmxcfs[11262]: [status] crit: cpg_initialize failed: 2
#

There are no containers or virtual machines running on any of the nodes, they are fresh installations and the only changes made are I have install sudo, added my user to the sudo group and configured eth1 for the RPN on each server.

It is driving me insane and costing me huge amounts of time. Does anyone know how I can fix this or at least get the second and third nodes working again once the pvecm add fails so I don't have to waste so much time reinstall every single time?

Thanks in advance.

spirit · Sep 22, 2016

seem that multicast is not working

Try to look at the wiki here for debug:
https://pve.proxmox.com/wiki/Multicast_notes

ThinkPrivacy · Sep 22, 2016

spirit said:
seem that multicast is not working

Try to look at the wiki here for debug:
https://pve.proxmox.com/wiki/Multicast_notes

OK it looks like Multicast isn't working...

How do I revert my servers 2 and 3 back to life and remove them from the cluster without having to reinstall?

mathx · Jan 17, 2017

Bump. Same issue. Pretty deadly issue to not have multicast working briefly result in total shutdown of the web interface/requiring reinstall.

I dont know that mulitcast was blocked for certain, but I modified something that may have allowed it. I now see multicast traffic from the master on the secondary

00:30:16.437846 IP xx.xx.xxx.248.5404 > 239.192.103.27.5405: UDP, length 136

this is seen on both nodes.

(and I did http://pve.proxmox.com/wiki/Multicast_notes#Linux:_Enabling_Multicast_querier_on_bridges on the master / first node, as I think the switch does not support it default )

I ^C'd the stuck waiting for quorum... and tried rerunning the pvecm add but i get

# pvecm add x.x.x.248
can't create shared ssh key database '/etc/pve/priv/authorized_keys'
authentication key already exists

Course I cant create /etc/pve/priv, its mounted as a clustered dir. So rebooting to see if that helps...

Rebooting helped somewhat, I can see the second node in the interface on the first now, and it has a green check. however clicking on its disks tries to populate the summary/status window, and the spinner spins til it says "communication failure".

Is there a way to remove the entire cluster config and start again in case multicast wasnt working when we started?

mathx · Jan 17, 2017

I figured i was going to reinstall both nodes anyway, so I started messing around with the /etc/pve/priv auth keys and known_hosts files as well as removing corosync.conf on the node and pvecm re-creating. A few reboots, manually pruning out authkeys and using pvecm add -f to force add, i got various errors like "transport endpoint not connected" referring to the priv dir (a shared fs mounted at that dir of course), but now things seem to be working. I cant tell you what I did in the end to get here. :/

My question is now 'what test can i perform to prove it's working' for sure? (I suppose installing a ct from the other node, and then migrating it might be a good test.) [EDIT: And that worked. So I think things are good.]

Wish I could tell everyone how I backed out of a mutlicast/querier failed pvecm create/setup and got a working cluster but I can't. I can tell you it's possible though, once you ensure multicast and querier are working.

In the cluster setup wikipage, you should underline the multicast requirements, in that people should NOT ASSUME their networks support it, even on an open/unfirewalled configuration.

Without a querier, things will not work. Worse - once you run the cluster setup with multicast not running, you seem to paint yourself in a corner with no way back to sanity, effectively breaking all your nodes.

mathx · Aug 23, 2017

Lol and I just forgot about this and did it again to myself. Anyone figure out the way to back out and restart?

Though, multicast querier is ON on the master/first node. So Im confused why timeout waiting for quorum.

Would having querier on both nodes cause the issue? It seemed to be on - turning one off didnt help.

I see packets coming out of the one node:

19:49:49.117635 IP 162.x.x.y.5404 > 239.192.95.45.5405: UDP, length 119

but I see no such packets on the wire at the other node. I assume this is from some IGMP problem.

This includes seeing absolutely no IGMP packets on vmbr0 (or vmbr1 or 2, some other private networks I have ct's and vm's on) from the master node where I created the cluster originally.

Status on that node says:

Version: 6.2.0
Config Version: 2
Cluster Name: hc-nplus
Cluster Id: 24525
Cluster Member: Yes
Cluster Generation: 4
Membership state: Cluster-Member
Nodes: 1
Expected votes: 1
Total votes: 1
Node votes: 1
Quorum: 1
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: charon
Node ID: 1
Multicast addresses: 239.192.95.45
Node addresses: 127.0.0.1

status on the other node is largely the same except node addresses is a real ip, node Id is 2 and Quorum is : 2 activity blocked.

On the node trying to join where I do see IGMP, I see from ip maddr it's added itself to an mcast group:

vmbr0: inet 239.192.95.45

But no such group is listed on the master/cluster-creation node.

mathx · Sep 13, 2017

bump...

with this stuck in this situation, even the web console wont come up, the server is basically functionally wedged for PVE operations

A71 · Feb 22, 2018

I just run in the exact same situation as you, with version 5.1 (proxmox-ve_5.1-3.iso).
Assumed multicast was on, broke second node.

I'm going to try revert to a working configuration, as installation in my case requires network changes (using VLAN, package that is btw NOT included in default install).... so a simple install takes some hours.
I'll write feeback here

A71 · Feb 22, 2018

Good found ! Check your /etc/hosts file !

Because the servers I'm using needs an advanced network configuration (as stated on page https://pve.proxmox.com/wiki/Proxmox_VE_4.x_Cluster#Re-installing_a_cluster_node , in the Re-install the node section, the 1st install of Proxmox should not have LACP to gain access to the network).I'm also using VLAN for management network, and vlan debian package is not installed by default.
So network configuration is at 1st temporary. And that is why I had an erroneous /etc/hosts file on my 2nd node.
Don't ask me why, 1st node had this updated correctly...

I also (like lot of other people apparently) did my first "pvecm add node1" from node2 with multicast not working (or was /etc/hosts file culprit already?)

Code:

pvecm add node1
The authenticity of host 'node1 (10.10.10.101)' can't be established.
ECDSA key fingerprint is SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Are you sure you want to continue connecting (yes/no)? yes
root@node1s password:
copy corosync auth key
stopping pve-cluster service
backup old database
Job for corosync.service failed because the control process exited with error code.
See "systemctl status corosync.service" and "journalctl -xe" for details.
waiting for quorum..

I'm listing here consequences of having a bad /etc/hosts file on 2nd node ( even if it makes sense when you understand how it works, for a beginner it does not) :

On node2:
- service corosync status <=> failed <=> /etc/pve/corosync.conf is read only
  corosync[2138]: error [SERV ] Service engine 'corosync_quorum' failed to load for reason 'configuration error: nodelist or quorum.expected_vote
- pvecm add node1
  can't create shared ssh key database '/etc/pve/priv/authorized_keys'
  detected the following error(s):
  * authentication key '/etc/corosync/authkey' already exists
  * cluster config '/etc/pve/corosync.conf' already exists
- pvecm add node1 --force
  can't create shared ssh key database '/etc/pve/priv/authorized_keys'
  cluster not ready - no quorum?
  unable to add node: command failed (ssh node1 -o BatchMode=yes pvecm addnode node2 --force 1)
- pvecm nodes
  Cannot initialize CMAP service
- pvecm status
  Cannot initialize CMAP service

As soon as I corrected node2:/etc/hosts, I was able to have a working cluster by simply restarting corosync service:

Code:

service corosync restart
service corosync status
   => active
corosync[23674]:  [QUORUM] Members[1]: 2
corosync[23674]:  [MAIN  ] Completed service synchronization, ready to provide service.
corosync[23674]: notice  [TOTEM ] A new membership (10.10.10.101:12) was formed. Members joined: 1
corosync[23674]:  [TOTEM ] A new membership (10.10.10.101:12) was formed. Members joined: 1
corosync[23674]: notice  [QUORUM] This node is within the primary component and will provide service.
corosync[23674]: notice  [QUORUM] Members[2]: 1 2
corosync[23674]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
corosync[23674]:  [QUORUM] This node is within the primary component and will provide service.
corosync[23674]:  [QUORUM] Members[2]: 1 2
corosync[23674]:  [MAIN  ] Completed service synchronization, ready to provide service.

Hope this helps

Search

Search

PVE Cluster fails every time

ThinkPrivacy

Member

spirit

Distinguished Member

ThinkPrivacy

Member

mathx

Renowned Member

mathx

Renowned Member

mathx

Renowned Member

mathx

Renowned Member

A71

New Member

A71

New Member