Proxmox v4 Cluster - HP C3000 - Flex-10 Network - G7 blades

mjoconr

Renowned Member
Dec 5, 2009
88
1
73
Hi All

I have a 4 blade cluster I'm trying to setup and a I can not get past the first step after install, creating the cluster.

hosts are:
blade1 - 10.5.1.201
blade3 - 10.5.1.203
blade5 - 10.5.1.205
blade7 - 10.5.1.207

I've added these as host records in the hosts file and I'm able to ssh between the systems and they all have access to the internet and NTP is setup to sync the time. I have purchased and install the enterprise licenses and done the upgrades.


I ran the create cluster command
pvecm create Jazmin-Adl


Let that finished then ran
pvecm add blade7
This took many mins to finish.


The problem is that from the CLI the two nodes have never shown up
Code:
 root@blade7:~# pvecm nodes
 
 
 Membership information
 ----------------------
     Nodeid      Votes Name
          1          1 blade7 (local)
 
 
 root@blade5:~#  pvecm nodes
 
 
 Membership information
 ----------------------
     Nodeid      Votes Name
          2          1 blade5 (local)
 root@blade5:~#


There is no Quoratum
Code:
 root@blade7:~# pvecm status
 Quorum information
 ------------------
 Date:             Fri Nov 13 12:06:33 2015
 Quorum provider:  corosync_votequorum
 Nodes:            1
 Node ID:          0x00000001
 Ring ID:          1580
 Quorate:          No
 
 
 Votequorum information
 ----------------------
 Expected votes:   2
 Highest expected: 2
 Total votes:      1
 Quorum:           2 Activity blocked
 Flags:
 
 
 Membership information
 ----------------------
     Nodeid      Votes Name
 0x00000001          1 10.1.5.207 (local)


Attempting to add another node just gives
Code:
 root@blade1:~# pvecm add blade7
 root@blade7's password:
 unable to copy ssh ID


I have no idea what to try next. There does seem to be traffic between the nodes shown via tcpdump. The logs below show some sort of issue.

Code:
Nov 13 11:57:08 blade5 corosync[1171]:  [MAIN  ] Corosync Cluster Engine ('2.3.5'): started and ready to provide service.
Nov 13 11:57:08 blade5 corosync[1171]:  [MAIN  ] Corosync built-in features: augeas systemd pie relro bindnow
Nov 13 11:57:08 blade5 corosync[1172]:  [TOTEM ] Initializing transport (UDP/IP Multicast).
Nov 13 11:57:08 blade5 corosync[1172]:  [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Nov 13 11:57:08 blade5 corosync[1172]:  [TOTEM ] The network interface [10.1.5.205] is now up.
Nov 13 11:57:08 blade5 corosync[1172]:  [SERV  ] Service engine loaded: corosync configuration map access [0]
Nov 13 11:57:08 blade5 corosync[1172]:  [QB    ] server name: cmap
Nov 13 11:57:08 blade5 corosync[1172]:  [SERV  ] Service engine loaded: corosync configuration service [1]
Nov 13 11:57:08 blade5 corosync[1172]:  [QB    ] server name: cfg
Nov 13 11:57:08 blade5 corosync[1172]:  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Nov 13 11:57:08 blade5 corosync[1172]:  [QB    ] server name: cpg
Nov 13 11:57:08 blade5 corosync[1172]:  [SERV  ] Service engine loaded: corosync profile loading service [4]
Nov 13 11:57:08 blade5 corosync[1172]:  [QUORUM] Using quorum provider corosync_votequorum
Nov 13 11:57:08 blade5 corosync[1172]:  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Nov 13 11:57:08 blade5 corosync[1172]:  [QB    ] server name: votequorum
Nov 13 11:57:08 blade5 corosync[1172]:  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Nov 13 11:57:08 blade5 corosync[1172]:  [QB    ] server name: quorum
Nov 13 11:57:08 blade5 corosync[1172]:  [TOTEM ] A new membership (10.1.5.205:1572) was formed. Members joined: 2
Nov 13 11:57:08 blade5 corosync[1172]:  [QUORUM] Members[1]: 2
Nov 13 11:57:08 blade5 corosync[1172]:  [MAIN  ] Completed service synchronization, ready to provide service.
Nov 13 11:57:08 blade5 corosync[1172]:  [TOTEM ] Digest does not match
Nov 13 11:57:08 blade5 corosync[1172]:  [TOTEM ] Received message has invalid digest... ignoring.
Nov 13 11:57:08 blade5 corosync[1172]:  [TOTEM ] Invalid packet data
Nov 13 11:57:09 blade5 corosync[1165]: Starting Corosync Cluster Engine (corosync): [  OK  ]
Nov 13 11:57:09 blade5 corosync[1172]:  [TOTEM ] Digest does not match
Nov 13 11:57:09 blade5 corosync[1172]:  [TOTEM ] Received message has invalid digest... ignoring.
Nov 13 11:57:09 blade5 corosync[1172]:  [TOTEM ] Invalid packet data
Nov 13 11:57:09 blade5 corosync[1172]:  [TOTEM ] Digest does not match
....
Nov 13 12:02:35 blade5 corosync[1172]:  [TOTEM ] A new membership (10.1.5.205:1576) was formed. Members joined: 1
Nov 13 12:02:35 blade5 corosync[1172]:  [QUORUM] This node is within the primary component and will provide service.
Nov 13 12:02:35 blade5 corosync[1172]:  [QUORUM] Members[2]: 2 1
Nov 13 12:02:35 blade5 corosync[1172]:  [MAIN  ] Completed service synchronization, ready to provide service.
Nov 13 12:02:35 blade5 corosync[1172]:  [TOTEM ] Digest does not match
Nov 13 12:02:35 blade5 corosync[1172]:  [TOTEM ] Received message has invalid digest... ignoring.
....
Nov 13 12:02:46 blade5 corosync[1172]:  [TOTEM ] Digest does not match
Nov 13 12:02:46 blade5 corosync[1172]:  [TOTEM ] Received message has invalid digest... ignoring.
Nov 13 12:02:46 blade5 corosync[1172]:  [TOTEM ] Invalid packet data
Nov 13 12:03:07 blade5 corosync[1172]:  [TOTEM ] Retransmit List: 2c 2d
Nov 13 12:03:07 blade5 corosync[1172]:  [TOTEM ] Retransmit List: 2c 2d
Nov 13 12:03:07 blade5 corosync[1172]:  [TOTEM ] Retransmit List: 2c 2d 2e
Nov 13 12:03:07 blade5 corosync[1172]:  [TOTEM ] Retransmit List: 2c 2d 2e
Nov 13 12:03:07 blade5 corosync[1172]:  [TOTEM ] Retransmit List: 2c 2d 2e
Nov 13 12:03:07 blade5 corosync[1172]:  [TOTEM ] Retransmit List: 2c 2d 2e
Nov 13 12:03:07 blade5 corosync[1172]:  [TOTEM ] Retransmit List: 2c 2d 2e
....
Nov 13 12:03:08 blade5 corosync[1172]:  [TOTEM ] Retransmit List: 2c 2d 2e 2f
Nov 13 12:03:08 blade5 corosync[1172]:  [TOTEM ] Retransmit List: 2c 2d 2e 2f
Nov 13 12:03:09 blade5 corosync[1172]:  [TOTEM ] A processor failed, forming new configuration.
Nov 13 12:03:10 blade5 corosync[1172]:  [TOTEM ] A new membership (10.1.5.205:1580) was formed. Members left: 1
Nov 13 12:03:10 blade5 corosync[1172]:  [TOTEM ] Failed to receive the leave message. failed: 1
Nov 13 12:03:10 blade5 corosync[1172]:  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Nov 13 12:03:10 blade5 corosync[1172]:  [QUORUM] Members[1]: 2
Nov 13 12:03:10 blade5 corosync[1172]:  [MAIN  ] Completed service synchronization, ready to provide service.

Code:
 Nov 13 11:31:00 blade7 corosync[1300]:  [MAIN  ] Corosync Cluster Engine ('2.3.5'): started and ready to provide service.
 Nov 13 11:31:00 blade7 corosync[1300]:  [MAIN  ] Corosync built-in features: augeas systemd pie relro bindnow
 Nov 13 11:31:00 blade7 corosync[1301]:  [TOTEM ] Initializing transport (UDP/IP Multicast).
 Nov 13 11:31:00 blade7 corosync[1301]:  [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
 Nov 13 11:31:00 blade7 corosync[1301]:  [TOTEM ] The network interface [10.1.5.207] is now up.
 Nov 13 11:31:00 blade7 corosync[1301]:  [SERV  ] Service engine loaded: corosync configuration map access [0]
 Nov 13 11:31:00 blade7 corosync[1301]:  [QB    ] server name: cmap
 Nov 13 11:31:00 blade7 corosync[1301]:  [SERV  ] Service engine loaded: corosync configuration service [1]
 Nov 13 11:31:00 blade7 corosync[1301]:  [QB    ] server name: cfg
 Nov 13 11:31:00 blade7 corosync[1301]:  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
 Nov 13 11:31:00 blade7 corosync[1301]:  [QB    ] server name: cpg
 Nov 13 11:31:00 blade7 corosync[1301]:  [SERV  ] Service engine loaded: corosync profile loading service [4]
 Nov 13 11:31:00 blade7 corosync[1301]:  [QUORUM] Using quorum provider corosync_votequorum
 Nov 13 11:31:00 blade7 corosync[1301]:  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
 Nov 13 11:31:00 blade7 corosync[1301]:  [QB    ] server name: votequorum
 Nov 13 11:31:00 blade7 corosync[1301]:  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
 Nov 13 11:31:00 blade7 corosync[1301]:  [QB    ] server name: quorum
 Nov 13 11:31:00 blade7 corosync[1301]:  [TOTEM ] A new membership (10.1.5.207:436) was formed. Members joined: 1
 Nov 13 11:31:00 blade7 corosync[1301]:  [QUORUM] Members[1]: 1
 Nov 13 11:31:00 blade7 corosync[1301]:  [MAIN  ] Completed service synchronization, ready to provide service.
 Nov 13 11:31:00 blade7 corosync[1301]:  [TOTEM ] Digest does not match
 Nov 13 11:31:00 blade7 corosync[1301]:  [TOTEM ] Received message has invalid digest... ignoring.
 Nov 13 11:31:00 blade7 corosync[1301]:  [TOTEM ] Invalid packet data
 Nov 13 11:31:01 blade7 corosync[1301]:  [TOTEM ] Digest does not match
 Nov 13 11:31:01 blade7 corosync[1301]:  [TOTEM ] Received message has invalid digest... ignoring.
 Nov 13 11:31:01 blade7 corosync[1301]:  [TOTEM ] Invalid packet data
 Nov 13 11:31:01 blade7 corosync[1294]: Starting Corosync Cluster Engine (corosync): [  OK  ]
 Nov 13 11:31:01 blade7 corosync[1301]:  [TOTEM ] Digest does not match
 Nov 13 11:31:01 blade7 corosync[1301]:  [TOTEM ] Received message has invalid digest... ignoring.
 Nov 13 11:31:01 blade7 corosync[1301]:  [TOTEM ] Invalid packet data
 Nov 13 11:31:01 blade7 corosync[1301]:  [TOTEM ] Digest does not match
 Nov 13 11:31:01 blade7 corosync[1301]:  [TOTEM ] Received message has invalid digest... ignoring.
 Nov 13 11:31:01 blade7 corosync[1301]:  [TOTEM ] Invalid packet data
 Nov 13 11:31:02 blade7 corosync[1301]:  [TOTEM ] A new membership (10.1.5.207:440) was formed. Members
 ….
 Nov 13 11:31:09 blade7 corosync[1301]:  [MAIN  ] Completed service synchronization, ready to provide service.
 Nov 13 11:31:09 blade7 corosync[1301]:  [TOTEM ] Digest does not match
 Nov 13 11:31:09 blade7 corosync[1301]:  [TOTEM ] Received message has invalid digest... ignoring.
 Nov 13 11:31:09 blade7 corosync[1301]:  [TOTEM ] Invalid packet data
 Nov 13 11:31:10 blade7 corosync[1301]:  [TOTEM ] Digest does not match
 Nov 13 11:31:10 blade7 corosync[1301]:  [TOTEM ] Received message has invalid digest... ignoring.
 Nov 13 11:31:10 blade7 corosync[1301]:  [TOTEM ] Invalid packet data
 Nov 13 11:31:10 blade7 corosync[1301]:  [TOTEM ] Digest does not match
 Nov 13 11:31:10 blade7 corosync[1301]:  [TOTEM ] Received message has invalid digest... ignoring.
 Nov 13 11:31:10 blade7 corosync[1301]:  [TOTEM ] Invalid packet data
 Nov 13 11:31:10 blade7 corosync[1301]:  [TOTEM ] A new membership (10.1.5.207:464) was formed. Members
 Nov 13 11:31:10 blade7 corosync[1301]:  [QUORUM] Members[1]: 1
 Nov 13 11:31:10 blade7 corosync[1301]:  [MAIN  ] Completed service synchronization, ready to provide service.
 Nov 13 11:31:10 blade7 corosync[1301]:  [TOTEM ] Digest does not match
 Nov 13 11:31:10 blade7 corosync[1301]:  [TOTEM ] Received message has invalid digest... ignoring.
 Nov 13 11:31:10 blade7 corosync[1301]:  [TOTEM ] Invalid packet data
 Nov 13 11:31:11 blade7 corosync[1301]:  [TOTEM ] Digest does not match
 Nov 13 11:31:11 blade7 corosync[1301]:  [TOTEM ] Received message has invalid digest... ignoring.
 Nov 13 11:31:11 blade7 corosync[1301]:  [TOTEM ] Invalid packet data
 Nov 13 11:31:11 blade7 corosync[1301]:  [TOTEM ] Digest does not match
 Nov 13 11:31:11 blade7 corosync[1301]:  [TOTEM ] Received message has invalid digest... ignoring.
 Nov 13 11:31:11 blade7 corosync[1301]:  [TOTEM ] Invalid packet data
 Nov 13 11:31:11 blade7 corosync[1301]:  [TOTEM ] Digest does not match
 Nov 13 11:31:11 blade7 corosync[1301]:  [TOTEM ] Received message has invalid digest... ignoring.
 Nov 13 11:31:11 blade7 corosync[1301]:  [TOTEM ] Invalid packet data
 Nov 13 11:31:12 blade7 corosync[1301]:  [TOTEM ] A new membership (10.1.5.207:468) was formed. Members
 Nov 13 11:31:12 blade7 corosync[1301]:  [QUORUM] Members[1]: 1
 Nov 13 11:31:12 blade7 corosync[1301]:  [MAIN  ] Completed service synchronization, ready to provide service.
 ….
 Nov 13 11:32:55 blade7 corosync[1301]:  [TOTEM ] Digest does not match
 Nov 13 11:32:55 blade7 corosync[1301]:  [TOTEM ] Received message has invalid digest... ignoring.
 Nov 13 11:32:55 blade7 corosync[1301]:  [TOTEM ] Invalid packet data
 Nov 13 11:32:56 blade7 corosync[1301]:  [TOTEM ] Digest does not match
 Nov 13 11:32:56 blade7 corosync[1301]:  [TOTEM ] Received message has invalid digest... ignoring.
 Nov 13 11:32:56 blade7 corosync[1301]:  [TOTEM ] Invalid packet data
 Nov 13 11:32:56 blade7 corosync[1301]:  [TOTEM ] Digest does not match
 Nov 13 11:32:56 blade7 corosync[1301]:  [TOTEM ] Received message has invalid digest... ignoring.
 Nov 13 11:32:56 blade7 corosync[1301]:  [TOTEM ] Invalid packet data
 Nov 13 11:32:56 blade7 corosync[1301]:  [TOTEM ] Digest does not match
 Nov 13 11:32:56 blade7 corosync[1301]:  [TOTEM ] Received message has invalid digest... ignoring.
 Nov 13 11:32:56 blade7 corosync[1301]:  [TOTEM ] Invalid packet data
 Nov 13 11:32:57 blade7 corosync[1301]:  [TOTEM ] Digest does not match
 Nov 13 11:32:57 blade7 corosync[1301]:  [TOTEM ] Received message has invalid digest... ignoring.
 Nov 13 11:32:57 blade7 corosync[1301]:  [TOTEM ] Invalid packet data
 Nov 13 11:32:57 blade7 corosync[1301]:  [TOTEM ] Digest does not match
 Nov 13 11:32:57 blade7 corosync[1301]:  [TOTEM ] Received message has invalid digest... ignoring.
 Nov 13 11:32:57 blade7 corosync[1301]:  [TOTEM ] Invalid packet data
 Nov 13 11:32:58 blade7 corosync[1301]:  [TOTEM ] Digest does not match
 Nov 13 11:32:58 blade7 corosync[1301]:  [TOTEM ] Received message has invalid digest... ignoring.
 Nov 13 11:32:58 blade7 corosync[1301]:  [TOTEM ] Invalid packet data
 Nov 13 11:32:58 blade7 corosync[1301]:  [TOTEM ] Digest does not match
 ….
 Nov 13 11:33:06 blade7 corosync[1301]:  [TOTEM ] Invalid packet data
 Nov 13 11:33:06 blade7 corosync[1301]:  [TOTEM ] Digest does not match
 Nov 13 11:33:06 blade7 corosync[1301]:  [TOTEM ] Received message has invalid digest... ignoring.
 Nov 13 11:33:06 blade7 corosync[1301]:  [TOTEM ] Invalid packet data
 Nov 13 11:33:08 blade7 corosync[1301]:  [TOTEM ] FAILED TO RECEIVE
 Nov 13 11:33:09 blade7 corosync[1301]:  [TOTEM ] A new membership (10.1.5.207:1580) was formed. Members left: 2
 Nov 13 11:33:09 blade7 corosync[1301]:  [TOTEM ] Failed to receive the leave message. failed: 2
 Nov 13 11:33:09 blade7 corosync[1301]:  [QUORUM] This node is within the non-primary component and will NOT provide any services.
 Nov 13 11:33:09 blade7 corosync[1301]:  [QUORUM] Members[1]: 1
 Nov 13 11:33:09 blade7 corosync[1301]:  [MAIN  ] Completed service synchronization, ready to provide service.
 
As I indicated I'm able to ssh between the machines.

The first machine added did not have an issue, only the second. I tried it between blade1 and blade3. Then create a new cluster between blade7 and blade5, rebuilt blade1 and had the same problom. I really think the issue is related to the corosync issues.


Code:
Quorum:           2 Activity blocked

I have tried as you indicated and the following is the result.

Code:
root@blade1:/etc/pve/priv# ssh-copy-id blade7
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 2 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@blade7's password:
cat: write error: Permission denied

I tried from a total different machine and got the same result
Code:
mike@factory:~$ ssh-copy-id root@10.1.5.207
root@10.1.5.207's password:
cat: write error: Permission denied
mike@factory:~$

Code:
root@blade7:/etc/pve/priv# ls -lah
total 2.5K
dr-x------ 2 root www-data    0 Nov 12 15:02 .
drwxr-xr-x 2 root www-data    0 Jan  1  1970 ..
-r-------- 1 root www-data 1.7K Nov 12 15:02 authkey.key
-r-------- 1 root www-data  787 Nov 12 16:33 authorized_keys
-r-------- 1 root www-data 1.8K Nov 12 17:28 known_hosts
-r-------- 1 root www-data 1.7K Nov 12 15:02 pve-root-ca.key
-r-------- 1 root www-data    3 Nov 12 17:28 pve-root-ca.srl

It seems to me that problem is the readonly status of the pve directory, thats the one being synced by corosync right ?

I think I need to fix the quorum problem, the Wiki is out of date for V4, so I have been unable to switch the corosync to broadcast or udp.
 
So using some notes from http://www.dvbyell.org/doc/multicast.html I installed xorp on my local server and was able to get a IGMP Querier running.

Using omping I was able to confirm multicast was working. After a reboot I get the following.

Code:
root@blade5:~#  pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         2          1 blade5 (local)
         1          1 blade7
root@blade5:~#  pvecm status
Quorum information
------------------
Date:             Fri Nov 13 18:02:33 2015
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          1660
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 10.1.5.205 (local)
0x00000001          1 10.1.5.207

I was able to create three system in the cluster, but made the mistake of actually having my current directory be /etc/pve when I ran it on the fourth. The process did not finished on this one.

How do I clean it all up both in the cluster and on the machine which I messed up on ?
I would also like to know how to change to udp or broadcast as I'm not sure that I'll be able to get a mutlicast system doing in the DC when I install for real.
 
have you looked at
https://pve.proxmox.com/wiki/Multicast_notes
A the bottom it shows how to use unicast instead. Not sure if that is still valid for Proxmox4 tho. (i used it on my personal OVH Nodes with Proxmox VE 3 a long while back)


Regarding " undoing - if you do not need any of the data / Cluster configs the cleanest way (probably fastest too) is to do a complete reinstall of the nodes from a http://proxmox.com/en/downloads Image.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!