Cluster not working - /etc/pve read only

chrisalavoine

Well-Known Member
Sep 30, 2009
152
0
56
Hi,

We have 4 of our machines in a cluster all running:

pveversion -v
proxmox-ve-2.6.32: 3.4-157 (running kernel: 2.6.32-39-pve)
pve-manager: 3.4-6 (running version: 3.4-6/102d4547)
pve-kernel-2.6.32-39-pve: 2.6.32-157
pve-kernel-2.6.32-37-pve: 2.6.32-150
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-18
qemu-server: 3.4-6
pve-firmware: 1.1-4
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-33
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-10
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1


/etc/pve is read only.

I recently made the following additions to /etc/pve/cluster.conf to try and improve the stability of the cluster (we've had a few failures over the past week):

<totem token="54000"/>
<totem window_size="50"/>

Last Saturday we replaced all our switches that these hosts are connected to with new Dell N3048 48 port models. We've also created some LAG port-channels as 802.3ad to improve throughput, this is on both a LAN and SAN network as follows:

ess-prox-001 = 1 NIC to LAN network, 2 NIC's bonded 802.3ad to SAN network
ess-prox-002 = 1 NIC to LAN network, 2 NIC's bonded 802.3ad to SAN network
ess-prox-011 = 2 NIC bonded to LAN network 802.3ad, 2 NIC's bonded 802.3ad to SAN network
ess-prox-014 = 2 NIC bonded to LAN network 802.3ad, 4 NIC's bonded 802.3ad to SAN network

I have tried "pvecm expected 1" on a host to try and make /etc/pve writable, but that hasn't worked this time. I'm a little stuck so any help much appreciated.

Thanks.
 
Additional info. All syslogs on all hosts are showing the following:

Jul 9 19:29:48 ess-prox-001 pmxcfs[749147]: [status] crit: cpg_send_message failed: 9
Jul 9 19:29:48 ess-prox-001 pmxcfs[749147]: [status] crit: cpg_send_message failed: 9
Jul 9 19:29:48 ess-prox-001 pmxcfs[749147]: [status] crit: cpg_send_message failed: 9
Jul 9 19:29:48 ess-prox-001 pmxcfs[749147]: [dcdb] notice: cpg_join retry 58860
Jul 9 19:29:49 ess-prox-001 pmxcfs[749147]: [dcdb] notice: cpg_join retry 58870
Jul 9 19:29:50 ess-prox-001 pmxcfs[749147]: [dcdb] notice: cpg_join retry 58880
Jul 9 19:29:51 ess-prox-001 pmxcfs[749147]: [dcdb] notice: cpg_join retry 58890
Jul 9 19:29:52 ess-prox-001 pmxcfs[749147]: [dcdb] notice: cpg_join retry 58900


Thanks,
Chris.
 
Sounds almost like your switch may be blocking multicast... I don't have the links in front of me, but there are some threads about adjusting the switch for multicast groups...
 
Sounds almost like your switch may be blocking multicast... I don't have the links in front of me, but there are some threads about adjusting the switch for multicast groups...

I thought that something like that was happening also, but this cluster has been working on and off since Saturday and I've done some multicast tests with ssmping and omping and they have all been successful from all hosts so I don't think it's multicast.

Thanks,
Chris.
 
My cluster is back up again after restarting one of the nodes. Not sure how long it will last for, but we shall see.
 
And it's down again, same state as before. I did notice an mtu error on one of the SAN configs (missing mtu 9000), so I'd like to restart that node once the cluster is back up after a node reboot.
 
I managed to get things working ok for a few days and then it went out of sync again. No quorum, each node is fine if you web to it but /etc/pve is read only and there's no migration etc, this is in the logs on all machines:

Jul 15 16:04:57 ess-prox-011 pmxcfs[813394]: [status] crit: cpg_send_message failed: 9
Jul 15 16:04:57 ess-prox-011 pmxcfs[813394]: [status] crit: cpg_send_message failed: 9
Jul 15 16:04:57 ess-prox-011 pmxcfs[813394]: [status] crit: cpg_send_message failed: 9
Jul 15 16:04:57 ess-prox-011 pmxcfs[813394]: [status] crit: cpg_send_message failed: 9
Jul 15 16:04:57 ess-prox-011 pmxcfs[813394]: [status] crit: cpg_send_message failed: 9
Jul 15 16:04:58 ess-prox-011 pmxcfs[813394]: [dcdb] notice: cpg_join retry 53230
Jul 15 16:04:59 ess-prox-011 pmxcfs[813394]: [dcdb] notice: cpg_join retry 53240
Jul 15 16:05:00 ess-prox-011 pmxcfs[813394]: [dcdb] notice: cpg_join retry 53250
Jul 15 16:05:01 ess-prox-011 pmxcfs[813394]: [dcdb] notice: cpg_join retry 53260
Jul 15 16:05:02 ess-prox-011 pmxcfs[813394]: [dcdb] notice: cpg_join retry 53270
Jul 15 16:05:02 ess-prox-011 dlm_controld[656484]: daemon cpg_leave error retrying
Jul 15 16:05:03 ess-prox-011 pmxcfs[813394]: [dcdb] notice: cpg_join retry 53280
Jul 15 16:05:04 ess-prox-011 pmxcfs[813394]: [dcdb] notice: cpg_join retry 53290
Jul 15 16:05:05 ess-prox-011 pmxcfs[813394]: [dcdb] notice: cpg_join retry 53300
Jul 15 16:05:06 ess-prox-011 pmxcfs[813394]: [dcdb] notice: cpg_join retry 53310
Jul 15 16:05:07 ess-prox-011 pmxcfs[813394]: [dcdb] notice: cpg_join retry 53320
Jul 15 16:05:07 ess-prox-011 pmxcfs[813394]: [status] crit: cpg_send_message failed: 9
Jul 15 16:05:07 ess-prox-011 pmxcfs[813394]: [status] crit: cpg_send_message failed: 9
Jul 15 16:05:07 ess-prox-011 pmxcfs[813394]: [status] crit: cpg_send_message failed: 9

Am beginning to wonder if that fact that I have 2 nodes with 802.3ad bonded links and 2 nodes with just single links to the core is causing the problem. The 2 with bonded links obviously will have more throughput so maybe that's what's causing the corosync issues. Anyone else have any experience with this type of problem?

Thanks,
Chris.
 
I have a similar issue. I have 2 node cluster that I cant put together. Was working fine until I upgraded to 3.0 a while a go. I get the same error messages you do and my setup in LACP bonds to a pair of Netgear GS716Tv2 which are also connected to each other via LACP. Multicast seems ok on the switches and I can omping with no issue. Now for the life of me I can't get this cluster assembled. If I recreate the cluster I can see that everything looks good using pvecm commands (status, nodes) and the cluster is quorate but cman always says waiting for quorom and those errors come when trying to add the 2nd node.

I am not sure if this is LACP related or not but this sure is frustrating.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!