Cluster not working - /etc/pve read only

chrisalavoine · Jul 9, 2015

Hi,

We have 4 of our machines in a cluster all running:

pveversion -v
proxmox-ve-2.6.32: 3.4-157 (running kernel: 2.6.32-39-pve)
pve-manager: 3.4-6 (running version: 3.4-6/102d4547)
pve-kernel-2.6.32-39-pve: 2.6.32-157
pve-kernel-2.6.32-37-pve: 2.6.32-150
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-18
qemu-server: 3.4-6
pve-firmware: 1.1-4
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-33
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-10
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

/etc/pve is read only.

I recently made the following additions to /etc/pve/cluster.conf to try and improve the stability of the cluster (we've had a few failures over the past week):

<totem token="54000"/>
<totem window_size="50"/>

Last Saturday we replaced all our switches that these hosts are connected to with new Dell N3048 48 port models. We've also created some LAG port-channels as 802.3ad to improve throughput, this is on both a LAN and SAN network as follows:

ess-prox-001 = 1 NIC to LAN network, 2 NIC's bonded 802.3ad to SAN network
ess-prox-002 = 1 NIC to LAN network, 2 NIC's bonded 802.3ad to SAN network
ess-prox-011 = 2 NIC bonded to LAN network 802.3ad, 2 NIC's bonded 802.3ad to SAN network
ess-prox-014 = 2 NIC bonded to LAN network 802.3ad, 4 NIC's bonded 802.3ad to SAN network

I have tried "pvecm expected 1" on a host to try and make /etc/pve writable, but that hasn't worked this time. I'm a little stuck so any help much appreciated.

Thanks.

chrisalavoine · Jul 9, 2015

Additional info. All syslogs on all hosts are showing the following:

Jul 9 19:29:48 ess-prox-001 pmxcfs[749147]: [status] crit: cpg_send_message failed: 9
Jul 9 19:29:48 ess-prox-001 pmxcfs[749147]: [status] crit: cpg_send_message failed: 9
Jul 9 19:29:48 ess-prox-001 pmxcfs[749147]: [status] crit: cpg_send_message failed: 9
Jul 9 19:29:48 ess-prox-001 pmxcfs[749147]: [dcdb] notice: cpg_join retry 58860
Jul 9 19:29:49 ess-prox-001 pmxcfs[749147]: [dcdb] notice: cpg_join retry 58870
Jul 9 19:29:50 ess-prox-001 pmxcfs[749147]: [dcdb] notice: cpg_join retry 58880
Jul 9 19:29:51 ess-prox-001 pmxcfs[749147]: [dcdb] notice: cpg_join retry 58890
Jul 9 19:29:52 ess-prox-001 pmxcfs[749147]: [dcdb] notice: cpg_join retry 58900

Thanks,
Chris.

nethfel · Jul 9, 2015

Sounds almost like your switch may be blocking multicast... I don't have the links in front of me, but there are some threads about adjusting the switch for multicast groups...

chrisalavoine · Jul 9, 2015

nethfel said:
Sounds almost like your switch may be blocking multicast... I don't have the links in front of me, but there are some threads about adjusting the switch for multicast groups...

I thought that something like that was happening also, but this cluster has been working on and off since Saturday and I've done some multicast tests with ssmping and omping and they have all been successful from all hosts so I don't think it's multicast.

Thanks,
Chris.

chrisalavoine · Jul 9, 2015

My cluster is back up again after restarting one of the nodes. Not sure how long it will last for, but we shall see.

chrisalavoine · Jul 10, 2015

And it's down again, same state as before. I did notice an mtu error on one of the SAN configs (missing mtu 9000), so I'd like to restart that node once the cluster is back up after a node reboot.

chrisalavoine · Jul 15, 2015

I managed to get things working ok for a few days and then it went out of sync again. No quorum, each node is fine if you web to it but /etc/pve is read only and there's no migration etc, this is in the logs on all machines:

Jul 15 16:04:57 ess-prox-011 pmxcfs[813394]: [status] crit: cpg_send_message failed: 9
Jul 15 16:04:57 ess-prox-011 pmxcfs[813394]: [status] crit: cpg_send_message failed: 9
Jul 15 16:04:57 ess-prox-011 pmxcfs[813394]: [status] crit: cpg_send_message failed: 9
Jul 15 16:04:57 ess-prox-011 pmxcfs[813394]: [status] crit: cpg_send_message failed: 9
Jul 15 16:04:57 ess-prox-011 pmxcfs[813394]: [status] crit: cpg_send_message failed: 9
Jul 15 16:04:58 ess-prox-011 pmxcfs[813394]: [dcdb] notice: cpg_join retry 53230
Jul 15 16:04:59 ess-prox-011 pmxcfs[813394]: [dcdb] notice: cpg_join retry 53240
Jul 15 16:05:00 ess-prox-011 pmxcfs[813394]: [dcdb] notice: cpg_join retry 53250
Jul 15 16:05:01 ess-prox-011 pmxcfs[813394]: [dcdb] notice: cpg_join retry 53260
Jul 15 16:05:02 ess-prox-011 pmxcfs[813394]: [dcdb] notice: cpg_join retry 53270
Jul 15 16:05:02 ess-prox-011 dlm_controld[656484]: daemon cpg_leave error retrying
Jul 15 16:05:03 ess-prox-011 pmxcfs[813394]: [dcdb] notice: cpg_join retry 53280
Jul 15 16:05:04 ess-prox-011 pmxcfs[813394]: [dcdb] notice: cpg_join retry 53290
Jul 15 16:05:05 ess-prox-011 pmxcfs[813394]: [dcdb] notice: cpg_join retry 53300
Jul 15 16:05:06 ess-prox-011 pmxcfs[813394]: [dcdb] notice: cpg_join retry 53310
Jul 15 16:05:07 ess-prox-011 pmxcfs[813394]: [dcdb] notice: cpg_join retry 53320
Jul 15 16:05:07 ess-prox-011 pmxcfs[813394]: [status] crit: cpg_send_message failed: 9
Jul 15 16:05:07 ess-prox-011 pmxcfs[813394]: [status] crit: cpg_send_message failed: 9
Jul 15 16:05:07 ess-prox-011 pmxcfs[813394]: [status] crit: cpg_send_message failed: 9

Am beginning to wonder if that fact that I have 2 nodes with 802.3ad bonded links and 2 nodes with just single links to the core is causing the problem. The 2 with bonded links obviously will have more throughput so maybe that's what's causing the corosync issues. Anyone else have any experience with this type of problem?

Thanks,
Chris.

Mr.Embedded · Jul 16, 2015

I have a similar issue. I have 2 node cluster that I cant put together. Was working fine until I upgraded to 3.0 a while a go. I get the same error messages you do and my setup in LACP bonds to a pair of Netgear GS716Tv2 which are also connected to each other via LACP. Multicast seems ok on the switches and I can omping with no issue. Now for the life of me I can't get this cluster assembled. If I recreate the cluster I can see that everything looks good using pvecm commands (status, nodes) and the cluster is quorate but cman always says waiting for quorom and those errors come when trying to add the 2nd node.

I am not sure if this is LACP related or not but this sure is frustrating.

Search

Search

Cluster not working - /etc/pve read only

chrisalavoine

Well-Known Member

chrisalavoine

Well-Known Member

nethfel

Member

chrisalavoine

Well-Known Member

chrisalavoine

Well-Known Member

chrisalavoine

Well-Known Member

chrisalavoine

Well-Known Member

Mr.Embedded

Well-Known Member