The next OVH disaster happen!

proxtest

Active Member
Mar 19, 2014
108
0
36
Yesterday my OVH-vRack goes offline! (it's a verry stupid network design there also!)

Cet équipement ne répond plus. Nous investiguons. <- if happen this will hurt u verry much!

Ceph and Proxmox stops working and 2 vm's are crashing, after the vRack comes back
ceph do the resync and everything looks good for the first moment! (was verry happy not losing data (replica 3))
But proxmox starts hurting me badly! Cluster runs out of control and node1 was fencing the two other nodes even ceph was online! The rgmanager have problems to connect each other. I have to restart all 3 nodes! There was no HA working, i was not able to start, stop and migrate vm's manualy! :-(
rgmanager doesn't stop on the nodes (wait 5 minutes) so i have to killall rgmanger tasks, then the nodes are rebooting.

Now all 3 nodes are online but on node3 i see this in boot.log:

Fri Jan 30 18:31:43 2015: Starting cluster: Fri Jan 30 18:31:43 2015: Checking if cluster has been disabled at boot... [ OK ]
Fri Jan 30 18:31:43 2015: Checking Network Manager... [ OK ]
Fri Jan 30 18:31:43 2015: Global setup... [ OK ]
Fri Jan 30 18:31:43 2015: Loading kernel modules... [ OK ]
Fri Jan 30 18:31:43 2015: Mounting configfs... [ OK ]
Fri Jan 30 18:31:43 2015: Starting cman... [ OK ]
Fri Jan 30 18:31:52 2015: Waiting for quorum... Timed-out waiting for cluster
Fri Jan 30 18:32:30 2015: [FAILED]
....
Fri Jan 30 18:32:42 2015: cluster not ready - no quorum?


Cluster Status for cRZ @ Sat Jan 31 11:13:31 2015
Member Status: Quorate


Member Name ID Status
------ ---- ---- ------
node1pv 1 Online
node2pv 2 Online
node3pv 3 Online, Local

Version: 6.2.0
Config Version: 14465
Cluster Name: cRZ
Cluster Id: 3710
Cluster Member: Yes
Cluster Generation: 76348
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Node votes: 1
Quorum: 2
Active subsystems: 1
Flags:
Ports Bound: 0
Node name: node3pv
Node ID: 3
Multicast addresses: 239.192.14.140
Node addresses: 10.10.10.3


root@node1pv:~# pvecm nodes
Node Sts Inc Joined Name
1 M 76320 2015-01-30 17:48:14 node1pv
2 M 76348 2015-01-30 18:40:07 node2pv
3 M 76344 2015-01-30 18:31:41 node3pv


root@node2pv:~# pvecm nodes
Node Sts Inc Joined Name
1 M 76348 2015-01-30 18:40:07 node1pv
2 M 76340 2015-01-30 18:39:40 node2pv
3 M 76348 2015-01-30 18:40:07 node3pv



root@node3pv:~# pvecm nodes
Node Sts Inc Joined Name
1 M 76348 2015-01-30 18:40:07 node1pv
2 M 76348 2015-01-30 18:40:07 node2pv
3 M 76336 2015-01-30 18:31:41 node3pv

Now my questions:

So there is no fault tolerance in proxmox for such providers with a lack of HA infrastucture?

Now is it running or not, looks like but i'm not sure!
How can i test the quorum is working on all 3 nodes?
Why every node has different times for joining?
Can proxmox communicate over different networks, is there a fallback option?
Because the "outside" network was still working!

Regards
 
Last edited: