Cluster nodes can not be seen in GUI - Cluster problem

Informatique BDA

Active Member
Feb 9, 2016
6
0
41
41
Hello everyone,

We have 2 HP Proliant DL360 G9 servers. A Proxmox cluster (version 4.4-12 / e71b7a74) with 2 nodes (hyp1 and hyp2) was created. A glusterFS storage synchronize data.

The problem encountered is the following:
- On node hyp1, the web interface displays the VMs of this node normally, on the other hand hyp2 is with a red crossed out. VMs are started and running. We can stop a VM but we can not start a VM otherwise a message appears: cluster not ready - no quorum? (500)

- On node hyp2, the web interface displays the VMs of this node normally, on the other hand hyp1 is with a red crossed out. VMs are started and running. We can stop a VM but we can not start a VM otherwise a message appears: cluster not ready - no quorum? (500)

When a node is restarted, the cluster runs again but only for 1 or 2 days then the problem returns. There is a cluster issue. Why do not they see each other?

Currently in production, you can not restart a node because the VMs currently work and can not be stopped. Vmdump backups no longer work (VM locked).

Can you help me find the solution?

Thanks you

Bonjour à tous,

Nous avons 2 serveurs HP Proliant DL360 G9. Un cluster Proxmox (version 4.4-12/e71b7a74) avec 2 nœuds (hyp1 et hyp2) a été créé. Un espace de stockage GlusterFS synchronise les données.

Le problème rencontré est le suivant :
- Sur le nœud hyp1, l’interface web affiche les VM de ce nœud normalement, par contre hyp2 est avec un rouge barré. Les VM sont démarrés et fonctionnent. On peut arrêter une VM mais par contre on ne peut pas démarrer une VM faire sinon un message apparait : cluster not ready - no quorum? (500)

- Sur le nœud hyp2, l’interface web affiche les VM de ce nœud normalement, par contre hyp1 est avec un rouge barré. Les VM sont démarrés et fonctionnent. On peut arrêter une VM mais par contre on ne peut pas démarrer une VM faire sinon un message apparait : cluster not ready - no quorum? (500)

Lorsque l’on redémarre un nœud, le cluster fonctionne de nouveau mais seulement pendant 1 ou 2 jours puis le problème revient. Il y a un problème de cluster. Pourquoi il ne se voient plus ?

Actuellement en production, on ne peut pas redémarrer un nœud car les VM actuellement fonctionnent et ne peuvent être arrêté. Les sauvegardes vmdump ne fonctionnent plus non plus (VM locked).

Pouvez-vous m’aider à trouver la solution ?

Merci
 
Thanks you for reply.

Each node has a bond with a static IP in a VLAN for discussions between the hypervisors (VLAN 21 storage). Then each node has another bond with several Tagged VLANs for the VMs (VLAN 2, 17 etc).
Aggregations are managed on ALCATEL OmniSwitch 6450 switches.

Config files of the nodes :

Code:
/etc/network/interfaces on HYP1 :
…
# Bond VM production
auto bond0
iface bond0 inet manual
        slaves eth0 eth1
        bond_miimon 100
        bond_mode 802.3ad
        bond_xmit_hash_policy layer2+3

# Bond Hyp connection (VLAN21 storage)
auto bond1
iface bond1 inet static
        address  172.20.199.20
        netmask  255.255.255.224
        slaves eth2 eth3
        bond_miimon 100
        bond_mode 802.3ad
        bond_xmit_hash_policy layer2+3
        mtu 9000

# Interface VM (VLAN2 internat)
auto bond0.2
iface bond0.2 inet manual
        vlan_raw_device bond0

# Interface VM (VLAN17 servers)
auto vmbr17
iface vmbr17 inet static
        # Proxmox interface
        address 172.20.199.68
        netmask 255.255.255.192
        gateway 172.20.199.126
        bridge_ports bond0.17
        bridge_stp off
        bridge_fd 0

Code:
/etc/hosts in HYP1 :

127.0.0.1      localhost
172.20.199.21    hyp2.domain.lan  hyp2
172.20.199.68    hyp1.domain.lan  hyp1

Code:
/etc/network/interfaces on HYP2 :
…
# Bond VM production
auto bond0
iface bond0 inet manual
        slaves eth0 eth1
        bond_miimon 100
        bond_mode 802.3ad
        bond_xmit_hash_policy layer2+3

# Bond Hyp connection (VLAN21 storage)
auto bond1
iface bond1 inet static
        address  172.20.199.21
        netmask  255.255.255.224
        slaves eth2 eth3
        bond_miimon 100
        bond_mode 802.3ad
        bond_xmit_hash_policy layer2+3
        mtu 9000

# Interface VM (VLAN2 internat)
auto bond0.2
iface bond0.2 inet manual
        vlan_raw_device bond0

# Interface VM (VLAN17 servers)
auto vmbr17
iface vmbr17 inet static
        # Proxmox interface
        address 172.20.199.69
        netmask 255.255.255.192
        gateway 172.20.199.126
        bridge_ports bond0.17
        bridge_stp off
        bridge_fd 0

Code:
/etc/hosts in HYP2 :

127.0.0.1      localhost
172.20.199.20    hyp1.domain.lan  hyp1
172.20.199.69    hyp2.domain.lan  hyp2

When I do a test with omping on hyp1, here is the result:

# omping -c 10000 -i 0.001 -F -q hyp1 hyp2
hyp2 : waiting for response msg
hyp2 : waiting for response msg
hyp2 : waiting for response msg
hyp2 : waiting for response msg
hyp2 : waiting for response msg
hyp2 : waiting for response msg
hyp2 : waiting for response msg
hyp2 : waiting for response msg
hyp2 : waiting for response msg
^C
hyp2 : response message never received

When I do a test with omping on hyp2, here is the result:

# omping -c 10000 -i 0.001 -F -q hyp1 hyp2
hyp1 : waiting for response msg
hyp1: waiting for response msg
hyp1: waiting for response msg
hyp1: waiting for response msg
hyp1: waiting for response msg
hyp1: waiting for response msg
hyp1: waiting for response msg
hyp1: waiting for response msg
hyp1: waiting for response msg
^C
hyp1: response message never received


When I restart hyp2, the cluster runs properly again but for how long?

An idea ?
 
When I do a test with omping on hyp1, here is the result:

# omping -c 10000 -i 0.001 -F -q hyp1 hyp2
hyp2 : waiting for response msg
hyp2 : waiting for response msg
hyp2 : waiting for response msg
hyp2 : waiting for response msg
hyp2 : waiting for response msg
hyp2 : waiting for response msg
hyp2 : waiting for response msg
hyp2 : waiting for response msg
hyp2 : waiting for response msg
^C
hyp2 : response message never received

When I do a test with omping on hyp2, here is the result:

# omping -c 10000 -i 0.001 -F -q hyp1 hyp2
hyp1 : waiting for response msg
hyp1: waiting for response msg
hyp1: waiting for response msg
hyp1: waiting for response msg
hyp1: waiting for response msg
hyp1: waiting for response msg
hyp1: waiting for response msg
hyp1: waiting for response msg
hyp1: waiting for response msg
^C
hyp1: response message never received
you have to do it on all nodes at the same time
 
When I run the command on both nodes at the same time, I have the same result: waiting for response msg.

I have reboot the hyp2. The cluster is OK but the test for multicast failed. Normal ?
 
VLAN and multicast may require some special settings in switch, please check it.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!