Node is showing as red

alchemyx

New Member
Apr 2, 2014
7
0
1
Hello,

I have a strange issue. When I log to node A via HTTP, B shows in red (but I can browse summary, storage and so on).
If I log to B via HTTP, a shows as red. Output from A:

Code:
root@proxmox-A:~# pvecm status
Version: 6.2.0
Config Version: 12
Cluster Name: KLASTER
Cluster Id: 9492
Cluster Member: Yes
Cluster Generation: 244
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Quorum device votes: 1
Total votes: 3
Node votes: 1
Quorum: 2  
Active subsystems: 7
Flags: 
Ports Bound: 0 178  
Node name: proxmox-A
Node ID: 1
Multicast addresses: 239.192.37.57 
Node addresses: 10.10.10.1 
root@proxmox-A:~# /etc/init.d/cman status
cluster is running.
root@proxmox-A:~# cat /etc/pve/.members
{
"nodename": "proxmox-A",
"version": 3,
"cluster": { "name": "KLASTER", "version": 12, "nodes": 2, "quorate": 1 },
"nodelist": {
  "proxmox-A": { "id": 1, "online": 1, "ip": "10.10.10.1"},
  "proxmox-B": { "id": 2, "online": 0}
  }
}

And B:
Code:
root@proxmox-B:~# pvecm status
Version: 6.2.0
Config Version: 12
Cluster Name: KLASTER
Cluster Id: 9492
Cluster Member: Yes
Cluster Generation: 244
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Quorum device votes: 1
Total votes: 3
Node votes: 1
Quorum: 2  
Active subsystems: 7
Flags: 
Ports Bound: 0 178  
Node name: proxmox-B
Node ID: 2
Multicast addresses: 239.192.37.57 
Node addresses: 10.10.10.2 
root@proxmox-B:~# /etc/init.d/cman status
cluster is running.
root@proxmox-B:~# cat /etc/pve/.members
{
"nodename": "proxmox-B",
"version": 7,
"cluster": { "name": "KLASTER", "version": 12, "nodes": 2, "quorate": 1 },
"nodelist": {
  "proxmox-A": { "id": 1, "online": 1, "ip": "10.10.10.1"},
  "proxmox-B": { "id": 2, "online": 1, "ip": "10.10.10.2"}
  }
}

So as you can see on node A, there B seems to be down. I can ping them both ways
and ssh from A to B and B to A. But still it shows as offline. It happend after some reboots
(I was testing failover). Any idea why? Those are test boxes but I would like to know
how to fix that beforge going into production.

Cluster config:

Code:
 <?xml version="1.0"?>
<cluster config_version="12" name="KLASTER">
  <cman expected_votes="3" keyfile="/var/lib/pve-cluster/corosync.authkey"/>
  <quorumd allow_kill="0" interval="1" label="proxmox1_qdisk" tko="10" votes="1"/>
  <totem token="54000"/>
  <fencedevices>
    <fencedevice agent="fence_ifmib" community="password" ipaddr="1.2.3.4" name="szafa-a-b" snmp_version="2c"/>
  </fencedevices>
  <clusternodes>
    <clusternode name="proxmox-A" nodeid="1" votes="1">
      <fence>
        <method name="fence">
          <device action="off" name="szafa-a-b" port="GigabitEthernet0/21"/>
          <device action="off" name="szafa-a-b" port="GigabitEthernet0/22"/>
        </method>
      </fence>
      <unfence>
        <device action="on" name="szafa-a-b" port="GigabitEthernet0/21"/>
        <device action="on" name="szafa-a-b" port="GigabitEthernet0/22"/>
      </unfence>
    </clusternode>
    <clusternode name="proxmox-B" nodeid="2" votes="1">
      <fence>
        <method name="fence">
          <device action="off" name="szafa-a-b" port="GigabitEthernet0/23"/>
          <device action="off" name="szafa-a-b" port="GigabitEthernet0/24"/>
        </method>
      </fence>
      <unfence>
        <device action="on" name="szafa-a-b" port="GigabitEthernet0/23"/>
        <device action="on" name="szafa-a-b" port="GigabitEthernet0/24"/>
      </unfence>
    </clusternode>
  </clusternodes>
  <rm>
    <pvevm autostart="1" vmid="107"/>
    <pvevm autostart="1" vmid="101"/>
    <pvevm autostart="1" vmid="100"/>
  </rm>
</cluster>
 
Thanks, but I am passing it trough switch that just transparently passes multicasts along (so treats them as broadcast). And nothning
really changed. I wass diging and I found that (I have shared LVM VG over iSCSI):

Code:
  --- Logical volume ---
  LV Path                /dev/shared/vm-100-disk-1
  LV Name                vm-100-disk-1
  VG Name                shared
  LV UUID                Vxf4Al-32Xs-6Byz-xxE7-0dc7-ULpX-CzoVwP
  LV Write Access        read/write
  LV Creation host, time proxmox-A, 2014-04-04 19:57:32 +0200
  LV Status              NOT available
  LV Size                32.00 GiB
  Current LE             8192
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto

All other are also NOT available. vgscan fixed it. But node is still red. In syslog I have:

Code:
 Apr  7 10:40:17 proxmox-B pmxcfs[2267]: [status] crit: cpg_send_message failed: 9

I was having none VMs running on node B so I did /etc/init.d/cman restart and it did not help.
So going trough my configs I found that IGMP was not disabled so it may be true that there was
some trouble with multicasts. Also I simplified my network config (changed from tagged vlans to
one untagged vlan). When I was doing that my node got fenced. So I decied to reboot everything
and see if it happens againg.
Thank you for clues!