Losing network on one node from 3 node cluster

czechsys

Renowned Member
Nov 18, 2015
502
54
93
Hi,

i have strange problem with network from some version of 4.3 PVE. Setup is 3 node cluster (G5 + 2x G6 HP servers), openvswitch and dedicated corosync ring0, configured on the same vmbr0 for all nodes and connected to the same switch.

Code:
Membership information
----------------------
    Nodeid      Votes Name
         1          1 prox0-coro0 (local)
         3          1 prox1-coro0
         2          1 prox2-coro0

Quorum information
------------------
Date:  Wed Nov 30 12:07:05 2016
Quorum provider:  corosync_votequorum
Nodes:  3
Node ID:  0x00000001
Ring ID:  1/1552
Quorate:  Yes

Votequorum information
----------------------
Expected votes:  3
Highest expected: 3
Total votes:  3
Quorum:  2  
Flags:  Quorate

Membership information
----------------------
  Nodeid  Votes Name
0x00000001  1 172.16.0.10 (local)
0x00000003  1 172.16.0.11
0x00000002  1 172.16.0.12

Problematic node:
Code:
proxmox-ve: 4.3-72 (running kernel: 4.4.21-1-pve)
pve-manager: 4.3-12 (running version: 4.3-12/6894c9d9)
pve-kernel-4.4.21-1-pve: 4.4.21-71
pve-kernel-4.4.24-1-pve: 4.4.24-72
pve-kernel-4.4.19-1-pve: 4.4.19-66
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-47
qemu-server: 4.0-96
pve-firmware: 1.1-10
libpve-common-perl: 4.0-83
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-68
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.3-17
pve-qemu-kvm: 2.7.0-8
pve-container: 1.0-85
pve-firewall: 2.0-31
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.6-1
lxcfs: 2.0.5-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80
openvswitch-switch: 2.6.0-2

Rest 2 nodes:
Code:
proxmox-ve: 4.3-71 (running kernel: 4.4.21-1-pve)
pve-manager: 4.3-10 (running version: 4.3-10/7230e60f)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.21-1-pve: 4.4.21-71
pve-kernel-4.4.15-1-pve: 4.4.15-60
pve-kernel-4.4.19-1-pve: 4.4.19-66
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-47
qemu-server: 4.0-94
pve-firmware: 1.1-10
libpve-common-perl: 4.0-80
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-68
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.3-14
pve-qemu-kvm: 2.7.0-8
pve-container: 1.0-81
pve-firewall: 2.0-31
pve-ha-manager: 1.0-35
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.5-1
lxcfs: 2.0.4-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80
openvswitch-switch: 2.5.0-1

Symptomps:
Updated 1 node ahead of rest 2 marked offline:
Code:
Nov 29 06:23:00 prox1 corosync[1769]: [TOTEM ] A processor failed, forming new configuration.
Nov 29 06:23:02 prox1 corosync[1769]: [TOTEM ] A new membership (172.16.0.11:1548) was formed. Members left: 1 2
Nov 29 06:23:02 prox1 corosync[1769]: [TOTEM ] Failed to receive the leave message. failed: 1 2
Nov 29 06:23:02 prox1 corosync[1769]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Nov 29 06:23:02 prox1 corosync[1769]: [QUORUM] Members[1]: 3
Nov 29 06:23:02 prox1 corosync[1769]: [MAIN  ] Completed service synchronization, ready to provide service.
Nov 29 06:23:52 prox1 corosync[1769]: [TOTEM ] A new membership (172.16.0.10:1552) was formed. Members joined: 1 2
Nov 29 06:23:52 prox1 corosync[1769]: [QUORUM] This node is within the primary component and will provide service.
Nov 29 06:23:52 prox1 corosync[1769]: [QUORUM] Members[3]: 1 3 2
Nov 29 06:23:52 prox1 corosync[1769]: [MAIN  ] Completed service synchronization, ready to provide service.
Nov 30 04:30:06 prox1 corosync[1769]: [TOTEM ] Retransmit List: a2c39
Nov 30 04:30:07 prox1 corosync[1769]: [TOTEM ] Retransmit List: a2c4c
Nov 30 04:30:07 prox1 corosync[1769]: [TOTEM ] Retransmit List: a2c4e
Nov 30 04:30:08 prox1 corosync[1769]: [TOTEM ] Retransmit List: a2c51
Nov 30 04:30:08 prox1 corosync[1769]: [TOTEM ] Retransmit List: a2c52
Nov 30 04:30:10 prox1 corosync[1769]: [TOTEM ] Retransmit List: a2c5b

Rest nodes:
Code:
Nov 29 06:23:00 prox2 corosync[1766]: [TOTEM ] A processor failed, forming new configuration.
Nov 29 06:23:02 prox2 corosync[1766]: [TOTEM ] A new membership (172.16.0.10:1548) was formed. Members left: 3
Nov 29 06:23:02 prox2 corosync[1766]: [TOTEM ] Failed to receive the leave message. failed: 3
Nov 29 06:23:02 prox2 corosync[1766]: [QUORUM] Members[2]: 1 2
Nov 29 06:23:02 prox2 corosync[1766]: [MAIN  ] Completed service synchronization, ready to provide service.
Nov 29 06:23:52 prox2 corosync[1766]: [TOTEM ] A new membership (172.16.0.10:1552) was formed. Members joined: 3
Nov 29 06:23:52 prox2 corosync[1766]: [QUORUM] Members[3]: 1 3 2
Nov 29 06:23:52 prox2 corosync[1766]: [MAIN  ] Completed service synchronization, ready to provide service.
Nov 30 03:00:26 prox2 corosync[1766]: [TOTEM ] Retransmit List: 97bd0 97bd1

Code:
Nov 29 06:23:00 prox0 corosync[1415]: [TOTEM ] A processor failed, forming new configuration.
Nov 29 06:23:02 prox0 corosync[1415]: [TOTEM ] A new membership (172.16.0.10:1548) was formed. Members left: 3
Nov 29 06:23:02 prox0 corosync[1415]: [TOTEM ] Failed to receive the leave message. failed: 3
Nov 29 06:23:02 prox0 corosync[1415]: [QUORUM] Members[2]: 1 2
Nov 29 06:23:02 prox0 corosync[1415]: [MAIN  ] Completed service synchronization, ready to provide service.
Nov 29 06:23:52 prox0 corosync[1415]: [TOTEM ] A new membership (172.16.0.10:1552) was formed. Members joined: 3
Nov 29 06:23:52 prox0 corosync[1415]: [QUORUM] Members[3]: 1 3 2
Nov 29 06:23:52 prox0 corosync[1415]: [MAIN  ] Completed service synchronization, ready to provide service.
Nov 30 04:30:06 prox0 corosync[1415]: [TOTEM ] Retransmit List: a2c39
Nov 30 04:30:06 prox0 corosync[1415]: [TOTEM ] Retransmit List: a2c42
Nov 30 04:30:07 prox0 corosync[1415]: [TOTEM ] Retransmit List: a2c4c
Nov 30 04:30:07 prox0 corosync[1415]: [TOTEM ] Retransmit List: a2c4e
Nov 30 04:30:07 prox0 corosync[1415]: [TOTEM ] Retransmit List: a2c4f
Nov 30 04:30:07 prox0 corosync[1415]: [TOTEM ] Retransmit List: a2c50
Nov 30 04:30:08 prox0 corosync[1415]: [TOTEM ] Retransmit List: a2c51
Nov 30 04:30:08 prox0 corosync[1415]: [TOTEM ] Retransmit List: a2c52
Nov 30 04:30:09 prox0 corosync[1415]: [TOTEM ] Retransmit List: a2c54
Nov 30 04:30:10 prox0 corosync[1415]: [TOTEM ] Retransmit List: a2c5a
Nov 30 04:30:10 prox0 corosync[1415]: [TOTEM ] Retransmit List: a2c5b

Datacenter Health says: Online 3 (but no Uptime in Nodes). Prox1 is marked with "red X" in the "Server View". Prox1->Summary shows "CPU usage" changing, but graphs are empty. So, every menu/info works without failure even from webinterface from other node.
Problematic node eventually stops responding to snmp check around 11:45. No VMs on prox1, so no info, if connection problem is there too. SNMPd restart didnt helped to resolving monitoring. So, i restarted in this order: corosync, pve-cluster, pve-ha-lrm, pve-ha-crm, pveproxy, pvedaemon.
Still no snmp working, still marked "red X".

Any debug idea? I am not sure, if update rest 2 to having same version, because i dont want getting problems with VMs on prox2 node...
 

Attachments

  • proxmox_fail.png
    proxmox_fail.png
    98.3 KB · Views: 13
Hi
Are you able to login in the "red" node ? ( node 1)
If yes, check the output of

pvecm nodes


if you see all three nodes

then check with pvesm status

It could that the problematic node is hanging on I/O, and thus not able to report its status to the rest of the cluster.
 
Thank you for reply.
I will check it next time. What i remember, before restarting "pvecm nodes" was ok on all three, "pvesm status" stucked on problematic node. After restart, snmpd started working and there was recovering HW raid, so I/O problem could be issue, but because time of the problem vs time of start HW raid recovery doesnt look related. I will wait for "next fail" to check if those problems are bound together.