Losing network on one node from 3 node cluster

czechsys

Renowned Member
Nov 18, 2015
418
42
93
Hi,

i have strange problem with network from some version of 4.3 PVE. Setup is 3 node cluster (G5 + 2x G6 HP servers), openvswitch and dedicated corosync ring0, configured on the same vmbr0 for all nodes and connected to the same switch.

Code:
Membership information
----------------------
    Nodeid      Votes Name
         1          1 prox0-coro0 (local)
         3          1 prox1-coro0
         2          1 prox2-coro0

Quorum information
------------------
Date:  Wed Nov 30 12:07:05 2016
Quorum provider:  corosync_votequorum
Nodes:  3
Node ID:  0x00000001
Ring ID:  1/1552
Quorate:  Yes

Votequorum information
----------------------
Expected votes:  3
Highest expected: 3
Total votes:  3
Quorum:  2  
Flags:  Quorate

Membership information
----------------------
  Nodeid  Votes Name
0x00000001  1 172.16.0.10 (local)
0x00000003  1 172.16.0.11
0x00000002  1 172.16.0.12

Problematic node:
Code:
proxmox-ve: 4.3-72 (running kernel: 4.4.21-1-pve)
pve-manager: 4.3-12 (running version: 4.3-12/6894c9d9)
pve-kernel-4.4.21-1-pve: 4.4.21-71
pve-kernel-4.4.24-1-pve: 4.4.24-72
pve-kernel-4.4.19-1-pve: 4.4.19-66
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-47
qemu-server: 4.0-96
pve-firmware: 1.1-10
libpve-common-perl: 4.0-83
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-68
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.3-17
pve-qemu-kvm: 2.7.0-8
pve-container: 1.0-85
pve-firewall: 2.0-31
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.6-1
lxcfs: 2.0.5-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80
openvswitch-switch: 2.6.0-2

Rest 2 nodes:
Code:
proxmox-ve: 4.3-71 (running kernel: 4.4.21-1-pve)
pve-manager: 4.3-10 (running version: 4.3-10/7230e60f)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.21-1-pve: 4.4.21-71
pve-kernel-4.4.15-1-pve: 4.4.15-60
pve-kernel-4.4.19-1-pve: 4.4.19-66
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-47
qemu-server: 4.0-94
pve-firmware: 1.1-10
libpve-common-perl: 4.0-80
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-68
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.3-14
pve-qemu-kvm: 2.7.0-8
pve-container: 1.0-81
pve-firewall: 2.0-31
pve-ha-manager: 1.0-35
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.5-1
lxcfs: 2.0.4-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80
openvswitch-switch: 2.5.0-1

Symptomps:
Updated 1 node ahead of rest 2 marked offline:
Code:
Nov 29 06:23:00 prox1 corosync[1769]: [TOTEM ] A processor failed, forming new configuration.
Nov 29 06:23:02 prox1 corosync[1769]: [TOTEM ] A new membership (172.16.0.11:1548) was formed. Members left: 1 2
Nov 29 06:23:02 prox1 corosync[1769]: [TOTEM ] Failed to receive the leave message. failed: 1 2
Nov 29 06:23:02 prox1 corosync[1769]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Nov 29 06:23:02 prox1 corosync[1769]: [QUORUM] Members[1]: 3
Nov 29 06:23:02 prox1 corosync[1769]: [MAIN  ] Completed service synchronization, ready to provide service.
Nov 29 06:23:52 prox1 corosync[1769]: [TOTEM ] A new membership (172.16.0.10:1552) was formed. Members joined: 1 2
Nov 29 06:23:52 prox1 corosync[1769]: [QUORUM] This node is within the primary component and will provide service.
Nov 29 06:23:52 prox1 corosync[1769]: [QUORUM] Members[3]: 1 3 2
Nov 29 06:23:52 prox1 corosync[1769]: [MAIN  ] Completed service synchronization, ready to provide service.
Nov 30 04:30:06 prox1 corosync[1769]: [TOTEM ] Retransmit List: a2c39
Nov 30 04:30:07 prox1 corosync[1769]: [TOTEM ] Retransmit List: a2c4c
Nov 30 04:30:07 prox1 corosync[1769]: [TOTEM ] Retransmit List: a2c4e
Nov 30 04:30:08 prox1 corosync[1769]: [TOTEM ] Retransmit List: a2c51
Nov 30 04:30:08 prox1 corosync[1769]: [TOTEM ] Retransmit List: a2c52
Nov 30 04:30:10 prox1 corosync[1769]: [TOTEM ] Retransmit List: a2c5b

Rest nodes:
Code:
Nov 29 06:23:00 prox2 corosync[1766]: [TOTEM ] A processor failed, forming new configuration.
Nov 29 06:23:02 prox2 corosync[1766]: [TOTEM ] A new membership (172.16.0.10:1548) was formed. Members left: 3
Nov 29 06:23:02 prox2 corosync[1766]: [TOTEM ] Failed to receive the leave message. failed: 3
Nov 29 06:23:02 prox2 corosync[1766]: [QUORUM] Members[2]: 1 2
Nov 29 06:23:02 prox2 corosync[1766]: [MAIN  ] Completed service synchronization, ready to provide service.
Nov 29 06:23:52 prox2 corosync[1766]: [TOTEM ] A new membership (172.16.0.10:1552) was formed. Members joined: 3
Nov 29 06:23:52 prox2 corosync[1766]: [QUORUM] Members[3]: 1 3 2
Nov 29 06:23:52 prox2 corosync[1766]: [MAIN  ] Completed service synchronization, ready to provide service.
Nov 30 03:00:26 prox2 corosync[1766]: [TOTEM ] Retransmit List: 97bd0 97bd1

Code:
Nov 29 06:23:00 prox0 corosync[1415]: [TOTEM ] A processor failed, forming new configuration.
Nov 29 06:23:02 prox0 corosync[1415]: [TOTEM ] A new membership (172.16.0.10:1548) was formed. Members left: 3
Nov 29 06:23:02 prox0 corosync[1415]: [TOTEM ] Failed to receive the leave message. failed: 3
Nov 29 06:23:02 prox0 corosync[1415]: [QUORUM] Members[2]: 1 2
Nov 29 06:23:02 prox0 corosync[1415]: [MAIN  ] Completed service synchronization, ready to provide service.
Nov 29 06:23:52 prox0 corosync[1415]: [TOTEM ] A new membership (172.16.0.10:1552) was formed. Members joined: 3
Nov 29 06:23:52 prox0 corosync[1415]: [QUORUM] Members[3]: 1 3 2
Nov 29 06:23:52 prox0 corosync[1415]: [MAIN  ] Completed service synchronization, ready to provide service.
Nov 30 04:30:06 prox0 corosync[1415]: [TOTEM ] Retransmit List: a2c39
Nov 30 04:30:06 prox0 corosync[1415]: [TOTEM ] Retransmit List: a2c42
Nov 30 04:30:07 prox0 corosync[1415]: [TOTEM ] Retransmit List: a2c4c
Nov 30 04:30:07 prox0 corosync[1415]: [TOTEM ] Retransmit List: a2c4e
Nov 30 04:30:07 prox0 corosync[1415]: [TOTEM ] Retransmit List: a2c4f
Nov 30 04:30:07 prox0 corosync[1415]: [TOTEM ] Retransmit List: a2c50
Nov 30 04:30:08 prox0 corosync[1415]: [TOTEM ] Retransmit List: a2c51
Nov 30 04:30:08 prox0 corosync[1415]: [TOTEM ] Retransmit List: a2c52
Nov 30 04:30:09 prox0 corosync[1415]: [TOTEM ] Retransmit List: a2c54
Nov 30 04:30:10 prox0 corosync[1415]: [TOTEM ] Retransmit List: a2c5a
Nov 30 04:30:10 prox0 corosync[1415]: [TOTEM ] Retransmit List: a2c5b

Datacenter Health says: Online 3 (but no Uptime in Nodes). Prox1 is marked with "red X" in the "Server View". Prox1->Summary shows "CPU usage" changing, but graphs are empty. So, every menu/info works without failure even from webinterface from other node.
Problematic node eventually stops responding to snmp check around 11:45. No VMs on prox1, so no info, if connection problem is there too. SNMPd restart didnt helped to resolving monitoring. So, i restarted in this order: corosync, pve-cluster, pve-ha-lrm, pve-ha-crm, pveproxy, pvedaemon.
Still no snmp working, still marked "red X".

Any debug idea? I am not sure, if update rest 2 to having same version, because i dont want getting problems with VMs on prox2 node...
 

Attachments

  • proxmox_fail.png
    proxmox_fail.png
    98.3 KB · Views: 13
Hi
Are you able to login in the "red" node ? ( node 1)
If yes, check the output of

pvecm nodes


if you see all three nodes

then check with pvesm status

It could that the problematic node is hanging on I/O, and thus not able to report its status to the rest of the cluster.
 
Thank you for reply.
I will check it next time. What i remember, before restarting "pvecm nodes" was ok on all three, "pvesm status" stucked on problematic node. After restart, snmpd started working and there was recovering HW raid, so I/O problem could be issue, but because time of the problem vs time of start HW raid recovery doesnt look related. I will wait for "next fail" to check if those problems are bound together.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!