Hey everyone,
We run a cluster behind 2x Arista DCS-7050T-64-R switches, both in the same vlan (512).
They are connected to each other also.
The cluster has 9 nodes running Proxmox v4.4 (6 hypervisors, rest storage nodes).
We had igmp snooping disabled in full on the vlan with command:
And all worked fine for about 10 days, but suddenly today it dropped out of nowhere.
The issue we are experiencing started with the web GUI suddenly giving red X'es next to each node and the issue seems to lay in the pvestatd daemon (at least that is my thought but not sure if it is also the case).
Several hypervisors kept hanging on command 'pvecm nodes', but after restarting the pve-cluster service, it started working again (the command, the X'es remained red).
It seems like when we fix one node by restarting the service, another node that worked before starts having the same issue.
We tried enabling igmp snooping with a carrier address, but no solution.
switch1:
switch2:
Vlan 512 :
----------
We run this command on all nodes:
The results were:
Node11 and Node6 can't communicate at all. Node 11 right away said:
All other nodes could communicate but returned something along these lines pointing at a possible problem with Node6:
Now this happens on Node6, but not on other nodes:
Here is some output from earlier today (CEST), the log /var/log/syslog printed:
Can anyone help us debug this without having to reboot all hypervisors?
If more information is required, please let us know.
We run a cluster behind 2x Arista DCS-7050T-64-R switches, both in the same vlan (512).
They are connected to each other also.
The cluster has 9 nodes running Proxmox v4.4 (6 hypervisors, rest storage nodes).
We had igmp snooping disabled in full on the vlan with command:
Code:
no ip igmp snooping vlan 512
The issue we are experiencing started with the web GUI suddenly giving red X'es next to each node and the issue seems to lay in the pvestatd daemon (at least that is my thought but not sure if it is also the case).
Several hypervisors kept hanging on command 'pvecm nodes', but after restarting the pve-cluster service, it started working again (the command, the X'es remained red).
It seems like when we fix one node by restarting the service, another node that worked before starts having the same issue.
We tried enabling igmp snooping with a carrier address, but no solution.
switch1:
Code:
ip igmp snooping vlan 512 querier
ip igmp snooping vlan 512 querier address 1.1.1.1
switch2:
Code:
ip igmp snooping vlan 512 querier
ip igmp snooping vlan 512 querier address 2.2.2.2
Code:
5.3-CLUS01#show ip igmp snooping vlan 512
Global IGMP Snooping configuration:
-------------------------------------------
IGMP snooping : Enabled
IGMPv2 immediate leave : Enabled
Robustness variable : 2
Report flooding : Disabled
Vlan 512 :
----------
Code:
IGMP snooping : Enabled
IGMPv2 immediate leave : Default
Multicast router learning mode : pim-dvmrp
IGMP max group limit : No limit set
Recent attempt to exceed limit : No
Report flooding : Default
IGMP snooping pruning active : True
Flooding traffic to VLAN : False
We run this command on all nodes:
Code:
omping -c 600 -i 1 -q <list of all hypervisors>
The results were:
Node11 and Node6 can't communicate at all. Node 11 right away said:
Code:
Node6: server told us to stop
All other nodes could communicate but returned something along these lines pointing at a possible problem with Node6:
Code:
node06 : unicast, xmt/rcv/%loss = 423/423/0%, min/avg/max/std-dev = 0.066/0.174/1.652/0.112
node06 : multicast, xmt/rcv/%loss = 423/423/0%, min/avg/max/std-dev = 0.079/0.192/1.657/0.111
node07 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.050/0.138/1.586/0.128
node07 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.053/0.148/1.609/0.129
node08 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.052/0.195/3.238/0.267
node08 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.058/0.211/3.242/0.269
node09 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.061/0.270/6.758/0.534
node09 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.067/0.285/6.804/0.536
node11 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.061/0.147/1.613/0.139
node11 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.066/0.158/1.618/0.139
Now this happens on Node6, but not on other nodes:
Code:
root@node06:~# cat /etc/pve/corosync.conf
cat: /etc/pve/corosync.conf: Transport endpoint is not connected
Here is some output from earlier today (CEST), the log /var/log/syslog printed:
Code:
Jun 18 13:48:26 node05 pve-firewall[3285]: firewall update time (10.002 seconds)
Jun 18 13:48:27 node05 pmxcfs[3056]: [status] notice: cpg_leave retry 26850
Jun 18 13:48:27 node05 pmxcfs[3056]: [status] notice: cpg_send_message retry 10
Jun 18 13:48:28 node05 pmxcfs[3056]: [status] notice: cpg_leave retry 26860
Jun 18 13:48:28 node05 pmxcfs[3056]: [status] notice: cpg_send_message retry 20
Jun 18 13:48:29 node05 pmxcfs[3056]: [status] notice: cpg_leave retry 26870
Jun 18 13:48:29 node05 pmxcfs[3056]: [status] notice: cpg_send_message retry 30
Jun 18 13:48:30 node05 pmxcfs[3056]: [status] notice: cpg_leave retry 26880
Jun 18 13:48:30 node05 pmxcfs[3056]: [status] notice: cpg_send_message retry 40
Jun 18 13:48:31 node05 pmxcfs[3056]: [status] notice: cpg_leave retry 26890
Jun 18 13:48:31 node05 pmxcfs[3056]: [status] notice: cpg_send_message retry 50
Jun 18 13:48:32 node05 pmxcfs[3056]: [status] notice: cpg_leave retry 26900
Jun 18 13:48:32 node05 pmxcfs[3056]: [status] notice: cpg_send_message retry 60
Jun 18 13:48:33 node05 pmxcfs[3056]: [status] notice: cpg_leave retry 26910
Jun 18 13:48:33 node05 pmxcfs[3056]: [status] notice: cpg_send_message retry 70
Jun 18 13:48:34 node05 pmxcfs[3056]: [status] notice: cpg_leave retry 26920
Jun 18 13:48:34 node05 pmxcfs[3056]: [status] notice: cpg_send_message retry 80
Jun 18 13:48:35 node05 pmxcfs[3056]: [status] notice: cpg_leave retry 26930
Jun 18 13:48:35 node05 pmxcfs[3056]: [status] notice: cpg_send_message retry 90
Jun 18 13:48:36 node05 pmxcfs[3056]: [status] notice: cpg_leave retry 26940
Jun 18 13:48:36 node05 pmxcfs[3056]: [status] notice: cpg_send_message retry 100
Jun 18 13:48:36 node05 pmxcfs[3056]: [status] notice: cpg_send_message retried 100 times
..........
Jun 18 18:03:01 node05 pmxcfs[39690]: [status] notice: remove message from non-member 10/91328
Jun 18 18:03:01 node05 pmxcfs[39690]: [status] notice: remove message from non-member 10/91328
Jun 18 18:03:01 node05 pmxcfs[39690]: [status] notice: remove message from non-member 10/91328
Jun 18 18:03:01 node05 pmxcfs[39690]: [status] notice: remove message from non-member 10/91328
Jun 18 18:03:01 node05 pmxcfs[39690]: [status] notice: remove message from non-member 10/91328
Can anyone help us debug this without having to reboot all hypervisors?
If more information is required, please let us know.