Cluster nodes going Red part 2.

RobFantini

Famous Member
May 24, 2012
2,023
107
133
Boston,Mass
Hello.

We continue to have the issue of cluster nodes red color at Datacenter section of pve web interface.

Since the long thread from December:

We do not use any shared storage besides /etc/pve .

I eliminated all NFS.

NTP/ systemd-timesyncd has been configured to use our pfsense hardware to get time updates

There are no Backup or heavy network traffic leading up to the red issue. I set all backups to occur on Saturday.

I set up a central rsyslog server for the cluster nodes. This is useful to check what leads up to the issue. SO far I have not found the reason for cluster issue.

From the logs the issue is always preceded with lines like this:
Code:
Jan 29 05:14:47 sys3 corosync[7088]:  [TOTEM ] A processor failed, forming new configuration.
..
Jan 29 05:14:47 dell1 corosync[7575]:  [TOTEM ] A processor failed, forming new configuration.
..
Jan 29 05:14:47 sys4 corosync[8688]:  [TOTEM ] A processor failed, forming new configuration.
..
Jan 29 05:14:48 sys5 corosync[5069]:  [TOTEM ] A processor failed, forming new configuration.
..

shortly after:
Code:
Jan 29 05:14:48 sys5 corosync[5069]:  [MAIN  ] Corosync main process was not scheduled for 3947.5273 ms (threshold is 1840.0000 ms). Consider token timeout increase.
Jan 29 05:14:48 sys4 corosync[8688]:  [TOTEM ] A new membership (10.1.10.21:15808) was formed. Members
Jan 29 05:14:48 sys3 corosync[7088]:  [TOTEM ] A new membership (10.1.10.21:15808) was formed. Members
Jan 29 05:14:48 dell1 corosync[7575]:  [TOTEM ] A new membership (10.1.10.21:15808) was formed. Members
Jan 29 05:14:48 sys4 corosync[8688]:  [QUORUM] Members[4]: 4 2 1 3
Jan 29 05:14:48 sys3 corosync[7088]:  [QUORUM] Members[4]: 4 2 1 3
Jan 29 05:14:48 dell1 corosync[7575]:  [QUORUM] Members[4]: 4 2 1 3
Jan 29 05:14:48 sys4 corosync[8688]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jan 29 05:14:48 sys3 corosync[7088]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jan 29 05:14:48 dell1 corosync[7575]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jan 29 05:14:48 sys5 corosync[5069]:  [TOTEM ] A processor failed, forming new configuration.
Jan 29 05:14:48 sys5 corosync[5069]:  [TOTEM ] A new membership (10.1.10.21:15808) was formed. Members
Jan 29 05:14:48 sys5 corosync[5069]:  [QUORUM] Members[4]: 4 2 1 3
Jan 29 05:14:48 sys5 corosync[5069]:  [MAIN  ] Completed service synchronization, ready to provide service.
I am not sure of the exact second that nodes go red .

I've set up rsyslog to send an email alert every time there is a 'processor failed' line. Lately this has been occurring once a day .


From cli pvecm commands show the cluster OK.

After yesterdays issue I updated all 4 noted to latest pve testing software. That did not fix the issue.

To make the red nodes turn green I do this at every node:
Code:
systemctl restart pve-cluster

I am not sure what to do next. Any suggestions?
 
I remember back then you did some mass pings and had some that were a couple seconds. (afaik 2500-3900 ms)

Did that completely disappear ?
 
I remember back then you did some mass pings and had some that were a couple seconds. (afaik 2500-3900 ms)

Did that completely disappear ?

yes that issue has been dealt with. I replaced one of the motherboards to deal with bad hardware.

next time the red issue occurs, I'll retest.

also I've seen threads with others having similar issue.
 
And i think we also figured out its not a bog-down in your multicast, right ?
And you also did do QOS related measures for your Proxmox-Cluster network on a separate subnet, right ?



Do you have a monitoring solution up, that continuously (1s or 0.1s interval) queries stuff like this on every node:
  • network latency of every network link
    • perhaps even omping between the nodes
  • cpu
  • IO per disk
  • Proxmox-Sevices
???

Might help narrow it down faster by allowing you to correlate that data.

Edit: And maybe even queries the Switches (in case they are having hick ups and are eating your multicasts :p )

By now I think its gonna be something really really stupidly obvious, once found out.
 
Last edited:
only our phone system and pve are on the network. we do over 500 phone calls per day, and have not had any issues in a year related to main switch. if network were bad we'd know it from phone call issues.

despite no phone issues I changed the switch from Netgear GSM7300 series to Cisco . same pve red issue as before.

If your are interested I could send a copy of our central rsyslog cluster log. There is nothing at all prior to nodes going red.

I think a default cluster software setting needs to be tweaked.
 
There is nothing at all prior to nodes going red.
If there is nothing showing, then it will not help :p

I changed the switch from Netgear GSM7300 series to Cisco
You said you also changed a couple mainboards that had bad onboard nics.
ANY chance you also changed the cables And if you use wall-mounted patchpannels, looked those up?
(as i said, its probably something really stupid, due to the fact that it does not show up more often as an issue)
And the last time I personally experienced this, it was due to network being slow to respond.

Edit: If I am correctly remembering, there is a way to tune the CLuster to have no issues with larger network latencies, so it does not miss a beat. Just can not find it at the moment.
Pretty sure it came up during a discussion i hat with @spirit regarding the actual limits of Corosync/Cman
 
Last edited:
hi Q-wulf - it is not the cables. for 2 reasons. 1- red issue continued when sys5 was off line two weeks. 2- when red occurs - at every https pve node all the other nodes are red - except the node connected to.

I do all backups on Saturday.
overnight kvm's were backed up on all nodes. there was no red issues.

Late afternoon LXC's get backed up. lets see if there are issues related to lxc backup.

Before that will install todays deb updates.
 
I'm guessing unless we get really lucky at some place, the only way is to do get out the big guns.

  1. Monitor every conceivable variable (e.g. zabbix)
  2. Wait for the issue to appear again.
  3. Determine the exact time from the logs
  4. Backtrace what has happend on Node(s), Vm(s), Network(s) right before that in your monitoring solution.

Edit: found what @spirit and I talked about:
https://forum.proxmox.com/threads/7-nodes-osd-issue.25609/#post-128390
eventually leads to this:
http://linux.die.net/man/5/corosync.conf
and the ability to manually increase the timeout.

edit: and you are sure you have no "loops" on your network, right ? :p (just came up on /r/networking - totally unrelated)
 
I'm guessing unless we get really lucky at some place, the only way is to do get out the big guns.

  1. Monitor every conceivable variable (e.g. zabbix)
  2. Wait for the issue to appear again.
  3. Determine the exact time from the logs
  4. Backtrace what has happend on Node(s), Vm(s), Network(s) right before that in your monitoring solution.

Edit: found what @spirit and I talked about:
https://forum.proxmox.com/threads/7-nodes-osd-issue.25609/#post-128390
eventually leads to this:
http://linux.die.net/man/5/corosync.conf
and the ability to manually increase the timeout.

edit: and you are sure you have no "loops" on your network, right ? :p (just came up on /r/networking - totally unrelated)

I'll check the links.

and as for loops - there are none. and if one were to occur the managed switch is set to shutdown that physical port on switch. they do occur once in awhile - usually dhcp server will report the issue.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!