Having an odd issue

nethfel · Apr 17, 2015

Hi all,

Since the last update I ran a few days ago, I seem to be having an odd problem - at somewhat random times, in the web interface, whichever machine I log into shows green but the other two in the cluster show red. The datacenter summary shows them all as online and not estranged. I can remote into each of them. If I restart pve-cluster on all of them, everything seems to come back as it should, but it concerns me that it seems to be randomly stopping. There is no firewall between the machines. I am seeing this in the log it appears just before it must stop working:

Code:

Apr 17 04:54:55 corosync [TOTEM ] Retransmit List: 8e84 8e85 8e86 8e87 8e8e 8e8f 8e90 8e91 8e92 8e93 8e94 8e95 8e96 8e9d 8e9e 8e9f 8ea0 8ea1 8ea2 8ea3 8ea4 8ea5 8ea6
Apr 17 04:54:55 corosync [TOTEM ] Retransmit List: 8e84 8e85 8e86 8e87 8e8e 8e8f 8e90 8e91 8e92 8e93 8e94 8e95 8e96 8e9d 8e9e 8e9f 8ea0 8ea1 8ea2 8ea3 8ea4 8ea5 8ea6
Apr 17 04:54:55 corosync [TOTEM ] Retransmit List: 8e84 8e85 8e86 8e87 8e8e 8e8f 8e90 8e91 8e92 8e93 8e94 8e95 8e96 8e9d 8e9e 8e9f 8ea0 8ea1 8ea2 8ea3 8ea4 8ea5 8ea6
Apr 17 04:54:55 corosync [TOTEM ] Retransmit List: 8e84 8e85 8e86 8e87 8e8e 8e8f 8e90 8e91 8e92 8e93 8e94 8e95 8e96 8e9d 8e9e 8e9f 8ea0 8ea1 8ea2 8ea3 8ea4 8ea5 8ea6
Apr 17 04:54:55 corosync [TOTEM ] Retransmit List: 8e84 8e85 8e86 8e87 8e8e 8e8f 8e90 8e91 8e92 8e93 8e94 8e95 8e96 8e9d 8e9e 8e9f 8ea0 8ea1 8ea2 8ea3 8ea4 8ea5 8ea6
Apr 17 04:54:55 corosync [TOTEM ] Retransmit List: 8e84 8e85 8e86 8e87 8e8e 8e8f 8e90 8e91 8e92 8e93 8e94 8e95 8e96 8e9d 8e9e 8e9f 8ea0 8ea1 8ea2 8ea3 8ea4 8ea5 8ea6
Apr 17 04:54:55 corosync [TOTEM ] Retransmit List: 8e84 8e85 8e86 8e87 8e8e 8e8f 8e90 8e91 8e92 8e93 8e94 8e95 8e96 8e9d 8e9e 8e9f 8ea0 8ea1 8ea2 8ea3 8ea4 8ea5 8ea6
Apr 17 04:54:55 corosync [TOTEM ] Retransmit List: 8e84 8e85 8e86 8e87 8e8e 8e8f 8e90 8e91 8e92 8e93 8e94 8e95 8e96 8e9d 8e9e 8e9f 8ea0 8ea1 8ea2 8ea3 8ea4 8ea5 8ea6
Apr 17 04:54:55 corosync [TOTEM ] Retransmit List: 8e84 8e85 8e86 8e87 8e8e 8e8f 8e90 8e91 8e92 8e93 8e94 8e95 8e96 8e9d 8e9e 8e9f 8ea0 8ea1 8ea2 8ea3 8ea4 8ea5 8ea6
Apr 17 04:54:55 corosync [TOTEM ] Retransmit List: 8e84 8e85 8e86 8e87 8e8e 8e8f 8e90 8e91 8e92 8e93 8e94 8e95 8e96 8e9d 8e9e 8e9f 8ea0 8ea1 8ea2 8ea3 8ea4 8ea5 8ea6
Apr 17 04:54:55 corosync [TOTEM ] Retransmit List: 8e84 8e85 8e86 8e87 8e8e 8e8f 8e90 8e91 8e92 8e93 8e94 8e95 8e96 8e9d 8e9e 8e9f 8ea0 8ea1 8ea2 8ea3 8ea4 8ea5 8ea6

Anyone have any clue as to what could be wrong? Luckily I'm not actually running anything on these servers yet; this is the second time within 5 days this has happened.

dietmar · Apr 17, 2015

This can happen when one node is not able to process corosync packets fast enough (maybe too much load on one node?)

nethfel · Apr 17, 2015

I'm really not sure how that is possible - these machines, all three of them:

* have no VMs yet
* Are dual quad core xeon machines (8 total physical cores)
* have 16 gigs of ram
* Are connected to each other via gigabit link on an isolated vlan for the management (the only machines on this vlan are the 3 pve machines and our internal router that interconnects everything and acts as the firewall only allowing certain people access to the web and ssh interface of the pve nodes)
* were all rebooted 2 days, 19hrs ago (the last time this happened where I just rebooted them instead of restarted pve-cluster)

I haven't even connected them to the ceph cluster yet.

This cluster has only three machines in it.

When this issue occurs - each machine:
if I open up the web interface, it shows itself as green and the other two as red

In theory I'd imagine if one machine got overloaded, the other two would still show each other as green and the overloaded node as red.

That said, I'm not willing to discount the possibility of a load issue. I've installed sysstat so I can get more load data over a period of time.

I'm also seeing this in syslog:

Code:

Apr 17 12:12:01 pve-3 pveproxy[578272]: ipcc_send_rec failed: Transport endpoint is not connected
Apr 17 12:12:01 pve-3 pvedaemon[3770]: ipcc_send_rec failed: Transport endpoint is not connected
Apr 17 12:12:01 pve-3 pveproxy[578273]: ipcc_send_rec failed: Transport endpoint is not connected
Apr 17 12:12:01 pve-3 pveproxy[578271]: ipcc_send_rec failed: Transport endpoint is not connected
Apr 17 12:12:01 pve-3 pvedaemon[3771]: ipcc_send_rec failed: Transport endpoint is not connected

nethfel · Apr 17, 2015

ps - I should also note, I tested all nodes with omping, and they all are working for multicast (sample output from pve-2):

Code:

pve-1 :   unicast, xmt/rcv/%loss = 29/29/0%, min/avg/max/std-dev = 0.138/0.154/0.233/0.018
pve-1 : multicast, xmt/rcv/%loss = 29/29/0%, min/avg/max/std-dev = 0.146/0.166/0.237/0.016
pve-3 :   unicast, xmt/rcv/%loss = 34/34/0%, min/avg/max/std-dev = 0.114/0.150/0.178/0.012
pve-3 : multicast, xmt/rcv/%loss = 34/34/0%, min/avg/max/std-dev = 0.123/0.157/0.190/0.013

I've also increased the window size on the totem to see if that changes anything. The default is 50, I've set it to 125 (in docs on corosync, it says you should stay within 256000/MTU (which for me is 1500), so I'm less then the max recommended)

nethfel · Apr 18, 2015

I'm still seeing the totem retransmits... All three servers during that period were showing ~99.88% idle. I'm just not seeing why there is this going on with this cluster. I don't see any errors in the switch during the timeframe specified by the syslog... I'm not quite sure how I can resolve this without a reinstall (and not using an MDADM RAID 1 on the boot drive, but it had been working fine in RAID 1 up until a few days ago just after the last update - and honestly, another set of three running the ceph cluster with RAID 1 on the boot drive isn't showing these symptoms so there is something hinky in this cluster that I'd really like to understand even though it'd be inconsequential if I wiped and re-created the systems....)

mir · Apr 18, 2015

Try changing /etc/pve/cluster.conf like this:
<totem token="54000"/>

This will increase totem timeout. (default is 10000)

nethfel · Apr 27, 2015

Well, I haven't been able to get it to resolve yet even with both the window and token parameters set.

I've tried different ethernet cards, I've checked server load (CPU load never exceeds a fraction of a single core, no processes zombied, etc.) so I know it's not a server issue. I've checked the switch, and I don't see anything in the logs and another proxmox cluster on a different vlan on the switch isn't experiencing these issues so I'm not really sure where to look next prior to doing a re-install...

mir · Apr 27, 2015

Are the MTU identical on all servers and the switch?

nethfel · Apr 27, 2015

Yes, all set to the default 1500.

I'm trying a different switch now (it's just odd though that another vlan on the original switch having a proxmox cluster isn't experiencing this issue) to eliminate the switch as the culprit - unfortunately, it takes some time for the problem to crop up so I may not have results for another several hours. Jumbo frames is disabled both on the switch and the servers themselves are only using regular frames.

I'm checking everything in the switch again to make sure there isn't something mis-configured, but so far no issues found.

mir · Apr 27, 2015

What is the brand and model of the switch which you are replacing?

nethfel · Apr 27, 2015

Right now I'm testing with an unmanaged switch. The main switch is a Cisco SG300-52. I've just gone through all of the multicast settings of the SG - if everything works on the unmanaged switch (currently this unmanaged switch only has the two proxmox servers on it and I'm monitoring from the console right there) then I will try some of the multicast settings in the 300 to see if I can clear the issue up.

UPDATE: I'm still seeing the issue when using the unmanaged switch. I'm going to try to increase the max receive buffer for the network to see if that improves matters at all, but really that would seem to just be a bandaid; I'd really like to figure out why these machines are being problematical. The only other thing I could imagine is if there is something going on with the mdadm raid on these boxes, but I can't see how the there could be a correlation between software raid and the network as we're not even close to overloading these servers (especially considering all that's installed is proxmox

mir · Apr 27, 2015

Do you have an igmp querier enabled on every vlan where your proxmox servers are communicating? Remember a quierier has to be present on every vlan. Read more here: https://supportforums.cisco.com/discussion/11464821/igmp-snooping-sg300

nethfel · Apr 27, 2015

Originally I wasn't using any IGMP snooping or queriers or any filters aside from standard vlans. I do have that all currently setup at this point - but even so - I was seeing these retransmits on an unmanaged switch as well so there has to be something with at least one of these machines...

mir · Apr 27, 2015

An unmanaged switch does not support IGMP snooping so this must be enabled on the bridge.

nethfel · Apr 28, 2015

IGMP snooping is irrelevant in this test, there are only the two proxmox machines on this switch, the unmanaged switch was not interconnected (no uplink or any other servers that would cause traffic) to anything so all traffic that existed on it were for and between these two machines so I honestly don't see how a lack of IGMP snooping would have had any effect in this test scenario.

Search

Search

Having an odd issue

nethfel

Member

dietmar

Proxmox Staff Member

nethfel

Member

nethfel

Member

nethfel

Member

mir

Famous Member

nethfel

Member

mir

Famous Member

nethfel

Member

mir

Famous Member

nethfel

Member

mir

Famous Member

nethfel

Member

mir

Famous Member

nethfel

Member