Having an odd issue

nethfel

Member
Dec 26, 2014
151
0
16
Hi all,

Since the last update I ran a few days ago, I seem to be having an odd problem - at somewhat random times, in the web interface, whichever machine I log into shows green but the other two in the cluster show red. The datacenter summary shows them all as online and not estranged. I can remote into each of them. If I restart pve-cluster on all of them, everything seems to come back as it should, but it concerns me that it seems to be randomly stopping. There is no firewall between the machines. I am seeing this in the log it appears just before it must stop working:

Code:
Apr 17 04:54:55 corosync [TOTEM ] Retransmit List: 8e84 8e85 8e86 8e87 8e8e 8e8f 8e90 8e91 8e92 8e93 8e94 8e95 8e96 8e9d 8e9e 8e9f 8ea0 8ea1 8ea2 8ea3 8ea4 8ea5 8ea6
Apr 17 04:54:55 corosync [TOTEM ] Retransmit List: 8e84 8e85 8e86 8e87 8e8e 8e8f 8e90 8e91 8e92 8e93 8e94 8e95 8e96 8e9d 8e9e 8e9f 8ea0 8ea1 8ea2 8ea3 8ea4 8ea5 8ea6
Apr 17 04:54:55 corosync [TOTEM ] Retransmit List: 8e84 8e85 8e86 8e87 8e8e 8e8f 8e90 8e91 8e92 8e93 8e94 8e95 8e96 8e9d 8e9e 8e9f 8ea0 8ea1 8ea2 8ea3 8ea4 8ea5 8ea6
Apr 17 04:54:55 corosync [TOTEM ] Retransmit List: 8e84 8e85 8e86 8e87 8e8e 8e8f 8e90 8e91 8e92 8e93 8e94 8e95 8e96 8e9d 8e9e 8e9f 8ea0 8ea1 8ea2 8ea3 8ea4 8ea5 8ea6
Apr 17 04:54:55 corosync [TOTEM ] Retransmit List: 8e84 8e85 8e86 8e87 8e8e 8e8f 8e90 8e91 8e92 8e93 8e94 8e95 8e96 8e9d 8e9e 8e9f 8ea0 8ea1 8ea2 8ea3 8ea4 8ea5 8ea6
Apr 17 04:54:55 corosync [TOTEM ] Retransmit List: 8e84 8e85 8e86 8e87 8e8e 8e8f 8e90 8e91 8e92 8e93 8e94 8e95 8e96 8e9d 8e9e 8e9f 8ea0 8ea1 8ea2 8ea3 8ea4 8ea5 8ea6
Apr 17 04:54:55 corosync [TOTEM ] Retransmit List: 8e84 8e85 8e86 8e87 8e8e 8e8f 8e90 8e91 8e92 8e93 8e94 8e95 8e96 8e9d 8e9e 8e9f 8ea0 8ea1 8ea2 8ea3 8ea4 8ea5 8ea6
Apr 17 04:54:55 corosync [TOTEM ] Retransmit List: 8e84 8e85 8e86 8e87 8e8e 8e8f 8e90 8e91 8e92 8e93 8e94 8e95 8e96 8e9d 8e9e 8e9f 8ea0 8ea1 8ea2 8ea3 8ea4 8ea5 8ea6
Apr 17 04:54:55 corosync [TOTEM ] Retransmit List: 8e84 8e85 8e86 8e87 8e8e 8e8f 8e90 8e91 8e92 8e93 8e94 8e95 8e96 8e9d 8e9e 8e9f 8ea0 8ea1 8ea2 8ea3 8ea4 8ea5 8ea6
Apr 17 04:54:55 corosync [TOTEM ] Retransmit List: 8e84 8e85 8e86 8e87 8e8e 8e8f 8e90 8e91 8e92 8e93 8e94 8e95 8e96 8e9d 8e9e 8e9f 8ea0 8ea1 8ea2 8ea3 8ea4 8ea5 8ea6
Apr 17 04:54:55 corosync [TOTEM ] Retransmit List: 8e84 8e85 8e86 8e87 8e8e 8e8f 8e90 8e91 8e92 8e93 8e94 8e95 8e96 8e9d 8e9e 8e9f 8ea0 8ea1 8ea2 8ea3 8ea4 8ea5 8ea6

Anyone have any clue as to what could be wrong? Luckily I'm not actually running anything on these servers yet; this is the second time within 5 days this has happened.
 
This can happen when one node is not able to process corosync packets fast enough (maybe too much load on one node?)
 
I'm really not sure how that is possible - these machines, all three of them:

* have no VMs yet
* Are dual quad core xeon machines (8 total physical cores)
* have 16 gigs of ram
* Are connected to each other via gigabit link on an isolated vlan for the management (the only machines on this vlan are the 3 pve machines and our internal router that interconnects everything and acts as the firewall only allowing certain people access to the web and ssh interface of the pve nodes)
* were all rebooted 2 days, 19hrs ago (the last time this happened where I just rebooted them instead of restarted pve-cluster)


I haven't even connected them to the ceph cluster yet.

This cluster has only three machines in it.

When this issue occurs - each machine:
if I open up the web interface, it shows itself as green and the other two as red

In theory I'd imagine if one machine got overloaded, the other two would still show each other as green and the overloaded node as red.

That said, I'm not willing to discount the possibility of a load issue. I've installed sysstat so I can get more load data over a period of time.

I'm also seeing this in syslog:

Code:
Apr 17 12:12:01 pve-3 pveproxy[578272]: ipcc_send_rec failed: Transport endpoint is not connected
Apr 17 12:12:01 pve-3 pvedaemon[3770]: ipcc_send_rec failed: Transport endpoint is not connected
Apr 17 12:12:01 pve-3 pveproxy[578273]: ipcc_send_rec failed: Transport endpoint is not connected
Apr 17 12:12:01 pve-3 pveproxy[578271]: ipcc_send_rec failed: Transport endpoint is not connected
Apr 17 12:12:01 pve-3 pvedaemon[3771]: ipcc_send_rec failed: Transport endpoint is not connected
 
Last edited:
ps - I should also note, I tested all nodes with omping, and they all are working for multicast (sample output from pve-2):
Code:
pve-1 :   unicast, xmt/rcv/%loss = 29/29/0%, min/avg/max/std-dev = 0.138/0.154/0.233/0.018
pve-1 : multicast, xmt/rcv/%loss = 29/29/0%, min/avg/max/std-dev = 0.146/0.166/0.237/0.016
pve-3 :   unicast, xmt/rcv/%loss = 34/34/0%, min/avg/max/std-dev = 0.114/0.150/0.178/0.012
pve-3 : multicast, xmt/rcv/%loss = 34/34/0%, min/avg/max/std-dev = 0.123/0.157/0.190/0.013

I've also increased the window size on the totem to see if that changes anything. The default is 50, I've set it to 125 (in docs on corosync, it says you should stay within 256000/MTU (which for me is 1500), so I'm less then the max recommended)
 
Last edited:
I'm still seeing the totem retransmits... All three servers during that period were showing ~99.88% idle. I'm just not seeing why there is this going on with this cluster. I don't see any errors in the switch during the timeframe specified by the syslog... I'm not quite sure how I can resolve this without a reinstall (and not using an MDADM RAID 1 on the boot drive, but it had been working fine in RAID 1 up until a few days ago just after the last update - and honestly, another set of three running the ceph cluster with RAID 1 on the boot drive isn't showing these symptoms so there is something hinky in this cluster that I'd really like to understand even though it'd be inconsequential if I wiped and re-created the systems....)
 
Well, I haven't been able to get it to resolve yet even with both the window and token parameters set.

I've tried different ethernet cards, I've checked server load (CPU load never exceeds a fraction of a single core, no processes zombied, etc.) so I know it's not a server issue. I've checked the switch, and I don't see anything in the logs and another proxmox cluster on a different vlan on the switch isn't experiencing these issues so I'm not really sure where to look next prior to doing a re-install...
 
Yes, all set to the default 1500.

I'm trying a different switch now (it's just odd though that another vlan on the original switch having a proxmox cluster isn't experiencing this issue) to eliminate the switch as the culprit - unfortunately, it takes some time for the problem to crop up so I may not have results for another several hours. Jumbo frames is disabled both on the switch and the servers themselves are only using regular frames.

I'm checking everything in the switch again to make sure there isn't something mis-configured, but so far no issues found.
 
Right now I'm testing with an unmanaged switch. The main switch is a Cisco SG300-52. I've just gone through all of the multicast settings of the SG - if everything works on the unmanaged switch (currently this unmanaged switch only has the two proxmox servers on it and I'm monitoring from the console right there) then I will try some of the multicast settings in the 300 to see if I can clear the issue up.

UPDATE: I'm still seeing the issue when using the unmanaged switch. I'm going to try to increase the max receive buffer for the network to see if that improves matters at all, but really that would seem to just be a bandaid; I'd really like to figure out why these machines are being problematical. The only other thing I could imagine is if there is something going on with the mdadm raid on these boxes, but I can't see how the there could be a correlation between software raid and the network as we're not even close to overloading these servers (especially considering all that's installed is proxmox ;)
 
Last edited:
Originally I wasn't using any IGMP snooping or queriers or any filters aside from standard vlans. I do have that all currently setup at this point - but even so - I was seeing these retransmits on an unmanaged switch as well so there has to be something with at least one of these machines...
 
IGMP snooping is irrelevant in this test, there are only the two proxmox machines on this switch, the unmanaged switch was not interconnected (no uplink or any other servers that would cause traffic) to anything so all traffic that existed on it were for and between these two machines so I honestly don't see how a lack of IGMP snooping would have had any effect in this test scenario.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!