Node Restarts

Downeast Tech

New Member
Sep 20, 2016
14
0
1
41
I have a 3 node cluster setup and 2 of the nodes have new VM running Windows Server 2008. Every few days the two nodes running the VMs will restart and bring the VMs back up, while the 3rd node stays up with no restarts. I am looking for any tips to check or keywords within logs files to find out why this might be happening. I originally thought it was because of a network issue during backup but I disabled the backup and it still restarts. Thanks for any help.
 
The switch these are going through is old and didnt have multicast. I am installing a new one that does multicast and igmp this week. Will update when I have more info. Thanks for the help.
 
I don't really know what I am looking for in the syslog to indicate what caused the shutdown. There is a lot more stuff in the syslog on the two nodes running the VM's, but any hints on what to search for would be greatly appreciated.

I installed a new switch that allows multicast. I am following the link for ProxMox Multicast tips. I have OMPing installed and cant seem to get any response from the other nodes when I use the utility. It just keeps waiting for a response. Could my switch be setup incorrectly or something on the nodes done incorrectly.

Thanks for the help.
 
I just noticed that the first two nodes of my cluster (server1, server2) have the same FCDN as our existing server. The old servers will be replaced with these new ones, but could that have something to do with the issues? Just throwing anything I can out there to see if helps. Thanks.
 
I have some more information on this issue.

I have tested the multicast and know that it is working. I followed the wiki multicast notes and tested everything it said. The omping 600 test completed on each of the three nodes with no dropped packets and it ran for around 10 minutes.

In the syslog on each server I have the corosync [TOTEM] Retransmit List error several times each day. After the retransmit error ends, which is different each time, there is usually a TOTEM Failed to Receive error and what looks like the cluster finding each member again by forming a new membership. From the information I can find on this error it usually is because one processor is running slower than the others. These servers were purchased new and all have the same hardware. Also since the error happens on all the nodes I didn't think it was isolated to a slow processor.

Another issue that I have noticed is with SSH between nodes. After a restart of all the nodes the cluster seems to communicate fine when the GUI being able to send commands to each node. There are times though with one node will not take commands like there is an issue with the SSH. When I putty into each node I can ssh into the other nodes and it asks me to accept the new ssh key. After I do this the communication between nodes works like it is supposed too.

I don't know if these issues are related but any help in resolving this would be greatly appreciated. Many thanks.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!