Node Restarts

Downeast Tech · Sep 20, 2016

I have a 3 node cluster setup and 2 of the nodes have new VM running Windows Server 2008. Every few days the two nodes running the VMs will restart and bring the VMs back up, while the 3rd node stays up with no restarts. I am looking for any tips to check or keywords within logs files to find out why this might be happening. I originally thought it was because of a network issue during backup but I disabled the backup and it still restarts. Thanks for any help.

wosp · Sep 20, 2016

Test multicast traffic, please see https://pve.proxmox.com/wiki/Multicast_notes

If this is ok, is there enough free memory on the PVE nodes?

Downeast Tech · Sep 24, 2016

The switch these are going through is old and didnt have multicast. I am installing a new one that does multicast and igmp this week. Will update when I have more info. Thanks for the help.

cadbury · Sep 24, 2016

Have you checked the syslog for any indication of what triggered the shutdown? Checked cron?

Downeast Tech · Oct 5, 2016

I don't really know what I am looking for in the syslog to indicate what caused the shutdown. There is a lot more stuff in the syslog on the two nodes running the VM's, but any hints on what to search for would be greatly appreciated.

I installed a new switch that allows multicast. I am following the link for ProxMox Multicast tips. I have OMPing installed and cant seem to get any response from the other nodes when I use the utility. It just keeps waiting for a response. Could my switch be setup incorrectly or something on the nodes done incorrectly.

Thanks for the help.

Downeast Tech · Oct 5, 2016

I just noticed that the first two nodes of my cluster (server1, server2) have the same FCDN as our existing server. The old servers will be replaced with these new ones, but could that have something to do with the issues? Just throwing anything I can out there to see if helps. Thanks.

Downeast Tech · Nov 4, 2016

I have some more information on this issue.

I have tested the multicast and know that it is working. I followed the wiki multicast notes and tested everything it said. The omping 600 test completed on each of the three nodes with no dropped packets and it ran for around 10 minutes.

In the syslog on each server I have the corosync [TOTEM] Retransmit List error several times each day. After the retransmit error ends, which is different each time, there is usually a TOTEM Failed to Receive error and what looks like the cluster finding each member again by forming a new membership. From the information I can find on this error it usually is because one processor is running slower than the others. These servers were purchased new and all have the same hardware. Also since the error happens on all the nodes I didn't think it was isolated to a slow processor.

Another issue that I have noticed is with SSH between nodes. After a restart of all the nodes the cluster seems to communicate fine when the GUI being able to send commands to each node. There are times though with one node will not take commands like there is an issue with the SSH. When I putty into each node I can ssh into the other nodes and it asks me to accept the new ssh key. After I do this the communication between nodes works like it is supposed too.

I don't know if these issues are related but any help in resolving this would be greatly appreciated. Many thanks.

Node Restarts

Downeast Tech

New Member

wosp

Renowned Member

Downeast Tech

New Member

cadbury

New Member

Downeast Tech

New Member

Downeast Tech

New Member

Downeast Tech

New Member

We value your privacy