Restart Question

Downeast Tech · Nov 21, 2016

I have a 3 node cluster running HA with 2 VM's. I have had them functioning as intended for the past week. On day 5 node1 restarted itself with no syslog entry. The VM running on node1 was automatically transfered to node 3. I wasn't sure if this was just an intended maintenance schedule or not so I let things be. Now 11 days after they were implemented node 2 restarted with its VM auto transferred to node1. I am curious is this is a built in mechanism or if there is something going on with the nodes that I need to look into. The VM's have come back up within a few minutes so there is no interruption with service but I am just trying to cover all the bases. Thanks.

ravib123 · Nov 22, 2016

Well,

I would wonder how you are setup for fencing? An incorrect configuration with fencing can cause a lot of issues ... I would imagine this symptom could be part of it.

Almost exclusively though when I experience these issues it is hardware related. What did you do to test the hardware before implementation?

You'll have to let us know what you track down, these ones are always fun.

Downeast Tech · Nov 22, 2016

I use the watchdog fencing that was automatically setup with the HA Cluster 4.x. I followed the wiki and it seems to work. If it could work better, please give me some pointers.Thanks!

Downeast Tech · Nov 22, 2016

The hardware was all new Dell Poweredge servers, and the Netgear 10GB switch recommended by the Proxmox wiki HA cluster setup. I have had some minor issues with nodes restarting, but they are preventable when I restart nodes manually periodically. I had asked the forum for help on these before and they only thought they were multicast issues. I tested multicast several times and haven't had any lost packets nor high latency with it.

ravib123 · Nov 22, 2016

Downeast Tech said:
I use the watchdog fencing that was automatically setup with the HA Cluster 4.x. I followed the wiki and it seems to work. If it could work better, please give me some pointers.Thanks!

I was mostly just pointing in a logical direction for you to review your config there, since the HA is working and passing off and the HA has the ability to force a reboot if it detects a node offline.

I don't have specific pointers because I don't know or manage your environment.

ravib123 · Nov 22, 2016

Downeast Tech said:
The hardware was all new Dell Poweredge servers, and the Netgear 10GB switch recommended by the Proxmox wiki HA cluster setup. I have had some minor issues with nodes restarting, but they are preventable when I restart nodes manually periodically. I had asked the forum for help on these before and they only thought they were multicast issues. I tested multicast several times and haven't had any lost packets nor high latency with it.

Dell servers? This is going to sound awful but have you kept the BIOS up to date?

We see so many dell bios issues that crash systems (linux and windows) that are just bios/firmware updates for on-board devices.

On those servers did you run memtest for an extended period of time? We usually do no less than 20 passes of memtest on servers because I frequently see issues at 10+ passes.

ravib123 · Nov 22, 2016

Downeast Tech said:
The hardware was all new Dell Poweredge servers, and the Netgear 10GB switch recommended by the Proxmox wiki HA cluster setup. I have had some minor issues with nodes restarting, but they are preventable when I restart nodes manually periodically. I had asked the forum for help on these before and they only thought they were multicast issues. I tested multicast several times and haven't had any lost packets nor high latency with it.

Also, I would be curious about the 10G switches, any packet loss? LACP? Most recent firmware?

Sadly, just because devices are new doesn't mean they are 100%. We see so many DOA items from assembled systems and networking devices to laptops and parts.

We always test hardware extensively before deployment to make sure we can always rule out hardware issues. I suppose as an MSP we see a lot of volume so I'm a bit jaded.

Downeast Tech · Nov 22, 2016

I didn't do any BIOS updates. The servers have been in place and running HA for the past 5 months with only a few hiccups. I will do some memtests to see how those perform. I will also check for BIOS and firmware updates to see how those help. I appreciate the help.

ravib123 · Nov 22, 2016

Downeast Tech said:
I didn't do any BIOS updates. The servers have been in place and running HA for the past 5 months with only a few hiccups. I will do some memtests to see how those perform. I will also check for BIOS and firmware updates to see how those help. I appreciate the help.

Totally, the nice part about HA is the ability to do these maintenancy things.

If the firmware was never updated yeah, I mean the things I see happen with the dell hardware that is 100% BIOS/firmware based.

On the dell side both server and laptop/desktop firmware problems crop up based on updates that are so crazy you'd think dell tested them with an ice cream cone and a 4 year old instead of a team of professionals.

That being said, you tie up all that dell owns it is a wonder they get anything done at all, and some of their hardware is fantastic. I'm on a lattitude rugged extreme, for the price (refurb) of like $1600 I can't tell you how often I need to wash it under the sink because I spill my coffee on the keyboard. Not having a good workspace (from client to client) means a lot more spills.

Search

Search

Restart Question

Downeast Tech

New Member

ravib123

Active Member

Downeast Tech

New Member

Downeast Tech

New Member

ravib123

Active Member

ravib123

Active Member

ravib123

Active Member

Downeast Tech

New Member

ravib123

Active Member