node instability - uncertain of cause

BloodyIron

Renowned Member
Jan 14, 2013
302
27
93
it.lanified.com
I'm running a cluster at home. I know the recommended number is 3 nodes in a cluster, I'm running two. I simply am not able to afford a third at this time, so please overlook that fact.

I have two nodes, node 1 has been a fucking champ, node 2 has mucked up twice now. They are identical hardware, and are only about a year old.

About a month or two ago (forget the exact date/time), node2 buggered up real good (effectively the same how it is now) and started having kernel panics when I would try to migrate VMs, or do anything fairly intensive. At the time I performed rigorous CPU, RAM and other hardware stress testing to try to identify a potential cause, and I came up with no observable hardware failure. So what did I do? I removed the node from the cluster, re-installed Proxmox VE on the host, re-added it to the cluster, and it's been operating like a champ since... until today.

I am seeing the same effects today as previously. To add a bit more detail, sometimes I see a kernel panic, sometimes I see a hard-reset of node2. For the last month or two, I have been able to safely migrate VMs back and forth online (online seems to trigger the vomiting) ever since I fixed it last, ~1-2mo, and all other expected functionality has appeared to be just fine.

I looked at the logs, and I saw some oddities with the nightly backups from last night. I'm unsure if it's the cause, or if it's the canary. I woke up this morning to find 3 of my VMs stuck at grub, as if they had been hard-reset. Node2 was responding, and I thought this was odd. What I had assumed was perhaps a backup had failed part way through and it had reset the VM (less than ideal, but better than it could be). Then I began to interact with the VMs, and the kernel panics, and node2 hard-resets started happening.

I'm sure there's logs and stuff I can paste here, but I'm not certain which ones are desirable in this case.

I figure it would be prudent to try and find the cause of this so perhaps either I can fix it on my end, or a fix can be made, instead of simply nuking the node and re-installing it. Both nodes are fully updated as of this morning, and the effects persist.

Please advise.
 
I'm running a cluster at home. I know the recommended number is 3 nodes in a cluster, I'm running two. I simply am not able to afford a third at this time, so please overlook that fact.

I have two nodes, node 1 has been a fucking champ, node 2 has mucked up twice now. They are identical hardware, and are only about a year old.

About a month or two ago (forget the exact date/time), node2 buggered up real good (effectively the same how it is now) and started having kernel panics when I would try to migrate VMs, or do anything fairly intensive. At the time I performed rigorous CPU, RAM and other hardware stress testing to try to identify a potential cause, and I came up with no observable hardware failure. So what did I do? I removed the node from the cluster, re-installed Proxmox VE on the host, re-added it to the cluster, and it's been operating like a champ since... until today.

I am seeing the same effects today as previously. To add a bit more detail, sometimes I see a kernel panic, sometimes I see a hard-reset of node2. For the last month or two, I have been able to safely migrate VMs back and forth online (online seems to trigger the vomiting) ever since I fixed it last, ~1-2mo, and all other expected functionality has appeared to be just fine.

I looked at the logs, and I saw some oddities with the nightly backups from last night. I'm unsure if it's the cause, or if it's the canary. I woke up this morning to find 3 of my VMs stuck at grub, as if they had been hard-reset. Node2 was responding, and I thought this was odd. What I had assumed was perhaps a backup had failed part way through and it had reset the VM (less than ideal, but better than it could be). Then I began to interact with the VMs, and the kernel panics, and node2 hard-resets started happening.

I'm sure there's logs and stuff I can paste here, but I'm not certain which ones are desirable in this case.

I figure it would be prudent to try and find the cause of this so perhaps either I can fix it on my end, or a fix can be made, instead of simply nuking the node and re-installing it. Both nodes are fully updated as of this morning, and the effects persist.

Please advise.
Hi,
there are many posibilities for such things - with hard resets I think an reinstall won't help.

The first thing I would change was the power supply and the order of the RAM (are you use ECC-Ram?).
I had also reboot issues with an supermicro server which are gone after an bios update (WTF) - look for an actual bios, and reset the values - perhaps something instable like overclocking?!

Temperature issues due old fans?

Udo
 
1) You missed the part where I said that a re-install had corrected it previously. The issues when they were happening were IMMEDIATE and REPEATABLE, once the re-install happened I was unable to reproduce the issues whatsoever for the last 2 months. I'm confident this is a software issue.

2) The PSU tests fine and is less than a year old, it's not even close to hitting it's limit. You also missed the part where I said that I stress tested the CPU and RAM and they passed.

3) It may be a BIOS issue, and I will see if there is an update, but if that were the case the stability would be far more erratic. The issue started last night some time, and since then it is easily reproducible and happens all the time, why didn't it happen in the last 2 months?

I am confident this is a software issue.

Hi,
there are many posibilities for such things - with hard resets I think an reinstall won't help.

The first thing I would change was the power supply and the order of the RAM (are you use ECC-Ram?).
I had also reboot issues with an supermicro server which are gone after an bios update (WTF) - look for an actual bios, and reset the values - perhaps something instable like overclocking?!

Temperature issues due old fans?

Udo
 
1) You missed the part where I said that a re-install had corrected it previously. The issues when they were happening were IMMEDIATE and REPEATABLE, once the re-install happened I was unable to reproduce the issues whatsoever for the last 2 months. I'm confident this is a software issue.

2) The PSU tests fine and is less than a year old, it's not even close to hitting it's limit. You also missed the part where I said that I stress tested the CPU and RAM and they passed.

3) It may be a BIOS issue, and I will see if there is an update, but if that were the case the stability would be far more erratic. The issue started last night some time, and since then it is easily reproducible and happens all the time, why didn't it happen in the last 2 months?

I am confident this is a software issue.
Hi,
it's only an guess of mine, but if it's an software issue, why should the node reboot?
I have seen some kernel-panics (with different sources), but then the node don't reboot - only hang with kernel-panic.

Reboots I had only with faulty PSUs (which run stable during test) and faulty BIOS.

But if you feel better do an reinstall - why should the actual software reboots you server? And what is different after reinstall?? You will install the same software on the same device...

Udo
 
Well, you can never be sure if it's a SW or a HW failure until you have definite proof. Stress-testing memory and CPU does not test everything. Even high value brand servers can crash twice a day after rigorous testing and passing their own firmware-based self-tests at every single reboot. I know, I've seen it. I always wanted to try crash kernels, but unfortunately haven't found time to study the process well enough to try. It's a mechanism to record and provide dumps and enough information to help finding the root cause of a specific kernel crash. It involves automatically starting a new, secondary "crash" kernel after the main one gives in, and is responsible for collecting the data necessary for investigation.

I can only hope to help suggesting trying it out yourself, a good writeup is here, for example: http://www.dedoimedo.com/computers/kdump.html - based on Ubuntu, but should be general enough to adapt it for the PVE and its RHEL kernel. If you happen to try it I'd be glad to hear how it went.
 
Interesting proposition, I may try it out at some point, but doesnt proxmox ve already collect kernel panics or other forms of failures in a dump location?

I'm trying to keep the system in the same state so when I know what I need to supply for troubleshooting I can get it in an original format.

I'm confident it's a software issue since I can't trigger any system issues any other way (live migrate VM onto host, ~5-10s later it kernel panics or hard-resets). Just the whole combined variables point me to software, and I've been troubleshooting systems for about 15 years (being a sys admin and all).

Well, you can never be sure if it's a SW or a HW failure until you have definite proof. Stress-testing memory and CPU does not test everything. Even high value brand servers can crash twice a day after rigorous testing and passing their own firmware-based self-tests at every single reboot. I know, I've seen it. I always wanted to try crash kernels, but unfortunately haven't found time to study the process well enough to try. It's a mechanism to record and provide dumps and enough information to help finding the root cause of a specific kernel crash. It involves automatically starting a new, secondary "crash" kernel after the main one gives in, and is responsible for collecting the data necessary for investigation.

I can only hope to help suggesting trying it out yourself, a good writeup is here, for example: http://www.dedoimedo.com/computers/kdump.html - based on Ubuntu, but should be general enough to adapt it for the PVE and its RHEL kernel. If you happen to try it I'd be glad to hear how it went.
 
Ahh there's a BIOS update for the motherboards addressing stability. I'll see if I can slap it on there and see if it helps. Oddly enough though, the stability is only with node2, but if it corrects the issue for node2, I'll slap it on node1 also.

I'm open to more/alternate ideas in the mean time too :)

Thanks for the input so far folks, sorry if I'm a bit brash :S
 
Looks like I spoke too soon. The stability issue is still here. I have opened the system and I'm using a thermal laser measurement while it's operating as it fails, and the temperatures are all safe, so I know it's not a thermal issue. I've also reset the BIOS to defaults, so the update should be just fine.

I am still suspecting it is software, as it crashes when trying to spin up some VMs automatically, and does it every time. It just hard-resets, no error. :S
 
Your assumption that its a software issue is just that, a poor assumption.

Recently I had two systems that were identical. Dual Xeon CPUs, 128GB ECC ram, we're talking high end server grade hardware.

One of them kept having similar issues to what you describe. I setup a serial console and logged the serial output so I could catch a kernel panic. Each time the panic indicated a RAM error.

I swapped one bank of RAM, ran memtest for days and all was OK. Put back in production, same crap, kernel panic.

Did not matter what test I ran, stress tests, memtest, you name it, no test would trigger this problem. The only thing that would trigger it is a production load, same as you describe.

Wanna take a guess what the problem was?

A $1k CPU was to blame. Replaced under warranty and problem resolved.

The fact you cannot trigger the problem using some artificial test is insignificant.

Start swapping parts between the two systems, when the problem moves you will have identified the source of your problem. CPU, RAM, motherboard, disks, raid card, cables, power supply, leave no component unturned.

Yes, disks can cause the problem you describe too. My qty 100+ WD RE3 disks are unhappy reminders of this fact, fixed with an unpublished firmware update.

Base your assertions on facts not assumptions, otherwise your no better than a dog chasing it's tail.
 
You're right, I do need to base assertions on fact, not assumptions. When I wrote the last few posts I had no facts, as I was unable to conclusively determine the point of failure, hence "suspecting" and all the vague language I used.

Since I wrote my last post I have continued non-stop to try and diagnose the problem, going through realtek driver modules, to re-installing, to all sorts of other things. At this point I can't even get it to boot off of a LiveUSB into an ubuntu desktop, so I've declared the system unstable and am taking it in to get RMA'd tomorrow.

Leading up to this I was trying to gather together what I could based on what limited information I was presented with. I'm never against any conclusion in regards to diagnosing a system, if it's conclusive. Earlier I had no conclusive evidence as far as I could tell, hence speculation.

Your example, sounds like a right pain in the neck, that sucks.



Your assumption that its a software issue is just that, a poor assumption.

Recently I had two systems that were identical. Dual Xeon CPUs, 128GB ECC ram, we're talking high end server grade hardware.

One of them kept having similar issues to what you describe. I setup a serial console and logged the serial output so I could catch a kernel panic. Each time the panic indicated a RAM error.

I swapped one bank of RAM, ran memtest for days and all was OK. Put back in production, same crap, kernel panic.

Did not matter what test I ran, stress tests, memtest, you name it, no test would trigger this problem. The only thing that would trigger it is a production load, same as you describe.

Wanna take a guess what the problem was?

A $1k CPU was to blame. Replaced under warranty and problem resolved.

The fact you cannot trigger the problem using some artificial test is insignificant.

Start swapping parts between the two systems, when the problem moves you will have identified the source of your problem. CPU, RAM, motherboard, disks, raid card, cables, power supply, leave no component unturned.

Yes, disks can cause the problem you describe too. My qty 100+ WD RE3 disks are unhappy reminders of this fact, fixed with an unpublished firmware update.

Base your assertions on facts not assumptions, otherwise your no better than a dog chasing it's tail.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!