All VMs power off

ravib123

Active Member
Nov 27, 2012
47
0
26
United States
deusmachine.com
I'm using proxmox 2.3-12.

I am seeing an issue where all my VMs stop but I am having issues tracking it down.

One time I did actually get to see it happen and they all paused then stopped, I've only seen it in the gui.

Where do I even start looking on what the cause is?

The enviornment:

1 large proxmox node (128GB, dual quad) and 2 smaller ones (single proc, 32GB) for some test activities. This all connects to a NFS san for the qcow2 files.

I should remark that occationally one ore two of the very demanding containers will go down as well.

The strangest part I am seeing is that not everything goes down and it's all on the same storage.

Any pointers would be nice.
 
Yes,

I am able to restart them, they look to have been un-gracefully shut down.

For the SAN it's an openfiler NFS store (XFS).

No logging of errors on the SAN.

Also I should remark no performance issues even when it occurs.

- - - Updated - - -

I'm not sure if it will help, but I updated to the most recent release which is 2.3-18, it had a new kernel and a qemu update was in my list as well.

Perhaps this will help.

If I did I will update the post as resolved, but I am assuming it wont make a difference so my quest is ongoing for a resolution.

- - - Updated - - -

I'm not sure if it will help, but I updated to the most recent release which is 2.3-18, it had a new kernel and a qemu update was in my list as well.

Perhaps this will help.

If I did I will update the post as resolved, but I am assuming it wont make a difference so my quest is ongoing for a resolution.

- - - Updated - - -

I'm not sure if it will help, but I updated to the most recent release which is 2.3-18, it had a new kernel and a qemu update was in my list as well.

Perhaps this will help.

If I did I will update the post as resolved, but I am assuming it wont make a difference so my quest is ongoing for a resolution.

- - - Updated - - -

I'm not sure if it will help, but I updated to the most recent release which is 2.3-18, it had a new kernel and a qemu update was in my list as well.

Perhaps this will help.

If I did I will update the post as resolved, but I am assuming it wont make a difference so my quest is ongoing for a resolution.
 
Yes,

They restart just fine, they show that there was an un-graceful (dirty) shutdown.

The SAN is an openfiler NFS Share (XFS), no errors logged there and no performance loss seen either.


I'm not sure if it will help, but I updated to the most recent release which is 2.3-18, it had a new kernel and a qemu update was in my list as well.

Perhaps this will help.

If I did I will update the post as resolved, but I am assuming it wont make a difference so my quest is ongoing for a resolution.
 
Do you feel this has some relevance to the issue I am experiencing? If so please elaborate.

The forums are still reasonably active, commercial support is still available, and the last update to their public site is from 2012. I wouldn't say it's abandoned, just very few code base changes, that being said no problems with 10GBe or infiniband in my use of OF.

- - - Updated - - -



Thanks for that link, It might be a backup issue, though I only backup a few containers with the built-in features. I use idera for anything that isn't a container.

I've re-review my logs and pay more attention to backup start times and stop times.
 
Ok so,

I've replaced the server (twice actually), updated the bios, replaced the hard drive (single ssd, not fresh install), replaced the 10G NIC, memtested till I can memtest no more (not in the listed order naturally) but basically I've narrowed this down to the server itself rebooting (and thus the VMs) and I see no hardware reason for such a reboot. (no fencing configured btw)

I have other production servers running the same equipment that don't experience this issue.

I even went ahead and updated it to v3 for giggles.

Anyone ever seen just a bum install that has these symptoms?
 
Ok so,

I've replaced the server (twice actually), updated the bios, replaced the hard drive (single ssd, not fresh install), replaced the 10G NIC, memtested till I can memtest no more (not in the listed order naturally) but basically I've narrowed this down to the server itself rebooting (and thus the VMs) and I see no hardware reason for such a reboot. (no fencing configured btw)

I have other production servers running the same equipment that don't experience this issue.

I even went ahead and updated it to v3 for giggles.

Anyone ever seen just a bum install that has these symptoms?

No shortage of spare parts, just a shortage of time. this is the first I could get back to this issue since it isn't a client related issue ;D
 
I saw once and was due to electricity (power line) problems/disturb/whatever, enve if the server had an APC UPS.
Changing APC ups with Online model solved the problem, except that Proxmox is old (<=1.9) and NUT does not support that Online model.
If your serve has some integrated monitor (like dell Drack or whatever is called, Fujitsu iRMC etc.) have a look at the log there.
 
I saw once and was due to electricity (power line) problems/disturb/whatever, enve if the server had an APC UPS.
Changing APC ups with Online model solved the problem, except that Proxmox is old (<=1.9) and NUT does not support that Online model.
If your serve has some integrated monitor (like dell Drack or whatever is called, Fujitsu iRMC etc.) have a look at the log there.

I like what you're thinking.

I actually recall hearing of such UPS issues though this would be the first time I've ever experienced it. I am using a pretty budget oriented UPS on that server.

Potentially if the UPS is putting out improper power the redundant power supplies might shutoff to protect themselves.

The IPMI doesn't have anything logged about over/under voltage but I suspect in this situation it wouldn't make it to the mainboard and would instead be at the power supply level. This would explain why it happens only under load, posts no errors that I can find, and tests good on the bench.

I'll replace the UPS and let you know what I find.

Thanks for the obscure idea :D
 
Just wanted to let ya'll know this issue was solved.

I replaced the unit 3 times, the third time it worked.

Interestingly the hardware was just unstable under load, but in a not re-producible way.

We replaced:
Battery Backup
Power Supply
RAM
CPUs
and finally the entire barebones unit 3 times to get a good one.

Of course we did all the burn-in processes on each unit and after each change.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!