All VMs power off

ravib123 · Apr 14, 2013

I'm using proxmox 2.3-12.

I am seeing an issue where all my VMs stop but I am having issues tracking it down.

One time I did actually get to see it happen and they all paused then stopped, I've only seen it in the gui.

Where do I even start looking on what the cause is?

The enviornment:

1 large proxmox node (128GB, dual quad) and 2 smaller ones (single proc, 32GB) for some test activities. This all connects to a NFS san for the qcow2 files.

I should remark that occationally one ore two of the very demanding containers will go down as well.

The strangest part I am seeing is that not everything goes down and it's all on the same storage.

Any pointers would be nice.

m.ardito · Apr 15, 2013

hi

can you restart them? what is the behaviour after they start ?

Marco

hotwired007 · Apr 15, 2013

What SAN is it? Any error messages on the SAN?

ravib123 · Apr 15, 2013

Yes,

I am able to restart them, they look to have been un-gracefully shut down.

For the SAN it's an openfiler NFS store (XFS).

No logging of errors on the SAN.

Also I should remark no performance issues even when it occurs.

- - - Updated - - -

I'm not sure if it will help, but I updated to the most recent release which is 2.3-18, it had a new kernel and a qemu update was in my list as well.

Perhaps this will help.

If I did I will update the post as resolved, but I am assuming it wont make a difference so my quest is ongoing for a resolution.

- - - Updated - - -

I'm not sure if it will help, but I updated to the most recent release which is 2.3-18, it had a new kernel and a qemu update was in my list as well.

Perhaps this will help.

If I did I will update the post as resolved, but I am assuming it wont make a difference so my quest is ongoing for a resolution.

- - - Updated - - -

I'm not sure if it will help, but I updated to the most recent release which is 2.3-18, it had a new kernel and a qemu update was in my list as well.

Perhaps this will help.

If I did I will update the post as resolved, but I am assuming it wont make a difference so my quest is ongoing for a resolution.

- - - Updated - - -

I'm not sure if it will help, but I updated to the most recent release which is 2.3-18, it had a new kernel and a qemu update was in my list as well.

Perhaps this will help.

If I did I will update the post as resolved, but I am assuming it wont make a difference so my quest is ongoing for a resolution.

ravib123 · Apr 15, 2013

Yes,

They restart just fine, they show that there was an un-graceful (dirty) shutdown.

The SAN is an openfiler NFS Share (XFS), no errors logged there and no performance loss seen either.

I'm not sure if it will help, but I updated to the most recent release which is 2.3-18, it had a new kernel and a qemu update was in my list as well.

Perhaps this will help.

If I did I will update the post as resolved, but I am assuming it wont make a difference so my quest is ongoing for a resolution.

mir · Apr 15, 2013

You are aware of the fact that Openfiler hasn't seen an update in almost 2 years and that the project is abandoned?

vcp_ai · Apr 16, 2013

May be your problem is related to backups ??
See post http://forum.proxmox.com/threads/13177-Backup-of-VM-to-NFS-fails-on-2-3

ravib123 · Apr 17, 2013

Do you feel this has some relevance to the issue I am experiencing? If so please elaborate.

The forums are still reasonably active, commercial support is still available, and the last update to their public site is from 2012. I wouldn't say it's abandoned, just very few code base changes, that being said no problems with 10GBe or infiniband in my use of OF.

- - - Updated - - -

vcp_ai said:
May be your problem is related to backups ??
See post http://forum.proxmox.com/threads/13177-Backup-of-VM-to-NFS-fails-on-2-3

Thanks for that link, It might be a backup issue, though I only backup a few containers with the built-in features. I use idera for anything that isn't a container.

I've re-review my logs and pay more attention to backup start times and stop times.

ravib123 · Jul 16, 2013

Ok so,

I've replaced the server (twice actually), updated the bios, replaced the hard drive (single ssd, not fresh install), replaced the 10G NIC, memtested till I can memtest no more (not in the listed order naturally) but basically I've narrowed this down to the server itself rebooting (and thus the VMs) and I see no hardware reason for such a reboot. (no fencing configured btw)

I have other production servers running the same equipment that don't experience this issue.

I even went ahead and updated it to v3 for giggles.

Anyone ever seen just a bum install that has these symptoms?

ravib123 · Jul 16, 2013

ravib123 said:
Ok so,

I've replaced the server (twice actually), updated the bios, replaced the hard drive (single ssd, not fresh install), replaced the 10G NIC, memtested till I can memtest no more (not in the listed order naturally) but basically I've narrowed this down to the server itself rebooting (and thus the VMs) and I see no hardware reason for such a reboot. (no fencing configured btw)

I have other production servers running the same equipment that don't experience this issue.

I even went ahead and updated it to v3 for giggles.

Anyone ever seen just a bum install that has these symptoms?

No shortage of spare parts, just a shortage of time. this is the first I could get back to this issue since it isn't a client related issue ;D

mmenaz · Jul 16, 2013

I saw once and was due to electricity (power line) problems/disturb/whatever, enve if the server had an APC UPS.
Changing APC ups with Online model solved the problem, except that Proxmox is old (<=1.9) and NUT does not support that Online model.
If your serve has some integrated monitor (like dell Drack or whatever is called, Fujitsu iRMC etc.) have a look at the log there.

ravib123 · Jul 17, 2013

mmenaz said:
I saw once and was due to electricity (power line) problems/disturb/whatever, enve if the server had an APC UPS.
Changing APC ups with Online model solved the problem, except that Proxmox is old (<=1.9) and NUT does not support that Online model.
If your serve has some integrated monitor (like dell Drack or whatever is called, Fujitsu iRMC etc.) have a look at the log there.

I like what you're thinking.

I actually recall hearing of such UPS issues though this would be the first time I've ever experienced it. I am using a pretty budget oriented UPS on that server.

Potentially if the UPS is putting out improper power the redundant power supplies might shutoff to protect themselves.

The IPMI doesn't have anything logged about over/under voltage but I suspect in this situation it wouldn't make it to the mainboard and would instead be at the power supply level. This would explain why it happens only under load, posts no errors that I can find, and tests good on the bench.

I'll replace the UPS and let you know what I find.

Thanks for the obscure idea

ravib123 · Nov 17, 2013

Just wanted to let ya'll know this issue was solved.

I replaced the unit 3 times, the third time it worked.

Interestingly the hardware was just unstable under load, but in a not re-producible way.

We replaced:
Battery Backup
Power Supply
RAM
CPUs
and finally the entire barebones unit 3 times to get a good one.

Of course we did all the burn-in processes on each unit and after each change.

Search

Search

All VMs power off

ravib123

Active Member

m.ardito

Famous Member

hotwired007

Member

ravib123

Active Member

ravib123

Active Member

mir

Famous Member

vcp_ai

Renowned Member

ravib123

Active Member

ravib123

Active Member

ravib123

Active Member

mmenaz

Renowned Member

ravib123

Active Member

ravib123

Active Member