Analysing crashes

kobuki

Renowned Member
Dec 30, 2008
473
27
93
I have a misbehaving PVE box, used for developent and some office tasks. Not mission critical, but it runs 7x24. It routinely crashes and reboots once or twice a week. Totally random, no correlation with load. I'm suspecting a HW failure but since I'm not at the console at reboots, I don't have an idea of what's wrong. I'd like to install a few crashdump tools, and I need a debug version of the kernel for the purpose. Is it available somewhere? If not, can you suggest a way to see what happens when the crash happens?
 
Something to consider:
1) Bad RAM
2) Bad ventilation especially CPU FAN causing the CPU getting to hot. When this happens the CPU automatically shots down.
3) Bad sectors on disk can cause crashes.
 
Thanks. But these I all know and have checked... We even ran a RAM test for a night and no problems were detected. I'm suspecting either a bad driver or just a random HW failure. But if it's a driver I might have chances to upgrade and fix the misbehaviour.
 
no hints in system logs, I suppose?
is the node under UPS or power filtering? power spikes or other issues could also cause similar behaviour, perhaps.
otherwise, you could post pveversion -v details and/or hardware details, maybe others are using same hardware with or without problems, and that could help you to sort out the hardware/driver source of crashes.

Marco
 
I have a misbehaving PVE box, used for developent and some office tasks. Not mission critical, but it runs 7x24. It routinely crashes and reboots once or twice a week. Totally random, no correlation with load. I'm suspecting a HW failure but since I'm not at the console at reboots, I don't have an idea of what's wrong. I'd like to install a few crashdump tools, and I need a debug version of the kernel for the purpose. Is it available somewhere? If not, can you suggest a way to see what happens when the crash happens?


kernel version ?

some of us have had crashs with pve-kernel > 2.6.32-27
 
So far I've kept it at 2.6.32-27 because of vzdump backups hanging randomly on newer ones. But because of the new OVZ vulnerability I'll need to upgrade this box too. So I thought at least I'll try to find the culprit by installing a crashdump mechanism, but that requires a kernel with symbols compiled in. No hints in the logs, it just restarts after a panic, possibly.
 
So far I've kept it at 2.6.32-27 because of vzdump backups hanging randomly on newer ones. But because of the new OVZ vulnerability I'll need to upgrade this box too. So I thought at least I'll try to find the culprit by installing a crashdump mechanism, but that requires a kernel with symbols compiled in. No hints in the logs, it just restarts after a panic, possibly.
Hi,
long time ago I had trouble with one server - after changing the redundant powersuply all work without issues.
On another server (supermicro) the often reboots stops after bios/firmware upgrade...

About memtest: do you use ECC-Ram and it's enabled in the bios? Any mem-related info in the logs before?


Udo
 
Nah, it's AMD desktop HW, no redundant PSU, no ECC RAM. Yeah, not ideal, but it should not just randomly crash like this. And honestly, I'd rather have them upgrade this box to real server HW instead of debugging crashes like this...
 
It could be crashing also if it had windows or another desktop os installed, if run 24x7 perhaps. The fastest way to find the cause and fix it it could be having another identical machine, well behaving, and try to replace each component until you find what cause this. But if you just use another desktop machine, it could have no troubles, easily. I used a simple desktop pc for a while, in the beginnings, just to learn, having no problem at all.

Marco
 
TBH, I'm running some other PVE boxes on commodity desktop HW without issues. This particular one is the only AMD machine. The rest is all more or less recent Intel, including servers that are almost all Xeons. I'm not sure if the RHEL6 kernel is a better fit for Intels or not, just noting.
 
TBH, I'm running some other PVE boxes on commodity desktop HW without issues. This particular one is the only AMD machine. The rest is all more or less recent Intel, including servers that are almost all Xeons. I'm not sure if the RHEL6 kernel is a better fit for Intels or not, just noting.
Hi,
I assume that the reason is not AMD - most of my ve-hosts are amd based... all issues has nothing to do with amd (powersupply, supermicro and so on).

Udo
 
If PVE is in a workstation, will be better see the BIOS configuration.

I have great satisfactions with these configurations in the BIOS (in several motherboards Asus that are workstation, but with the difference that i always use a Intel processor, i guess that it will have not difference relevant):
1- uefi: enabled
2- performance enabled (PVE isn't good working with the power saving)
3- c1e: disabled
4- Any configuration that perform power saving (checking each option of your BIOS)

Best regards
Cesar
 
Last edited:
Thanks for the hints. I might try disabling power-saving functions and c1e, though on Intel boards I have zero problems using them. I think UEFI only plays a role on boot until the kernel takes over, but will check next time.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!