Q: more verbose logging, intermittent crashing Proxmox host, any suggestions?

fortechitsolutions

Renowned Member
Jun 4, 2008
434
46
93
Hi all, this may be a bit out-of-scope (more of a linux-y question than Proxmox) but just in case.

I've got a modest sized proxmox 2-node cluster recently deployed (various months ago) for a client,
-- new supermicro gear, nice modern CPU and lots of CPU and ram (48 HT-CPU cores / intel xeon based dual-socket, 128gig ram per server chassis)
-- fibre shared san disk array for shared storage.
-- proxmox 3.4.X latest stable installed
-- using a tiny iSCSI target on a separate device (synology) as the 'quorum vote tie breaker' for the proxmox cluster to prevent split-brain.

weird thing, in the last 2 weeks: One of these 2 hosts simply tanks - 2 weeks ago was Saturday late at night / just before sunday AM; the following week it crashed ~7 days +45min later. Not entirely clear if there is a 'time periodicity' to the thing yet. I did have the same batch of VMs running on the host.

First time this happened - there was a hint of a dimm being suspect in the supermicro. Vendor removed and replaced the bad - suspect part on Monday. we had a similar error-crash a few months ago, with a separate dimm part, and a remove-and-replace cycle then as well. That time things appeared to improve after remove-replace the 'bad' dimm.

So, this first time - about a week ago: I ran Memtest for 48hours, things looked clean, booted back to proxmox, migrated VMs back from Node2 / to re-balance them across node1, node2 in the Proxmox cluster. Then ~5 days later - saturday night - we get another crash.

Basic pattern for the crash,
- console claims the host is online, but we can't ping in to the IP address / no access ssh or http or otherwise.
- console is a KVM IMPI remote supermicro console, I can attach but no response to keypress, and it is blank - black screen.
- can send reboot / power off and reboot the host fine. Goes to a normal POST / Boot sequence.
- nothing at all is logged in messages or other logs on the server. no trace of kernel panic. no smoking gun to give me a hint. There are enough 'house keeping' messages logged that I can pretty easily pin down the crash time, simply based on when normal chatty log behaviour stops and then resumes after the reboot.

So I'm a bit baffled. After the crash this most recent weekend, there was no unhappy supermicro DIMM message, and just to be somewhat more sure, I ran memtest for 24hrs, then booted into proxmox; ran a CPU stress test app in userspace as root user ("stress" app and then "memtester" app). Had the thing pounding all CPUs with 'stress' for 24hrs - load was pinned and no problems. Similarly stressed ram using 'memtester' and ran another 24hrs - seems happy as a clam. no apparent issues with CPU:Ram tolerating any hard work.

I am wondering, if there are easy tunables I can tweak, just to crank up verbosity of logging on the host, maybe just for a week or so - in the event that things go south, there is a better chance of me catching a hint about what is going wrong.

For now, I am just migrating "some but not as many" VMs back onto the 'not well trusted host" and see if we can get a week of uptime out of the thing without crashes. Note this host is not 'super busy'. RAM use was sitting around 40-50% and CPU utilization was modest (20%ish maybe?)

But I figured to be a bit more proactive, I would ask, if anyone has any little lovely hints about making debugging / more verbose logging - something I can also do during this time - to try to improve chances of catching a hint / error messages - if such a thing is generated.


Any suggestions are appreciated.

Thanks!


Tim
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!