Hi all, this may be a bit out-of-scope (more of a linux-y question than Proxmox) but just in case.
I've got a modest sized proxmox 2-node cluster recently deployed (various months ago) for a client,
-- new supermicro gear, nice modern CPU and lots of CPU and ram (48 HT-CPU cores / intel xeon based dual-socket, 128gig ram per server chassis)
-- fibre shared san disk array for shared storage.
-- proxmox 3.4.X latest stable installed
-- using a tiny iSCSI target on a separate device (synology) as the 'quorum vote tie breaker' for the proxmox cluster to prevent split-brain.
weird thing, in the last 2 weeks: One of these 2 hosts simply tanks - 2 weeks ago was Saturday late at night / just before sunday AM; the following week it crashed ~7 days +45min later. Not entirely clear if there is a 'time periodicity' to the thing yet. I did have the same batch of VMs running on the host.
First time this happened - there was a hint of a dimm being suspect in the supermicro. Vendor removed and replaced the bad - suspect part on Monday. we had a similar error-crash a few months ago, with a separate dimm part, and a remove-and-replace cycle then as well. That time things appeared to improve after remove-replace the 'bad' dimm.
So, this first time - about a week ago: I ran Memtest for 48hours, things looked clean, booted back to proxmox, migrated VMs back from Node2 / to re-balance them across node1, node2 in the Proxmox cluster. Then ~5 days later - saturday night - we get another crash.
Basic pattern for the crash,
- console claims the host is online, but we can't ping in to the IP address / no access ssh or http or otherwise.
- console is a KVM IMPI remote supermicro console, I can attach but no response to keypress, and it is blank - black screen.
- can send reboot / power off and reboot the host fine. Goes to a normal POST / Boot sequence.
- nothing at all is logged in messages or other logs on the server. no trace of kernel panic. no smoking gun to give me a hint. There are enough 'house keeping' messages logged that I can pretty easily pin down the crash time, simply based on when normal chatty log behaviour stops and then resumes after the reboot.
So I'm a bit baffled. After the crash this most recent weekend, there was no unhappy supermicro DIMM message, and just to be somewhat more sure, I ran memtest for 24hrs, then booted into proxmox; ran a CPU stress test app in userspace as root user ("stress" app and then "memtester" app). Had the thing pounding all CPUs with 'stress' for 24hrs - load was pinned and no problems. Similarly stressed ram using 'memtester' and ran another 24hrs - seems happy as a clam. no apparent issues with CPU:Ram tolerating any hard work.
I am wondering, if there are easy tunables I can tweak, just to crank up verbosity of logging on the host, maybe just for a week or so - in the event that things go south, there is a better chance of me catching a hint about what is going wrong.
For now, I am just migrating "some but not as many" VMs back onto the 'not well trusted host" and see if we can get a week of uptime out of the thing without crashes. Note this host is not 'super busy'. RAM use was sitting around 40-50% and CPU utilization was modest (20%ish maybe?)
But I figured to be a bit more proactive, I would ask, if anyone has any little lovely hints about making debugging / more verbose logging - something I can also do during this time - to try to improve chances of catching a hint / error messages - if such a thing is generated.
Any suggestions are appreciated.
Thanks!
Tim
I've got a modest sized proxmox 2-node cluster recently deployed (various months ago) for a client,
-- new supermicro gear, nice modern CPU and lots of CPU and ram (48 HT-CPU cores / intel xeon based dual-socket, 128gig ram per server chassis)
-- fibre shared san disk array for shared storage.
-- proxmox 3.4.X latest stable installed
-- using a tiny iSCSI target on a separate device (synology) as the 'quorum vote tie breaker' for the proxmox cluster to prevent split-brain.
weird thing, in the last 2 weeks: One of these 2 hosts simply tanks - 2 weeks ago was Saturday late at night / just before sunday AM; the following week it crashed ~7 days +45min later. Not entirely clear if there is a 'time periodicity' to the thing yet. I did have the same batch of VMs running on the host.
First time this happened - there was a hint of a dimm being suspect in the supermicro. Vendor removed and replaced the bad - suspect part on Monday. we had a similar error-crash a few months ago, with a separate dimm part, and a remove-and-replace cycle then as well. That time things appeared to improve after remove-replace the 'bad' dimm.
So, this first time - about a week ago: I ran Memtest for 48hours, things looked clean, booted back to proxmox, migrated VMs back from Node2 / to re-balance them across node1, node2 in the Proxmox cluster. Then ~5 days later - saturday night - we get another crash.
Basic pattern for the crash,
- console claims the host is online, but we can't ping in to the IP address / no access ssh or http or otherwise.
- console is a KVM IMPI remote supermicro console, I can attach but no response to keypress, and it is blank - black screen.
- can send reboot / power off and reboot the host fine. Goes to a normal POST / Boot sequence.
- nothing at all is logged in messages or other logs on the server. no trace of kernel panic. no smoking gun to give me a hint. There are enough 'house keeping' messages logged that I can pretty easily pin down the crash time, simply based on when normal chatty log behaviour stops and then resumes after the reboot.
So I'm a bit baffled. After the crash this most recent weekend, there was no unhappy supermicro DIMM message, and just to be somewhat more sure, I ran memtest for 24hrs, then booted into proxmox; ran a CPU stress test app in userspace as root user ("stress" app and then "memtester" app). Had the thing pounding all CPUs with 'stress' for 24hrs - load was pinned and no problems. Similarly stressed ram using 'memtester' and ran another 24hrs - seems happy as a clam. no apparent issues with CPU:Ram tolerating any hard work.
I am wondering, if there are easy tunables I can tweak, just to crank up verbosity of logging on the host, maybe just for a week or so - in the event that things go south, there is a better chance of me catching a hint about what is going wrong.
For now, I am just migrating "some but not as many" VMs back onto the 'not well trusted host" and see if we can get a week of uptime out of the thing without crashes. Note this host is not 'super busy'. RAM use was sitting around 40-50% and CPU utilization was modest (20%ish maybe?)
But I figured to be a bit more proactive, I would ask, if anyone has any little lovely hints about making debugging / more verbose logging - something I can also do during this time - to try to improve chances of catching a hint / error messages - if such a thing is generated.
Any suggestions are appreciated.
Thanks!
Tim