total system freeze

May 16, 2020
265
16
38
51
Antwerp, Belgium
commandline.be
the proxmox system completely froze, zero logs happening, no network, not even a bad write to disk.
the first log entry i see after me trying to figure out what happened to the server are logs for the host booting up.

* only recent changes to the system was the introduction of huge pages. before i cannot remember a freeze.
* other notable topic is a VM which i supsected to have caused problems before, not an entire freeze though.

So questions are

  • what can cause a full system freeze. how to keep track of possible root causes ?
  • can hugepages cause a system freeze on a system with memory to spare ?
  • can a VM with 3 vCPU assigned cause a freeze ?
 
Last edited:
Hi,

These are the worst mistakes when debugging.
Most of the time, such errors are related to the hardware. If it is the software, you usually still have an output on the screen(IPMI). Did you see something?
 
" the proxmox system completely froze, zero logs happening, no network, not even a bad write to disk. "
Hi,

These are the worst mistakes when debugging.
Most of the time, such errors are related to the hardware. If it is the software, you usually still have an output on the screen(IPMI). Did you see something?

not really "the proxmox system completely froze, zero logs happening, no network, not even a bad write to disk. "
What i did find is there is one disk, an older sata drive which is showing ECC error such as here below. It is not my experience this may cause a freeze.

syslog:Jun 16 00:13:41 pvx smartd[2544]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 47 to 48 syslog:Jun 16 01:43:41 pvx smartd[2544]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 48 to 47 syslog:Jun 16 02:13:41 pvx smartd[2544]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 47 to 48 syslog:Jun 16 08:38:04 pvx smartd[2432]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 48 to 87 syslog:Jun 16 09:08:04 pvx smartd[2432]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 87 to 60 syslog:Jun 16 09:38:05 pvx smartd[2432]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 60 to 59

This disk was not formatted or in use in any way when the freeze occurred. Just because i formatted it as swap and assigned it as swap device, watch and learn i guess. Though i now think of simply removing the disk.

i noticed this happen after about three days of uptime. Because of that i removed a few boot time kernel parameters .

see also https://forum.proxmox.com/threads/erroneous-vm-setting-caused-a-system-fail.71138/
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!