some of VMs crash when system corrected RAM error

chchang

Well-Known Member
Feb 6, 2018
34
4
48
47
I used two 8G ecc ram proxmox servers , and One of them has been having some problems lately, the dmesg shows system auto corrected the error

Code:
[15327239.518589] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1

[15327239.518590] {4}[Hardware Error]: It has been corrected by h/w and requires no further action

[15327239.518591] {4}[Hardware Error]: event severity: corrected

[15327239.518592] {4}[Hardware Error]:  Error 0, type: corrected

[15327239.518592] {4}[Hardware Error]:  fru_text: CorrectedErr

[15327239.518593] {4}[Hardware Error]:   section_type: memory error

[15327239.518594] {4}[Hardware Error]:   node: 0 device: 1

[15327239.518594] {4}[Hardware Error]:   error_type: 2, single-bit ECC

[15327239.518596] ghes_edac: Internal error: Can't find EDAC structure

the proxmox is still alived , but one of VMs on the host will have a kernel panic durning the correction.

I don't know why , the other VM does'nt effect by the correction , only specified one. they were all the same OS (ubuntu 14.04)

I will replace the memory for further test , but strill curious about why the correction will make vm kernel panic ?
 
Are you sure the RAM is intact otherwise? Maybe the "corrected" error is only a symptom of a bad stick in general, and some other data corruption is causing the crash you're seeing. I'd replace the RAM stick ASAP either way, failing hardware is never a good thing, even if ECC can extend it's life a bit.