[SOLVED] TrueNAS Crashing

Tanman

New Member
Oct 15, 2023
7
0
1
I am fairly new to proxmox and virtualization in general, and have had some issues getting proxmox up and running with consistency.
Over the past week or two I have been dealing with my proxmox host rebooting and the whole system crashing. I was able to narrow the issue down to poor power delivery with the system not being in a battery backup.
During one of the crashes, the system completely froze (i was unable to even put in CLI code on the host itself) and was forced to hard restart the system.
I now only have TrueNAS crashing and i'm seeing some hardware errors within syslog.

Oct 14 21:53:47 proxmox kernel: mce: [Hardware Error]: Machine check events logged Oct 14 21:53:47 proxmox kernel: [Hardware Error]: Uncorrected, software restartable error. Oct 14 21:53:47 proxmox kernel: [Hardware Error]: CPU:16 (19:21:2) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|-|Poison|-]: 0xbc00080001010135 Oct 14 21:53:47 proxmox kernel: [Hardware Error]: Error Addr: 0x0000000338e45e80 Oct 14 21:53:47 proxmox kernel: [Hardware Error]: IPID: 0x001000b000000000 Oct 14 21:53:47 proxmox kernel: [Hardware Error]: Load Store Unit Ext. Error Code: 1, An ECC error or L2 poison was detected on a data cache read by a load. Oct 14 21:53:47 proxmox kernel: [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD Oct 14 21:53:47 proxmox kernel: mce: Uncorrected hardware memory error in user-access at 338e45e80 Oct 14 21:53:47 proxmox kernel: Memory failure: 0x338e45: recovery action for unsplit thp: Ignored Oct 14 21:53:47 proxmox kernel: mce: Memory error not recovered Oct 14 21:53:47 proxmox kernel: sda: sda1 sda2 Oct 14 21:53:47 proxmox kernel: fwbr100i0: port 2(tap100i0) entered disabled state Oct 14 21:53:47 proxmox kernel: fwbr100i0: port 2(tap100i0) entered disabled state Oct 14 21:53:47 proxmox kernel: sdc: sdc1 sdc2 Oct 14 21:53:47 proxmox systemd[1]: 100.scope: Deactivated successfully. Oct 14 21:53:47 proxmox systemd[1]: 100.scope: Consumed 43min 40.429s CPU time

Now i assume this is from my forced restart and i have corrupted some system files. does proxmox have a feature like SFC scannow like windows has for checking system integrity?
If i'm not correct in my assumption, which in all possibility i'm probably wrong, what other things can i do to get this error cleared up and keep TrueNAS from crashing?
I can provide any other details needed. Thanks in advance for the help.
 
It talks about memory errors. Did you run memtest86+ over night to check if you got a bad RAM module which is a common problem when there is general system instability?
 
It talks about memory errors. Did you run memtest86+ over night to check if you got a bad RAM module which is a common problem when there is general system instability?
I have not run any memtest86 passes, these errors as far as i can tell were not coming up before the "hard reset", but i will go ahead with some runs to be sure.
 
It talks about memory errors. Did you run memtest86+ over night to check if you got a bad RAM module which is a common problem when there is general system instability?
i went ahead and ran a memtest over the past 24 hours and was able to get 11 passes in with 0 errors, so ram is likely good.
 
ended up being corrupted system files. fixed with debsums. for anyone that may deal with the same issue
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!