Proxmox random crashes. Please help!

en4ble · Aug 9, 2023

This is new build. Tried both 7.2 and 8.0 PVE with same results.

System will crash randomly, not even being heavy loaded.

Specs:

24 x AMD Ryzen 9 5900X 12-Core Processor (1 Socket)
Kernel Version Linux 6.2.16-6-pve #1 SMP PREEMPT_DYNAMIC PMX 6.2.16-7 (2023-08-01T11:23Z)
PVE Manager Version pve-manager/8.0.4/d258a813cfa6b390
gigabyte x570s aourus elite

Tried flash it with newest firmware on the board with same effect. Just random crash. This is the snipped before it went reboot:

Tried different things such CPU/ power and Memory Stress test. I've managed to crash it once with memory test but couldn't replicate.

System running LXC containers mainly with cron job to trimm them once a weeek.

# !/bin/bash

# Path to FSTRIM
FSTRIM=/sbin/fstrim

# List of host volumes to trim separated by spaces
# eg TRIMVOLS="/mnt/a /mnt/b /mnt/c"
TRIMVOLS="/"

## Trim all LXC containers ##
echo "LXC CONTAINERS"
for i in $(/sbin/pct list | awk '/^[0-9]/ {print $1}'); do
echo "Trimming Container $i"
/sbin/pct fstrim $i 2>&1 | logger -t "pct fstrim [$$]"
done
echo ""

## Trim host volumes ##
echo "HOST VOLUMES"
for i in $TRIMVOLS; do
echo "Trimming $i"
$FSTRIM -v $i 2>&1 | logger -t "fstrim [$$]"
done

Journal error log:
root@pve04:/var/log/journal/a31d1bf8dc6941c59cdbb78748c91267# journalctl -p err -f
Aug 09 15:27:00 pve04 kernel: mce: [Hardware Error]: CPU 4: Machine Check: 0 Bank 1: bc800800060c0859
Aug 09 15:27:00 pve04 kernel: mce: [Hardware Error]: TSC 0 ADDR da6954e40 MISC d012000000000000 IPID 100b000000000
Aug 09 15:27:00 pve04 kernel: mce: [Hardware Error]: PROCESSOR 2:a20f12 TIME 1691612813 SOCKET 0 APIC 8 microcode a20120a
Aug 09 15:27:02 pve04 smartd[1978]: Device: /dev/nvme1, number of Error Log entries increased from 44 to 45
Aug 09 15:27:03 pve04 smartd[1978]: Device: /dev/nvme2, number of Error Log entries increased from 44 to 45

Tired memory reseat and make sure all fits correctly - all good there.

LXC container config:
arch: amd64
cores: 4
hostname: n1-grp7-10.1.70.2-16127-hdd1
memory: 7178
net0: name=eth0,bridge=vmbr0,firewall=1,gw=10.1.70.1,hwaddr=0A:41:E9:5D

D:81,ip=10.1.70.2/24,rate=4,tag=70,type=veth
onboot: 0
ostype: ubuntu
rootfs: local-lvm:vm-7001-disk-0,size=240G
swap: 0
lxc.apparmor.profile: unconfined
lxc.cgroup2.devices.allow: a
lxc.cap.drop:
lxc.cgroup2.devices.allow: b 7:* rwm
lxc.cgroup2.devices.allow: c 10:237 rwm
lxc.mount.entry: /dev/loop0 dev/loop0 none bind,create=file 0 0
lxc.mount.entry: /dev/loop1 dev/loop1 none bind,create=file 0 0
lxc.mount.entry: /dev/loop2 dev/loop2 none bind,create=file 0 0
lxc.mount.entry: /dev/loop3 dev/loop3 none bind,create=file 0 0
lxc.mount.entry: /dev/loop4 dev/loop4 none bind,create=file 0 0
lxc.mount.entry: /dev/loop5 dev/loop5 none bind,create=file 0 0
lxc.mount.entry: /dev/loop6 dev/loop6 none bind,create=file 0 0
lxc.mount.entry: /dev/loop7 dev/loop7 none bind,create=file 0 0
lxc.mount.entry: /dev/loop8 dev/loop8 none bind,create=file 0 0
lxc.mount.entry: /dev/loop9 dev/loop9 none bind,create=file 0 0
lxc.mount.entry: /dev/loop10 dev/loop10 none bind,create=file 0 0
---al lthe way to 99 loop
lxc.mount.entry: /dev/loop99 dev/loop99 none bind,create=file 0 0
lxc.mount.entry: /dev/loop-control dev/loop-control none bind,create=file 0

SVM enabled, I read in one forum someone talking about C-State disabling?! Not sure what this effects.

I have also did not try to run VMs instead of LXC

Disk info (all thin):

Any suggestions would be greatly appreciated. Thank You in advance.

en4ble · Aug 10, 2023

@flames
I've tried your methond with the settings as suggested in your reply to this post.
SVM = enable (virtualization aka vt-d in intel world)
IOMMU = enable (default = auto on most x570)
Power idle control = Typical current idle (cstate 6 disabled on some x570)

This is with 5900x and gigabyte x570s aourus elite (upgraded to newest bios)

System started rebooting more often as with just C-STATE removed all together.

Looking at the journal its strange its always the same core (CPU4)

Code:

root@pve04:/var/log/journal/a31d1bf8dc6941c59cdbb78748c91267# journalctl -p err -f
Aug 09 15:27:00 pve04 kernel: mce: [Hardware Error]: CPU 4: Machine Check: 0 Bank 1: bc800800060c0859
Aug 09 15:27:00 pve04 kernel: mce: [Hardware Error]: TSC 0 ADDR da6954e40 MISC d012000000000000 IPID 100b000000000
Aug 09 15:27:00 pve04 kernel: mce: [Hardware Error]: PROCESSOR 2:a20f12 TIME 1691612813 SOCKET 0 APIC 8 microcode a20120a
Aug 09 15:27:02 pve04 smartd[1978]: Device: /dev/nvme1, number of Error Log entries increased from 44 to 45
Aug 09 15:27:03 pve04 smartd[1978]: Device: /dev/nvme2, number of Error Log entries increased from 44 to 45


Aug 10 12:17:12 pve04 kernel: mce: [Hardware Error]: CPU 4: Machine Check: 0 Bank 0: bc00080001010135
Aug 10 12:17:12 pve04 kernel: mce: [Hardware Error]: TSC 0 ADDR c719e0834 MISC d012000000000000 IPID 1000b000000000
Aug 10 12:17:12 pve04 kernel: mce: [Hardware Error]: PROCESSOR 2:a20f12 TIME 1691687825 SOCKET 0 APIC 8 microcode a20120a
Aug 10 12:17:15 pve04 smartd[2000]: Device: /dev/nvme1, number of Error Log entries increased from 50 to 51
Aug 10 12:17:15 pve04 smartd[2000]: Device: /dev/nvme2, number of Error Log entries increased from 50 to 51
Aug 10 14:04:35 pve04 pveproxy[26794]: got inotify poll request in wrong process - disabling inotify
Aug 10 14:43:17 pve04 pvedaemon[38024]: command '/usr/bin/termproxy 5900 --path /vms/102 --perm VM.Console -- /usr/bin/dtach -A /var/run/dtach/vzctlconsole102 -r winch -z lxc-console -n 102 -e -1' failed: exit code 1
Aug 10 14:43:17 pve04 pvedaemon[2364]: <root@pam> end task UPID:pve04:00009488:000D5EA2:64D53DCB:vncproxy:102:root@pam: command '/usr/bin/termproxy 5900 --path /vms/102 --perm VM.Console -- /usr/bin/dtach -A /var/run/dtach/vzctlconsole102 -r winch -z lxc-console -n 102 -e -1' failed: exit code 1


Message from syslogd@pve04 at Aug 10 14:55:26 ...
 kernel:[ 9501.647165] [Hardware Error]: Uncorrected, software restartable error.

Message from syslogd@pve04 at Aug 10 14:55:26 ...
 kernel:[ 9501.647174] [Hardware Error]: CPU:4 (19:21:2) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|-|Poison|-]: 0xbc00080001010135

Message from syslogd@pve04 at Aug 10 14:55:26 ...
 kernel:[ 9501.647187] [Hardware Error]: Error Addr: 0x000000182acb2800

Message from syslogd@pve04 at Aug 10 14:55:26 ...
 kernel:[ 9501.647195] [Hardware Error]: IPID: 0x001000b000000000

Message from syslogd@pve04 at Aug 10 14:55:26 ...
 kernel:[ 9501.647202] [Hardware Error]: Load Store Unit Ext. Error Code: 1, An ECC error or L2 poison was detected on a data cache read by a load.

Message from syslogd@pve04 at Aug 10 14:55:26 ...
 kernel:[ 9501.647216] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
Aug 10 14:55:26 pve04 kernel: mce: Uncorrected hardware memory error in user-access at 182acb2800
Aug 10 14:55:26 pve04 kernel: [Hardware Error]: Uncorrected, software restartable error.
Aug 10 14:55:26 pve04 kernel: [Hardware Error]: CPU:4 (19:21:2) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|-|Poison|-]: 0xbc00080001010135
Aug 10 14:55:26 pve04 kernel: [Hardware Error]: Error Addr: 0x000000182acb2800
Aug 10 14:55:26 pve04 kernel: [Hardware Error]: IPID: 0x001000b000000000
Aug 10 14:55:26 pve04 kernel: [Hardware Error]: Load Store Unit Ext. Error Code: 1, An ECC error or L2 poison was detected on a data cache read by a load.
Aug 10 14:55:26 pve04 kernel: [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
Aug 10 14:55:26 pve04 kernel: Memory failure: 0x182acb2: Sending SIGBUS to stressapptest:39043 due to hardware memory corruption
Aug 10 14:55:26 pve04 kernel: Memory failure: 0x182acb2: recovery action for dirty LRU page: Recovered
Aug 10 14:58:32 pve04 pveproxy[43356]: got inotify poll request in wrong process - disabling inotify

Memory used on this board is:
G.Skill RipJaws V Series 128GB (4 x 32GB) 288-Pin SDRAM PC4-21300 DDR4 2666 CL18-18-18-38 1.20V Quad Channel Desktop Memory Model F4-2666C18Q-128GVK

PS. I don't see this module under qualified vendor list for Vermeer (5900x)

From the buzzkill bug I've read some people downgrading the speed on their memory, havent tried that yet. Any other suggestions you would have I would really appreciate. Not really sure what else to do here, thinking of going vmware to see if this helps?!

en4ble · Aug 10, 2023

Actually this is what I received when testing the RAM with stressapptest:

Code:

Message from syslogd@pve04 at Aug 10 14:55:26 ...
 kernel:[ 9501.647165] [Hardware Error]: Uncorrected, software restartable error.

Message from syslogd@pve04 at Aug 10 14:55:26 ...
 kernel:[ 9501.647174] [Hardware Error]: CPU:4 (19:21:2) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|-|Poison|-]: 0xbc00080001010135

Message from syslogd@pve04 at Aug 10 14:55:26 ...
 kernel:[ 9501.647187] [Hardware Error]: Error Addr: 0x000000182acb2800

Message from syslogd@pve04 at Aug 10 14:55:26 ...
 kernel:[ 9501.647195] [Hardware Error]: IPID: 0x001000b000000000

Message from syslogd@pve04 at Aug 10 14:55:26 ...
 kernel:[ 9501.647202] [Hardware Error]: Load Store Unit Ext. Error Code: 1, An ECC error or L2 poison was detected on a data cache read by a load.

Message from syslogd@pve04 at Aug 10 14:55:26 ...
 kernel:[ 9501.647216] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
Aug 10 14:55:26 pve04 kernel: mce: Uncorrected hardware memory error in user-access at 182acb2800
Aug 10 14:55:26 pve04 kernel: [Hardware Error]: Uncorrected, software restartable error.
Aug 10 14:55:26 pve04 kernel: [Hardware Error]: CPU:4 (19:21:2) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|-|Poison|-]: 0xbc00080001010135
Aug 10 14:55:26 pve04 kernel: [Hardware Error]: Error Addr: 0x000000182acb2800
Aug 10 14:55:26 pve04 kernel: [Hardware Error]: IPID: 0x001000b000000000
Aug 10 14:55:26 pve04 kernel: [Hardware Error]: Load Store Unit Ext. Error Code: 1, An ECC error or L2 poison was detected on a data cache read by a load.
Aug 10 14:55:26 pve04 kernel: [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
Aug 10 14:55:26 pve04 kernel: Memory failure: 0x182acb2: Sending SIGBUS to stressapptest:39043 due to hardware memory corruption
Aug 10 14:55:26 pve04 kernel: Memory failure: 0x182acb2: recovery action for dirty LRU page: Recovered
Aug 10 14:58:32 pve04 pveproxy[43356]: got inotify poll request in wrong process - disabling inotify

en4ble · Aug 10, 2023

Also testing combitionation of disabling Precision Boost Overdrive and Core Performance Boost + dowgraded RAM to 2400 form 2666 (with the other Cstate options modified).

flames · Aug 11, 2023

hm... this is strange. i still have many of those systems running (ryzen 5950x in my case) with boards gygabyte x570 aorus pro and x570s master. everything is good so far (now mixed pve 7.4.16 opt-in kernel 16.2.16-4 and pve 8.0.4 kernel 16.2.14-4).

asking just in case, you have latest bios (or at least gigabyte x570s F4 with AGESA V2 1.2.0.7, and x570[non s] F36 with AGESA V2 1.2.0.7)?

flames · Aug 11, 2023

btw. half of my x570/ryzen59xx systems do not even have ecc unbuffered ram. but i always carefully select rams, that match each other, if not in a set of 4x32gb... running some kingston value unbuffered ecc also, those had channel sync issues, that is why i try to match them by my self (trying different modules, always have 16++ modules in stock, run memtest and a winpe burn in test for at least 48 hours. if no issues, sys is going staging, then after a week productive)
only once had a broken cpu, but it had different logs and wasn't even running for more than 5 min w/o any burn in test running

en4ble · Aug 11, 2023

flames said:
hm... this is strange. i still have many of those systems running (ryzen 5950x in my case) with boards gygabyte x570 aorus pro and x570s master. everything is good so far (now mixed pve 7.4.16 opt-in kernel 16.2.16-4 and pve 8.0.4 kernel 16.2.14-4).

asking just in case, you have latest bios (or at least gigabyte x570s F4 with AGESA V2 1.2.0.7, and x570[non s] F36 with AGESA V2 1.2.0.7)?

Flames! thanks for response really. Yeah I flashed that guy with newest bios which was on 2 revs below.

en4ble · Aug 11, 2023

flames said:
btw. half of my x570/ryzen59xx systems do not even have ecc unbuffered ram. but i always carefully select rams, that match each other, if not in a set of 4x32gb... running some kingston value unbuffered ecc also, those had channel sync issues, that is why i try to match them by my self (trying different modules, always have 16++ modules in stock, run memtest and a winpe burn in test for at least 48 hours. if no issues, sys is going staging, then after a week productive)
only once had a broken cpu, but it had different logs and wasn't even running for more than 5 min w/o any burn in test running

So I've noticed that my ram is NOT on the supported list by that CPU BUT its on supported list on the motherboard.

What I have done so far is I lowered the memory to 2400 from 2666 (which I've noticed the supported list for that CPU didn't really see anything with 2666). I have disabled Precision Boost Overdrive and Core Performance Boost as well.

Now I don't really know what actually helped - is it the ram to 2400 OR the extra features but system been up for 17hrs which is the newest record.

PS. side note. I just realized that my personal PC got the same CPU (5900x) on asrock 550m legend and I actually had similar issues where PC would just shutdown randomly. BIOS upgrade didn't help till I actually removed the XMP settings from my RAM and its been stable from that point, no other changes were done.

So is this coincidence OR those higher end CPUs are super touchy about the RAM?! I don't know but those two really feel heavy related.

en4ble · Aug 11, 2023

@flames may I ask what RAM sku you are running on your systems?! Thanks in advance again.

flames · Aug 12, 2023

Sure, all machines have each 4x 32 GB never mixed different modules and tested to match modules as best as i can considering smallest possible time waste.

KSM32ED8/32HC (ECC unbuff 3200MHz) <-- not tested yet
KVR32N22D8/32 <-- cheap, always in stock locally and more stable than KCP! my current go to modules
KCP432ND8/32 (3200MHz)
KSM26ED8/32MF (ECC unbuff, 2666MHz)
KVR26N19D8/32 (2666MHz, cheap)
KCP426ND8/26 (2666MHz)

Older machines got RAM that was available, G.Skill, HyperX, what ever, even with RGB sh... (the corona delivery issues, eat what you get or die).
To be honest, I never looked into the compatibility list of the CPUs or Mainboards. Only specs matter.

also, i am running any bios settings default, ofc. except the following:
SVM = enable (virtualization aka vt-d in intel world)
IOMMU = enable (default = auto on most x570)
Power idle control = Typical current idle (cstate 6 disabled on some x570)
Power on after power loss / Always on

which means RAM timings and voltages are always default.

to add on that matter, i found the Asus WS x570 ACE Mainboard has no settings Power Idle Control / C state 6 disabling, but is stable all the time with KCP and KVR modules. guess they figured out, that cstate 6 has issues yet and disabled it by default. had no ASrock boards yet and no specked down chipsets like 520/550 etc, because they lack of PCIe lanes (with amd 5900/5950x there is no onboard graphics, so need gpu [4 lanes] + sas/sata HBA for ceph OSDs [8 lanes] + 10gbe intel nic [8 lanes] + 2x small nvme drives for os in zfs mirroring [2x 4 lanes]

another plus of the WS x570 ACE is the built in IPMI module, which saves a port on the kvmoip switch or a separate pikvm

Search

Search

Proxmox random crashes. Please help!

en4ble

New Member

en4ble

New Member

en4ble

New Member

en4ble

New Member

flames

Renowned Member

flames

Renowned Member

en4ble

New Member

en4ble

New Member

en4ble

New Member

flames

Renowned Member