Memory Issues and Constant Crashing - Replaced Everything and looking for ideas?

drjaymz@

Member
Jan 19, 2022
139
5
23
102
Recently I have had many issues from corrupt SSD to regular crashing. And more recently when running things like apt upgrade it seems to find checksum errors in the files downloaded. So you might think - that sounds like dodgy RAM? I ran memtest all day and no issues. Same thing.

I then swapped the RAM from another system, same, thing.
I then swapped the entire board, including the CPU and replaced the SSD with M.2. Same thing.
I have 3 of these machines, so I installed again from scratch on the 3rd machine. Same thing.

So at this point all hardware has been changed in every combination and I still have issues - which is extremely perplexing. I didn't have issues until 8.4. But the issues I encounter really do still feel like memory issues.

Now the machine identifies as
pve1 kernel: Hardware name: HP HP EliteDesk 800 G3 DM 65W/829A, BIOS P21 Ver. 02.38 04/21/2021

There is a BIOS update dated 2024 but no real actual information of any use. If there were a systemic problem with that machine I am certain I'd find it via Google but can't really find anything useful.
The RAM appears to be DDR4 Speed: 2400 MT/s Manufacturer: Hynix/Hyundai Part Number: LX8GDDR4T2400, with 3 machines I have 6 sticks. Thats not no name and as far as I know should be ok.

All problem appear when starting a particular VM which is the Home Assistant OS, but as I have seen issues when that isn't running, I'm working on the basis that its not related, that just happens to use a large enough chunk of resources to make a masked problem noticeable.

I booted into a windows bootable disk and ran various diagnostics and that runs perfectly fine, all day, thrashing the disk or memory, no issues at all.

So I don't really know what my question is, it just looks like this machine will not run Proxmox without falling over and I have no idea why. Having replaced everything and reinstalled from a download I just pulled then nothing remains.
Only the PSU hasn't been changed - so I would try that and also the BIOS upgrade which can only be installed from windows unfortunately.

I am really asking if anyone has an ideas what else to try? I find it surprising that with 3 completely independent machines whatever the problem is, its not a fault on one machine it must be a fundamental incompatibility.

root@pve1:/var/log# journalctl | grep "BUG:"
Jun 16 12:32:40 pve1 kernel: BUG: unable to handle page fault for address: 000000000002a7c0
Jun 16 12:39:24 pve1 kernel: BUG: Bad page state in process kworker/u8:8 pfn:2f75a2
Jun 16 12:39:26 pve1 kernel: BUG: Bad page state in process zstd pfn:337434
Jun 16 12:39:30 pve1 kernel: BUG: Bad page state in process vma pfn:35309e
Jun 16 12:39:33 pve1 kernel: BUG: Bad page state in process vma pfn:39e49a
Jun 16 12:39:41 pve1 kernel: BUG: Bad page state in process zstd pfn:290c3
Jun 16 12:39:43 pve1 kernel: BUG: unable to handle page fault for address: 0000000000029e40
Jun 16 12:44:01 pve1 kernel: BUG: scheduling while atomic: vma/4716/0x00000000
Jun 16 13:09:14 pve1 kernel: BUG: unable to handle page fault for address: 0000000000029e40
Jun 22 16:47:35 pve1 kernel: BUG: Bad page state in process CPU 1/KVM pfn:1bdcb4
Jun 22 16:47:35 pve1 kernel: BUG: Bad page state in process CPU 0/KVM pfn:1c2743
Jun 22 16:47:37 pve1 kernel: BUG: Bad page state in process iou-wrk-2210 pfn:1ced99
Jun 22 16:47:38 pve1 kernel: BUG: Bad page state in process CPU 0/KVM pfn:1d351c
Jun 22 16:47:41 pve1 kernel: BUG: Bad page state in process CPU 0/KVM pfn:1e189e
Jun 22 16:47:42 pve1 kernel: BUG: Bad page state in process CPU 1/KVM pfn:1e5f07
Jun 22 16:47:43 pve1 kernel: BUG: Bad page state in process iou-wrk-2210 pfn:1ed6b9
Jun 22 16:47:44 pve1 kernel: BUG: Bad page state in process iou-wrk-2210 pfn:1eff53
Jun 22 16:59:44 pve1 kernel: BUG: unable to handle page fault for address: 0000000000029e40
Jun 29 01:00:59 pve1 kernel: BUG: Bad page state in process iou-wrk-2150 pfn:340c59
Jun 29 01:01:00 pve1 kernel: BUG: Bad page state in process iou-wrk-2150 pfn:344d0b
Jun 29 01:01:02 pve1 kernel: BUG: Bad page state in process iou-wrk-2150 pfn:39e49a
Jun 29 01:01:03 pve1 kernel: BUG: Bad page state in process iou-wrk-2150 pfn:3ae598
Jun 29 01:01:04 pve1 kernel: BUG: Bad page state in process iou-wrk-2150 pfn:3b70cd
Jun 29 01:01:05 pve1 kernel: BUG: Bad page state in process iou-wrk-2150 pfn:3bfc15
Jun 29 01:01:05 pve1 kernel: BUG: Bad page state in process iou-wrk-2150 pfn:3c4330
Jun 29 01:01:05 pve1 kernel: BUG: Bad page state in process iou-wrk-2150 pfn:3c6f86
Jun 29 01:01:05 pve1 kernel: BUG: Bad page state in process kvm pfn:3c6d0c
Jun 29 01:01:06 pve1 kernel: BUG: Bad page state in process kvm pfn:3d038f
Jun 29 01:01:06 pve1 kernel: BUG: Bad page state in process iou-wrk-2150 pfn:3d0ea9
Jul 06 01:00:10 pve1 kernel: BUG: Bad page state in process tar pfn:344d0b
Jul 06 01:00:30 pve1 kernel: BUG: Bad page state in process tar pfn:39e49a
Jul 06 10:53:28 pve1 kernel: BUG: Bad page state in process CPU 0/KVM pfn:1b9111
Jul 06 10:53:31 pve1 kernel: BUG: Bad page state in process iou-wrk-2119 pfn:1c9290
Jul 06 10:53:31 pve1 kernel: BUG: Bad page state in process iou-wrk-2119 pfn:1c7672
Jul 06 10:53:34 pve1 kernel: BUG: Bad page state in process CPU 0/KVM pfn:1dc232
Jul 06 10:53:38 pve1 kernel: BUG: Bad page state in process CPU 1/KVM pfn:1efa50
Jul 06 10:53:43 pve1 kernel: BUG: Bad page state in process iou-wrk-2119 pfn:1f64a4
Jul 06 10:53:48 pve1 kernel: BUG: Bad page state in process iou-wrk-2119 pfn:215894
Jul 06 10:53:48 pve1 kernel: BUG: Bad page state in process CPU 0/KVM pfn:21a88a
Jul 06 10:55:16 pve1 kernel: BUG: Bad page state in process pvedaemon worke pfn:1888af
Jul 06 10:55:16 pve1 kernel: BUG: unable to handle page fault for address: 0000000008000000
Jul 06 10:55:22 pve1 kernel: BUG: Bad page state in process pve-firewall pfn:163e30
Jul 06 10:55:54 pve1 kernel: BUG: Bad page map in process pveproxy worker pte:200000000000 pmd:11601a067
Jul 06 10:55:54 pve1 kernel: BUG: Bad rss-counter state mm:000000009949e7bb type:MM_SWAPENTS val:-1
Jul 06 10:55:54 pve1 kernel: BUG: Bad page state in process pveproxy pfn:227e4a
Jul 06 10:55:55 pve1 kernel: BUG: Bad page state in process pveproxy pfn:227ff2
Jul 06 10:55:55 pve1 kernel: BUG: Bad page state in process pveproxy pfn:2286e1
Jul 06 10:55:56 pve1 kernel: BUG: Bad page state in process pveproxy pfn:2288fd
Jul 06 10:55:56 pve1 kernel: BUG: Bad page state in process pveproxy pfn:228d5c
Jul 06 10:55:56 pve1 kernel: BUG: Bad page state in process pveproxy pfn:228e46
Jul 06 10:55:56 pve1 kernel: BUG: Bad page state in process pveproxy pfn:228e7c
Jul 06 10:55:56 pve1 kernel: BUG: Bad page state in process pveproxy worker pfn:1864ff
Jul 06 10:55:57 pve1 kernel: BUG: Bad page state in process pveproxy pfn:104e64
Jul 06 10:55:58 pve1 kernel: BUG: Bad page state in process pveproxy pfn:228f6c
Jul 13 18:06:10 pve1 kernel: BUG: unable to handle page fault for address: 0000000000029e40
Jul 13 18:06:46 pve1 kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [kworker/u8:3:168]
Jul 13 18:06:56 pve1 kernel: BUG: unable to handle page fault for address: 0000000000029e40
Jul 13 18:10:57 pve1 kernel: BUG: Bad page state in process CPU 0/KVM pfn:228760
 
I can't really believe it but I have 3 of these and they are all unstable. It must have been a bad batch but good enough to pass power on or self testing.
 
All problem appear when starting a particular VM which is the Home Assistant OS
If I understand correctly: when not running that VM (or any VM) you have no issues.

Could you please show the output for:
Code:
qm config <vmid> # replacing <vmid> with the actual id number of the VM


I can't really believe it but I have 3 of these and they are all unstable.
You probably should stop buying lottery tickets!
 
If I understand correctly: when not running that VM (or any VM) you have no issues.

Could you please show the output for:
Code:
qm config <vmid> # replacing <vmid> with the actual id number of the VM



You probably should stop buying lottery tickets!

I thought it was that VM because that VM would trip up, then in the logs you'll see a load of KVM exceptions followed by some seg faults and then a bunch of memory page issues - I think its simply because more was happening in that VM such that it was more likely. However, as noted when I ran apt install on the command line [on the proxmox host] it was finding checksum errors in the downloaded files - which to me is an indication of corruption which is more often than not memory related (could be memory timing). These days you'd don't have to faff with memory timing because the memory is detected and all should be well.

I did update the BIOS on the machines and it didn't make any difference and in fact, to do so required a bootable USB and that managed to crash, so something very wrong. I tried the Memtest86 and disk diagnostics etc and they ran fine. Normally when diagnosing issues you'd go back to the minimal configuration - these are basically a board with CPU and Memory and not much else, default BIOS settings and away you go. The fact that I have more than one of them meant that I could swap anything or even the complete machine. Over the last 35 years that would serve me well and I'd pretty much figure out the memory or something else would be at fault - once the problem goes with you from one machine to another whilst its possible they both have the same issue its very very unlikely. Yet here we are.

I'll include the requested machine config - but there's nothing odd about it. I also mounted the disk independently and checked it it was ok.
Sunday I woke up to find the entire disk is corrupted and basically I wiped it all again. The conclusion is, that no matter how unlikely it is, all the machines I have have some incompatibility.


Code:
root@pve:~# qm config 101
agent: 1
bios: ovmf
boot: order=scsi0
cores: 2
cpu: host
description: <div align='center'><a href='https://Helper-Scripts.com' target='_blank' rel='noopener noreferrer'><img src='https://raw.githubusercontent.com/community-scripts/ProxmoxVE/main/misc/images/logo-81x112.png'/></a>%0A%0A  # Home Assistant OS%0A%0A  <a href='https://ko-fi.com/D1D7EP4GF'><img src='https://img.shields.io/badge/&#x2615;-Buy me a coffee-blue' /></a>%0A  </div>
efidisk0: local-lvm:vm-101-disk-0,efitype=4m,size=4M
localtime: 1
memory: 4096
meta: creation-qemu=9.2.0,ctime=1742573478
name: haos15.0
net0: virtio=02:4A:61:A0:E1:75,bridge=vmbr0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-101-disk-1,cache=writethrough,discard=on,size=32G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=b51d08f0-69a6-468f-a7f7-677849889055
tablet: 0
tags: community-script
vmgenid: 1e5eb0b6-9b4d-4bb6-aa81-0ebc2dcffe7d
 
Still appears unanswered directly:

Have you tested this?
I said that I still had issues, running apt install. Independent of the VM. It also borked when running backup - in fact that is where it most often happens.
I have installed different hardware as of yesterday evening and I am running the same potentially problematic VM so we will see if that falls over. I now don't have access to the previous logs.

I have now put it down to an issue with the hardware I was using - despite it being a branded and well supported piece of hardware often used with proxmox and home labs.

At work, I have 40 or so proxmox / PBS hosts across 5 sites. I don't see this type of issue and if I did, it would quickly end up swapped out. Happy to move on. Will come back if the VM does indeed crash again on the new hardware.
 
Last edited:
I said that I still had issues, running apt install.
Sorry but that did not mention directly if the VM was running at that time.

In my experience mini pcs can suffer from the following (in order):
  • Bad chipsets/buses
  • Bad thermals
  • Bad PSUs
  • Bad NW/nics

Often, something simple like too many peripherals connected at once can trigger some (or all) of the above. What have you got connected/attached while running PVE? (Include all USB,PCI & Sata etc).

Consider:
I booted into a windows bootable disk and ran various diagnostics and that runs perfectly fine, all day, thrashing the disk or memory, no issues at all.
But not when running PVE itself with no VM running. So start with what is physically (HW) different in those 2 scenarios. Then move on to SW differences. You could try installing vanilla Debian 12 (which PVE is based on) on the system & then try "thrashing the disk & memory" all day. If that works without issue - consider the above again, etc.

Anyway good luck.
 
  • Like
Reactions: Johannes S