Recently I have had many issues from corrupt SSD to regular crashing. And more recently when running things like apt upgrade it seems to find checksum errors in the files downloaded. So you might think - that sounds like dodgy RAM? I ran memtest all day and no issues. Same thing.
I then swapped the RAM from another system, same, thing.
I then swapped the entire board, including the CPU and replaced the SSD with M.2. Same thing.
I have 3 of these machines, so I installed again from scratch on the 3rd machine. Same thing.
So at this point all hardware has been changed in every combination and I still have issues - which is extremely perplexing. I didn't have issues until 8.4. But the issues I encounter really do still feel like memory issues.
Now the machine identifies as
pve1 kernel: Hardware name: HP HP EliteDesk 800 G3 DM 65W/829A, BIOS P21 Ver. 02.38 04/21/2021
There is a BIOS update dated 2024 but no real actual information of any use. If there were a systemic problem with that machine I am certain I'd find it via Google but can't really find anything useful.
The RAM appears to be DDR4 Speed: 2400 MT/s Manufacturer: Hynix/Hyundai Part Number: LX8GDDR4T2400, with 3 machines I have 6 sticks. Thats not no name and as far as I know should be ok.
All problem appear when starting a particular VM which is the Home Assistant OS, but as I have seen issues when that isn't running, I'm working on the basis that its not related, that just happens to use a large enough chunk of resources to make a masked problem noticeable.
I booted into a windows bootable disk and ran various diagnostics and that runs perfectly fine, all day, thrashing the disk or memory, no issues at all.
So I don't really know what my question is, it just looks like this machine will not run Proxmox without falling over and I have no idea why. Having replaced everything and reinstalled from a download I just pulled then nothing remains.
Only the PSU hasn't been changed - so I would try that and also the BIOS upgrade which can only be installed from windows unfortunately.
I am really asking if anyone has an ideas what else to try? I find it surprising that with 3 completely independent machines whatever the problem is, its not a fault on one machine it must be a fundamental incompatibility.
root@pve1:/var/log# journalctl | grep "BUG:"
Jun 16 12:32:40 pve1 kernel: BUG: unable to handle page fault for address: 000000000002a7c0
Jun 16 12:39:24 pve1 kernel: BUG: Bad page state in process kworker/u8:8 pfn:2f75a2
Jun 16 12:39:26 pve1 kernel: BUG: Bad page state in process zstd pfn:337434
Jun 16 12:39:30 pve1 kernel: BUG: Bad page state in process vma pfn:35309e
Jun 16 12:39:33 pve1 kernel: BUG: Bad page state in process vma pfn:39e49a
Jun 16 12:39:41 pve1 kernel: BUG: Bad page state in process zstd pfn:290c3
Jun 16 12:39:43 pve1 kernel: BUG: unable to handle page fault for address: 0000000000029e40
Jun 16 12:44:01 pve1 kernel: BUG: scheduling while atomic: vma/4716/0x00000000
Jun 16 13:09:14 pve1 kernel: BUG: unable to handle page fault for address: 0000000000029e40
Jun 22 16:47:35 pve1 kernel: BUG: Bad page state in process CPU 1/KVM pfn:1bdcb4
Jun 22 16:47:35 pve1 kernel: BUG: Bad page state in process CPU 0/KVM pfn:1c2743
Jun 22 16:47:37 pve1 kernel: BUG: Bad page state in process iou-wrk-2210 pfn:1ced99
Jun 22 16:47:38 pve1 kernel: BUG: Bad page state in process CPU 0/KVM pfn:1d351c
Jun 22 16:47:41 pve1 kernel: BUG: Bad page state in process CPU 0/KVM pfn:1e189e
Jun 22 16:47:42 pve1 kernel: BUG: Bad page state in process CPU 1/KVM pfn:1e5f07
Jun 22 16:47:43 pve1 kernel: BUG: Bad page state in process iou-wrk-2210 pfn:1ed6b9
Jun 22 16:47:44 pve1 kernel: BUG: Bad page state in process iou-wrk-2210 pfn:1eff53
Jun 22 16:59:44 pve1 kernel: BUG: unable to handle page fault for address: 0000000000029e40
Jun 29 01:00:59 pve1 kernel: BUG: Bad page state in process iou-wrk-2150 pfn:340c59
Jun 29 01:01:00 pve1 kernel: BUG: Bad page state in process iou-wrk-2150 pfn:344d0b
Jun 29 01:01:02 pve1 kernel: BUG: Bad page state in process iou-wrk-2150 pfn:39e49a
Jun 29 01:01:03 pve1 kernel: BUG: Bad page state in process iou-wrk-2150 pfn:3ae598
Jun 29 01:01:04 pve1 kernel: BUG: Bad page state in process iou-wrk-2150 pfn:3b70cd
Jun 29 01:01:05 pve1 kernel: BUG: Bad page state in process iou-wrk-2150 pfn:3bfc15
Jun 29 01:01:05 pve1 kernel: BUG: Bad page state in process iou-wrk-2150 pfn:3c4330
Jun 29 01:01:05 pve1 kernel: BUG: Bad page state in process iou-wrk-2150 pfn:3c6f86
Jun 29 01:01:05 pve1 kernel: BUG: Bad page state in process kvm pfn:3c6d0c
Jun 29 01:01:06 pve1 kernel: BUG: Bad page state in process kvm pfn:3d038f
Jun 29 01:01:06 pve1 kernel: BUG: Bad page state in process iou-wrk-2150 pfn:3d0ea9
Jul 06 01:00:10 pve1 kernel: BUG: Bad page state in process tar pfn:344d0b
Jul 06 01:00:30 pve1 kernel: BUG: Bad page state in process tar pfn:39e49a
Jul 06 10:53:28 pve1 kernel: BUG: Bad page state in process CPU 0/KVM pfn:1b9111
Jul 06 10:53:31 pve1 kernel: BUG: Bad page state in process iou-wrk-2119 pfn:1c9290
Jul 06 10:53:31 pve1 kernel: BUG: Bad page state in process iou-wrk-2119 pfn:1c7672
Jul 06 10:53:34 pve1 kernel: BUG: Bad page state in process CPU 0/KVM pfn:1dc232
Jul 06 10:53:38 pve1 kernel: BUG: Bad page state in process CPU 1/KVM pfn:1efa50
Jul 06 10:53:43 pve1 kernel: BUG: Bad page state in process iou-wrk-2119 pfn:1f64a4
Jul 06 10:53:48 pve1 kernel: BUG: Bad page state in process iou-wrk-2119 pfn:215894
Jul 06 10:53:48 pve1 kernel: BUG: Bad page state in process CPU 0/KVM pfn:21a88a
Jul 06 10:55:16 pve1 kernel: BUG: Bad page state in process pvedaemon worke pfn:1888af
Jul 06 10:55:16 pve1 kernel: BUG: unable to handle page fault for address: 0000000008000000
Jul 06 10:55:22 pve1 kernel: BUG: Bad page state in process pve-firewall pfn:163e30
Jul 06 10:55:54 pve1 kernel: BUG: Bad page map in process pveproxy worker pte:200000000000 pmd:11601a067
Jul 06 10:55:54 pve1 kernel: BUG: Bad rss-counter state mm:000000009949e7bb type:MM_SWAPENTS val:-1
Jul 06 10:55:54 pve1 kernel: BUG: Bad page state in process pveproxy pfn:227e4a
Jul 06 10:55:55 pve1 kernel: BUG: Bad page state in process pveproxy pfn:227ff2
Jul 06 10:55:55 pve1 kernel: BUG: Bad page state in process pveproxy pfn:2286e1
Jul 06 10:55:56 pve1 kernel: BUG: Bad page state in process pveproxy pfn:2288fd
Jul 06 10:55:56 pve1 kernel: BUG: Bad page state in process pveproxy pfn:228d5c
Jul 06 10:55:56 pve1 kernel: BUG: Bad page state in process pveproxy pfn:228e46
Jul 06 10:55:56 pve1 kernel: BUG: Bad page state in process pveproxy pfn:228e7c
Jul 06 10:55:56 pve1 kernel: BUG: Bad page state in process pveproxy worker pfn:1864ff
Jul 06 10:55:57 pve1 kernel: BUG: Bad page state in process pveproxy pfn:104e64
Jul 06 10:55:58 pve1 kernel: BUG: Bad page state in process pveproxy pfn:228f6c
Jul 13 18:06:10 pve1 kernel: BUG: unable to handle page fault for address: 0000000000029e40
Jul 13 18:06:46 pve1 kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [kworker/u8:3:168]
Jul 13 18:06:56 pve1 kernel: BUG: unable to handle page fault for address: 0000000000029e40
Jul 13 18:10:57 pve1 kernel: BUG: Bad page state in process CPU 0/KVM pfn:228760
I then swapped the RAM from another system, same, thing.
I then swapped the entire board, including the CPU and replaced the SSD with M.2. Same thing.
I have 3 of these machines, so I installed again from scratch on the 3rd machine. Same thing.
So at this point all hardware has been changed in every combination and I still have issues - which is extremely perplexing. I didn't have issues until 8.4. But the issues I encounter really do still feel like memory issues.
Now the machine identifies as
pve1 kernel: Hardware name: HP HP EliteDesk 800 G3 DM 65W/829A, BIOS P21 Ver. 02.38 04/21/2021
There is a BIOS update dated 2024 but no real actual information of any use. If there were a systemic problem with that machine I am certain I'd find it via Google but can't really find anything useful.
The RAM appears to be DDR4 Speed: 2400 MT/s Manufacturer: Hynix/Hyundai Part Number: LX8GDDR4T2400, with 3 machines I have 6 sticks. Thats not no name and as far as I know should be ok.
All problem appear when starting a particular VM which is the Home Assistant OS, but as I have seen issues when that isn't running, I'm working on the basis that its not related, that just happens to use a large enough chunk of resources to make a masked problem noticeable.
I booted into a windows bootable disk and ran various diagnostics and that runs perfectly fine, all day, thrashing the disk or memory, no issues at all.
So I don't really know what my question is, it just looks like this machine will not run Proxmox without falling over and I have no idea why. Having replaced everything and reinstalled from a download I just pulled then nothing remains.
Only the PSU hasn't been changed - so I would try that and also the BIOS upgrade which can only be installed from windows unfortunately.
I am really asking if anyone has an ideas what else to try? I find it surprising that with 3 completely independent machines whatever the problem is, its not a fault on one machine it must be a fundamental incompatibility.
root@pve1:/var/log# journalctl | grep "BUG:"
Jun 16 12:32:40 pve1 kernel: BUG: unable to handle page fault for address: 000000000002a7c0
Jun 16 12:39:24 pve1 kernel: BUG: Bad page state in process kworker/u8:8 pfn:2f75a2
Jun 16 12:39:26 pve1 kernel: BUG: Bad page state in process zstd pfn:337434
Jun 16 12:39:30 pve1 kernel: BUG: Bad page state in process vma pfn:35309e
Jun 16 12:39:33 pve1 kernel: BUG: Bad page state in process vma pfn:39e49a
Jun 16 12:39:41 pve1 kernel: BUG: Bad page state in process zstd pfn:290c3
Jun 16 12:39:43 pve1 kernel: BUG: unable to handle page fault for address: 0000000000029e40
Jun 16 12:44:01 pve1 kernel: BUG: scheduling while atomic: vma/4716/0x00000000
Jun 16 13:09:14 pve1 kernel: BUG: unable to handle page fault for address: 0000000000029e40
Jun 22 16:47:35 pve1 kernel: BUG: Bad page state in process CPU 1/KVM pfn:1bdcb4
Jun 22 16:47:35 pve1 kernel: BUG: Bad page state in process CPU 0/KVM pfn:1c2743
Jun 22 16:47:37 pve1 kernel: BUG: Bad page state in process iou-wrk-2210 pfn:1ced99
Jun 22 16:47:38 pve1 kernel: BUG: Bad page state in process CPU 0/KVM pfn:1d351c
Jun 22 16:47:41 pve1 kernel: BUG: Bad page state in process CPU 0/KVM pfn:1e189e
Jun 22 16:47:42 pve1 kernel: BUG: Bad page state in process CPU 1/KVM pfn:1e5f07
Jun 22 16:47:43 pve1 kernel: BUG: Bad page state in process iou-wrk-2210 pfn:1ed6b9
Jun 22 16:47:44 pve1 kernel: BUG: Bad page state in process iou-wrk-2210 pfn:1eff53
Jun 22 16:59:44 pve1 kernel: BUG: unable to handle page fault for address: 0000000000029e40
Jun 29 01:00:59 pve1 kernel: BUG: Bad page state in process iou-wrk-2150 pfn:340c59
Jun 29 01:01:00 pve1 kernel: BUG: Bad page state in process iou-wrk-2150 pfn:344d0b
Jun 29 01:01:02 pve1 kernel: BUG: Bad page state in process iou-wrk-2150 pfn:39e49a
Jun 29 01:01:03 pve1 kernel: BUG: Bad page state in process iou-wrk-2150 pfn:3ae598
Jun 29 01:01:04 pve1 kernel: BUG: Bad page state in process iou-wrk-2150 pfn:3b70cd
Jun 29 01:01:05 pve1 kernel: BUG: Bad page state in process iou-wrk-2150 pfn:3bfc15
Jun 29 01:01:05 pve1 kernel: BUG: Bad page state in process iou-wrk-2150 pfn:3c4330
Jun 29 01:01:05 pve1 kernel: BUG: Bad page state in process iou-wrk-2150 pfn:3c6f86
Jun 29 01:01:05 pve1 kernel: BUG: Bad page state in process kvm pfn:3c6d0c
Jun 29 01:01:06 pve1 kernel: BUG: Bad page state in process kvm pfn:3d038f
Jun 29 01:01:06 pve1 kernel: BUG: Bad page state in process iou-wrk-2150 pfn:3d0ea9
Jul 06 01:00:10 pve1 kernel: BUG: Bad page state in process tar pfn:344d0b
Jul 06 01:00:30 pve1 kernel: BUG: Bad page state in process tar pfn:39e49a
Jul 06 10:53:28 pve1 kernel: BUG: Bad page state in process CPU 0/KVM pfn:1b9111
Jul 06 10:53:31 pve1 kernel: BUG: Bad page state in process iou-wrk-2119 pfn:1c9290
Jul 06 10:53:31 pve1 kernel: BUG: Bad page state in process iou-wrk-2119 pfn:1c7672
Jul 06 10:53:34 pve1 kernel: BUG: Bad page state in process CPU 0/KVM pfn:1dc232
Jul 06 10:53:38 pve1 kernel: BUG: Bad page state in process CPU 1/KVM pfn:1efa50
Jul 06 10:53:43 pve1 kernel: BUG: Bad page state in process iou-wrk-2119 pfn:1f64a4
Jul 06 10:53:48 pve1 kernel: BUG: Bad page state in process iou-wrk-2119 pfn:215894
Jul 06 10:53:48 pve1 kernel: BUG: Bad page state in process CPU 0/KVM pfn:21a88a
Jul 06 10:55:16 pve1 kernel: BUG: Bad page state in process pvedaemon worke pfn:1888af
Jul 06 10:55:16 pve1 kernel: BUG: unable to handle page fault for address: 0000000008000000
Jul 06 10:55:22 pve1 kernel: BUG: Bad page state in process pve-firewall pfn:163e30
Jul 06 10:55:54 pve1 kernel: BUG: Bad page map in process pveproxy worker pte:200000000000 pmd:11601a067
Jul 06 10:55:54 pve1 kernel: BUG: Bad rss-counter state mm:000000009949e7bb type:MM_SWAPENTS val:-1
Jul 06 10:55:54 pve1 kernel: BUG: Bad page state in process pveproxy pfn:227e4a
Jul 06 10:55:55 pve1 kernel: BUG: Bad page state in process pveproxy pfn:227ff2
Jul 06 10:55:55 pve1 kernel: BUG: Bad page state in process pveproxy pfn:2286e1
Jul 06 10:55:56 pve1 kernel: BUG: Bad page state in process pveproxy pfn:2288fd
Jul 06 10:55:56 pve1 kernel: BUG: Bad page state in process pveproxy pfn:228d5c
Jul 06 10:55:56 pve1 kernel: BUG: Bad page state in process pveproxy pfn:228e46
Jul 06 10:55:56 pve1 kernel: BUG: Bad page state in process pveproxy pfn:228e7c
Jul 06 10:55:56 pve1 kernel: BUG: Bad page state in process pveproxy worker pfn:1864ff
Jul 06 10:55:57 pve1 kernel: BUG: Bad page state in process pveproxy pfn:104e64
Jul 06 10:55:58 pve1 kernel: BUG: Bad page state in process pveproxy pfn:228f6c
Jul 13 18:06:10 pve1 kernel: BUG: unable to handle page fault for address: 0000000000029e40
Jul 13 18:06:46 pve1 kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [kworker/u8:3:168]
Jul 13 18:06:56 pve1 kernel: BUG: unable to handle page fault for address: 0000000000029e40
Jul 13 18:10:57 pve1 kernel: BUG: Bad page state in process CPU 0/KVM pfn:228760