newbie: Am I correct in assuming I have defective hardware?

proxbash

New Member
May 25, 2025
1
0
1
I'm very much a newbie learning my way around in proxmox ve and pbs. Last month I bought a Minisforum MS-01 from amazon and setup proxmox VE on it and a couple VMs. Then I bought another MS-01 and got them working as a 2-node cluster. I'm having problems only with the 2nd unit I bought. It will run small workloads OK, like online-migration of a linux VM with 4GB RAM and 32GB HD. But try that with bigger workloads like with a 75GB HD and it fails to finish the migrate to the failing MS-01 (the new one). Sometimes the target hangs hard, usually the error gets detected by the sender and it all rolls back. Poking around in the logs I find MCE errors getting logged only by this failing unit.

This is an example of typical output during a big migrate.
Code:
root@ms-01-20250512:~# journalctl -k -f | grep -iE 'mce|cmci|error'
May 25 09:22:24 ms-01-20250512 kernel: CPU12 BANK1 CMCI storm detected
May 25 09:22:24 ms-01-20250512 kernel: CPU16 BANK1 CMCI storm detected
May 25 09:22:24 ms-01-20250512 kernel: mce: [Hardware Error]: Machine check events logged
May 25 09:22:24 ms-01-20250512 kernel: mce: [Hardware Error]: Machine check events logged

If I replicate zfs for guests to migrate quickly, the replication runs for a while, generates errors like above, then hangs the target node hard requiring power-off/reset.

I disabled the pve.enterprise repo and enabled pve.no-subscription, apt update/apt upgrade, installed Intel microcode. I upgraded the BIOS to 1.26. Both units have the same hardware and as near as I can tell, the same software too.

To compare the MS-01 that can receive a big replication to the one that can't, I put this together to confirm I get the same output on each unit, and I do - only the serial #s differ. On the BIOS I didn't play with any overclocking etc and only disabled secure boot.
Code:
dmidecode -s system-serial-number
pveversion -v
grep -i microcode /proc/cpuinfo | uniq
dmesg | grep -i microcode
dpkg -l | grep intel-microcode
dmidecode -t bios

What irks me a little is that I can run memtest86 for 5+ passes and it gets no errors. But I've heard that these errors can come from L2 cache problems on the CPU that memtest won't show? I would just feel more grounded if I had something like memtest86 that fails, instead of only my real workload that brings into question whether I've installed and configured stuff correctly. You never know what you don't know.

Thanks for any comments!
 
Last edited:
could you check the temperature on both units, ms-01 is known to have bad thermal paste

https://forums.servethehome.com/ind...ompatibility-thread.42785/page-30#post-415137

and please do not use liquid metal

https://forums.servethehome.com/ind...orum-ms-01-heating-problem.43519/#post-416292

Liquid Metal we are seeing a -20C and premium paste -10C. Once you do this though the warranty will be gone.

If you are in the return window, you could return it to amazon, because the warranty from Minisforum is exactly like their thermal paste
 
Last edited: