Kernel error after fresh install on Intel NUC; PVE freezes

julio91

New Member
Apr 19, 2023
6
0
1
Hi there, I am new to this forum but have been using proxmox for quite some time now on several devices. I bought a used Intel Nuc and encountered an error when running pve 7.4. It works well with 6.4 though. I dumped countless of hours in this problem already. I hope you can help me out. I would be really greatful.

Description:
After installing Proxmox PVE on a Intel Nuc (NUC7i5BNK) the system boots fine but after a while just freezes. Sometimes I still get access to the UI interface for a short while but it mostly freezes completely. It a a fresh install with no VMs running. It happens after a rather short time (usually less than an hour) and without activity on the system. I tested the RAM with MemTest+86 various times and once let it run for several passes (during the night). And it did not report an error.
The error does not occur with PVE 6.4 and it has been running for several days not without issues. I attached the dmesg log down below which shows several boot and error cycles.

Hardware:
Intel NUC7i5BNK 32 GB Configuration


Steps done so far:
  • Installed Proxmox 7.4 --> error
    • 5.15 kernel --> error
    • opt-in pve-kernel-6.2 --> error
    • Reset BIOS to default settings --> error
    • Updated BIOS --> error
    • Re-downloaded, reflashed ISO and installed dozens of times --> error
  • Installed Proxmox 7.2--> error
  • MemTest86+ (the one provided with the ISO) --> passed
  • Installed Proxmox 6.4 --> working!
  • Installed Debian Desktop (latest) instead of PVE --> working!
 

Attachments

Last edited:
Anyone please? I am the kind of person who really tries to search the entire internet and tries out stuff before posting a thread but I really do not know what to do here anymore.
Currently I am "stuck" with pve 6.4 and that works but it does not seem like a long term solution ;)
 
I noticed in the log a page fault which cascaded through things, have you tested the memory on the NUC? [You can access it via the boot iso, advanced section on 7.4, I think.]
 
Yes. I did let it run several times with one entire pass and once during the night with several passes. I also suspected the memory but plain debian and pve 6.4 works like a charm. Can it then still be a memory problem?
 
It's all the page faults, seg faults and so on in the log which stand out to me. Personally, I'd say something is up with paging, and that'll be memory.
Okay, a surefire way of ruling it out would be to test the setup with a different set of RAM sticks, if possible, to see if it makes a difference.
While memory testing usually captures things well, I've been a bit burnt by it not catching a set of faulty RAM fairly recently. As such, I tend to be a bit more suspicious and untrusting of the results; if it says it's bad, though, then it will be.

You could, as an experiment, also test it with the 6.2 kernel (pinned at the top of the forum somewhere.) Just in case it is a feature flag which is at fault. I must be tired, I keep missing your already-tested steps.

Note: With strange faults, sometimes you have to test everything, and that takes a long time, unfortunately.
 
Last edited:
Hi thank you for your reply. I was a bit distracted in the last time but testing took a while.

I have one additional 8 Gb memory stick from a different brand. The one present are 16 Gb each from the same brand and same model.
I just swapped out the sticks and tried all possible combinations. Then I booted and waited if an error occurs. For some conbinations I did a second test run. These are the results.

1683873427769.png

For me it points to a fauly RAM No2.
BUT
- Why is it not faulty when I let it run by itself? Is the stick itself maybe ok but does not pair well with the others? How do I make sure that this does not happen when I buy a single replacement stick?
- Why does the error only occur with pve 7.4?

It would be awesome if someone could shine light on this situation. Thanks in advance.
 
Hi thank you for your reply. I was a bit distracted in the last time but testing took a while.

I have one additional 8 Gb memory stick from a different brand. The one present are 16 Gb each from the same brand and same model.
I just swapped out the sticks and tried all possible combinations. Then I booted and waited if an error occurs. For some conbinations I did a second test run. These are the results.

View attachment 50300

For me it points to a fauly RAM No2.
BUT
- Why is it not faulty when I let it run by itself? Is the stick itself maybe ok but does not pair well with the others? How do I make sure that this does not happen when I buy a single replacement stick?
- Why does the error only occur with pve 7.4?

It would be awesome if someone could shine light on this situation. Thanks in advance.
The main thing is you have identified a cause, which is RAM stick 2, so that's excellent progress. My thought is the timings might be laxer when the stick is on its own, so a tiny fault here and there won't cause major problems. In the case of two sticks, where timing has to be correct at all times, I suspect it is not working within the tolerances required for it to function; so, it's okay in single channel mode, but when in dual channel mode it dies hard.

Now, as to avoiding it... good question, this is one of those things where you have 2 choices:
1) Buy a new pair in a kit, make sure they are fine, and return for replacement if not.
2) Buy a single stick with the right timings, and cross your fingers.

In terms of differences between 7.4 and prior versions, it'll be something where there has been an efficiency gain, and it drives the RAM sticks harder; paging table improvements would do it. That sort of thing tends to happen with Kernel upgrades and backports; the sort of thing that is absolutely certain in 6.4 vs 7.2 and 7.4.
 
Hi
thanks a lot for your continous help.

mhm the main reason why I bought this NUC was the 32Gb RAM. That is a bummer. Ok I guess I have to pour more money into it :/
The used sticks are the following. They have identical marking on it (not sure if that means that they are matched pairs).
- Crucial CT16G4SFD8213.C16FH1 16GB DDR4-2133 SODIMM 1.2V CL15

I have really limited amount of knowledge about memory.
Regarding your second option: Do you have any advice what to look out for?

Would it be an option to try out other values here or is this rather a dead end?

signal-2023-05-12-120659_002.jpeg


Thanks in advance and best regards
 
I had a problem with memory myself and ended up going for a pair of G.Skill Trident Z RGB 32GB sticks. Those things just work, so a 2 x 16GB kit would probably work on your end. Or at least, let me put it this way, I've not run into a problem with better memory.

Having said that, ECC RAM does work better, but then you need the hardware and budget to handle it.

Now, to profiles, have you tried slower? I downclocked mine for stability: 3600 to 2400.
 
I could only use the "Memory multiplier" setting which had influence on the clockspeed. Putting it to 12 resulted in 1600 Mhz, but that did not work. Setting it to 14 strangely did not change the resulting frequency but did also not work. and 16 was the original value.
I also played around with the tRASmin value but to no avail. Is there anything I could try out if there is no direct option to underclock or change the RAM voltage?

Maybe I will just keep the other RAM (marked in the table as "foreign RAM") and let it run in the config 1x16GB corsair and 1x8Gb. Are there any strong objections against this way? For the beginning 24 Gb might be enough for me and then I use RAM 2 as standalone in the other NUC. So basically I swapped RAM2 with foreign RAM permanently. I could still replace the RAM later on.
 
Last edited:
At the moment, I'm inclined to say do what works. It shouldn't hurt, and will only restrict how many things you can run, or affect the total size of the Arc Cache if you are using ZFS. As it is, the other stick is 16GB, and can be used in a variety of other projects with ease.
Actually, if you were looking at testing Proxmox Backup Server, you could use it to run that, and do some training or use it properly.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!