[SOLVED] Problem with Intel NUC and Proxmox, troubleshooting

thrusty · Feb 25, 2020

I have an Intel Nuc NUC8i5BEH proxmox running on 4 VMs. From time to time I have the problem that proxmox does not react anymore. This shows up in the fact that I can't login via ssh anymore. Instead I get this error message: -bash: /etc/profile: input/output error

Furthermore, the WebUI can no longer be opened.

The VMs continue to run, but some running processes in the VMs crash. So Proxmox is not dead either, only I cannot access it anymore.

After hardreset and restart of Proxmox everything runs normally again.

In the sys.log I found a hint for the Intel Bug MDS (Zomieload, CVE-2018-3646).

#pve kernel: 0.165869] MDS CPU bug present and SMT on, data leak possible.

#pve kernel: 0.378694] pci 0000:00:1d.6: ASPM: current common clock configuration is broken, reconfig

Is it possible that this bug could lead to such behavior?

I would like to post the sys.log here. Maybe someone with a clue could take a look at it, but it is far too long.

I'm hoping someone can give me some advice on how to fix this problem.
Thanks for any help.

Stefan_R · Feb 25, 2020

The relevant parts of the syslog would be around the time when it 'crashes', you can post that between CODE tags to make it readable (three dots in the edit toolbar -> Insert "Code").

thrusty said:
#pve kernel: 0.165869] MDS CPU bug present and SMT on, data leak possible.

#pve kernel: 0.378694] pci 0000:00:1d.6: ASPM: current common clock configuration is broken, reconfig

Is it possible that this bug could lead to such behavior?

MDS is a security bug first and foremost, and does not cause crashes like you're seeing. The second line is new to me, but seems entirely unrelated.

thrusty said:
Instead I get this error message: -bash: /etc/profile: input/output error

Could it be a hardware fault? RAM, disks, etc...? Also, check for free space on all drives to make sure you're not running out - full disks can often cause weird behaviour like this.

thrusty · Feb 25, 2020

Stefan_R said:
The relevant parts of the syslog would be around the time when it 'crashes', you can post that between CODE tags to make it readable (three dots in the edit toolbar -> Insert "Code").

Thanks for the quick answer.
Unfortunately I can't say for sure if this behaviour occurs. Therefore it is difficult to identify a specific area in the sys.log. I will have to observe this for a while.

MDS is a security bug first and foremost, and does not cause crashes like you're seeing. The second line is new to me, but seems entirely unrelated.

I thought so, but I wasn't sure. A little bit I had hoped to have found the trigger of the error.

Could it be a hardware fault? RAM, disks, etc...? Also, check for free space on all drives to make sure you're not running out - full disks can often cause weird behaviour like this.

Space should be sufficient. Local and local-lvm both have about 80% free space.
I can test the ram tonight with Memtest86+. And I can report then.

thrusty · Mar 29, 2020

I swapped the RAM for a new one yesterday. In the hope that bad addresses in memory will cause the problem. But obviously this did not solve the problem.Last night this strange behaviour occurred again. I can't find the cause. A lack of memory can't be the cause either. The hard disk (an SSD) still has enough free space. The problem occurs irregularly, sometimes it comes after a few hours, sometimes it comes after a few days.
In the syslog I can't find any anomaly that points to this behavior. Are there other log files where I might find a clue?
As I have already written, Proxmox runs on an Intel NUC. Would it be possible that I would have to make special settings in the BIOS of the NUC?

Is it possible that the Nuc itself has a fault? But how can I test it?

n1nj4888 · Mar 29, 2020

I have the same NUC model and have no issues other than the well-documented regular integrated e1000e NIC “Detected Hardware Unit Hang” issues.
Perhaps check syslog for any occurrences of the “Detected Hardware Unit Hang” issue where the network loses connectivity briefly before returning ... I don’t notice any loss of connectivity to the hosts though when this occurs so this may or may not be the cause?

Take a look at the following post along with the workaround (not fix)...

https://forum.proxmox.com/threads/e1000-driver-hang.58284/page-4#post-302173

thrusty · Mar 29, 2020

Thanks for the answer.
There is no entry "Detected Hardware Unit Hang" in my syslog.
There are a few entries with "e1000e" on my syslog.
But I seem to have no problem with that.
I think these are normal messages when starting the system.

Code:

Mar 29 09:44:54 pve kernel: [    2.828230] e1000e: Intel(R) PRO/1000 Network Driver - 3.2.6-k
Mar 29 09:44:54 pve kernel: [    2.828231] e1000e: Copyright(c) 1999 - 2015 Intel Corporation.
Mar 29 09:44:54 pve kernel: [    2.828380] i801_smbus 0000:00:1f.4: SMBus using PCI interrupt
Mar 29 09:44:54 pve kernel: [    2.828550] e1000e 0000:00:1f.6: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
......

.....
Mar 29 09:44:54 pve kernel: [    3.040254] e1000e 0000:00:1f.6 0000:00:1f.6 (uninitialized): registered PHC clock
Mar 29 09:44:54 pve kernel: [    3.111898] e1000e 0000:00:1f.6 eth0: (PCI Express:2.5GT/s:Width x1) 94:c6:91:af:e5:c4
Mar 29 09:44:54 pve kernel: [    3.111899] e1000e 0000:00:1f.6 eth0: Intel(R) PRO/1000 Network Connection
Mar 29 09:44:54 pve kernel: [    3.112019] e1000e 0000:00:1f.6 eth0: MAC: 13, PHY: 12, PBA No: FFFFFF-0FF
Mar 29 09:44:54 pve kernel: [    3.112783] e1000e 0000:00:1f.6 eno1: renamed from eth0

Denny · Mar 31, 2020

There are a lot of variables to chase here. Maybe add some kind of external monitoring like Zabbix or Monit. You could also install Netdata and configure it to persist data locally. It would give you some metrics to eyeball around the time the problem occurs. Have you considered temperature might be the cause?

thrusty · Mar 31, 2020

Hello and thanks for your help.
Actually, I've been installing this hardware:
Intel NUC 8i5BHE
SSD Kingston A400 SA400S37/480G
RAM Kingston HyperX HX424S14IB/16
Proxmox V 6.1-8 with 4 running VMs

Thanks for the tip about the monitoring. I'm gonna have to read up on this. I don't know how to monitor certain services at this point.

I have read on this page about timing problems of the Kingston SSD A400 series. Maybe this problem has to do with my system failures?

This morning the system failed again. No access to the website.
Via SSH I can use commands like

Code:

#df -hl

and get an output.
But if I enter

Code:

#service pveproxy restart

then I get

Code:

-bash: /usr/sbin/services: input/output error

The behavior is similar for VMs.

thrusty · Jun 5, 2020

After any time I change the SSD and the problem is going. Apparently, some sectors on the SSD did not work or worked only partially.

Search

Search

[SOLVED] Problem with Intel NUC and Proxmox, troubleshooting

thrusty

Member

Stefan_R

Proxmox Retired Staff

thrusty

Member

thrusty

Member

n1nj4888

Well-Known Member

thrusty

Member

Denny

Active Member

thrusty

Member

thrusty

Member

We value your privacy