Intel NUC NUC7i7BNH Crashes Periodically

sea3pea0 · Feb 12, 2019

I'm running Proxmox 5.3-9 and I have 2 vms, a Windows Server and Ubuntu 18.04. The Ubuntu 18.04 vm has been crashing periodically. The Ubuntu vm is used solely to run Docker. Sometimes it crashes once a week, sometimes every 2-3 days. Usually when it does crash it causes an issue that makes the local network unusable until I unplug the network cable from the NUC machine after which the network is instantly working again. Right this moment the Ubuntu vm is crashed but in a rare circumstance I can still access all the services that are running from it's docker containers however I can't SSH in and on the proxmox console for the vm it shows:

About half of the time when the Ubuntu vm crashes since it brings the network down, the only way I am able to resolve it is by turning the machine off and then turning it back on. I have the logs from the Ubuntu vm piped to a syslog server on my NAS. Here are some logs from there related to the present crash:

Code:

2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.986903] ---[ end trace 8b0e882c8a395aec ]---
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.986654] CR2: ffff93f7c6456d80
2019-02-12 09:07:39 Alert pea_palace kern kernel [86472.986298] RIP: kmem_cache_alloc+0x81/0x1b0 RSP: ffffb4b586c43e80
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.985383] Code: 84 1c 5a 49 83 78 10 00 4d 8b 30 0f 84 00 01 00 00 4d 85 f6 0f 84 f7 00 00 00 49 63 5f 20 49 8b 3f 48 8d 4a 01 4c 89 f0 4c 01 f3 <48> 33 1b 49 33 9f 40 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0 74
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.985013] R13: 00007fff24db7fbe R14: 0000000000409295 R15: 0000000000000000
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.984646] R10: 0000000000000000 R11: 0000000000000206 R12: 000000000000000c
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.984281] RBP: 00007fff24db75b0 R08: 0000000000000000 R09: 0000000000000000
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.983918] RDX: 000000000040bbd4 RSI: 0000000000000800 RDI: 0000000000000002
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.983560] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000407bdd
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.983096] RSP: 002b:00007fff24db7460 EFLAGS: 00000206 ORIG_RAX: 0000000000000070
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.982849] RIP: 0033:0x407bdd
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.982547] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.982288] do_syscall_64+0x73/0x130
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.982002] sys_setsid+0x7c/0x110
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.981692] sched_autogroup_create_attach+0x3f/0x130
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.981420] sched_create_group+0x27/0x80
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.981143] ? sched_create_group+0x27/0x80
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.980916] Call Trace:
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.980551] CR2: ffff93f7c6456d80 CR3: 00000005a2a8a000 CR4: 00000000000006f0
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.980231] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.979740] FS: 000000000040bb98(0000) GS:ffff9391ffc00000(0000) knlGS:0000000000000000
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.979371] R13: ffff9391e9016a80 R14: ffff93f7c6456d80 R15: ffff9391e9016a80
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.979004] R10: ffff938cf207ea20 R11: 0000000000000000 R12: 00000000014080c0
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.978631] RBP: ffffb4b586c43eb0 R08: ffff9391ffc27480 R09: 0000000000000000
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.978257] RDX: 00000000000013c7 RSI: 00000000014080c0 RDI: 0000000000027480
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.977871] RAX: ffff93f7c6456d80 RBX: ffff93f7c6456d80 RCX: 00000000000013c8
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.977558] RSP: 0018:ffffb4b586c43e80 EFLAGS: 00010282
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.977261] RIP: 0010:kmem_cache_alloc+0x81/0x1b0
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.976643] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.976129] CPU: 0 PID: 20443 Comm: s6-supervise Tainted: G D W 4.15.0-45-generic #48-Ubuntu
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.973397] Modules linked in: veth xt_nat xt_tcpudp ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc aufs overlay input_leds serio_raw joydev cdc_acm shpchp qemu_fw_cfg mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic virtio_scsi e1000 usbhid hid psmouse i2c_piix4 floppy pata_acpi
2019-02-12 09:07:39 Warning pea_palace kern kernel [86472.973134] Oops: 0000 [#691] SMP PTI
2019-02-12 09:07:39 Information pea_palace kern kernel [86472.972850] PGD 57d93f067 P4D 57d93f067 PUD 0
2019-02-12 09:07:39 Alert pea_palace kern kernel [86472.972576] IP: kmem_cache_alloc+0x81/0x1b0
2019-02-12 09:07:39 Alert pea_palace kern kernel [86472.972185] BUG: unable to handle kernel paging request at ffff93f7c6456d80

I've been dealing with this issue for a while, but I'd really like to figure out how to prevent these crashes. If someone has some insight, I'd be grateful for some help sorting this out. If there's more information that I can provide, I am happy to do so just let me know. Thanks in advance!

mailinglists · Feb 13, 2019

Might be a hardware error.
Did you test your harwdare for defects?
Did you let memtest run for a while?
Network gets unusable probably because when the node crashes, it keeps repeating and sending the same frames, thus flooding the network.

sea3pea0 · Mar 25, 2019

My system was up for a few weeks and finally crashed the same way again this morning. Nothing noteworthy that I know of happened before the crash. I ran memtest for 8 passes without an error. I'm not sure what to try next.

I installed proxmox as a UEFI OS, so I'm thinking of trying throw a hail mary at my NUC by installing Proxmox as a legacy OS and see if this makes any difference.

loomes · Mar 25, 2019

I had a similar Problem some weeks ago with my Nuc7i5BNH, random crashs (same network Problems as you described), sometimes after few hours, sometimes after days.
It was an Bios Problem. I mean it was version 71 or 70. After Update to 72 (newest one back then) the error never come back.
So when you dont have actual Bios try it with the newest one.

sea3pea0 · Mar 26, 2019

I was on revision 72 when the current crash happened. I just updated to revision 76. I've been having this problem ever since I got my NUC 6 months to a year ago. I started out by installing Proxmox on Debian, then I tried using the Proxmox installer, with a couple fresh installs. I had a windows server vm running for a while that seemed to cause the processor to work way too much, especially during windows updates for some reason. I noticed when I had the windows vm running it crashed more frequently.

@loomes how did you install proxmox? Did you use the installer? Did you install as a legacy os or UEFI? Did you make any significant changes to the bios?

NdK73 · Mar 26, 2019

I'd bet on CPU overheating when under load. Or bad PSU that can't keep power stable when maximum current is required for extended periods.
Have you tried cpuburn?

sea3pea0 · Mar 26, 2019

I will try that, but since I removed the windows vm I haven't noticed the NUC working with any significant load, yet it still crashed. I've never used cpuburn before, what should I do while running cpu burn? Should I use another monitoring tool to watch for specific symptoms while running cpuburn?

NdK73 · Mar 27, 2019

cpuburn stresses the cpu to highlight power and thermal issues.
There shouldn't be anything else running. Better if you run it from runlevel 1 and readonly filesystems (or from USB key), to avoid corruptions.
If something goes wrong the hard way, it'll lock up the machine.
You have to let it run for at least a day (unless it highlights a problem sooner). Run one instance per core.
You can keep monitored the CPU temperature and verify voltages using lm-sensors, if your motherboard is compatible.

Intel NUC NUC7i7BNH Crashes Periodically

sea3pea0

New Member

mailinglists

Renowned Member

sea3pea0

New Member

loomes

Renowned Member

sea3pea0

New Member

NdK73

Renowned Member

sea3pea0

New Member

NdK73

Renowned Member

We value your privacy