[SOLVED] Multiple crashes across many days: (cause was a mystery). BAD power supply.

GarrettB · Aug 16, 2018

Hey everyone, this is my first post! I've had one node running 24/7 (with probably weekly restarts) for about four months or so, and have not had any problems.

Today, I have had two crashes. Last night, I updated all of the VMs with Ubuntu 18.04 updates that rolled out.

I thought maybe it was related to today's Proxmox updates that I installed, but those were at 9:46AM and the crash occurred around 9:22AM or so.

I looked through all Proxmox logs (syslog, kernel, apt, auth, etc) right before the crash, and found nothing. I looked through 7 VMs and their logs, and found nothing. Two are webservers, with nginx and basically have holding pages. One is a MYSQL server. One is a mail server, and one is a pfsense server, and one is an nginx reverse proxy, and one is a pihole server.

The second crash was at 5:55PM or so. Turned it back on around 7:22PM and it's been running since.

There is no sign from the charts on the node Summary page in Proxmox that the CPU was running high, or high use of RAM, or I/O, etc. I am running Proxmox 5.2-5.

As far as hardware, it's pretty simple with a Gigabyte motherboard using an AMD FX8350 processor with 32GB of RAM and two SSDs running normally without RAID.

The symptoms when I have found it crashed are: the power light on the power button is still lit. The entire unit is "off". No fans running, no network cards are blinking. If I press the power button, it won't force off. I have to shutdown the power supply and then switch it back to on, then power on the unit.

It is main-powered through a battery backup unit, and I checked the unit and it has no problems.

I did read about setting up a kernel log, but I thought I would post and see if anyone would like to troubleshoot with me. I'm thinking there is something I might find where I haven't looked, because I don't know it's there.

EDIT: Just crashed again at 9:55pm and I was there when it happened. It just shuts down.

GarrettB · Aug 16, 2018

UPDATE: I witnessed the network card going down and coming back up. I have three NICs total, one on the motherboard which is used to connect to Proxmox, and two in slots. The one used for incoming internet to the machine is in the first slot and showing this in the kernel log:

Aug 16 04:58:59 pve kernel: [25260.749570] hrtimer: interrupt took 17000 ns
Aug 16 08:31:15 pve kernel: [37996.736871] r8169 0000:02:00.0 enp2s0: link down
Aug 16 08:31:15 pve kernel: [37996.737014] vmbr2: port 1(enp2s0) entered disabled state
Aug 16 08:31:22 pve kernel: [38003.848007] r8169 0000:02:00.0 enp2s0: link up
Aug 16 08:31:22 pve kernel: [38003.848486] vmbr2: port 1(enp2s0) entered blocking state
Aug 16 08:31:22 pve kernel: [38003.848489] vmbr2: port 1(enp2s0) entered forwarding state
Aug 16 08:31:52 pve kernel: [38033.369083] r8169 0000:02:00.0 enp2s0: link down
Aug 16 08:31:52 pve kernel: [38033.369123] vmbr2: port 1(enp2s0) entered disabled state
Aug 16 08:31:59 pve kernel: [38040.426214] r8169 0000:02:00.0 enp2s0: link up
Aug 16 08:31:59 pve kernel: [38040.426703] vmbr2: port 1(enp2s0) entered blocking state
Aug 16 08:31:59 pve kernel: [38040.426706] vmbr2: port 1(enp2s0) entered forwarding state
Aug 16 08:32:11 pve kernel: [38052.695699] r8169 0000:02:00.0 enp2s0: link down
Aug 16 08:32:11 pve kernel: [38052.695737] vmbr2: port 1(enp2s0) entered disabled state
Aug 16 08:32:18 pve kernel: [38059.697711] r8169 0000:02:00.0 enp2s0: link up
Aug 16 08:32:18 pve kernel: [38059.698205] vmbr2: port 1(enp2s0) entered blocking state
Aug 16 08:32:18 pve kernel: [38059.698207] vmbr2: port 1(enp2s0) entered forwarding state

It seems I have a hardware problem. I wonder if there is a way to determine what might be behind this behavior?

At the moment, it is working fine again, and there was no shutdown overnight.

Faris Raouf · Aug 16, 2018

Cloud it be a temperature issue?

The "no fans running" -- has this happened every time? Because that's more than a crash. I would expect fans to only shut down if the system board lost power, or thought it was shutdown.

GarrettB · Aug 16, 2018

Yes, the fans don't run at all. The board is losing power.

I understand the opinions regarding Realtek NICs. I have them. Before Proxmox, I used the r8168 driver without issue on the same machine which ran Xenserver.

With the Proxmox install, I was prepared to load up the r8168 driver. However, the NICs worked just fine with the r8169 driver so I left it be.

But, I've isolated what is causing the shutdown of the main NIC and it is occurring when cron jobs on multiple VMs run simultaneously, ramping up the traffic on the NIC and it causes a complete crash.

I read this, but I think my symptoms have occurred given the newest VM web server that is now running as of a day ago.

For now, if someone is willing to provide a response regarding whether there are steps that can be taken to try the different r8168 drivers, that would be great. I will probably also start looking for a different NIC. This machine was a development machine but it's turned into a production one and I'm working through a list of redundancy needs anyway.

Thanks

GarrettB · Aug 16, 2018

I just found an error in how my interfaces file was setup. This was causing a failure message "to start Raise network interfaces".

It was (this is just part of the interfaces file):

auto enp3s0
iface enp3s0 inet static

auto vmbr0
iface vmbr0 inet static
address 192.168.1.42
netmask 255.255.255.0
gateway 192.168.1.1
bridge_ports enp3s0
bridge_stp off
bridge_fd 0

And now is:

auto enp3s0
iface enp3s0 inet manual

auto vmbr0
iface vmbr0 inet static
address 192.168.1.42
netmask 255.255.255.0
gateway 192.168.1.1
bridge_ports enp3s0
bridge_stp off
bridge_fd 0

Networking service is successfully starting. Would this also contribute to the NIC in question? It has these settings:

auto enp2s0
iface enp2s0 inet manual

auto vmbr2
iface vmbr2 inet manual
bridge_ports enp2s0
bridge_stp off
bridge_fd 0
bridge_vlan_aware yes

These have separate NICs, and enp2s0 has 10.0.0.x subnet.

GarrettB · Aug 17, 2018

It has crashed three more times today. Once, it was only 20 minutes after the most recent reboot. I have watched the temperatures and the CPU does not go over 45C.

I tried resetting the BIOS, no difference. I checked the CMOS battery which was fine, and reset it at the same time.

I also watched voltages because I'm suspecting there is a hardware issue. However, the voltages seem normal.

On one boot-up I selected Memory Test from the grub menu, and the screen went blank. Is this normal? I thought memory tests worked through a sequence of things on the screen. Maybe I'm recalling Windows memory tests.

On one particular shut down tonight, the fans were still running. Everything was rock solid until yesterday. I'm stumped. I've checked crash folders for the VMs and they are empty or have items from months ago when I was first installing.

GarrettB · Aug 17, 2018

I can't get kdump-tools to install...how do I get past this error?

Code:

Aug 17 09:09:35 pve kdump-tools[2039]: Starting kdump-tools: Unknown type (Reserved) while parsing /sys/firmware/memmap/17/type. Please report this as bug. Using RANGE_RESERVED now.
Aug 17 09:09:35 pve kdump-tools[2039]: Unknown type (Reserved) while parsing /sys/firmware/memmap/15/type. Please report this as bug. Using RANGE_RESERVED now.
Aug 17 09:09:35 pve kdump-tools[2039]: Unknown type (Reserved) while parsing /sys/firmware/memmap/5/type. Please report this as bug. Using RANGE_RESERVED now.
Aug 17 09:09:35 pve kdump-tools[2039]: Unknown type (Reserved) while parsing /sys/firmware/memmap/13/type. Please report this as bug. Using RANGE_RESERVED now.
Aug 17 09:09:35 pve kdump-tools[2039]: Unknown type (Reserved) while parsing /sys/firmware/memmap/18/type. Please report this as bug. Using RANGE_RESERVED now.
Aug 17 09:09:35 pve kdump-tools[2039]: Unknown type (Reserved) while parsing /sys/firmware/memmap/16/type. Please report this as bug. Using RANGE_RESERVED now.
Aug 17 09:09:35 pve kdump-tools[2039]: Unknown type (Unknown E820 type) while parsing /sys/firmware/memmap/6/type. Please report this as bug. Using RANGE_RESERVED now.
Aug 17 09:09:35 pve kdump-tools[2039]: Unknown type (Reserved) while parsing /sys/firmware/memmap/14/type. Please report this as bug. Using RANGE_RESERVED now.
Aug 17 09:09:35 pve kdump-tools[2039]: Unknown type (Reserved) while parsing /sys/firmware/memmap/12/type. Please report this as bug. Using RANGE_RESERVED now.
Aug 17 09:09:35 pve kdump-tools[2039]: Unknown type (Reserved) while parsing /sys/firmware/memmap/2/type. Please report this as bug. Using RANGE_RESERVED now.
Aug 17 09:09:35 pve kdump-tools[2039]: Unknown type (Reserved) while parsing /sys/firmware/memmap/10/type. Please report this as bug. Using RANGE_RESERVED now.
Aug 17 09:09:35 pve kdump-tools[2039]: Unknown type (Reserved) while parsing /sys/firmware/memmap/19/type. Please report this as bug. Using RANGE_RESERVED now.
Aug 17 09:09:35 pve kdump-tools[2039]: ELF core (kcore) parse failed
Aug 17 09:09:35 pve kdump-tools[2039]: Cannot load /var/lib/kdump/vmlinuz
Aug 17 09:09:35 pve kdump-tools[2039]: failed to load kdump kernel ... failed!
Aug 17 09:09:35 pve kdump-tools: failed to load kdump kernel

GarrettB · Aug 20, 2018

Can someone confirm that kdump-tools works with 4.15.18-1-pve? I have read through the Debian man page, and see that bug patches were provided for the error above, for Ubuntu distros.

I see that Proxmox is compiled with kexec flags, so any hints on what to do with this error is appreciated, thanks.

Stoiko Ivanov · Aug 20, 2018

* memtest86+, usually does generate output on the display (just make sure to not select an entry with serial console)
* maybe try to run without the UPS (just to rule it out as the source of the problems)?
* you could try to gather the console output by configuring a netconsole - see https://pve.proxmox.com/wiki/Kernel_Crash_Trace_Log

GarrettB · Aug 20, 2018

Stoiko Ivanov said:
* memtest86+, usually does generate output on the display (just make sure to not select an entry with serial console)
* maybe try to run without the UPS (just to rule it out as the source of the problems)?
* you could try to gather the console output by configuring a netconsole - see https://pve.proxmox.com/wiki/Kernel_Crash_Trace_Log

Thanks, I will have to look into the memory test more then. I have tried netconsole but am having problems: https://forum.proxmox.com/threads/netconsole-logging-works-but-stops-at-nic-enabling.46382/

Stoiko Ivanov · Aug 20, 2018

* Just to be sure - this is a single node you have running? (not part of a HA-cluster)?
* if possible maybe try to get the netconsole output on another NIC (should the nic-driver be the source of the crash).
* if the mainboard has a serial port you could set it up as a console and see whether you get some output there

GarrettB · Aug 21, 2018

Stoiko Ivanov said:
* Just to be sure - this is a single node you have running? (not part of a HA-cluster)?
* if possible maybe try to get the netconsole output on another NIC (should the nic-driver be the source of the crash).
* if the mainboard has a serial port you could set it up as a console and see whether you get some output there

Yes, this is a one-node setup.

I changed to another NIC and was getting no logging through netconsole, but this shows in the syslog:

Aug 20 14:53:24 pve kernel: netconsole: network logging stopped on interface enp4s0 as it is joining a master device.

I attempted to manually set up netconsole, but could not get the "enabled" file in /sys/kernel/config/netconsole... to equal 1. So, couldn't get it to work.

But I also opened the case and took a look at the heat sink. It does appear it may have moved since installation. Maybe 3/16" - so I moved it back that much. It still shutdown a little while later. I moved on to removing RAM cards (I have four and took out the most recent two). Seems to be stable with two. With four it shuts down pretty quickly (within 20-30 minutes).

Here is what sensors returned:

fam15h_power-pci-00c4
Adapter: PCI adapter
power1: 80.06 W (crit = 125.19 W)
it8620-isa-0228
Adapter: ISA adapter
in0: +1.21 V (min = +0.00 V, max = +3.06 V)
in1: +1.49 V (min = +0.00 V, max = +3.06 V)
in2: +2.00 V (min = +0.00 V, max = +3.06 V)
in3: +2.00 V (min = +0.00 V, max = +3.06 V)
in4: +1.99 V (min = +0.00 V, max = +3.06 V)
in5: +1.18 V (min = +0.00 V, max = +3.06 V)
in6: +2.23 V (min = +0.00 V, max = +3.06 V)
3VSB: +3.26 V (min = +0.00 V, max = +6.12 V)
Vbat: +3.14 V
fan1: 1088 RPM (min = 0 RPM)
fan2: 722 RPM (min = 0 RPM)
fan3: 0 RPM (min = 0 RPM)
fan4: 0 RPM (min = 0 RPM)
fan5: 675000 RPM (min = 0 RPM)
temp1: +35.0°C (low = +127.0°C, high = +127.0°C) sensor = thermistor
temp2: +42.0°C (low = +127.0°C, high = +127.0°C) sensor = thermal diode
temp3: +28.0°C (low = +127.0°C, high = +127.0°C) sensor = Intel PECI
temp4: +45.0°C
temp5: +45.0°C
temp6: +45.0°C
intrusion0: ALARM
k10temp-pci-00c3
Adapter: PCI adapter
temp1: +28.2°C (high = +70.0°C)
(crit = +80.0°C, hyst = +77.0°C)

Fan #5 always jumps up to that rpm, and then shows 750 or so.

I appreciate the response today. It's been pretty frustrating not being able to get kdump or netconsole to work. The only thing I can see is a bad RAM card...but even with two cards in instead of 4, I still get a blank screen on memtest86. Weird.

GarrettB · Aug 21, 2018

After troubleshooting RAM cards and running memory tests (some of which were just resulting in a shutdown), I started watching voltages a little closer. "in1" above is the DRAM voltage and was a little bit variable and lower than the 1.5v recommended. I upped the voltage in BIOS settings and watched it. It was still fluctuating a little. And then there were additional shutdowns.

I took out 2 RAM cards after one shutdown, and re-installed them with the other 2 for 4 total, and went into the BIOS. The BIOS screen said it was corrupted, and reinstalled the BIOS. After boot-up, I adjusted the DRAM voltage to be 1.6v, and it appears to be at a fairly steady 1.54v. I also noticed after booting and going back into the BIOS, that the time is being preserved after a boot.

Let's hope this is it.

GarrettB · Aug 21, 2018

It turned out to be an overheating power supply!

Thank you for your help. I have learned a lot in the process.

Stoiko Ivanov · Aug 22, 2018

You're welcome (not that I did anything) - Glad the cause is something comparatively easy to swap!

AlexLup · Aug 23, 2018

GarrettB said:
Yes, this is a one-node setup.

I changed to another NIC and was getting no logging through netconsole, but this shows in the syslog:

Aug 20 14:53:24 pve kernel: netconsole: network logging stopped on interface enp4s0 as it is joining a master device.

I attempted to manually set up netconsole, but could not get the "enabled" file in /sys/kernel/config/netconsole... to equal 1. So, couldn't get it to work.

But I also opened the case and took a look at the heat sink. It does appear it may have moved since installation. Maybe 3/16" - so I moved it back that much. It still shutdown a little while later. I moved on to removing RAM cards (I have four and took out the most recent two). Seems to be stable with two. With four it shuts down pretty quickly (within 20-30 minutes).

Here is what sensors returned:

fam15h_power-pci-00c4
Adapter: PCI adapter
power1: 80.06 W (crit = 125.19 W)
it8620-isa-0228
Adapter: ISA adapter
in0: +1.21 V (min = +0.00 V, max = +3.06 V)
in1: +1.49 V (min = +0.00 V, max = +3.06 V)
in2: +2.00 V (min = +0.00 V, max = +3.06 V)
in3: +2.00 V (min = +0.00 V, max = +3.06 V)
in4: +1.99 V (min = +0.00 V, max = +3.06 V)
in5: +1.18 V (min = +0.00 V, max = +3.06 V)
in6: +2.23 V (min = +0.00 V, max = +3.06 V)
3VSB: +3.26 V (min = +0.00 V, max = +6.12 V)
Vbat: +3.14 V
fan1: 1088 RPM (min = 0 RPM)
fan2: 722 RPM (min = 0 RPM)
fan3: 0 RPM (min = 0 RPM)
fan4: 0 RPM (min = 0 RPM)
fan5: 675000 RPM (min = 0 RPM)
temp1: +35.0°C (low = +127.0°C, high = +127.0°C) sensor = thermistor
temp2: +42.0°C (low = +127.0°C, high = +127.0°C) sensor = thermal diode
temp3: +28.0°C (low = +127.0°C, high = +127.0°C) sensor = Intel PECI
temp4: +45.0°C
temp5: +45.0°C
temp6: +45.0°C
intrusion0: ALARM
k10temp-pci-00c3
Adapter: PCI adapter
temp1: +28.2°C (high = +70.0°C)
(crit = +80.0°C, hyst = +77.0°C)

Fan #5 always jumps up to that rpm, and then shows 750 or so.

I appreciate the response today. It's been pretty frustrating not being able to get kdump or netconsole to work. The only thing I can see is a bad RAM card...but even with two cards in instead of 4, I still get a blank screen on memtest86. Weird.

Can I ask what tool this is? How did you manage to install it ?

GarrettB · Aug 24, 2018

AlexLup said:
Can I ask what tool this is? How did you manage to install it ?

I assume you are referring to the "sensors" command: it is used with the "lm-sensors" package. I use Ubuntu so here is the related guide. It works broadly across many distros.

I will also add something I learned recently which is to watch in the console window with this command:

Code:

watch -n 1 sensors

Better than repeatedly requesting "sensors". 1 is for every 1 second. See here.

AlexLup · Aug 24, 2018

Yes thats what I meant, thank you so much for this.

GarrettB · Aug 24, 2018

After last night's update it appears I never installed the headers package for pve... Possibly why kdump was failing.

[SOLVED] Multiple crashes across many days: (cause was a mystery). BAD power supply.

GarrettB

Well-Known Member

GarrettB

Well-Known Member

Faris Raouf

Well-Known Member

GarrettB

Well-Known Member

GarrettB

Well-Known Member

GarrettB

Well-Known Member

GarrettB

Well-Known Member

GarrettB

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

GarrettB

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

GarrettB

Well-Known Member

GarrettB

Well-Known Member

GarrettB

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

AlexLup

Well-Known Member

GarrettB

Well-Known Member

AlexLup

Well-Known Member

GarrettB

Well-Known Member

We value your privacy