Random Reboots PVE 4.1

I'm really getting desperate... The last year a V3.4 and later 4.X through Debian worked fine on that machine and now suddenly it crashes without warning...
 
Last edited:
Please post the list of your hardware, including bios version of the motherboard.
 
CX430M power supply 1 year old
supermicro c7p67 with I5 2400 (bios R 2.0)
4 * kingston 8GB valuaram (brand new)
5 * WD20EADS/EZRX + 1 WD40EFRX
2 * MX200 SSD 250GB
1 extra cheap SATA controller ASM1062
1 Intel quad 82571EB Gbit pcie card

All connected to a APC Smart-UPS 1500. I run a bond of 2 realtek adapters to my workstation.

I would like to get the crash testing online. The manual proxmox delivers gives me this error

root@solo-prox-01:~# modprobe netconsole netconsole=@172.16.240.10/,@172.16.240.112/
modprobe: ERROR: could not insert 'netconsole': Device or resource busy
 
Last edited:
FYI the system doesn't go down when it's being tortured with prime95 and I already replaced the memory so IT wasn't the memory (or I have very little luck).When I do huge amounts of encrypting data, untarring large archives, moving data, ... or just run my backups (Proxmox backup) that's when the thing goes down. Already replaced the swap on ZFS with a native linux swap on the SSDs.
 
Does crashdump now creates files in /var/crash?

In my experience, netconsole only works in about half of all cases. I switched to crashdump entirely, because it yields best results.

@tom: Please provide pve-kernel-dbg kernels to do a proper debugging
 
@LnxBil
root@solo-prox-01:~# ls /var/crash/
kexec_cmd

but since the one Windows VM I have is down we have an 'historical' uptime

root@solo-prox-01:~# uptime
15:24:24 up 1 day, 7:36, 3 users, load average: 1.83, 2.24, 2.19

As i mentioned there is no vmlinuz in /boot
 
So please configure and test your kdump. vmlinuz is not important for getting the real cause, the crashdump is (the written dmesg)
 
So please configure and test your kdump. vmlinuz is not important for getting the real cause, the crashdump is (the written dmesg)

I know but I still need a debug kernel to give kdump a chance of doing anything, right?

root@solo-prox-01:~# service kdump-tools status
● kdump-tools.service - Kernel crash dump capture service
Loaded: loaded (/lib/systemd/system/kdump-tools.service; enabled)
Active: active (exited) since Thu 2016-03-03 17:08:33 CET; 4s ago
Process: 14814 ExecStop=/etc/init.d/kdump-tools stop (code=exited, status=0/SUCCESS)
Process: 14831 ExecStart=/etc/init.d/kdump-tools start (code=exited, status=0/SUCCESS)
Main PID: 14831 (code=exited, status=0/SUCCESS)

Mar 03 17:08:32 solo-prox-01 kdump-tools[14831]: Starting kdump-tools: /etc/default/kdump-tools: DEBUG_KERNEL does not exist: /vmlinuz ... failed!
Mar 03 17:08:33 solo-prox-01 kdump-tools[14831]: loaded kdump kernel.
 
I know but I still need a debug kernel to give kdump a chance of doing anything, right?

No, kdump dumps the memory and the kernel ring buffer. You need the debug kernel for the tool crash to analyze it further.

And create a simple link or configure your kdump correctly:

Code:
root@pvelocalhost:~# service kdump-tools status
● kdump-tools.service - Kernel crash dump capture service
   Loaded: loaded (/lib/systemd/system/kdump-tools.service; enabled)
   Active: active (exited) since Do 2016-03-03 17:14:34 CET; 8s ago
  Process: 1879 ExecStart=/etc/init.d/kdump-tools start (code=exited, status=0/SUCCESS)
Main PID: 1879 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/kdump-tools.service

Mär 03 17:14:33 pvelocalhost kdump-tools[1879]: Starting kdump-tools: /etc/default/kdump-tools: DEBUG_KERNEL does not exist: /vmlinuz ... failed!
Mär 03 17:14:34 pvelocalhost kdump-tools[1879]: loaded kdump kernel.

root@pvelocalhost:~# cd /

root@pvelocalhost:/# ln -s /boot/vmlinuz-4.2.8-1-pve /vmlinuz

root@pvelocalhost:/# service kdump-tools restart

root@pvelocalhost:/# service kdump-tools status
● kdump-tools.service - Kernel crash dump capture service
   Loaded: loaded (/lib/systemd/system/kdump-tools.service; enabled)
   Active: active (exited) since Do 2016-03-03 17:15:00 CET; 1s ago
  Process: 2320 ExecStop=/etc/init.d/kdump-tools stop (code=exited, status=0/SUCCESS)
  Process: 2337 ExecStart=/etc/init.d/kdump-tools start (code=exited, status=0/SUCCESS)
Main PID: 2337 (code=exited, status=0/SUCCESS)

Mär 03 17:15:00 pvelocalhost kdump-tools[2337]: Starting kdump-tools: loaded kdump kernel.


and please use the CODE tag for any console output.
 
@LnxBil Crashed again during backups (qm locks and tmp files are there to proof) and of course no dmesg file in /var/crash/

Uploaded syslog march 4 around 1:30 you see a storm of errrors, delays and warnings before reboot
 

Attachments

I don't know for sure, but maybe can the Proxmox people shed light on this:

Code:
Mar  4 02:29:39 solo-prox-01 pve-ha-lrm[3431]: loop take too long (63 seconds)

does this mean the watchdog is active and the machine got reset?

The syslog shows that the system is slowing down immensely and it cannot do anything in time.
 
I don't know for sure, but maybe can the Proxmox people shed light on this:

Code:
Mar  4 02:29:39 solo-prox-01 pve-ha-lrm[3431]: loop take too long (63 seconds)

does this mean the watchdog is active and the machine got reset?

The syslog shows that the system is slowing down immensely and it cannot do anything in time.

System is been running fine since friday night without making backups. Still waiting for a solution to get the backups to run without trashing the system. Which worked fine in the debian netinstall based set up I had before.
 
What you can do is doing snapshots (zvol-based VMs) and transfer these via zfs send to another machine. This is IMHO the much better way of doing backup, because you do not copy already copied stuff. You have to copy the <vm>.conf files by hand but it works great and in a very time-saving manner. The snapshot itself calls the same QEmu agent hooks and therefore is as consistent as the ordinary backup method (always use QEmu agent!!)
 
What you can do is doing snapshots (zvol-based VMs) and transfer these via zfs send to another machine. This is IMHO the much better way of doing backup, because you do not copy already copied stuff. You have to copy the <vm>.conf files by hand but it works great and in a very time-saving manner. The snapshot itself calls the same QEmu agent hooks and therefore is as consistent as the ordinary backup method (always use QEmu agent!!)

At some point the use of having a prefab system is gone. Proxmox was nice in the past but in the last year I had to rebuild all my openVZ machines to work in KVM/LXC which didn't work as described because I apparently used the wrong (default) compression to backup before installing the new proxmox. Had a lot of issues with the proxmox installer (wrong resolution, no keyboard) And now this issue (still plain vanilla) that 'can be solved better' using own scripts/adapting config files... What's next? Building my own debug kernel? Setting up a debian with ZFS myself (since proxmox through debian always worked fine in the past). I have seen Gentoo setups that were better maintainable and easier to install than what I have now...

I've ordered some parts to make 2 nodes at home and will try if SmartOS can deliver a more straightforward and maintainable solution.
 
Hi,

I've noticed same issue with the latest version. Did you found the solution?

Thanks !

we have 15 hardware nodes at 4 locations and have had no random reboots due to proxmox. one system had an incorrect sized mother board - removing a network cable [ the press down of clip part ] would caused the system to stop.



I'd like to know if anyone else has this issue and if so send hardware specs as Tom asked some posts ago. it would be good to know what is causing the original posters issue.