Proxmox Host - random freeze(?)

Semmo

Well-Known Member
May 27, 2019
35
4
48
39
Hi there!

I'm new to proxmox and installed it on my new Root server (i7-3770, 32GB, 4x6TB) with ZFS Raid10. But i get a random freeze (i think it's a freeze, more later) for my host. It becomes unresponsive over SSH oder the webinterface. A KVM Console does not show any output (when it get's connected in that state, never got the chance to have a KVM connected while this happens). The Soft Reboot (CTRL+ALT+REM) doesnt work. Just with a hard reset I get back to the machine.

This happend now about 6 times since 1,5 weeks. I have no idea where to look for the failure. I checked the syslog, kern.log and user.log but there are no entrys that help. Just normal stuff like the replication job right before the "freeze".

I also got a full hardware check by my hoster. Nothing found. I checked the RAM with memtest by myself, all is fine.

Has anyone an idea where i should look to find the problem?

Thanks in advance!

Version:
CPU(s)
8 x Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (1 Socket)

Kernel Version
Linux 4.15.18-14-pve #1 SMP PVE 4.15.18-39 (Wed, 15 May 2019 06:56:23 +0200)

PVE Manager Version
pve-manager/5.4-6/aa7856c5


May 28 13:59:00 proxmox systemd[1]: Starting Proxmox VE replication runner...

May 28 13:59:00 proxmox systemd[1]: Started Proxmox VE replication runner.

May 28 14:00:00 proxmox systemd[1]: Starting Proxmox VE replication runner...

May 28 14:00:00 proxmox systemd[1]: Started Proxmox VE replication runner.

May 28 14:01:00 proxmox systemd[1]: Starting Proxmox VE replication runner...

May 28 14:01:00 proxmox systemd[1]: Started Proxmox VE replication runner.

May 28 14:02:00 proxmox systemd[1]: Starting Proxmox VE replication runner...

May 28 14:02:00 proxmox systemd[1]: Started Proxmox VE replication runner.

May 28 14:03:00 proxmox systemd[1]: Starting Proxmox VE replication runner...

May 28 14:03:00 proxmox systemd[1]: Started Proxmox VE replication runner.

May 28 14:04:00 proxmox systemd[1]: Starting Proxmox VE replication runner...

May 28 14:04:00 proxmox systemd[1]: Started Proxmox VE replication runner.

May 28 14:05:00 proxmox systemd[1]: Starting Proxmox VE replication runner...

May 28 14:05:00 proxmox systemd[1]: Started Proxmox VE replication runner.

May 28 14:06:00 proxmox systemd[1]: Starting Proxmox VE replication runner...

May 28 14:06:00 proxmox systemd[1]: Started Proxmox VE replication runner.

May 28 14:07:00 proxmox systemd[1]: Starting Proxmox VE replication runner...

May 28 14:07:00 proxmox systemd[1]: Started Proxmox VE replication runner.

May 28 15:23:20 proxmox systemd[1]: Started Create list of required static device nodes for the current kernel.

May 28 15:23:20 proxmox systemd[1]: Starting Create Static Device Nodes in /dev...

May 28 15:23:20 proxmox systemd[1]: Mounted POSIX Message Queue File System.

May 28 15:23:20 proxmox systemd[1]: Mounted Debug File System.

May 28 15:23:20 proxmox systemd[1]: Mounted Huge Pages File System.

May 28 15:23:20 proxmox systemd[1]: Started Remount Root and Kernel File Systems.

May 28 15:23:20 proxmox systemd[1]: Starting Load/Save Random Seed...

May 28 15:23:20 proxmox systemd[1]: Starting Flush Journal to Persistent Storage...

May 28 15:23:20 proxmox systemd[1]: Starting udev Coldplug all Devices...

May 28 15:23:20 proxmox systemd-modules-load[1249]: Inserted module 'iscsi_tcp'

May 28 15:23:20 proxmox systemd[1]: Started udev Coldplug all Devices.

May 28 15:23:20 proxmox systemd[1]: Starting udev Wait for Complete Device Initialization...

May 28 15:23:20 proxmox systemd[1]: Started Flush Journal to Persistent Storage.

May 28 15:23:20 proxmox systemd[1]: Started Load/Save Random Seed.

May 28 15:23:20 proxmox systemd[1]: Mounted RPC Pipe File System.

May 28 15:23:20 proxmox keyboard-setup.sh[1248]: cannot open file /tmp/tmpkbd.wR8TZX

May 28 15:23:20 proxmox systemd-modules-load[1249]: Inserted module 'ib_iser'

May 28 14:04:00 proxmox systemd[1]: Started Proxmox VE replication runner.

May 28 14:05:00 proxmox systemd[1]: Starting Proxmox VE replication runner...

May 28 14:05:00 proxmox systemd[1]: Started Proxmox VE replication runner.

May 28 14:06:00 proxmox systemd[1]: Starting Proxmox VE replication runner...

May 28 14:06:00 proxmox systemd[1]: Started Proxmox VE replication runner.

May 28 14:07:00 proxmox systemd[1]: Starting Proxmox VE replication runner...

May 28 14:07:00 proxmox systemd[1]: Started Proxmox VE replication runner.

May 28 15:23:20 proxmox systemd[1]: Started Create list of required static device nodes for the current kernel.

May 28 15:23:20 proxmox systemd[1]: Starting Create Static Device Nodes in /dev...

May 28 15:23:20 proxmox systemd[1]: Mounted POSIX Message Queue File System.

May 28 15:23:20 proxmox systemd[1]: Mounted Debug File System.

May 28 15:23:20 proxmox systemd[1]: Mounted Huge Pages File System.

May 28 15:23:20 proxmox systemd[1]: Started Remount Root and Kernel File Systems.

May 28 15:23:20 proxmox systemd[1]: Starting Load/Save Random Seed...

May 28 12:51:19 proxmox rrdcached[7166]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1559033479.415439

May 28 13:51:19 proxmox rrdcached[7166]: flushing old values

May 28 13:51:19 proxmox rrdcached[7166]: rotating journals

May 28 13:51:19 proxmox rrdcached[7166]: started new journal /var/lib/rrdcached/journal/rrd.journal.1559044279.415405

May 28 13:51:19 proxmox rrdcached[7166]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1559037079.415429

May 28 15:23:20 proxmox kernel: [ 0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved

May 28 15:23:20 proxmox kernel: [ 0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable

May 28 15:23:20 proxmox kernel: [ 0.000000] MTRR default type: uncachable

May 28 15:23:20 proxmox kernel: [ 0.000000] MTRR fixed ranges enabled:

May 28 15:23:20 proxmox kernel: [ 0.000000] 00000-9FFFF write-back
 
I’ve also seen something similar on my home lab Intel NUC8 - Host freezes and I cannot SSH to it (or even ping it)... My NUC8 is also headless so can’t check the terminal screen at the point of crashing.

Only way to recover it is to do a full hard reboot of the host and, once it comes back up, I also don’t seem to be able to find anything obvious in the logs, other than the gap of output between the freeze time and the reboot time...

Very keen to understand if there’s any debug that can be turned on / any other options to get to the bottom of this?

Thanks!
 
Hello,

If it's a Kernel Panic, the logs won't help here, you can enable KDump here https://forum.proxmox.com/threads/kernel-panicking-cannot-enable-crash-dumps.28116/ or https://pve.proxmox.com/wiki/Kernel_Crash_Trace_Log. Alternatively, if it's not reproducible with earlier kernel, you could help with bisect (albeit tedious) here https://ldpreload.com/blog/git-bisect-run.

If it's frozen but not a kernel panic, the most common causes is DoS or resource exhaustion, e.g memory causing a kernel memory swap cascade which should recover eventually. In these cases, one would need to install some type of utility that monitors system resources to monitor the server before and during the event to determine the offending process.

Cheers
 
Thank you for this tip! I don't get the kdump to run :( but maybe it's just to late today and i should try later.

May 29 00:42:50 proxmox kdump-tools[17474]: Unknown type (Reserved) while parsing /sys/firmware/memmap/6/type. Please report this as bug. Using RANGE_RESERVED now.

May 29 00:42:50 proxmox kdump-tools[17474]: Unknown type (Reserved) while parsing /sys/firmware/memmap/4/type. Please report this as bug. Using RANGE_RESERVED now.

May 29 00:42:50 proxmox kdump-tools[17474]: Unknown type (Reserved) while parsing /sys/firmware/memmap/22/type. Please report this as bug. Using RANGE_RESERVED now.

May 29 00:42:50 proxmox kdump-tools[17474]: Unknown type (Reserved) while parsing /sys/firmware/memmap/2/type. Please report this as bug. Using RANGE_RESERVED now.

May 29 00:42:50 proxmox kdump-tools[17474]: Unknown type (Reserved) while parsing /sys/firmware/memmap/20/type. Please report this as bug. Using RANGE_RESERVED now.

May 29 00:42:50 proxmox kdump-tools[17474]: Unknown type (Reserved) while parsing /sys/firmware/memmap/19/type. Please report this as bug. Using RANGE_RESERVED now.

May 29 00:42:50 proxmox kdump-tools[17474]: ELF core (kcore) parse failed

May 29 00:42:50 proxmox kdump-tools[17474]: Cannot load /boot/vmlinuz-4.15.18-14-pve

May 29 00:42:50 proxmox kdump-tools[17474]: failed to load kdump kernel ... failed!

May 29 00:42:50 proxmox systemd[1]: Started Kernel crash dump capture service.
 
Yes, I'm seeing the same thing trying to enable kdump-tools... I did the following on the host that is experiencing the freezing / crash issue and the service reports it "failed to load kdump kernel ... failed!"

1) apt-get install kdump-tools
2) Create current kernel symlink:

cd /
ln -s /boot/pve/vmlinuz vmlinuz

3) Configure /etc/default/kdump-tools as follows:


# kdump-tools configuration
# ---------------------------------------------------------------------------
# USE_KDUMP - controls kdump will be configured
# 0 - kdump kernel will not be loaded
# 1 - kdump kernel will be loaded and kdump is configured
# KDUMP_SYSCTL - controls when a panic occurs, using the sysctl
# interface. The contents of this variable should be the
# "variable=value ..." portion of the 'sysctl -w ' command.
# If not set, the default value "kernel.panic_on_oops=1" will
# be used. Disable this feature by setting KDUMP_SYSCTL=" "
# Example - also panic on oom:
# KDUMP_SYSCTL="kernel.panic_on_oops=1 vm.panic_on_oom=1"
#

USE_KDUMP=1
KDUMP_COREDIR="/var/crash"
KDUMP_SYSCTL="kernel.panic_on_oops=1 kernel.panic_on_unrecovered_nmi=1"
DEBUG_KERNEL=/vmlinuz
MAKEDUMP_ARGS="-c --message-level 7 -d 11,31"

4) nano /etc/default/grub:

# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
# info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="Proxmox Virtual Environment"
GRUB_CMDLINE_LINUX_DEFAULT="quiet crashkernel=256M"
GRUB_CMDLINE_LINUX=""

# Disable os-prober, it might add menu entries for each guest
GRUB_DISABLE_OS_PROBER=true

# Uncomment to enable BadRAM filtering, modify to suit your needs
# This works with Linux (no patch required) and with any kernel that obtains
# the memory map information from GRUB (GNU Mach, kernel of FreeBSD ...)
#GRUB_BADRAM="0x01234567,0xfefefefe,0x89abcdef,0xefefefef"

# Uncomment to disable graphical terminal (grub-pc only)
#GRUB_TERMINAL=console

# The resolution used on graphical terminal
# note that you can use only modes which your graphic card supports via VBE
# you can see them in real GRUB with the command `vbeinfo'
#GRUB_GFXMODE=640x480

# Uncomment if you don't want GRUB to pass "root=UUID=xxx" parameter to Linux
#GRUB_DISABLE_LINUX_UUID=true

# Disable generation of recovery mode menu entries
GRUB_DISABLE_RECOVERY="true"

# Uncomment to get a beep at grub start
#GRUB_INIT_TUNE="480 440 1"

(5) update-grub
(6) reboot
(7) Check kdump service status:

root@<HOST>:/# service kdump-tools status
● kdump-tools.service - Kernel crash dump capture service
Loaded: loaded (/lib/systemd/system/kdump-tools.service; enabled; vendor preset: enabled)
Active: active (exited) since Wed 2019-05-29 16:05:32 +08; 27min ago
Process: 1127 ExecStart=/etc/init.d/kdump-tools start (code=exited, status=0/SUCCESS)
Main PID: 1127 (code=exited, status=0/SUCCESS)
Tasks: 0 (limit: 4915)
Memory: 0B
CPU: 0
CGroup: /system.slice/kdump-tools.service

May 29 16:05:32 <HOST> kdump-tools[1127]: Starting kdump-tools: Unknown type (Unknown E820 type) while parsing /sys/firmware/memmap/15/type. Please report this as bug. Using RANGE_RESERVED now.
May 29 16:05:32 <HOST> kdump-tools[1127]: Unknown type (Reserved) while parsing /sys/firmware/memmap/13/type. Please report this as bug. Using RANGE_RESERVED now.
May 29 16:05:32 <HOST> kdump-tools[1127]: Unknown type (Reserved) while parsing /sys/firmware/memmap/11/type. Please report this as bug. Using RANGE_RESERVED now.
May 29 16:05:32 <HOST> kdump-tools[1127]: Unknown type (Reserved) while parsing /sys/firmware/memmap/16/type. Please report this as bug. Using RANGE_RESERVED now.
May 29 16:05:32 <HOST> kdump-tools[1127]: Unknown type (Reserved) while parsing /sys/firmware/memmap/22/type. Please report this as bug. Using RANGE_RESERVED now.
May 29 16:05:32 <HOST> kdump-tools[1127]: Unknown type (Reserved) while parsing /sys/firmware/memmap/20/type. Please report this as bug. Using RANGE_RESERVED now.
May 29 16:05:32 <HOST> kdump-tools[1127]: ELF core (kcore) parse failed
May 29 16:05:32 <HOST> kdump-tools[1127]: Cannot load /boot/vmlinuz-4.15.18-12-pve
May 29 16:05:32 <HOST> kdump-tools[1127]: failed to load kdump kernel ... failed!

May 29 16:05:32 <HOST> systemd[1]: Started Kernel crash dump capture service.
 
@Semmo - Unfortunately not... I haven’t had a chance to try alternatives after the kdump method didn’t seem to work as expected (and detailed above)... did you manage to try to capture logs with the Netconsole method using a second PVE host to log the logs to? I havent tried that as yet but that would be my next test...

Will report back if I can get netconsole working. Please do post here if you manage to capture any logs which indicate the cause of the problem.
 
It’s worth noting though that there appears to be multiple posts on the forums recently about PVE nodes hanging/freezing and only a reboot solves the issue so hoping someone can replicate the issue and grab the kernel crash logs for this.
 
@Semmo - No further crashes as yet on my server but I've now eventually managed to get the netconsole method working to catch test Kernel Panic logs on a remote (target) server. I've written up what I did below so hopefully, if you can also get this logging, we have double the chance of capturing some logs!

The following guide seems out of date and makes assumptions that the NIC address on the source (local) server is eth0 ... On my server, the NIC address is eno1 however changing the GRUB config line to include eno1 is of no real use since eno1 is later blocked during boot process as it is added to the "vmbr0" interface as a slave port... Therefore better to run the netconsole command manually following boot rather than during boot ...

https://pve.proxmox.com/wiki/Kernel_Crash_Trace_Log

Here is what I did to turn on netconsole log capture:

On TARGET server - I used a debian 9 machine which is on the same subnet as the physical NIC interface of the sending source server
(1) Make a directory for the netconsole logs to be stored in (As below)
mkdir /var/log/netconsole
(2) Set up a rsyslog - Create a new file /etc/rsyslog.d/01-netconsole-collector.conf with the following content (Note: The stop text below is correct and should be added - I changed it to red colour since this is listed as ~ in the Proxmox wiki which appears to now be deprecated according to rsyslog)
# Start UDP server on port 5555
$ModLoad imudp
$UDPServerRun 5555
# Define templates
$template NetconsoleFile,"/var/log/netconsole/%fromhost-ip%.log"
$template NetconsoleFormat,"%rawmsg%"
# Accept endline characters (unfortunatelly these options are global)
$EscapeControlCharactersOnReceive off
$DropTrailingLFOnReception off
# Store collected logs using templates without local ones
:fromhost-ip, !isequal, "127.0.0.1" ?NetconsoleFile;NetconsoleFormat
# Discard logs match the rule above
& stop
(3) Restart rsyslog
systemctl restart rsyslog
Enable Logging on SOURCE server -
(1) Rather than editing GRUB config, I simply start netconsole from the SOURCE server shell - This would need to be re-run every time the host reboots to re-enable logging:

modprobe netconsole netconsole=5555@<SOURCE IP>/<SOURCE INTERFACE>,5555@<TARGET IP>/<TARGET MAC> loglevel=7
(2) For example, my command above looks like the following as I specify the vmbr0 bond address rather than the actual physical NIC addresses - Not sure which would work best for you!
modprobe netconsole netconsole=5555@192.168.1.201/vmbr0,5555@192.168.1.205/<TARGET MAC ADDRESS OF 192.168.1.205> loglevel=7

The SOURCE server should now be logging to the TARGET server - You can check that it will (hopefully) log the kernel panic logs by initiating a forced kernel panic on the SOURCE server (CAUTION: This will cause a forced kernel panic on the SOURCE server and you'll have to (A) manually restart/power cycle the server and (B) once it comes back up again, re-enable the SOURCE server logging as per "Enable Logging on Source Server" above)

Force a kernel panic on SOURCE server:

sysctl -w kernel.sysrq=1
echo c > /proc/sysrq-trigger

Check the kernel panic was caught on the TARGET server:
(1) Check the log file:
/var/log/netconsole/<SOURCE IP>.log
 
Last edited:
I experienced this today after upgrading to from kernel `4.15.18-14-pve` to `4.15.18-15-pve`.

Downgraded the kernel and stable for a few hours now. Not sure what I could grab log wise as it was just a full hard lock.
 
@n1nj4888
Thanks for the detailed description! But unfortunately my host is on a single root server and I have no access to a other machine in the same network. I don't think this is possible with a VPN connection. Or is it possible?

I still try to run the kdump locally but with no luck.

The crashes continue for me, so no "surprise" fix for my host.
 
I have poor hardware, '07:20:55 up 11 days, 8:36, 5 users, load average: 0.51, 0.72, 0.68'
 
Proxmox team doesn't fix their wiki, but the hidden PVE docs, which you can hit help! Is very well documented
@n1nj4888

@lhorace:
Not sure what you mean by this? The wiki is out of date yes but are you saying there is more up to date info in the actual Proxmox server help files?

Regardless, I’ve documented and got netconsole working now so will just wait and see whether I get another freeze/crash and what the logs say...
 
@n1nj4888
Thanks for the detailed description! But unfortunately my host is on a single root server and I have no access to a other machine in the same network. I don't think this is possible with a VPN connection. Or is it possible?

I still try to run the kdump locally but with no luck.

The crashes continue for me, so no "surprise" fix for my host.

Can’t you spin up another machine (or a raspberry pi or something) on the same network or is it a single remote server you have access to? Personally, I just spun up a VM (on another server of course) to act as the netconsole logger...
 
I haven’t had a crash since enabling the netconsole so will have to wait until one occurs (if at all!) again. My server is Intel i5 based so not sure whether the c state bios fix proposed would be of any use for Intel?

I have the same issue with my intel pentium g3258. The problem only occurs when idle.
 
Can’t you spin up another machine (or a raspberry pi or something) on the same network or is it a single remote server you have access to? Personally, I just spun up a VM (on another server of course) to act as the netconsole logger...

It is a single remote server and I have only access to that machine. So the only chance would be to rent another server and try vlan :/
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!