Troubleshooting a Kernel Panic

redmop

Well-Known Member
Feb 24, 2015
121
2
58
I could use some help here. I'm having a kernel panic on Proxmox 3.4 that has been happening since 3.2. It happens on either saturday or sunday, but not every week. It's pretty random if it happens any givern week.

There are only 3 active KVM machines. The only thing that is running actively is an internal encrypted backup in a centos 5 vm. The other two are ubuntu 14.04 VMs that are just dns, dhcp, and samba, and they are idling. There are not any logged in users at this time. The internal backup that runs worked perfectly when the machine was bare metal.

I'm not all that great at troubleshooting kernel panics. What information is needed? I have captures of /var/log from the host and the centos5 vm, and a screencap of part of the kernel dump via iLo.

I will rebuild if I have to, but I want to avoid that if at all possible, as I don't have another host to run from.
 
/var/log/syslog

Jul 4 20:34:58 cb-prox1 simplesnapwrap[937994]: Running: /sbin/zfs send -I tank/vm-102-disk-1@__simplesnap_prox1_2015-07-04T18:34:19__ tank/vm-102-disk-1@__simplesnap_pr

ox1_2015-07-04T20:34:55__







Jul 4 20:45:01 cb-prox1 /USR/SBIN/CRON[939327]: (root) CMD (/usr/sbin/zfSnap -a 24h -p prox1_frequent_ -r tank)

Jul 4 20:48:38 cb-prox1 rrdcached[4264]: flushing old values

Jul 4 20:48:38 cb-prox1 rrdcached[4264]: rotating journals

Jul 4 20:48:38 cb-prox1 rrdcached[4264]: started new journal /var/lib/rrdcached/journal/rrd.journal.1436064518.813773

Jul 4 20:48:38 cb-prox1 rrdcached[4264]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1436057318.814930

Jul 4 20:48:42 cb-prox1 sensord: Chip: acpitz-virtual-0

Jul 4 20:48:42 cb-prox1 sensord: Adapter: Virtual device

Jul 4 20:48:42 cb-prox1 sensord: temp1: 8.3 C

Jul 4 20:48:42 cb-prox1 sensord: Chip: power_meter-acpi-0

Jul 4 20:48:42 cb-prox1 sensord: Adapter: ACPI interface

Jul 4 20:48:42 cb-prox1 sensord: Chip: coretemp-isa-0000

Jul 4 20:48:42 cb-prox1 sensord: Adapter: ISA adapter

Jul 4 20:48:42 cb-prox1 sensord: Physical id 0: 53.0 C

Jul 4 20:48:42 cb-prox1 sensord: Core 0: 53.0 C

Jul 4 20:48:42 cb-prox1 sensord: Core 1: 52.0 C

Jul 4 20:48:42 cb-prox1 sensord: Core 2: 47.0 C

Jul 4 20:48:42 cb-prox1 sensord: Core 3: 50.0 C

Jul 4 20:48:42 cb-prox1 sensord: Core 4: 52.0 C

Jul 4 20:48:42 cb-prox1 sensord: Core 5: 49.0 C

Jul 4 20:55:14 cb-prox1 pvestatd[4880]: status update time (6.162 seconds)

Jul 4 21:00:01 cb-prox1 /USR/SBIN/CRON[941148]: (root) CMD (/usr/sbin/zfSnap -a 48h -p prox1_hourly_ -r tank)

Jul 4 21:00:01 cb-prox1 /USR/SBIN/CRON[941150]: (root) CMD (/usr/sbin/zfSnap -a 24h -p prox1_frequent_ -r tank)

Jul 4 21:00:01 cb-prox1 /USR/SBIN/CRON[941149]: (root) CMD (/usr/local/bin/zfs_health_check.sh)

Jul 4 21:15:01 cb-prox1 /USR/SBIN/CRON[942939]: (root) CMD (/usr/sbin/zfSnap -a 24h -p prox1_frequent_ -r tank)

Jul 4 21:17:01 cb-prox1 /USR/SBIN/CRON[943190]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)

Jul 4 21:18:42 cb-prox1 sensord: Chip: acpitz-virtual-0

Jul 4 21:18:42 cb-prox1 sensord: Adapter: Virtual device

Jul 4 21:18:42 cb-prox1 sensord: temp1: 8.3 C

Jul 4 21:18:42 cb-prox1 sensord: Chip: power_meter-acpi-0

Jul 4 21:18:42 cb-prox1 sensord: Adapter: ACPI interface

Jul 4 21:18:42 cb-prox1 sensord: Chip: coretemp-isa-0000

Jul 4 21:18:42 cb-prox1 sensord: Adapter: ISA adapter

Jul 4 21:18:42 cb-prox1 sensord: Physical id 0: 53.0 C

Jul 4 21:18:42 cb-prox1 sensord: Core 0: 47.0 C

Jul 4 21:18:42 cb-prox1 sensord: Core 1: 48.0 C

Jul 4 21:18:42 cb-prox1 sensord: Core 2: 46.0 C

Jul 4 21:18:42 cb-prox1 sensord: Core 3: 44.0 C

Jul 4 21:18:42 cb-prox1 sensord: Core 4: 53.0 C

Jul 4 21:18:42 cb-prox1 sensord: Core 5: 50.0 C

Jul 4 21:22:27 cb-prox1 kernel: ------------[ cut here ]----Jul 6 07:04:26 cb-prox1 kernel: imklog 5.8.11, log source = /proc/kmsg started.

Jul 6 07:04:26 cb-prox1 rsyslogd: [origin software="rsyslogd" swVersion="5.8.11" x-pid="3944" x-info="http://www.rsyslog.com"] start
 
/var/log/syslog

Jul 4 20:34:55 cb-prox1 simplesnapwrap[937994]: Will use tank/vm-102-disk-1@__simplesnap_prox1_2015-07-04T18:34:19__ as basis.

Jul 4 20:34:55 cb-prox1 simplesnapwrap[937994]: Making snapshot tank/vm-102-disk-1@__simplesnap_prox1_2015-07-04T20:34:55__

Jul 4 20:34:55 cb-prox1 simplesnapwrap[937994]: Running: /sbin/zfs snapshot tank/vm-102-disk-1@__simplesnap_prox1_2015-07-04T20:34:55__

Jul 4 20:34:58 cb-prox1 simplesnapwrap[937994]: zfs exited successfully.

Jul 4 20:34:58 cb-prox1 simplesnapwrap[937994]: Sending incremental stream back to tank/vm-102-disk-1@__simplesnap_prox1_2015-07-04T18:34:19__

Jul 4 20:34:58 cb-prox1 simplesnapwrap[937994]: Running: /sbin/zfs send -I tank/vm-102-disk-1@__simplesnap_prox1_2015-07-04T18:34:19__ tank/vm-102-disk-1@__simplesnap_pr

ox1_2015-07-04T20:34:55__

Jul 6 07:04:26 cb-prox1 kernel: imklog 5.8.11, log source = /proc/kmsg started.

Jul 6 07:04:26 cb-prox1 rsyslogd: [origin software="rsyslogd" swVersion="5.8.11" x-pid="3944" x-info="http://www.rsyslog.com"] start
 
try upload the whole file to pastebin.com
there is nothing related to kernel panic in the logs you provide here.
 
what is the physical server brand ?

do you have done all bios updates ?

new processors have bugfix (microcodes) with bios updates


Looking at your logs, I see

Code:
[LIST=1]
[*][COLOR=#000000]Jul  6 07:04:26 cb-prox1 kernel: Your BIOS is broken and requested that x2apic be disabled.[/COLOR]

[*][COLOR=#000000]Jul  6 07:04:26 cb-prox1 kernel: This will slightly decrease performance.
[/COLOR]
[/LIST]
 
Do a real crash analysis of the NEXT crash using kdump (http://www.thegeekstuff.com/2014/05/kdump is for RH-based distros, but works similar in Debian). syslog and messages normally never help, because a kernel crash does not get written to logfiles. You should use netconsole for remote catching the real output (or serial console if still present).

You should automatically setup kdump on all your physical machines in advance for get the benefits if needed.
 
As is explained here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1398497/comments/31
You have a gen8 server so:
1) Open /etc/default/grub
2) Add this-> intremap=no_x2apic_optout inside the quotes in GRUB_CMDLINE_LINUX_DEFAULT
3) close the file and run: sudo update-grub
4) reboot server.

It is further explained that you might need to add this intel_idle.max_cstate=0 as well but try without first and if you still face problems try this too. It as well will require sudo update-grub and a reboot.
 
ProLiant ML350p Gen8

BIOS P72 (latest)

I have some 385p G8 units that gave me grief with random panics. They came factory with Broadcom chip set nics. Changed to intel and that all disappeared. YMMV But they have been stable now over 90 days+

Your timeline sounds like mine.


Edit, dug thru some logs, it looks like for me it was the tg3 driver that was being used for the broadcom chips
 
Last edited:
Well, we got another week. Looks like it's fixed. Thanks everyone, for your help. :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!