random kernel panics, possibly caused by e1000e module

olaszfiu

Renowned Member
Feb 23, 2015
7
0
66
Hello,
I'm experiencing random kernel panics on a 6-nodes Proxmox VE cluster, recently updated to 3.4 (but they also used to occur with version 3.3).
My current PVE config is:

# pveversion -v
proxmox-ve-2.6.32: 3.3-147 (running kernel: 2.6.32-37-pve)
pve-manager: 3.4-1 (running version: 3.4-1/3f2d890e)
pve-kernel-2.6.32-32-pve: 2.6.32-136
pve-kernel-2.6.32-37-pve: 2.6.32-147
pve-kernel-2.6.32-34-pve: 2.6.32-140
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-16
qemu-server: 3.3-20
pve-firmware: 1.1-3
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-31
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-12
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1


I managed to capture the screenshot of one of those kernel panics (see attached netconsole.zip).
It seems to me they occur during an interupt handling caused by the e1000e module.

Each node has:

- 2x Intel 82574L NICs (module: e1000e).
The first NIC is unused; the second one is bridged on vmbr0, where several VMs are attached, each using a different VLAN tag.
The VMs virtual disks are hosted on a gluster replicated volume.

- 2x Intel 82576 NICs (module: igb).
Both NICs are bonded to bond0.
This is a dedicated interface used for client-server gluster communications.

Unfortunately, the kernel panics are not easily reproducible.
Sometimes they happen on boot, few minutes after all VMs have started.
A few times they happened during VM live migration.
Sometimes they also occurred when I installed the virtio net driver inside a Windows 8.1 VM.


Do you have any clue what the problem could be?

Thanks,
Rosario
 

Attachments

  • lspci.txt
    4.1 KB · Views: 12
  • netconsole.zip
    4.8 KB · Views: 13
  • cpuinfo.zip
    1.3 KB · Views: 2
Hello,
I'm experiencing random kernel panics on a 6-nodes Proxmox VE cluster, recently updated to 3.4 (but they also used to occur with version 3.3).
My current PVE config is:

<snip>

Do you have any clue what the problem could be?

Thanks,
Rosario

I had issues with using e1000 network drivers in KVM Windows VM's they would just freeze. Only noticed it on 3.3 and 3.4 changing the drivers to Virtio (network and disk) worked for me (running 3.4).

Stephen
 
Last edited:
Thanks Stephen,
but unfortunately the kernel panics occur on the hypervisors (which have real Intel network cards). The VMs use virtio drivers and they work fine.
Cheers, Rosario
 
Thanks Stephen, but the info in that thread don't seem to be related to my case.


As a further test, today I tried to move the vmbr0 bridge on top of an Intel 82576 NIC (instead of the Intel 82574L NIC) and after a while, when all the VMs were started, I got the same kind of kernel panic (see attached: netconsole-igb.zip).
This time the "Fatal exception in interrupt" seems to be related to the "igb" kernel module.


This is getting really complicated to debug...
Could any Proxmox VE developers have a look at this, please ?


Thanks, Rosario
 

Attachments

  • netconsole-igb.zip
    4.4 KB · Views: 4
It looks like disabling the GRO on the bridged NIC (eth2) has mitigated the problem.
I added in /etc/network/interfaces:

pre-up /sbin/ethtool -K eth2 gro off

No kernel panics observed since yesterday.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!