Random crashing version 3.3

NSW

New Member
Jul 19, 2011
17
0
1
Montana, USA
Hi,

I am getting some random crashing with both a new install and an updated install. I've searched and cant really find any good information on it. If someone out there has any ideas, i would greatly appreciate the help. This node is part of a small cluster that was being updated from 3.2-4 to 3.3-5.

Below is the most recent crash that i have actually been able to capture.

Code:
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:2717!
invalid opcode: 0000 [#1] SMP 
last sysfs file: /sys/kernel/uevent_seqnum
CPU 0 
Modules linked in: netconsole ip_set vzethdev vznetdev pio_nfs pio_direct pfmt_raw pfmt_ploop1 ploop simfs vzrst nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 vzcpt nf_conntrack vzdquota vzmon vzdev ip6t_REJECT ip6table_mangle ip6table_filter ip6_tables xt_length xt_hl xt_tcpmss xt_TCPMSS iptable_mangle iptable_filter xt_multiport xt_limit vhost_net xt_dscp tun macvtap macvlan nfnetlink_log nfnetlink ipt_REJECT kvm_amd ip_tables kvm dlm configfs vzevent ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc bonding 8021q garp ipv6 fuse snd_pcsp snd_pcm snd_page_alloc snd_timer snd serio_raw i2c_piix4 fam15h_power k10temp amd64_edac_mod edac_mce_amd edac_core soundcore shpchp ext3 mbcache jbd sg ata_generic pata_acpi mpt2sas raid_class usb_storage igb i2c_algo_bit bnx2 pata_atiixp i2c_core dca scsi_transport_sas ahci [last unloaded: scsi_wait_scan]

Pid: 7620, comm: kvm veid: 0 Not tainted 2.6.32-34-pve #1 042stab094_7 Supermicro H8DGU-LN4/H8DGU-LN4
RIP: 0010:[<ffffffff81472c49>]  [<ffffffff81472c49>] skb_segment+0x709/0x740
RSP: 0018:ffff8800282037f0  EFLAGS: 00010212
RAX: 0000000000000000 RBX: ffff8804246cbe40 RCX: ffff88042bba0d40
RDX: 000000000000004d RSI: ffff88042d135882 RDI: ffff880423962882
RBP: ffff8800282038a0 R08: 0000000000000000 R09: 0000000000000000
R10: ffff880423962800 R11: 0000000000000000 R12: 00000000000005ee
R13: 0000000000000000 R14: ffff880429ea3e80 R15: 000000000000004d
FS:  00007f18615a2900(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000e9f3000 CR3: 000000100a715000 CR4: 00000000000407f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kvm (pid: 7620, veid: 0, threadinfo ffff88100a7a0000, task ffff8810208e0d30)
Stack:
 ffff880028203820 ffffffff8105bda8 0000000100000000 000000820000003c
<d> 000000000000003c 0000000000000046 0000000000000000 0000000000000046
<d> 0100880000000000 ffffffffffffffba ffff88042d286580 0000000000000000
Call Trace:
 <IRQ> 
 [<ffffffff8105bda8>] ? task_rq_lock+0x58/0xa0
 [<ffffffff814c3e51>] tcp_tso_segment+0xf1/0x320
 [<ffffffff81471ce7>] ? __kfree_skb+0x47/0xa0
 [<ffffffff814eb2e1>] inet_gso_segment+0x111/0x2e0
 [<ffffffff8147f037>] skb_mac_gso_segment+0xa7/0x290
 [<ffffffff8147f278>] __skb_gso_segment+0x58/0xc0
 [<ffffffff8147f2f3>] skb_gso_segment+0x13/0x20
 [<ffffffff8147f391>] dev_hard_start_xmit+0x91/0x5f0
 [<ffffffff8149e19a>] sch_direct_xmit+0x16a/0x1d0
 [<ffffffff8147fbd8>] dev_queue_xmit+0x208/0x300
 [<ffffffff8151f780>] ? __br_forward+0x0/0xd0
 [<ffffffff8151f48b>] br_dev_queue_push_xmit+0x7b/0xc0
 [<ffffffff8151f528>] br_forward_finish+0x58/0x60
 [<ffffffff8151f82b>] __br_forward+0xab/0xd0
 [<ffffffff8151f3ee>] deliver_clone+0x3e/0x60
 [<ffffffff8151f780>] ? __br_forward+0x0/0xd0
 [<ffffffff8151f722>] br_flood+0x82/0xe0
 [<ffffffff8151facc>] br_flood_forward+0x1c/0x20
 [<ffffffff81520c60>] br_handle_frame_finish+0x330/0x370
 [<ffffffff81520e4a>] br_handle_frame+0x1aa/0x250
 [<ffffffff8147ffdf>] __netif_receive_skb+0x24f/0x770
 [<ffffffff81480648>] netif_receive_skb+0x58/0x60
 [<ffffffff81480848>] napi_gro_complete+0xc8/0x150
 [<ffffffff81480ad3>] dev_gro_receive+0x203/0x320
 [<ffffffff8152f358>] vlan_gro_common+0x1b8/0x260
 [<ffffffff8152f882>] vlan_gro_receive+0x82/0xa0
 [<ffffffffa00772de>] igb_receive_skb+0x2e/0x50 [igb]
 [<ffffffffa0081cdf>] igb_poll+0x74f/0x1370 [igb]
 [<ffffffff81060b4d>] ? enqueue_task_fair+0xdd/0x1f0
 [<ffffffff81058c96>] ? enqueue_task+0x66/0x80
 [<ffffffff814810b1>] net_rx_action+0x1a1/0x3b0
 [<ffffffff81014d79>] ? read_tsc+0x9/0x20
 [<ffffffff8107d24b>] __do_softirq+0x11b/0x260
 [<ffffffff8100c4cc>] call_softirq+0x1c/0x30
 [<ffffffff81010235>] do_softirq+0x75/0xb0
 [<ffffffff8107d525>] irq_exit+0xc5/0xd0
 [<ffffffff81563f92>] do_IRQ+0x72/0xe0
 [<ffffffff8100bb13>] ret_from_intr+0x0/0x11
 <EOI> 
 [<ffffffff811bd6f0>] ? sys_ioctl+0x0/0x80
 [<ffffffff8100b182>] ? system_call_fastpath+0x16/0x1b
Code: c5 fc ff ff 41 8b 87 d4 00 00 00 49 03 87 d8 00 00 00 48 83 78 18 00 75 1b 48 89 48 18 e9 f8 fe ff ff f0 ff 81 ec 00 00

Code:
proxmox-ve-2.6.32: 3.3-139 (running kernel: 2.6.32-34-pve)
pve-manager: 3.3-5 (running version: 3.3-5/bfebec03)
pve-kernel-2.6.32-32-pve: 2.6.32-136
pve-kernel-2.6.32-34-pve: 2.6.32-139
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-1
pve-cluster: 3.0-15
qemu-server: 3.3-3
pve-firmware: 1.1-3
libpve-common-perl: 3.0-19
libpve-access-control: 3.0-15
libpve-storage-perl: 3.0-25
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-10
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

Thanks in advance for any advice or info you can provide.
 
Hi,

I am getting some random crashing with both a new install and an updated install. I've searched and cant really find any good information on it. If someone out there has any ideas, i would greatly appreciate the help. This node is part of a small cluster that was being updated from 3.2-4 to 3.3-5.

...

Thanks in advance for any advice or info you can provide.

Hi,
don't know if it helps, but I see you use an Supermicro MB. I had one Supermicro mainboard, where pve suddenly reboots until my colleague flash an new bios.

Unfortunal supermicro don't show, which issues are solved with which upgrade (but wrote you should only bios-update if your issue is bios-related!!) - other companys, like Asus, has much better information. This is the reason, why I avoid supermicro these days.
Nevertheless, you can try an bios-update...

Udo
 
Hi nsw,

Udo is correct.
We have had major problems with the supermicro x9scm-f boards (total hypervisor crash when starting a big copy) Turned out that one of the nics had issues with linux. This was not a proxmox issue, we managed to reproduce this error on debian and centos with plain kvm.
Solution: replace the boards with supermicro atom c2750 boards, rock solid since the replacement.
The x9scm-f boards are now used for our omnios+napp-it storage boxes: stable without any problem. The illumos driver for that nic is better than the linux one.

Regards,

Dirk Adamsky
 
Udo and Dirk,

Thanks for the input. I've checked the BIOS on both and they are up to date. Whats wired is the other identical server, still running 3.2-4, has had no crashing issues at all. It's only the updated boxes that are crashing. I may look at rolling back the kernel and see if that helps at all.

On the supermicro BIOS issue, i can agree. They tell you nothing about the updates or what they fix. We went with supermicro because we got a good deal on them and they had a good AMD option. We are running the AS-2022G-URF and they use the H8DGU-F board.
 
Udo and Dirk,

Thanks for the input. I've checked the BIOS on both and they are up to date. Whats wired is the other identical server, still running 3.2-4, has had no crashing issues at all. It's only the updated boxes that are crashing. I may look at rolling back the kernel and see if that helps at all.

On the supermicro BIOS issue, i can agree. They tell you nothing about the updates or what they fix. We went with supermicro because we got a good deal on them and they had a good AMD option. We are running the AS-2022G-URF and they use the H8DGU-F board.

Hi,
you can try the 2.6.32-33 ( I had also on an AMD-system trouble with 2.6.32-28 -32 ). The 2.6.32-27 run stable for me and the 2.6.32-33 too.


Udo
 
Reporting back, no luck with the other kernels. Still crashing with the same error on both boxes. I even tried pushing one of the servers up to the pvetest repo and updating with no effect. Updated firmware on the interconnecting switch, still no luck. I guess the last option i have is to downgrade/reinstall to 3.2-4 which is running without a single problem. *sigh* Going to miss the NoVNC console. :(

Anyway, thanks for all the input Udo. I'll keep working on this with a test system and hopefully a newer version down the line will work without crashing daily. If anyone has any other ideas, i'm open to input.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!