Troubleshooting a Kernel Panic

Damn. It just failed. It only seems to fail when I run the proxmox backups. I can beat on the CPU, network, memory, and disks all day, but something in there causes it to crash, and only on the weekend. I'm looking to do a rebuild here, but I want to avoid that if I can. Thoughts?
 
I can even run them manually on the weekends. It only seems to crash when the automated backup from proxmox starts. I have the backup directory on a synology mounted via samba. could switching to NFS help?
 
still putting my bets on the tg3 driver

I fought this problem for about 12 months, random panics till I got fed up put some money into some different nics. No issues since.

I'll see if I can find some of my old logs, but as I recall in my case I had a 24 core setup. Each core would eventually lock/hang.
 
Last edited:
That is the confusing part for me. I can run high bandwidth though the nic at any time, except the proxox automated backup. I did a manual backup of all the VMs on the server yesterday afternoon, and it went flawless. When the automated one runs to backup the same VMs, the host dies. I can pretty well predict when it will happen. Weekend proxmox auto backup, server dies.
 
I'll see what I can do to get another NIC in there though. And I'm considering rebuilding the server.
 
me too, I understand your logic and frustration. It looks like my logs have turned over, but still looking. But I have a gut feeling that if you check your syslog ( I think, maybe dmesg) you might find some sort of irq issue on each core (going from memory) as it dies.

And as you have discovered , it might go weeks before showing itself again or show up the next day or hour.

I finally compiled the latest tg3 driver from broadcom and it might have done some good but I had ordered new nics by then too.
 
Nope, nothing in syslog, or another logs. They just end at the crash, and restart at the reboot. I'm going to setup kdump when I can get some reboot time in there, or maybe I'll just not run the automated backup until proxmox 4, and hope the 4.x kernel makes it happy.
 
I'll see what I can do to get another NIC in there though. And I'm considering rebuilding the server.

I did it without rebuiding, pulled all the broadcom nics out (including the built in 4 port card that HP put in) and put in new ( i used intel).

then in /etc/udev/rules.d delete the file "70-persistent-net.rules" and reboot and let machine reconfig for new nics
 
looking at some of my old posts

how about "hung_task_timeout_secs" in the logs anywhere?
[h=2][/h]
 
ahhhh my syslog was too big even for pastebin pro,

this is the kind of noise I was getting until each cpu was stuck and the server died, one thing I remember now is I sill had a concole, but could not ssh into it but tg3 is involved


Code:
Mar 16 20:42:36 proliant02 kernel: BUG: soft lockup - CPU#3 stuck for 67s! [kvm:14275]
Mar 16 20:42:36 proliant02 kernel: Modules linked in: openvswitch ip_set vzethdev vznetdev pio_nfs pio_direct pfmt_raw pfmt_ploop1 ploop simfs vzrst nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 vzcpt nf_conntrack vzdquota vzmon vzdev ip6t_REJECT ip6table_mangle ip6table_filter ip6_tables xt_length xt_hl xt_tcpmss xt_TCPMSS iptable_mangle iptable_filter xt_multiport xt_limit xt_dscp ipt_REJECT ip_tables vhost_net tun macvtap macvlan nfnetlink_log kvm_amd nfnetlink kvm dlm sctp configfs vzevent dm_round_robin ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc bonding 8021q garp ipv6 fuse dm_multipath snd_pcsp snd_pcm power_meter snd_page_alloc i2c_piix4 acpi_ipmi serio_raw hpwdt snd_timer snd soundcore amd64_edac_mod edac_mce_amd edac_core i2c_core k10temp fam15h_power hpilo shpchp ipmi_si ipmi_msghandler ext3 mbcache jbd sg ata_generic pata_acpi tg3 pata_atiixp ptp pps_core ahci hpsa [last unloaded: scsi_wait_scan]
Mar 16 20:42:36 proliant02 kernel: 
Mar 16 20:42:36 proliant02 kernel: CPU 3 
Mar 16 20:42:36 proliant02 kernel: Modules linked in: openvswitch ip_set vzethdev vznetdev pio_nfs pio_direct pfmt_raw pfmt_ploop1 ploop simfs vzrst nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 vzcpt nf_conntrack vzdquota vzmon vzdev ip6t_REJECT ip6table_mangle ip6table_filter ip6_tables xt_length xt_hl xt_tcpmss xt_TCPMSS iptable_mangle iptable_filter xt_multiport xt_limit xt_dscp ipt_REJECT ip_tables vhost_net tun macvtap macvlan nfnetlink_log kvm_amd nfnetlink kvm dlm sctp configfs vzevent dm_round_robin ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc bonding 8021q garp ipv6 fuse dm_multipath snd_pcsp snd_pcm power_meter snd_page_alloc i2c_piix4 acpi_ipmi serio_raw hpwdt snd_timer snd soundcore amd64_edac_mod edac_mce_amd edac_core i2c_core k10temp fam15h_power hpilo shpchp ipmi_si ipmi_msghandler ext3 mbcache jbd sg ata_generic pata_acpi tg3 pata_atiixp ptp pps_core ahci hpsa [last unloaded: scsi_wait_scan]
Mar 16 20:42:36 proliant02 kernel: 
Mar 16 20:42:36 proliant02 kernel: 
Mar 16 20:42:36 proliant02 kernel: Pid: 14275, comm: kvm veid: 0 Not tainted 2.6.32-37-pve #1 042stab104 HP ProLiant DL385p Gen8
Mar 16 20:42:36 proliant02 kernel: RIP: 0010:[<ffffffff81562dbe>]  [<ffffffff81562dbe>] _spin_lock+0x1e/0x30
Mar 16 20:42:36 proliant02 kernel: RSP: 0018:ffff88018cdf97b8  EFLAGS: 00000283
Mar 16 20:42:36 proliant02 kernel: RAX: 000000000000d8d0 RBX: ffff88018cdf97b8 RCX: 0000000000000018
Mar 16 20:42:36 proliant02 kernel: RDX: 000000000000d8ce RSI: ffff880201a6d6c0 RDI: ffffffff81e42950
Mar 16 20:42:36 proliant02 kernel: RBP: ffffffff8100bcce R08: ffff880201a6d998 R09: 0000000000000000
Mar 16 20:42:36 proliant02 kernel: R10: ffff88017b6a49c0 R11: 0000000000000000 R12: ffff880c3cc2a200
Mar 16 20:42:36 proliant02 kernel: R13: ffff88018ccb7300 R14: ffffffff810646aa R15: ffff88018cdf9758
Mar 16 20:42:36 proliant02 kernel: FS:  00000000fff9e000(0000) GS:ffff880850440000(0000) knlGS:0000000000000000
Mar 16 20:42:36 proliant02 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 16 20:42:36 proliant02 kernel: CR2: 0000000000d8ed18 CR3: 00000001da9e4000 CR4: 00000000000407e0
Mar 16 20:42:36 proliant02 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 16 20:42:36 proliant02 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Mar 16 20:42:36 proliant02 kernel: Process kvm (pid: 14275, veid: 0, threadinfo ffff88018cdf8000, task ffff88043db32f40)
Mar 16 20:42:36 proliant02 kernel: Stack:
Mar 16 20:42:36 proliant02 kernel: ffff88018cdf9808 ffffffff8105035b 00007f5e2717c000 ffff880201a6d6c0
Mar 16 20:42:36 proliant02 kernel: <d> ffff88018ce14f40 ffff880201a6d6c0 00007f5e2717c000 ffff880201a6d998
Mar 16 20:42:36 proliant02 kernel: <d> ffff88017b03dbe0 ffffea002f362f00 ffff88018cdf9838 ffffffff81050756
Mar 16 20:42:36 proliant02 kernel: Call Trace:
Mar 16 20:42:36 proliant02 kernel: [<ffffffff8105035b>] ? flush_tlb_others_ipi+0x6b/0x130
Mar 16 20:42:36 proliant02 kernel: [<ffffffff81050756>] ? native_flush_tlb_others+0x76/0x90
Mar 16 20:42:36 proliant02 kernel: [<ffffffff8105052e>] ? flush_tlb_page+0x5e/0xb0
Mar 16 20:42:36 proliant02 kernel: [<ffffffff81168104>] ? do_wp_page+0x334/0x9a0
Mar 16 20:42:36 proliant02 kernel: [<ffffffff8106d762>] ? default_wake_function+0x12/0x20
Mar 16 20:42:36 proliant02 kernel: [<ffffffff810a6196>] ? autoremove_wake_function+0x16/0x40
Mar 16 20:42:36 proliant02 kernel: [<ffffffff8116cfbd>] ? handle_pte_fault+0x54d/0x16c0
Mar 16 20:42:36 proliant02 kernel: [<ffffffffa04e1729>] ? kvm_read_guest+0x59/0xa0 [kvm]
Mar 16 20:42:36 proliant02 kernel: [<ffffffffa0500969>] ? paging64_walk_addr+0x1c9/0x590 [kvm]
Mar 16 20:42:36 proliant02 kernel: [<ffffffff8116e34c>] ? handle_mm_fault+0x21c/0x300
Mar 16 20:42:36 proliant02 kernel: [<ffffffff8116c1da>] ? follow_page+0x3ea/0x4b0
Mar 16 20:42:36 proliant02 kernel: [<ffffffff8116e556>] ? __get_user_pages+0x126/0x3d0
Mar 16 20:42:36 proliant02 kernel: [<ffffffff8116e8ba>] ? get_user_pages+0x5a/0x60
Mar 16 20:42:36 proliant02 kernel: [<ffffffff810501f7>] ? get_user_pages_fast+0xa7/0x190
Mar 16 20:42:36 proliant02 kernel: [<ffffffffa04e2fb3>] ? hva_to_pfn+0x33/0x170 [kvm]
Mar 16 20:42:36 proliant02 kernel: [<ffffffff81562306>] ? down_read+0x16/0x2b
Mar 16 20:42:36 proliant02 kernel: [<ffffffffa04e48bb>] ? kvm_host_page_size+0xcb/0x1b0 [kvm]
Mar 16 20:42:36 proliant02 kernel: [<ffffffffa04fc394>] ? mapping_level+0x64/0xf0 [kvm]
Mar 16 20:42:36 proliant02 kernel: [<ffffffffa0502f04>] ? tdp_page_fault+0x74/0x150 [kvm]
Mar 16 20:42:36 proliant02 kernel: [<ffffffffa04fdee8>] ? kvm_mmu_page_fault+0x28/0xd0 [kvm]
Mar 16 20:42:36 proliant02 kernel: [<ffffffffa054c0bf>] ? pf_interception+0x7f/0xe0 [kvm_amd]
Mar 16 20:42:36 proliant02 kernel: [<ffffffffa054ff9e>] ? handle_exit+0x1be/0x3c0 [kvm_amd]
Mar 16 20:42:36 proliant02 kernel: [<ffffffffa04f87e0>] ? kvm_arch_vcpu_ioctl_run+0x3c0/0xf20 [kvm]
Mar 16 20:42:36 proliant02 kernel: [<ffffffff810c5b16>] ? wake_futex+0x66/0x80
Mar 16 20:42:36 proliant02 kernel: [<ffffffff810c6040>] ? futex_wake+0x70/0x140
Mar 16 20:42:36 proliant02 kernel: [<ffffffffa04de8e3>] ? kvm_vcpu_ioctl+0x2e3/0x580 [kvm]
Mar 16 20:42:36 proliant02 kernel: [<ffffffff810c8159>] ? do_futex+0x159/0xb60
Mar 16 20:42:36 proliant02 kernel: [<ffffffff8105833f>] ? __dequeue_entity+0x2f/0x50
Mar 16 20:42:36 proliant02 kernel: [<ffffffff81009738>] ? __switch_to+0x128/0x2f0
Mar 16 20:42:36 proliant02 kernel: [<ffffffff811c486a>] ? vfs_ioctl+0x2a/0xa0
Mar 16 20:42:36 proliant02 kernel: [<ffffffff81061f12>] ? finish_task_switch+0xc2/0x100
Mar 16 20:42:36 proliant02 kernel: [<ffffffff811c4e9e>] ? do_vfs_ioctl+0x7e/0x5a0
Mar 16 20:42:36 proliant02 kernel: [<ffffffff810c8bed>] ? sys_futex+0x8d/0x190
Mar 16 20:42:36 proliant02 kernel: [<ffffffff811c540f>] ? sys_ioctl+0x4f/0x80
Mar 16 20:42:36 proliant02 kernel: [<ffffffff810a4f29>] ? sys_clock_gettime+0x69/0xa0
Mar 16 20:42:36 proliant02 kernel: [<ffffffff8100b182>] ? system_call_fastpath+0x16/0x1b
Mar 16 20:42:36 proliant02 kernel: Code: 00 00 00 01 74 05 e8 32 c5 d3 ff 5d c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> b7 17 eb f5 83 3f 00 75 f4 eb df 5d c3 0f 1f 40 00 55 48 89 
Mar 16 20:42:36 proliant02 kernel: Call Trace:
Mar 16 20:42:36 proliant02 kernel: [<ffffffff8105035b>] ? flush_tlb_others_ipi+0x6b/0x130
Mar 16 20:42:36 proliant02 kernel: [<ffffffff81050756>] ? native_flush_tlb_others+0x76/0x90
Mar 16 20:42:36 proliant02 kernel: [<ffffffff8105052e>] ? flush_tlb_page+0x5e/0xb0
Mar 16 20:42:36 proliant02 kernel: [<ffffffff81168104>] ? do_wp_page+0x334/0x9a0
Mar 16 20:42:36 proliant02 kernel: [<ffffffff8106d762>] ? default_wake_function+0x12/0x20
Mar 16 20:42:36 proliant02 kernel: [<ffffffff810a6196>] ? autoremove_wake_function+0x16/0x40
Mar 16 20:42:36 proliant02 kernel: [<ffffffff8116cfbd>] ? handle_pte_fault+0x54d/0x16c0
Mar 16 20:42:36 proliant02 kernel: [<ffffffffa04e1729>] ? kvm_read_guest+0x59/0xa0 [kvm]
Mar 16 20:42:36 proliant02 kernel: [<ffffffffa0500969>] ? paging64_walk_addr+0x1c9/0x590 [kvm]
Mar 16 20:42:36 proliant02 kernel: [<ffffffff8116e34c>] ? handle_mm_fault+0x21c/0x300
Mar 16 20:42:36 proliant02 kernel: [<ffffffff8116c1da>] ? follow_page+0x3ea/0x4b0
Mar 16 20:42:36 proliant02 kernel: [<ffffffff8116e556>] ? __get_user_pages+0x126/0x3d0
Mar 16 20:42:36 proliant02 kernel: [<ffffffff8116e8ba>] ? get_user_pages+0x5a/0x60
Mar 16 20:42:36 proliant02 kernel: [<ffffffff810501f7>] ? get_user_pages_fast+0xa7/0x190
Mar 16 20:42:36 proliant02 kernel: [<ffffffffa04e2fb3>] ? hva_to_pfn+0x33/0x170 [kvm]
Mar 16 20:42:36 proliant02 kernel: [<ffffffff81562306>] ? down_read+0x16/0x2b
Mar 16 20:42:36 proliant02 kernel: [<ffffffffa04e48bb>] ? kvm_host_page_size+0xcb/0x1b0 [kvm]
Mar 16 20:42:36 proliant02 kernel: [<ffffffffa04fc394>] ? mapping_level+0x64/0xf0 [kvm]
Mar 16 20:42:36 proliant02 kernel: [<ffffffffa0502f04>] ? tdp_page_fault+0x74/0x150 [kvm]
Mar 16 20:42:36 proliant02 kernel: [<ffffffffa04fdee8>] ? kvm_mmu_page_fault+0x28/0xd0 [kvm]
Mar 16 20:42:36 proliant02 kernel: [<ffffffffa054c0bf>] ? pf_interception+0x7f/0xe0 [kvm_amd]
Mar 16 20:42:36 proliant02 kernel: [<ffffffffa054ff9e>] ? handle_exit+0x1be/0x3c0 [kvm_amd]
Mar 16 20:42:36 proliant02 kernel: [<ffffffffa04f87e0>] ? kvm_arch_vcpu_ioctl_run+0x3c0/0xf20 [kvm]
Mar 16 20:42:36 proliant02 kernel: [<ffffffff810c5b16>] ? wake_futex+0x66/0x80
Mar 16 20:42:36 proliant02 kernel: [<ffffffff810c6040>] ? futex_wake+0x70/0x140
Mar 16 20:42:36 proliant02 kernel: [<ffffffffa04de8e3>] ? kvm_vcpu_ioctl+0x2e3/0x580 [kvm]
Mar 16 20:42:36 proliant02 kernel: [<ffffffff810c8159>] ? do_futex+0x159/0xb60
Mar 16 20:42:36 proliant02 kernel: [<ffffffff8105833f>] ? __dequeue_entity+0x2f/0x50
Mar 16 20:42:36 proliant02 kernel: [<ffffffff81009738>] ? __switch_to+0x128/0x2f0
Mar 16 20:42:36 proliant02 kernel: [<ffffffff811c486a>] ? vfs_ioctl+0x2a/0xa0
Mar 16 20:42:36 proliant02 kernel: [<ffffffff81061f12>] ? finish_task_switch+0xc2/0x100
Mar 16 20:42:36 proliant02 kernel: [<ffffffff811c4e9e>] ? do_vfs_ioctl+0x7e/0x5a0
Mar 16 20:42:36 proliant02 kernel: [<ffffffff810c8bed>] ? sys_futex+0x8d/0x190
Mar 16 20:42:36 proliant02 kernel: [<ffffffff811c540f>] ? sys_ioctl+0x4f/0x80
Mar 16 20:42:36 proliant02 kernel: [<ffffffff810a4f29>] ? sys_clock_gettime+0x69/0xa0
Mar 16 20:42:36 proliant02 kernel: [<ffffffff8100b182>] ? system_call_fastpath+0x16/0x1b
 
You still had an active console? I can't do anything with mine except iLo reboot. I didn't see anything like the "soft lockup - CPU#3 stick for 67s seconds. We may have had different problems. Still, I'll switch out the NIC.
 
first glace it looks like a panic, but it was sending syslog messages to it.. this would go on all day if I let it.

but in your case it was locked?

It was a tough one to diagnose and a hard pill to swallow when I told my supervisor I needed to spend about $4k on the cluster but still not know 100% that it was a fix. But all the googling I did about the tg3 I felt pretty good..

Good luck
 
To make your life easier always choose Intel nics. So any motherboard with on-board nics I first look for the maker of the nic chip. If this is not Intel I wont spend any more time on this board.
 
I got it to go it while I was watching.

http://pastebin.com/YLujnFTT



Atop says the disks are not stressed. There is alot of iowait though. This is a proxmox install on mdadm raid 1 on ssds with zfs added in after.



I think rebuilding to a clean 3.4 on zfs might fix this.



Thoughts?
 
I got it to go it while I was watching.

I think rebuilding to a clean 3.4 on zfs might fix this.

Thoughts?


having not used zfs in my prox environment, I can only guess these are shares on the *network*

I can recall twice just ssh into a proxmox node to have it crash.. the tg3 drivers are just crap with any load in a cluster environment (that's just my opinion)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!