tg3 timeouts with KVM

Michael Lednev · Feb 11, 2015

Recently I've received several Dell R730 servers in Hetzner for existing project. Unfortunatelly the built-in NICs are not very good with KVM virtual machines. As soon as I put some (20-50 Mbps) traffic on VM the host dumps some backtrace and disables the interface. The problem isn't new and is well described in various maillists. Unfortunatelly I couldn't find any reliable workaround except for new NIC (Intel works great). Am I missing something or the problem is just not worth it? The host works fine without KVM guests, no problem with OpenVZ guest with veth interface or without VMs at all.

Here's excerpt from logs

Code:

Feb 11 08:53:24 s13 kernel: tap116i0: no IPv6 routers present
Feb 11 08:53:29 s13 kernel: vmbr1: port 2(tap116i0) entering learning state
Feb 11 08:53:44 s13 kernel: vmbr1: topology change detected, sending tcn bpdu
Feb 11 08:53:44 s13 kernel: vmbr1: port 2(tap116i0) entering forwarding state
Feb 11 08:55:24 s13 kernel: ------------[ cut here ]------------
Feb 11 08:55:24 s13 kernel: WARNING: at net/sched/sch_generic.c:267 dev_watchdog+0x28a/0x2a0() (Not tainted)
Feb 11 08:55:24 s13 kernel: Hardware name: PowerEdge R730
Feb 11 08:55:24 s13 kernel: NETDEV WATCHDOG: eth1 (tg3): transmit queue 0 timed out
Feb 11 08:55:24 s13 kernel: Modules linked in: dlm configfs xt_state ip_set vzethdev vznetdev pio_nfs pio_direct pfmt_raw pfmt_ploop1 ploop simfs vzrst nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 vzcpt nf_conntrack vzdquota vzmon vzdev ip6t_REJECT ip6table_mangle ip6table_filter ip6_tables xt_length xt_hl xt_tcpmss xt_TCPMSS iptable_mangle iptable_filter xt_multiport xt_limit xt_dscp ipt_REJECT ip_tables vhost_net tun macvtap macvlan nfnetlink_log kvm_intel nfnetlink kvm vzevent nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc ipv6 ext2 fuse snd_pcsp iTCO_wdt iTCO_vendor_support snd_pcm snd_page_alloc snd_timer dcdbas snd soundcore lpc_ich mfd_core shpchp wmi power_meter ext4 jbd2 mbcache sg ahci tg3 ptp pps_core megaraid_sas [last unloaded: configfs]
Feb 11 08:55:24 s13 kernel: Pid: 0, comm: swapper veid: 0 Not tainted 2.6.32-34-pve #1
Feb 11 08:55:24 s13 kernel: Call Trace:
Feb 11 08:55:24 s13 kernel: <IRQ> [<ffffffff810733b7>] ? warn_slowpath_common+0x87/0xe0
Feb 11 08:55:24 s13 kernel: [<ffffffff810734c6>] ? warn_slowpath_fmt+0x46/0x50
Feb 11 08:55:24 s13 kernel: [<ffffffff8149e01a>] ? dev_watchdog+0x28a/0x2a0
Feb 11 08:55:24 s13 kernel: [<ffffffff81015319>] ? sched_clock+0x9/0x10
Feb 11 08:55:26 s13 kernel: [<ffffffff8106c3da>] ? scheduler_tick+0xfa/0x240
Feb 11 08:55:26 s13 kernel: [<ffffffff8149dd90>] ? dev_watchdog+0x0/0x2a0
Feb 11 08:55:26 s13 kernel: [<ffffffff81087b76>] ? run_timer_softirq+0x176/0x370
Feb 11 08:55:26 s13 kernel: [<ffffffff8107d24b>] ? __do_softirq+0x11b/0x260
Feb 11 08:55:26 s13 kernel: [<ffffffff8100c4cc>] ? call_softirq+0x1c/0x30
Feb 11 08:55:26 s13 kernel: [<ffffffff81010215>] ? do_softirq+0x75/0xb0
Feb 11 08:55:26 s13 kernel: [<ffffffff8107d525>] ? irq_exit+0xc5/0xd0
Feb 11 08:55:26 s13 kernel: [<ffffffff8156404a>] ? smp_apic_timer_interrupt+0x4a/0x60
Feb 11 08:55:26 s13 kernel: [<ffffffff8100bcd3>] ? apic_timer_interrupt+0x13/0x20
Feb 11 08:55:26 s13 kernel: <EOI> [<ffffffff812e8dcb>] ? intel_idle+0xdb/0x160
Feb 11 08:55:26 s13 kernel: [<ffffffff812e8da9>] ? intel_idle+0xb9/0x160
Feb 11 08:55:26 s13 kernel: [<ffffffff81446994>] ? cpuidle_idle_call+0x94/0x130
Feb 11 08:55:26 s13 kernel: [<ffffffff81009219>] ? cpu_idle+0xa9/0x100
Feb 11 08:55:26 s13 kernel: [<ffffffff81536001>] ? rest_init+0x85/0x94
Feb 11 08:55:26 s13 kernel: [<ffffffff81c33ce1>] ? start_kernel+0x3ff/0x40b
Feb 11 08:55:26 s13 kernel: [<ffffffff81c3333b>] ? x86_64_start_reservations+0x126/0x12a
Feb 11 08:55:26 s13 kernel: [<ffffffff81c3344d>] ? x86_64_start_kernel+0x10e/0x11d
Feb 11 08:55:26 s13 kernel: ---[ end trace 3eb6af1e220fb20d ]---
Feb 11 08:55:26 s13 kernel: Tainting kernel with flag 0x9
Feb 11 08:55:26 s13 kernel: Pid: 0, comm: swapper veid: 0 Not tainted 2.6.32-34-pve #1
Feb 11 08:55:26 s13 kernel: Call Trace:
Feb 11 08:55:26 s13 kernel: <IRQ> [<ffffffff81073269>] ? add_taint+0x69/0x70
Feb 11 08:55:26 s13 kernel: [<ffffffff810733d9>] ? warn_slowpath_common+0xa9/0xe0
Feb 11 08:55:26 s13 kernel: [<ffffffff810734c6>] ? warn_slowpath_fmt+0x46/0x50
Feb 11 08:55:26 s13 kernel: [<ffffffff8149e01a>] ? dev_watchdog+0x28a/0x2a0
Feb 11 08:55:26 s13 kernel: [<ffffffff81015319>] ? sched_clock+0x9/0x10
Feb 11 08:55:26 s13 kernel: [<ffffffff8106c3da>] ? scheduler_tick+0xfa/0x240
Feb 11 08:55:26 s13 kernel: [<ffffffff8149dd90>] ? dev_watchdog+0x0/0x2a0
Feb 11 08:55:26 s13 kernel: [<ffffffff81087b76>] ? run_timer_softirq+0x176/0x370
Feb 11 08:55:26 s13 kernel: [<ffffffff8107d24b>] ? __do_softirq+0x11b/0x260
Feb 11 08:55:26 s13 kernel: [<ffffffff8100c4cc>] ? call_softirq+0x1c/0x30
Feb 11 08:55:26 s13 kernel: [<ffffffff81010215>] ? do_softirq+0x75/0xb0
Feb 11 08:55:26 s13 kernel: [<ffffffff8107d525>] ? irq_exit+0xc5/0xd0
Feb 11 08:55:26 s13 kernel: [<ffffffff8156404a>] ? smp_apic_timer_interrupt+0x4a/0x60
Feb 11 08:55:26 s13 kernel: [<ffffffff8100bcd3>] ? apic_timer_interrupt+0x13/0x20
Feb 11 08:55:26 s13 kernel: <EOI> [<ffffffff812e8dcb>] ? intel_idle+0xdb/0x160
Feb 11 08:55:26 s13 kernel: [<ffffffff812e8da9>] ? intel_idle+0xb9/0x160
Feb 11 08:55:26 s13 kernel: [<ffffffff81446994>] ? cpuidle_idle_call+0x94/0x130
Feb 11 08:55:26 s13 kernel: [<ffffffff81009219>] ? cpu_idle+0xa9/0x100
Feb 11 08:55:26 s13 kernel: [<ffffffff81536001>] ? rest_init+0x85/0x94
Feb 11 08:55:26 s13 kernel: [<ffffffff81c33ce1>] ? start_kernel+0x3ff/0x40b
Feb 11 08:55:26 s13 kernel: [<ffffffff81c3333b>] ? x86_64_start_reservations+0x126/0x12a
Feb 11 08:55:26 s13 kernel: [<ffffffff81c3344d>] ? x86_64_start_kernel+0x10e/0x11d
Feb 11 08:55:26 s13 kernel: tg3 0000:01:00.1: eth1: transmit timed out, resetting
Feb 11 08:55:26 s13 kernel: tg3 0000:01:00.1: eth1: 0x00000000: 0x165f14e4, 0x00100406, 0x02000000, 0x00800000
Feb 11 08:55:26 s13 kernel: tg3 0000:01:00.1: eth1: 0x00000010: 0x91b0000c, 0x00000000, 0x91b1000c, 0x00000000
Feb 11 08:55:26 s13 kernel: tg3 0000:01:00.1: eth1: 0x00000020: 0x91b2000c, 0x00000000, 0x00000000, 0x1f5b1028
Feb 11 08:55:26 s13 kernel: tg3 0000:01:00.1: eth1: 0x00000030: 0xfffc0000, 0x00000048, 0x00000000, 0x0000020e
Feb 11 08:55:26 s13 kernel: tg3 0000:01:00.1: eth1: 0x00000040: 0x00000000, 0xe2000000, 0xc8035001, 0x64002008
Feb 11 08:55:26 s13 kernel: tg3 0000:01:00.1: eth1: 0x00000050: 0x818c5803, 0x78000000, 0x0086a005, 0x00000000
[many lines just like these]
Feb 11 08:55:26 s13 kernel: tg3 0000:01:00.1: eth1: 0: Host status block [00000005:000000e4:(0000:0723:0000):(0000:00ab)]
Feb 11 08:55:26 s13 kernel: tg3 0000:01:00.1: eth1: 0: NAPI info [000000e2:000000e2:(0097:00ab:01ff):0000:(073b:0000:0000:0000)]
Feb 11 08:55:26 s13 kernel: tg3 0000:01:00.1: eth1: 1: Host status block [00000001:0000000b:(0000:0000:0000):(0aff:0000)]
Feb 11 08:55:26 s13 kernel: tg3 0000:01:00.1: eth1: 1: NAPI info [000000d9:000000d9:(0000:0000:01ff):0acd:(02cd:02cd:0000:0000)]
Feb 11 08:55:26 s13 kernel: tg3 0000:01:00.1: eth1: 2: Host status block [00000001:00000015:(09e5:0000:0000):(0000:0000)]
Feb 11 08:55:26 s13 kernel: tg3 0000:01:00.1: eth1: 2: NAPI info [000000de:000000de:(0000:0000:01ff):09ae:(01ae:01ae:0000:0000)]
Feb 11 08:55:26 s13 kernel: tg3 0000:01:00.1: eth1: 3: Host status block [00000001:0000003a:(0000:0000:0000):(0000:0000)]
Feb 11 08:55:26 s13 kernel: tg3 0000:01:00.1: eth1: 3: NAPI info [0000001d:0000001d:(0000:0000:01ff):0180:(0180:0180:0000:0000)]
Feb 11 08:55:26 s13 kernel: tg3 0000:01:00.1: eth1: 4: Host status block [00000001:000000c6:(0000:0000:00a1):(0000:0000)]
Feb 11 08:55:26 s13 kernel: tg3 0000:01:00.1: eth1: 4: NAPI info [0000009d:0000009d:(0000:0000:01ff):0078:(0078:0078:0000:0000)]
Feb 11 08:55:26 s13 kernel: tg3 0000:01:00.1: tg3_stop_block timed out, ofs=1400 enable_bit=2
Feb 11 08:55:26 s13 kernel: tg3 0000:01:00.1: tg3_stop_block timed out, ofs=c00 enable_bit=2
Feb 11 08:55:26 s13 kernel: tg3 0000:01:00.1: eth1: Link is down
Feb 11 08:55:27 s13 kernel: vmbr1: port 1(eth1) entering disabled state
Feb 11 08:55:27 s13 kernel: vmbr1: topology change detected, propagating
Feb 11 08:55:30 s13 kernel: tg3 0000:01:00.1: eth1: Link is up at 1000 Mbps, full duplex
Feb 11 08:55:30 s13 kernel: tg3 0000:01:00.1: eth1: Flow control is off for TX and off for RX
Feb 11 08:55:30 s13 kernel: tg3 0000:01:00.1: eth1: EEE is disabled
Feb 11 08:55:30 s13 kernel: vmbr1: topology change detected, propagating
Feb 11 08:55:30 s13 kernel: vmbr1: port 1(eth1) entering forwarding state

Michael Lednev · Feb 11, 2015

And more specifically the problem is related to virtio NIC in KVM guest. I've just changed it to e1000 and the problem hasn't repeated yet.

feijao · Mar 2, 2015

We had the exact same problem with one node.
It disabled an internal interface and the public one was being frequently restarted.

All the VM's in that cluster use virtio interfaces.
Do you confirm changing the VM's interface to e1000 will fix the problem?
Are your servers stable now?

This is strange as we have several servers with proxmox and this is the first time we detected this.

Michael Lednev · Mar 2, 2015

feijao said:
Do you confirm changing the VM's interface to e1000 will fix the problem?
Are your servers stable now?

Yes, no problems since my last post.

feijao said:
This is strange as we have several servers with proxmox and this is the first time we detected this.

The problem itself requires a very specific conditions. I believe all tg3 cards are affected by this but the first time I ran into this error only on 2 or 3 servers out of 30. Moving VM to another node however moved the error with it. I guess it's somehow related to nature of traffic on such VM.

Search

Search

tg3 timeouts with KVM

Michael Lednev

Member

Michael Lednev

Member

feijao

Active Member

Michael Lednev

Member