4.15 based test kernel for PVE 5.x available

Menno · Aug 15, 2018

Interesting enough everything seemed fine on our new(er) hardware until I started optimizing BIOS settings for our workload. Using HPE's document to optimize for low latency recommends disabling a few features that caused these kernel panics for me.

The relevant settings are:

Intel Virtualization Technology -> disabled
Intel Hyperthreading Options -> disabled
Intel Turbo Boost Technology -> disabled
Intel VT-d -> disabled

What's odd though is I have to keep them all enabled since pve-kernel 4.15.18 while 4.15.17 was fine with these disabled.

(We use 3 nodes just for storage with a Proxmox install to simplify the CEPH installation process, otherwise these settings obviously should be enabled except for Hyper Threading perhaps)

Perhaps someone else running into this issue might want to check these settings, I'll leave them enabled for now.

regards,
Menno

marsian · Aug 16, 2018

Interesting finding, thanks for sharing! Can anyone else with current crash-issues confirm if these changes are a fix for them too?

rotanid · Aug 29, 2018

the network crashes are back for us with kernel pve-kernel-4.15.18-2-pve / 4.15.18-20

Code:

Aug 29 15:19:28 vh6 kernel: [53879.772436] e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
Aug 29 15:19:28 vh6 kernel: [53879.772436]   TDH                  <0>
Aug 29 15:19:28 vh6 kernel: [53879.772436]   TDT                  <8>
Aug 29 15:19:28 vh6 kernel: [53879.772436]   next_to_use          <8>
Aug 29 15:19:28 vh6 kernel: [53879.772436]   next_to_clean        <0>
Aug 29 15:19:28 vh6 kernel: [53879.772436] buffer_info[next_to_clean]:
Aug 29 15:19:28 vh6 kernel: [53879.772436]   time_stamp           <100cc5f82>
Aug 29 15:19:28 vh6 kernel: [53879.772436]   next_to_watch        <0>
Aug 29 15:19:28 vh6 kernel: [53879.772436]   jiffies              <100cc6408>
Aug 29 15:19:28 vh6 kernel: [53879.772436]   next_to_watch.status <0>
Aug 29 15:19:28 vh6 kernel: [53879.772436] MAC Status             <80083>
Aug 29 15:19:28 vh6 kernel: [53879.772436] PHY Status             <796d>
Aug 29 15:19:28 vh6 kernel: [53879.772436] PHY 1000BASE-T Status  <7800>
Aug 29 15:19:28 vh6 kernel: [53879.772436] PHY Extended Status    <3000>
Aug 29 15:19:28 vh6 kernel: [53879.772436] PCI Status             <10>

Code:

# ethtool -e enp0s31f6 length 256
Offset        Values
------        ------
0x0000:        90 1b 0e da c4 eb 01 08 ff ff 84 00 8e 00 00 80
0x0010:        ff ff ff ff c3 10 1f 12 34 17 b7 15 00 00 00 00
0x0020:        00 00 00 00 00 80 05 a7 2c 30 00 16 00 00 00 0c
0x0030:        f4 18 02 0a 43 08 13 01 b7 15 ad ba b7 15 b8 15
0x0040:        ad ba b7 15 ad ba b7 15 00 00 80 80 00 4e 86 08
0x0050:        00 00 00 00 07 00 00 20 20 00 00 00 00 0e 00 00
0x0060:        00 01 00 40 0a 01 07 40 ff ff ff ff ff ff ff ff
0x0070:        ff ff ff ff ff ff ff ff ff ff 00 02 ff ff 2f 35
0x0080:        00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0090:        00 00 00 00 00 00 ff ff ff ff ff ff ff ff ff ff
0x00a0:        94 b0 00 08 0a 00 04 90 b0 47 40 24 c2 c1 21 fb
0x00b0:        80 60 1f 00 00 48 10 00 40 60 1f 00 04 d1 11 00
0x00c0:        03 0a 12 00 00 00 1f 00 04 b4 30 00 1c 00 31 00
0x00d0:        06 b4 30 00 09 00 31 00 07 b4 30 00 10 00 31 00
0x00e0:        0a b4 30 00 18 00 31 00 0c b4 30 00 18 00 31 00
0x00f0:        0d b4 30 00 18 00 31 00 01 fd 30 00 2c 9c 31 00

Code:

# ethtool -i enp0s31f6
driver: e1000e
version: 3.4.1.1-NAPI
firmware-version: 0.8-4
expansion-rom-version:
bus-info: 0000:00:1f.6
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

t.lamprecht · Aug 31, 2018

Menno said:
What's odd though is I have to keep them all enabled since pve-kernel 4.15.18 while 4.15.17 was fine with these disabled.

That is very very strange, with all those virtualization flags disabled you should have been never able to boot any VM with us (at least if you do not set 'kvm' manually to off)..
Please never turn them off.

rotanid said:
the network crashes are back for us with kernel pve-kernel-4.15.18-2-pve / 4.15.18-20

Meaning that it's always there, just a different (higher/lesser) likely chance to trigger it... Or even a HW/Chipset problem.

hmm your eeprom settings dump does not looks like you have a earlier problematic power feature enabled.

Have you tried disabling some offloading features? Often they're the cause of such hangs with specific chipsets:

Code:

ethtool -Kenp0s31f6 gso off gro off tso off

(may imply some performance degradation as now the CPU must do this work)

rotanid · Aug 31, 2018

we have 5 systems at the moment with this intel NIC and ProxmoxVE.

Meaning that it's always there, just a different (higher/lesser) likely chance to trigger it...

the system that crashed 2 days ago was running pve-kernel-4.13.16-2-pve until 3 days ago

Have you tried disabling some offloading features?

thanks, will try this!

sergopotap · Sep 7, 2018

My cluster work normally 4 days, after intremap=off

asocialpenguin.com/2013/12/23/interrupt-remapping-problems-with-intel-5500-5520-cpus/

robertb · Sep 10, 2018

Hello,
I also have the same problem with 2x Xeon e5 2680v2 on supermicro x9drw-if, latest bios and pve.
Crashes with the same message, additionally I have one of the 10g links (x520-da) flapping since.
However, removing the Intel-Microcode package seems to hhav stabilized it a bit.

pniebylski · Sep 20, 2018

Hello Proxmox Team.

Have this issue been resolved in the latest kernel update? I mean from what I can read here is that only Intel network cards have this problem.

This is a blocking issue for us as we cannot upgrade our environment until we get a green light that the kernel panic errors no longer happen when host machine have Intel NICs instaled.

Than you!

rotanid · Oct 11, 2018

we had the issue again today after upgrading to 4.15.18-26

marsian · Oct 28, 2018

So we've just run into the same issue with a SuperMicro system as well, having Intel i210 NICs installed and Kernel pve-kernel-4.15.18-7-pve: 4.15.18-27 running.

As the hardware is provided by a managed service we're currently asking them to install the most recent NIC firmware, but any ideas from the Proxmox team on how to solve this in addition?

The system was running rock solid for about 10 months now with always the latest V4.4 installed, but in order to stay within support we've decided to upgrade to 5.x.

I'll try to post crashlogs as soon as we can access the OS again...

marsian · Oct 29, 2018

Unfortunately the logs have been not stored, so we currently don't have any further details

So we'll adjust the log retention and try to have more information available when it crashes the next time...

jehster · Dec 11, 2018

Hi,

We're facing the problem on one node of our cluster:
3xR610 + SolarFlare SFN5322F.

Code:

Kernel Version Linux 4.15.18-9-pve #1 SMP PVE 4.15.18-30 (Thu, 15 Nov 2018 13:32:46 +0100)
PVE Manager Version pve-manager/5.3-5/97ae681d

[ 2103.747850] ------------[ cut here ]------------
[ 2103.747869] NETDEV WATCHDOG: eth5 (sfc): transmit queue 9 timed out
[ 2103.747899] WARNING: CPU: 14 PID: 0 at net/sched/sch_generic.c:323 dev_watchdog+0x222/0x230
[ 2103.747901] Modules linked in: xt_set ip_set_hash_net nfsv3 nfs_acl nfs lockd grace fscache ip_set ip6table_filter ip6_tables xt_multiport iptable_filter bonding softdog nfnetlink_log nfnetlink intel_p
werclamp coretemp kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul mgag200 ghash_clmulni_intel ttm pcbc drm_kms_helper drm i2c_algo_bit snd_pcm fb_sys_fops syscopyarea aesni_intel aes_x86_6
sysfillrect crypto_simd snd_timer sysimgblt glue_helper snd cryptd soundcore shpchp joydev input_leds gpio_ich acpi_power_meter ipmi_si dcdbas ipmi_devintf ipmi_msghandler wmi serio_raw intel_cstate lpc_
ch pcspkr i7core_edac ioatdma dca mac_hid vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm sunrpc ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 hid_generic
[ 2103.747983] usbmouse usbkbd sfc mtd ptp usbhid psmouse pps_core pata_acpi hid megaraid_sas mdio bnx2
[ 2103.747998] CPU: 14 PID: 0 Comm: swapper/14 Tainted: G I 4.15.18-9-pve #1
[ 2103.747999] Hardware name: Dell Inc. PowerEdge R610/0F0XJ6, BIOS 6.4.0 07/23/2013
[ 2103.748001] RIP: 0010:dev_watchdog+0x222/0x230
[ 2103.748002] RSP: 0018:ffff908bcf3c3e58 EFLAGS: 00010286
[ 2103.748003] RAX: 0000000000000000 RBX: 0000000000000009 RCX: 0000000000000000
[ 2103.748004] RDX: 0000000000040400 RSI: 00000000000000f6 RDI: 0000000000000300
[ 2103.748005] RBP: ffff908bcf3c3e88 R08: 0000000000000001 R09: 0000000000000400
[ 2103.748006] R10: ffff908bcf3da770 R11: 0000000000000400 R12: 0000000000000040
[ 2103.748007] R13: ffff9083cbd82000 R14: ffff9083cbd82478 R15: ffff9083c6b5cf40
[ 2103.748008] FS: 0000000000000000(0000) GS:ffff908bcf3c0000(0000) knlGS:0000000000000000
[ 2103.748009] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2103.748010] CR2: 00007ff3c5bd8fd0 CR3: 0000000eafe0a005 CR4: 00000000000226e0
[ 2103.748011] Call Trace:
[ 2103.748012] <IRQ>
[ 2103.748016] ? dev_deactivate_queue.constprop.33+0x60/0x60
[ 2103.748019] call_timer_fn+0x32/0x130
[ 2103.748021] run_timer_softirq+0x1dd/0x430
[ 2103.748024] ? tick_sched_handle+0x34/0x60
[ 2103.748026] ? ktime_get+0x43/0xa0
[ 2103.748028] __do_softirq+0x10c/0x2a2
[ 2103.748031] irq_exit+0xb8/0xc0
[ 2103.748033] smp_apic_timer_interrupt+0x79/0x130
[ 2103.748035] apic_timer_interrupt+0x84/0x90
[ 2103.748036] </IRQ>
[ 2103.748038] RIP: 0010:cpuidle_enter_state+0xa5/0x2e0
[ 2103.748039] RSP: 0018:ffffb93586313e58 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff11
[ 2103.748041] RAX: ffff908bcf3e28c0 RBX: 0000000000000003 RCX: 000000000000001f
[ 2103.748042] RDX: 000001e9d1247af1 RSI: ffffffd9314b6f1d RDI: 0000000000000000
[ 2103.748043] RBP: ffffb93586313e90 R08: 0000000000000ea6 R09: 0000000000000006
[ 2103.748044] R10: ffffb93586313e28 R11: 0000000000000135 R12: ffff908bcf3ec900
[ 2103.748045] R13: ffffffff9f771cb8 R14: 000001e9d1247af1 R15: ffffffff9f771ca0
[ 2103.748047] ? cpuidle_enter_state+0x97/0x2e0
[ 2103.748049] cpuidle_enter+0x17/0x20
[ 2103.748051] call_cpuidle+0x23/0x40
[ 2103.748053] do_idle+0x19a/0x200
[ 2103.748055] cpu_startup_entry+0x73/0x80
[ 2103.748057] start_secondary+0x1ab/0x200
[ 2103.748060] secondary_startup_64+0xa5/0xb0
[ 2103.748061] Code: 36 00 49 63 4e e8 eb 92 4c 89 ef c6 05 d8 ca d7 00 01 e8 32 1f fd ff 89 d9 48 89 c2 4c 89 ee 48 c7 c7 40 78 39 9f e8 4e 71 7f ff <0f> 0b eb c0 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 6
90 55 48
[ 2103.748091] ---[ end trace 9222b51587ffbe79 ]---
[ 2103.748097] sfc 0000:04:00.1 eth5: TX stuck with port_enabled=1: resetting channels
[ 2103.748206] sfc 0000:04:00.1 eth5: resetting (RECOVER_OR_ALL)
[ 2103.825671] sfc 0000:04:00.1 eth5: link down
[ 2103.991537] sfc 0000:04:00.1 eth5: link up at 10000Mbps full-duplex (MTU 1500)
[ 2296.761180] INFO: task lzop:8382 blocked for more than 120 seconds.
[ 2296.761223] Tainted: G W I 4.15.18-9-pve #1
[ 2296.761252] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2296.761273] lzop D 0 8382 8357 0x00000000
[ 2296.761275] Call Trace:
[ 2296.761281] __schedule+0x3e0/0x870
[ 2296.761283] ? bit_wait+0x60/0x60
[ 2296.761284] schedule+0x36/0x80
[ 2296.761286] io_schedule+0x16/0x40
[ 2296.761287] bit_wait_io+0x11/0x60
[ 2296.761288] __wait_on_bit+0x5a/0x90
[ 2296.761289] out_of_line_wait_on_bit+0x8e/0xb0
[ 2296.761291] ? bit_waitqueue+0x40/0x40
[ 2296.761305] nfs_wait_on_request+0x46/0x50 [nfs]
[ 2296.761311] nfs_lock_and_join_requests+0x121/0x510 [nfs]
[ 2296.761313] ? radix_tree_lookup_slot+0x22/0x50
[ 2296.761320] nfs_updatepage+0x151/0x910 [nfs]
[ 2296.761325] nfs_write_end+0x129/0x4e0 [nfs]
[ 2296.761327] generic_perform_write+0xff/0x1b0
[ 2296.761333] nfs_file_write+0xd7/0x250 [nfs]
[ 2296.761346] new_sync_write+0xe7/0x140
[ 2296.761348] __vfs_write+0x29/0x40
[ 2296.761349] vfs_write+0xb5/0x1a0
[ 2296.761351] SyS_write+0x55/0xc0
[ 2296.761353] do_syscall_64+0x73/0x130
[ 2296.761355] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 2296.761356] RIP: 0033:0x7f5411d99730
[ 2296.761357] RSP: 002b:00007ffd36bc2438 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 2296.761358] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5411d99730
[ 2296.761359] RDX: 0000000000000004 RSI: 00007ffd36bc24a0 RDI: 0000000000000001
[ 2296.761360] RBP: 0000000000000004 R08: fffffffffffffff5 R09: 00007f54123e9000
[ 2296.761360] R10: 00000000000001f9 R11: 0000000000000246 R12: 00007f5412492698
[ 2296.761361] R13: 0000000000000001 R14: 0000000000000019 R15: 00007ffd36bc24a0
[ 2417.586366] INFO: task lzop:8382 blocked for more than 120 seconds.
[ 2417.586392] Tainted: G W I 4.15.18-9-pve #1
[ 2417.586408] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2417.586429] lzop D 0 8382 8357 0x00000000
[ 2417.586431] Call Trace:
[ 2417.586437] __schedule+0x3e0/0x870
[ 2417.586439] ? bit_wait+0x60/0x60
[ 2417.586440] schedule+0x36/0x80
[ 2417.586442] io_schedule+0x16/0x40
[ 2417.586443] bit_wait_io+0x11/0x60
[ 2417.586443] __wait_on_bit+0x5a/0x90
[ 2417.586445] out_of_line_wait_on_bit+0x8e/0xb0
[ 2417.586447] ? bit_waitqueue+0x40/0x40
[ 2417.586461] nfs_wait_on_request+0x46/0x50 [nfs]
[ 2417.586467] nfs_lock_and_join_requests+0x121/0x510 [nfs]
[ 2417.586469] ? radix_tree_lookup_slot+0x22/0x50
[ 2417.586475] nfs_updatepage+0x151/0x910 [nfs]
[ 2417.586480] nfs_write_end+0x129/0x4e0 [nfs]
[ 2417.586483] generic_perform_write+0xff/0x1b0
[ 2417.586488] nfs_file_write+0xd7/0x250 [nfs]
[ 2417.586490] new_sync_write+0xe7/0x140
[ 2417.586491] __vfs_write+0x29/0x40
[ 2417.586493] vfs_write+0xb5/0x1a0
[ 2417.586494] SyS_write+0x55/0xc0
[ 2417.586496] do_syscall_64+0x73/0x130
[ 2417.586497] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 2417.586499] RIP: 0033:0x7f5411d99730
[ 2417.586499] RSP: 002b:00007ffd36bc2438 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 2417.586501] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5411d99730
[ 2417.586501] RDX: 0000000000000004 RSI: 00007ffd36bc24a0 RDI: 0000000000000001
[ 2417.586502] RBP: 0000000000000004 R08: fffffffffffffff5 R09: 00007f54123e9000
[ 2417.586503] R10: 00000000000001f9 R11: 0000000000000246 R12: 00007f5412492698
[ 2417.586504] R13: 0000000000000001 R14: 0000000000000019 R15: 00007ffd36bc24a0
[ 2538.411658] INFO: task lzop:8382 blocked for more than 120 seconds.
[ 2538.411702] Tainted: G W I 4.15.18-9-pve #1
[ 2538.411734] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2538.411778] lzop D 0 8382 8357 0x00000000
[ 2538.411780] Call Trace:
[ 2538.411787] __schedule+0x3e0/0x870
[ 2538.411789] ? bit_wait+0x60/0x60
[ 2538.411790] schedule+0x36/0x80
[ 2538.411791] io_schedule+0x16/0x40
[ 2538.411792] bit_wait_io+0x11/0x60
[ 2538.411793] __wait_on_bit+0x5a/0x90
[ 2538.411795] out_of_line_wait_on_bit+0x8e/0xb0
[ 2538.411797] ? bit_waitqueue+0x40/0x40
[ 2538.411810] nfs_wait_on_request+0x46/0x50 [nfs]
[ 2538.411816] nfs_lock_and_join_requests+0x121/0x510 [nfs]
[ 2538.411818] ? radix_tree_lookup_slot+0x22/0x50
[ 2538.411824] nfs_updatepage+0x151/0x910 [nfs]
[ 2538.411830] nfs_write_end+0x129/0x4e0 [nfs]
[ 2538.411832] generic_perform_write+0xff/0x1b0
[ 2538.411837] nfs_file_write+0xd7/0x250 [nfs]
[ 2538.411839] new_sync_write+0xe7/0x140
[ 2538.411841] __vfs_write+0x29/0x40
[ 2538.411842] vfs_write+0xb5/0x1a0
[ 2538.411843] SyS_write+0x55/0xc0
[ 2538.411845] do_syscall_64+0x73/0x130
[ 2538.411847] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 2538.411848] RIP: 0033:0x7f5411d99730
[ 2538.411849] RSP: 002b:00007ffd36bc2438 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 2538.411850] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5411d99730
[ 2538.411851] RDX: 0000000000000004 RSI: 00007ffd36bc24a0 RDI: 0000000000000001
[ 2538.411851] RBP: 0000000000000004 R08: fffffffffffffff5 R09: 00007f54123e9000
[ 2538.411852] R10: 00000000000001f9 R11: 0000000000000246 R12: 00007f5412492698
[ 2538.411853] R13: 0000000000000001 R14: 0000000000000019 R15: 00007ffd36bc24a0
[ 2659.236894] INFO: task lzop:8382 blocked for more than 120 seconds.
[ 2659.236936] Tainted: G W I 4.15.18-9-pve #1
[ 2659.236968] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2659.237003] lzop D 0 8382 8357 0x00000000
[ 2659.237004] Call Trace:
[ 2659.237010] __schedule+0x3e0/0x870
[ 2659.237012] ? bit_wait+0x60/0x60
[ 2659.237013] schedule+0x36/0x80
[ 2659.237015] io_schedule+0x16/0x40
[ 2659.237016] bit_wait_io+0x11/0x60
[ 2659.237017] __wait_on_bit+0x5a/0x90
[ 2659.237018] out_of_line_wait_on_bit+0x8e/0xb0
[ 2659.237020] ? bit_waitqueue+0x40/0x40
[ 2659.237034] nfs_wait_on_request+0x46/0x50 [nfs]
[ 2659.237040] nfs_lock_and_join_requests+0x121/0x510 [nfs]
[ 2659.237043] ? radix_tree_lookup_slot+0x22/0x50
[ 2659.237049] nfs_updatepage+0x151/0x910 [nfs]
[ 2659.237055] nfs_write_end+0x129/0x4e0 [nfs]
[ 2659.237057] generic_perform_write+0xff/0x1b0
[ 2659.237062] nfs_file_write+0xd7/0x250 [nfs]
[ 2659.237064] new_sync_write+0xe7/0x140
[ 2659.237066] __vfs_write+0x29/0x40
[ 2659.237067] vfs_write+0xb5/0x1a0
[ 2659.237068] SyS_write+0x55/0xc0
[ 2659.237070] do_syscall_64+0x73/0x130
[ 2659.237072] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 2659.237073] RIP: 0033:0x7f5411d99730
[ 2659.237074] RSP: 002b:00007ffd36bc2438 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 2659.237075] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5411d99730
[ 2659.237076] RDX: 0000000000000004 RSI: 00007ffd36bc24a0 RDI: 0000000000000001
[ 2659.237076] RBP: 0000000000000004 R08: fffffffffffffff5 R09: 00007f54123e9000
[ 2659.237077] R10: 00000000000001f9 R11: 0000000000000246 R12: 00007f5412492698
[ 2659.237078] R13: 0000000000000001 R14: 0000000000000019 R15: 00007ffd36bc24a0
[ 2671.524157] nfs: server 10.9.80.101 not responding, still trying
[ 2780.062084] INFO: task lzop:8382 blocked for more than 120 seconds.
[ 2780.062126] Tainted: G W I 4.15.18-9-pve #1
[ 2780.062160] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2780.062182] lzop D 0 8382 8357 0x00000000
[ 2780.062184] Call Trace:
[ 2780.062190] __schedule+0x3e0/0x870
[ 2780.062192] ? bit_wait+0x60/0x60
[ 2780.062193] schedule+0x36/0x80
[ 2780.062195] io_schedule+0x16/0x40
[ 2780.062196] bit_wait_io+0x11/0x60
[ 2780.062197] __wait_on_bit+0x5a/0x90
[ 2780.062198] out_of_line_wait_on_bit+0x8e/0xb0
[ 2780.062200] ? bit_waitqueue+0x40/0x40
[ 2780.062214] nfs_wait_on_request+0x46/0x50 [nfs]
[ 2780.062221] nfs_lock_and_join_requests+0x121/0x510 [nfs]
[ 2780.062223] ? radix_tree_lookup_slot+0x22/0x50
[ 2780.062229] nfs_updatepage+0x151/0x910 [nfs]
[ 2780.062235] nfs_write_end+0x129/0x4e0 [nfs]
[ 2780.062237] generic_perform_write+0xff/0x1b0
[ 2780.062243] nfs_file_write+0xd7/0x250 [nfs]
[ 2780.062244] new_sync_write+0xe7/0x140
[ 2780.062246] __vfs_write+0x29/0x40
[ 2780.062247] vfs_write+0xb5/0x1a0
[ 2780.062249] SyS_write+0x55/0xc0
[ 2780.062250] do_syscall_64+0x73/0x130
[ 2780.062252] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 2780.062253] RIP: 0033:0x7f5411d99730
[ 2780.062254] RSP: 002b:00007ffd36bc2438 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 2780.062255] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5411d99730
[ 2780.062256] RDX: 0000000000000004 RSI: 00007ffd36bc24a0 RDI: 0000000000000001
[ 2780.062257] RBP: 0000000000000004 R08: fffffffffffffff5 R09: 00007f54123e9000
[ 2780.062257] R10: 00000000000001f9 R11: 0000000000000246 R12: 00007f5412492698
[ 2780.062258] R13: 0000000000000001 R14: 0000000000000019 R15: 00007ffd36bc24a0
[ 2900.887397] INFO: task lzop:8382 blocked for more than 120 seconds.
[ 2900.887438] Tainted: G W I 4.15.18-9-pve #1
[ 2900.887464] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2900.887485] lzop D 0 8382 8357 0x00000000
[ 2900.887487] Call Trace:
[ 2900.887493] __schedule+0x3e0/0x870
[ 2900.887494] ? bit_wait+0x60/0x60
[ 2900.887495] schedule+0x36/0x80
[ 2900.887497] io_schedule+0x16/0x40
[ 2900.887498] bit_wait_io+0x11/0x60
[ 2900.887499] __wait_on_bit+0x5a/0x90
[ 2900.887500] out_of_line_wait_on_bit+0x8e/0xb0
[ 2900.887502] ? bit_waitqueue+0x40/0x40
[ 2900.887516] nfs_wait_on_request+0x46/0x50 [nfs]
[ 2900.887522] nfs_lock_and_join_requests+0x121/0x510 [nfs]
[ 2900.887524] ? radix_tree_lookup_slot+0x22/0x50
[ 2900.887530] nfs_updatepage+0x151/0x910 [nfs]
[ 2900.887535] nfs_write_end+0x129/0x4e0 [nfs]
[ 2900.887537] generic_perform_write+0xff/0x1b0
[ 2900.887543] nfs_file_write+0xd7/0x250 [nfs]
[ 2900.887545] new_sync_write+0xe7/0x140
[ 2900.887546] __vfs_write+0x29/0x40
[ 2900.887547] vfs_write+0xb5/0x1a0
[ 2900.887549] SyS_write+0x55/0xc0
[ 2900.887551] do_syscall_64+0x73/0x130
[ 2900.887552] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 2900.887554] RIP: 0033:0x7f5411d99730
[ 2900.887554] RSP: 002b:00007ffd36bc2438 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 2900.887556] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5411d99730
[ 2900.887556] RDX: 0000000000000004 RSI: 00007ffd36bc24a0 RDI: 0000000000000001
[ 2900.887557] RBP: 0000000000000004 R08: fffffffffffffff5 R09: 00007f54123e9000
[ 2900.887558] R10: 00000000000001f9 R11: 0000000000000246 R12: 00007f5412492698
[ 2900.887558] R13: 0000000000000001 R14: 0000000000000019 R15: 00007ffd36bc24a0
[ 3021.712639] INFO: task lzop:8382 blocked for more than 120 seconds.
[ 3021.712682] Tainted: G W I 4.15.18-9-pve #1
[ 3021.712702] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3021.712723] lzop D 0 8382 8357 0x00000000
[ 3021.712728] Call Trace:
[ 3021.712736] __schedule+0x3e0/0x870
[ 3021.712739] ? bit_wait+0x60/0x60
[ 3021.712741] schedule+0x36/0x80
[ 3021.712745] io_schedule+0x16/0x40
[ 3021.712747] bit_wait_io+0x11/0x60
[ 3021.712749] __wait_on_bit+0x5a/0x90
[ 3021.712751] out_of_line_wait_on_bit+0x8e/0xb0
[ 3021.712755] ? bit_waitqueue+0x40/0x40
[ 3021.712774] nfs_wait_on_request+0x46/0x50 [nfs]
[ 3021.712780] nfs_lock_and_join_requests+0x121/0x510 [nfs]
[ 3021.712782] ? radix_tree_lookup_slot+0x22/0x50
[ 3021.712789] nfs_updatepage+0x151/0x910 [nfs]
[ 3021.712794] nfs_write_end+0x129/0x4e0 [nfs]
[ 3021.712796] generic_perform_write+0xff/0x1b0
[ 3021.712802] nfs_file_write+0xd7/0x250 [nfs]
[ 3021.712804] new_sync_write+0xe7/0x140
[ 3021.712805] __vfs_write+0x29/0x40
[ 3021.712806] vfs_write+0xb5/0x1a0
[ 3021.712808] SyS_write+0x55/0xc0
[ 3021.712809] do_syscall_64+0x73/0x130
[ 3021.712811] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 3021.712812] RIP: 0033:0x7f5411d99730
[ 3021.712813] RSP: 002b:00007ffd36bc2438 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 3021.712814] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5411d99730
[ 3021.712815] RDX: 0000000000000004 RSI: 00007ffd36bc24a0 RDI: 0000000000000001
[ 3021.712816] RBP: 0000000000000004 R08: fffffffffffffff5 R09: 00007f54123e9000
[ 3021.712817] R10: 00000000000001f9 R11: 0000000000000246 R12: 00007f5412492698
[ 3021.712817] R13: 0000000000000001 R14: 0000000000000019 R15: 00007ffd36bc24a0
[ 3142.537959] INFO: task lzop:8382 blocked for more than 120 seconds.
[ 3142.537993] Tainted: G W I 4.15.18-9-pve #1
[ 3142.538009] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3142.538030] lzop D 0 8382 8357 0x00000000
[ 3142.538032] Call Trace:
[ 3142.538038] __schedule+0x3e0/0x870
[ 3142.538040] ? bit_wait+0x60/0x60
[ 3142.538041] schedule+0x36/0x80
[ 3142.538043] io_schedule+0x16/0x40
[ 3142.538044] bit_wait_io+0x11/0x60
[ 3142.538045] __wait_on_bit+0x5a/0x90
[ 3142.538046] out_of_line_wait_on_bit+0x8e/0xb0
[ 3142.538048] ? bit_waitqueue+0x40/0x40
[ 3142.538061] nfs_wait_on_request+0x46/0x50 [nfs]
[ 3142.538067] nfs_lock_and_join_requests+0x121/0x510 [nfs]
[ 3142.538070] ? radix_tree_lookup_slot+0x22/0x50
[ 3142.538076] nfs_updatepage+0x151/0x910 [nfs]
[ 3142.538082] nfs_write_end+0x129/0x4e0 [nfs]
[ 3142.538084] generic_perform_write+0xff/0x1b0
[ 3142.538089] nfs_file_write+0xd7/0x250 [nfs]
[ 3142.538091] new_sync_write+0xe7/0x140
[ 3142.538093] __vfs_write+0x29/0x40
[ 3142.538094] vfs_write+0xb5/0x1a0
[ 3142.538095] SyS_write+0x55/0xc0
[ 3142.538097] do_syscall_64+0x73/0x130
[ 3142.538098] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 3142.538100] RIP: 0033:0x7f5411d99730
[ 3142.538101] RSP: 002b:00007ffd36bc2438 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 3142.538102] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5411d99730
[ 3142.538103] RDX: 0000000000000004 RSI: 00007ffd36bc24a0 RDI: 0000000000000001
[ 3142.538103] RBP: 0000000000000004 R08: fffffffffffffff5 R09: 00007f54123e9000
[ 3142.538104] R10: 00000000000001f9 R11: 0000000000000246 R12: 00007f5412492698
[ 3142.538105] R13: 0000000000000001 R14: 0000000000000019 R15: 00007ffd36bc24a0
[ 3200.320822] device tap100i0 entered promiscuous mode
[ 3200.331590] vmbr0: port 3(tap100i0) entered blocking state
[ 3200.331592] vmbr0: port 3(tap100i0) entered disabled state
[ 3200.331699] vmbr0: port 3(tap100i0) entered blocking state
[ 3200.331701] vmbr0: port 3(tap100i0) entered forwarding state
[ 3222.152389] device tap101i0 entered promiscuous mode
[ 3222.161898] vmbr0: port 10(tap101i0) entered blocking state
[ 3222.161900] vmbr0: port 10(tap101i0) entered disabled state
[ 3222.161995] vmbr0: port 10(tap101i0) entered blocking state
[ 3222.161997] vmbr0: port 10(tap101i0) entered forwarding state
[ 3234.692797] nfs: server 10.9.80.101 not responding, still trying
[ 3244.546872] device tap106i0 entered promiscuous mode
[ 3244.555741] vmbr0: port 11(tap106i0) entered blocking state
[ 3244.555743] vmbr0: port 11(tap106i0) entered disabled state
[ 3244.555835] vmbr0: port 11(tap106i0) entered blocking state
[ 3244.555837] vmbr0: port 11(tap106i0) entered forwarding state
[ 3263.363288] INFO: task lzop:8382 blocked for more than 120 seconds.
[ 3263.363310] Tainted: G W I 4.15.18-9-pve #1
[ 3263.363326] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3263.363347] lzop D 0 8382 8357 0x00000000
[ 3263.363349] Call Trace:
[ 3263.363355] __schedule+0x3e0/0x870
[ 3263.363357] ? bit_wait+0x60/0x60
[ 3263.363358] schedule+0x36/0x80
[ 3263.363360] io_schedule+0x16/0x40
[ 3263.363361] bit_wait_io+0x11/0x60
[ 3263.363362] __wait_on_bit+0x5a/0x90
[ 3263.363363] out_of_line_wait_on_bit+0x8e/0xb0
[ 3263.363365] ? bit_waitqueue+0x40/0x40
[ 3263.363379] nfs_wait_on_request+0x46/0x50 [nfs]
[ 3263.363386] nfs_lock_and_join_requests+0x121/0x510 [nfs]
[ 3263.363388] ? radix_tree_lookup_slot+0x22/0x50
[ 3263.363394] nfs_updatepage+0x151/0x910 [nfs]
[ 3263.363399] nfs_write_end+0x129/0x4e0 [nfs]
[ 3263.363401] generic_perform_write+0xff/0x1b0
[ 3263.363407] nfs_file_write+0xd7/0x250 [nfs]
[ 3263.363409] new_sync_write+0xe7/0x140
[ 3263.363411] __vfs_write+0x29/0x40
[ 3263.363412] vfs_write+0xb5/0x1a0
[ 3263.363413] SyS_write+0x55/0xc0
[ 3263.363415] do_syscall_64+0x73/0x130
[ 3263.363417] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 3263.363418] RIP: 0033:0x7f5411d99730
[ 3263.363419] RSP: 002b:00007ffd36bc2438 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 3263.363420] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5411d99730
[ 3263.363421] RDX: 0000000000000004 RSI: 00007ffd36bc24a0 RDI: 0000000000000001
[ 3263.363421] RBP: 0000000000000004 R08: fffffffffffffff5 R09: 00007f54123e9000
[ 3263.363422] R10: 00000000000001f9 R11: 0000000000000246 R12: 00007f5412492698
[ 3263.363423] R13: 0000000000000001 R14: 0000000000000019 R15: 00007ffd36bc24a0
[ 3331.954972] device tap113i0 entered promiscuous mode
[ 3331.964983] vmbr0: port 12(tap113i0) entered blocking state
[ 3331.964985] vmbr0: port 12(tap113i0) entered disabled state
[ 3331.965074] vmbr0: port 12(tap113i0) entered blocking state
[ 3331.965076] vmbr0: port 12(tap113i0) entered forwarding state
[ 3384.188538] INFO: task lzop:8382 blocked for more than 120 seconds.
[ 3384.188566] Tainted: G W I 4.15.18-9-pve #1
[ 3384.188582] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3384.188603] lzop D 0 8382 8357 0x00000000
[ 3384.188605] Call Trace:
[ 3384.188611] __schedule+0x3e0/0x870
[ 3384.188613] ? bit_wait+0x60/0x60
[ 3384.188614] schedule+0x36/0x80
[ 3384.188616] io_schedule+0x16/0x40
[ 3384.188617] bit_wait_io+0x11/0x60
[ 3384.188618] __wait_on_bit+0x5a/0x90
[ 3384.188619] out_of_line_wait_on_bit+0x8e/0xb0
[ 3384.188621] ? bit_waitqueue+0x40/0x40
[ 3384.188639] nfs_wait_on_request+0x46/0x50 [nfs]
[ 3384.188646] nfs_lock_and_join_requests+0x121/0x510 [nfs]
[ 3384.188648] ? radix_tree_lookup_slot+0x22/0x50
[ 3384.188654] nfs_updatepage+0x151/0x910 [nfs]
[ 3384.188659] nfs_write_end+0x129/0x4e0 [nfs]
[ 3384.188661] generic_perform_write+0xff/0x1b0
[ 3384.188667] nfs_file_write+0xd7/0x250 [nfs]
[ 3384.188669] new_sync_write+0xe7/0x140
[ 3384.188670] __vfs_write+0x29/0x40
[ 3384.188672] vfs_write+0xb5/0x1a0
[ 3384.188673] SyS_write+0x55/0xc0
[ 3384.188675] do_syscall_64+0x73/0x130
[ 3384.188677] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 3384.188678] RIP: 0033:0x7f5411d99730
[ 3384.188679] RSP: 002b:00007ffd36bc2438 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 3384.188680] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5411d99730
[ 3384.188681] RDX: 0000000000000004 RSI: 00007ffd36bc24a0 RDI: 0000000000000001
[ 3384.188681] RBP: 0000000000000004 R08: fffffffffffffff5 R09: 00007f54123e9000
[ 3384.188682] R10: 00000000000001f9 R11: 0000000000000246 R12: 00007f5412492698
[ 3384.188683] R13: 0000000000000001 R14: 0000000000000019 R15: 00007ffd36bc24a0

All 3 nodes are up to date. No Ceph, no ZFS. Feel free to ask if more informations needed

have a good day
Jerome

coppola_f · Dec 11, 2018

guys,
we've setup two more nodes....
these two are identycal DL680 gen8.

done fresh install using downloaded .iso image,
operated all updates (on these units we've an active subscription!)

then set units as cluster members ad after some "blank" runtime days,
moved some VMs to the new nodes (these are the newer nodes then we've moved there the most expensive VMs!)

all seems to run fine for many ours (i think one or two full days at max!)
then users started to report extremely slow response from the terminal server and really long query response time from MySQL linux VM!

after some rapid checks,
we've moved VMs back to the older nodes (still running 4.13.x kernel!)

then magically issue seems to have solved, normal response time from both VMs....

actually rolled back units to 4.13.x kernel,
then restarted units and after some testing time, moved VMs back to newer nodes!

now all running fine!
our cluster updates have been locked to 4.13.x kernel branch!!

waiting suggestions or any other information may come from you all!

best regards,

Francesco

cvhideki · Sep 9, 2019

Hi everyone!
i have a problem with lan interface e100 on my pfsense

I too run opnsense in Prox. I have the virtual nics set as E1000. I also find that during high traffic scenarios such as Torrents skype call video streaming, downloading the WAN gateway will just drop. It then take a

Code:

ifconfig em1 down
ifconfig em1 up

and my network interface it's working again

Code:

ethtool -i eno1
driver: e1000e
version: 3.4.1.1-NAPI
firmware-version: 0.4-4
expansion-rom-version:
bus-info: 0000:00:1f.6
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

Debian 9 kernel linux

Code:

Linux 4.15.18-20-pve x86_64

pveversion

Code:

pve-manager/5.4-13/aee6f0ec (running kernel: 4.15.18-20-pve)

t.lamprecht · Sep 10, 2019

Hi. This is quite an old Thread, started well over a year ago. At many problems reported are not valid anymore. Maybe it's best to start a new thread with your specific issue.

You could try to disable some offloading, but that may reduce general performance a bit:

Code:

ethtool -K eth0 gso off gro off tso off

but could be worth a test

cvhideki · Sep 12, 2019

Didn't help
after 2 days, I have still problem

0xFelix · Sep 28, 2019

The problem I reported in post #84 kind of returned with kernel 4.15.18-21-pve. Booting back into 4.15.18-20-pve makes the network functional again. Reloading the igb module does not help anymore.

https://forum.proxmox.com/threads/4-15-based-test-kernel-for-pve-5-x-available.42097/post-208861

Antonio Blanco · Oct 12, 2019

I can confirm the issue showed up for me today with kernel 5.0.21-2-pve

It appeared on different hardware than before, hopefully the ethtool command will work around the problem

0xFelix · Oct 13, 2019

I noticed disabling runtime PM helps making the NIC functional again.

4.15 based test kernel for PVE 5.x available

Member

Well-Known Member

Renowned Member

Proxmox Staff Member

Renowned Member

New Member

Active Member

New Member

Renowned Member

Well-Known Member

Well-Known Member

Renowned Member

Renowned Member

Renowned Member

Proxmox Staff Member

Renowned Member

Member

Member

Member

We value your privacy