4.15 based test kernel for PVE 5.x available

Discussion in 'Proxmox VE: Installation and configuration' started by fabian, Mar 12, 2018.

  1. Menno

    Menno New Member

    Joined:
    Aug 7, 2018
    Messages:
    6
    Likes Received:
    0
    Interesting enough everything seemed fine on our new(er) hardware until I started optimizing BIOS settings for our workload. Using HPE's document to optimize for low latency recommends disabling a few features that caused these kernel panics for me.

    The relevant settings are:
    • Intel Virtualization Technology -> disabled
    • Intel Hyperthreading Options -> disabled
    • Intel Turbo Boost Technology -> disabled
    • Intel VT-d -> disabled
    What's odd though is I have to keep them all enabled since pve-kernel 4.15.18 while 4.15.17 was fine with these disabled.

    (We use 3 nodes just for storage with a Proxmox install to simplify the CEPH installation process, otherwise these settings obviously should be enabled except for Hyper Threading perhaps)

    Perhaps someone else running into this issue might want to check these settings, I'll leave them enabled for now.

    regards,
    Menno
     
  2. marsian

    marsian Member
    Proxmox Subscriber

    Joined:
    Sep 27, 2016
    Messages:
    37
    Likes Received:
    3
    Interesting finding, thanks for sharing! Can anyone else with current crash-issues confirm if these changes are a fix for them too?
     
  3. rotanid

    rotanid New Member

    Joined:
    Dec 18, 2012
    Messages:
    11
    Likes Received:
    0
    the network crashes are back for us with kernel pve-kernel-4.15.18-2-pve / 4.15.18-20

    Code:
    Aug 29 15:19:28 vh6 kernel: [53879.772436] e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
    Aug 29 15:19:28 vh6 kernel: [53879.772436]   TDH                  <0>
    Aug 29 15:19:28 vh6 kernel: [53879.772436]   TDT                  <8>
    Aug 29 15:19:28 vh6 kernel: [53879.772436]   next_to_use          <8>
    Aug 29 15:19:28 vh6 kernel: [53879.772436]   next_to_clean        <0>
    Aug 29 15:19:28 vh6 kernel: [53879.772436] buffer_info[next_to_clean]:
    Aug 29 15:19:28 vh6 kernel: [53879.772436]   time_stamp           <100cc5f82>
    Aug 29 15:19:28 vh6 kernel: [53879.772436]   next_to_watch        <0>
    Aug 29 15:19:28 vh6 kernel: [53879.772436]   jiffies              <100cc6408>
    Aug 29 15:19:28 vh6 kernel: [53879.772436]   next_to_watch.status <0>
    Aug 29 15:19:28 vh6 kernel: [53879.772436] MAC Status             <80083>
    Aug 29 15:19:28 vh6 kernel: [53879.772436] PHY Status             <796d>
    Aug 29 15:19:28 vh6 kernel: [53879.772436] PHY 1000BASE-T Status  <7800>
    Aug 29 15:19:28 vh6 kernel: [53879.772436] PHY Extended Status    <3000>
    Aug 29 15:19:28 vh6 kernel: [53879.772436] PCI Status             <10>
    
    Code:
    # ethtool -e enp0s31f6 length 256
    Offset        Values
    ------        ------
    0x0000:        90 1b 0e da c4 eb 01 08 ff ff 84 00 8e 00 00 80
    0x0010:        ff ff ff ff c3 10 1f 12 34 17 b7 15 00 00 00 00
    0x0020:        00 00 00 00 00 80 05 a7 2c 30 00 16 00 00 00 0c
    0x0030:        f4 18 02 0a 43 08 13 01 b7 15 ad ba b7 15 b8 15
    0x0040:        ad ba b7 15 ad ba b7 15 00 00 80 80 00 4e 86 08
    0x0050:        00 00 00 00 07 00 00 20 20 00 00 00 00 0e 00 00
    0x0060:        00 01 00 40 0a 01 07 40 ff ff ff ff ff ff ff ff
    0x0070:        ff ff ff ff ff ff ff ff ff ff 00 02 ff ff 2f 35
    0x0080:        00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    0x0090:        00 00 00 00 00 00 ff ff ff ff ff ff ff ff ff ff
    0x00a0:        94 b0 00 08 0a 00 04 90 b0 47 40 24 c2 c1 21 fb
    0x00b0:        80 60 1f 00 00 48 10 00 40 60 1f 00 04 d1 11 00
    0x00c0:        03 0a 12 00 00 00 1f 00 04 b4 30 00 1c 00 31 00
    0x00d0:        06 b4 30 00 09 00 31 00 07 b4 30 00 10 00 31 00
    0x00e0:        0a b4 30 00 18 00 31 00 0c b4 30 00 18 00 31 00
    0x00f0:        0d b4 30 00 18 00 31 00 01 fd 30 00 2c 9c 31 00
    
    Code:
    # ethtool -i enp0s31f6
    driver: e1000e
    version: 3.4.1.1-NAPI
    firmware-version: 0.8-4
    expansion-rom-version:
    bus-info: 0000:00:1f.6
    supports-statistics: yes
    supports-test: yes
    supports-eeprom-access: yes
    supports-register-dump: yes
    supports-priv-flags: no
    
     
  4. t.lamprecht

    t.lamprecht Proxmox Staff Member
    Staff Member

    Joined:
    Jul 28, 2015
    Messages:
    1,137
    Likes Received:
    147
    That is very very strange, with all those virtualization flags disabled you should have been never able to boot any VM with us (at least if you do not set 'kvm' manually to off)..
    Please never turn them off.

    Meaning that it's always there, just a different (higher/lesser) likely chance to trigger it... Or even a HW/Chipset problem.

    hmm your eeprom settings dump does not looks like you have a earlier problematic power feature enabled.

    Have you tried disabling some offloading features? Often they're the cause of such hangs with specific chipsets:
    Code:
    ethtool -Kenp0s31f6 gso off gro off tso off
    (may imply some performance degradation as now the CPU must do this work)
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  5. rotanid

    rotanid New Member

    Joined:
    Dec 18, 2012
    Messages:
    11
    Likes Received:
    0
    we have 5 systems at the moment with this intel NIC and ProxmoxVE.
    the system that crashed 2 days ago was running pve-kernel-4.13.16-2-pve until 3 days ago

    thanks, will try this!
     
  6. sergopotap

    sergopotap New Member

    Joined:
    Jun 28, 2018
    Messages:
    3
    Likes Received:
    0
    My cluster work normally 4 days, after intremap=off

    asocialpenguin.com/2013/12/23/interrupt-remapping-problems-with-intel-5500-5520-cpus/
     
  7. robertb

    robertb New Member

    Joined:
    Apr 4, 2017
    Messages:
    16
    Likes Received:
    0
    Hello,
    I also have the same problem with 2x Xeon e5 2680v2 on supermicro x9drw-if, latest bios and pve.
    Crashes with the same message, additionally I have one of the 10g links (x520-da) flapping since.
    However, removing the Intel-Microcode package seems to hhav stabilized it a bit.
     
  8. pniebylski

    pniebylski New Member

    Joined:
    Aug 10, 2018
    Messages:
    1
    Likes Received:
    1
    Hello Proxmox Team.

    Have this issue been resolved in the latest kernel update? I mean from what I can read here is that only Intel network cards have this problem.

    This is a blocking issue for us as we cannot upgrade our environment until we get a green light that the kernel panic errors no longer happen when host machine have Intel NICs instaled.

    Than you!
     
    robertb likes this.
  9. rotanid

    rotanid New Member

    Joined:
    Dec 18, 2012
    Messages:
    11
    Likes Received:
    0
    we had the issue again today after upgrading to 4.15.18-26
     
  10. marsian

    marsian Member
    Proxmox Subscriber

    Joined:
    Sep 27, 2016
    Messages:
    37
    Likes Received:
    3
    So we've just run into the same issue with a SuperMicro system as well, having Intel i210 NICs installed and Kernel pve-kernel-4.15.18-7-pve: 4.15.18-27 running.

    As the hardware is provided by a managed service we're currently asking them to install the most recent NIC firmware, but any ideas from the Proxmox team on how to solve this in addition?

    The system was running rock solid for about 10 months now with always the latest V4.4 installed, but in order to stay within support we've decided to upgrade to 5.x. :confused:

    I'll try to post crashlogs as soon as we can access the OS again...
     
  11. marsian

    marsian Member
    Proxmox Subscriber

    Joined:
    Sep 27, 2016
    Messages:
    37
    Likes Received:
    3
    Unfortunately the logs have been not stored, so we currently don't have any further details :confused:

    So we'll adjust the log retention and try to have more information available when it crashes the next time...
     
  12. jehster

    jehster New Member

    Joined:
    Jan 19, 2016
    Messages:
    8
    Likes Received:
    0
    Hi,

    We're facing the problem on one node of our cluster:
    3xR610 + SolarFlare SFN5322F.

    Code:
    Kernel Version Linux 4.15.18-9-pve #1 SMP PVE 4.15.18-30 (Thu, 15 Nov 2018 13:32:46 +0100)
    PVE Manager Version pve-manager/5.3-5/97ae681d
    [ 2103.747850] ------------[ cut here ]------------
    [ 2103.747869] NETDEV WATCHDOG: eth5 (sfc): transmit queue 9 timed out
    [ 2103.747899] WARNING: CPU: 14 PID: 0 at net/sched/sch_generic.c:323 dev_watchdog+0x222/0x230
    [ 2103.747901] Modules linked in: xt_set ip_set_hash_net nfsv3 nfs_acl nfs lockd grace fscache ip_set ip6table_filter ip6_tables xt_multiport iptable_filter bonding softdog nfnetlink_log nfnetlink intel_p
    werclamp coretemp kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul mgag200 ghash_clmulni_intel ttm pcbc drm_kms_helper drm i2c_algo_bit snd_pcm fb_sys_fops syscopyarea aesni_intel aes_x86_6
    sysfillrect crypto_simd snd_timer sysimgblt glue_helper snd cryptd soundcore shpchp joydev input_leds gpio_ich acpi_power_meter ipmi_si dcdbas ipmi_devintf ipmi_msghandler wmi serio_raw intel_cstate lpc_
    ch pcspkr i7core_edac ioatdma dca mac_hid vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm sunrpc ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 hid_generic
    [ 2103.747983] usbmouse usbkbd sfc mtd ptp usbhid psmouse pps_core pata_acpi hid megaraid_sas mdio bnx2
    [ 2103.747998] CPU: 14 PID: 0 Comm: swapper/14 Tainted: G I 4.15.18-9-pve #1
    [ 2103.747999] Hardware name: Dell Inc. PowerEdge R610/0F0XJ6, BIOS 6.4.0 07/23/2013
    [ 2103.748001] RIP: 0010:dev_watchdog+0x222/0x230
    [ 2103.748002] RSP: 0018:ffff908bcf3c3e58 EFLAGS: 00010286
    [ 2103.748003] RAX: 0000000000000000 RBX: 0000000000000009 RCX: 0000000000000000
    [ 2103.748004] RDX: 0000000000040400 RSI: 00000000000000f6 RDI: 0000000000000300
    [ 2103.748005] RBP: ffff908bcf3c3e88 R08: 0000000000000001 R09: 0000000000000400
    [ 2103.748006] R10: ffff908bcf3da770 R11: 0000000000000400 R12: 0000000000000040
    [ 2103.748007] R13: ffff9083cbd82000 R14: ffff9083cbd82478 R15: ffff9083c6b5cf40
    [ 2103.748008] FS: 0000000000000000(0000) GS:ffff908bcf3c0000(0000) knlGS:0000000000000000
    [ 2103.748009] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 2103.748010] CR2: 00007ff3c5bd8fd0 CR3: 0000000eafe0a005 CR4: 00000000000226e0
    [ 2103.748011] Call Trace:
    [ 2103.748012] <IRQ>
    [ 2103.748016] ? dev_deactivate_queue.constprop.33+0x60/0x60
    [ 2103.748019] call_timer_fn+0x32/0x130
    [ 2103.748021] run_timer_softirq+0x1dd/0x430
    [ 2103.748024] ? tick_sched_handle+0x34/0x60
    [ 2103.748026] ? ktime_get+0x43/0xa0
    [ 2103.748028] __do_softirq+0x10c/0x2a2
    [ 2103.748031] irq_exit+0xb8/0xc0
    [ 2103.748033] smp_apic_timer_interrupt+0x79/0x130
    [ 2103.748035] apic_timer_interrupt+0x84/0x90
    [ 2103.748036] </IRQ>
    [ 2103.748038] RIP: 0010:cpuidle_enter_state+0xa5/0x2e0
    [ 2103.748039] RSP: 0018:ffffb93586313e58 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff11
    [ 2103.748041] RAX: ffff908bcf3e28c0 RBX: 0000000000000003 RCX: 000000000000001f
    [ 2103.748042] RDX: 000001e9d1247af1 RSI: ffffffd9314b6f1d RDI: 0000000000000000
    [ 2103.748043] RBP: ffffb93586313e90 R08: 0000000000000ea6 R09: 0000000000000006
    [ 2103.748044] R10: ffffb93586313e28 R11: 0000000000000135 R12: ffff908bcf3ec900
    [ 2103.748045] R13: ffffffff9f771cb8 R14: 000001e9d1247af1 R15: ffffffff9f771ca0
    [ 2103.748047] ? cpuidle_enter_state+0x97/0x2e0
    [ 2103.748049] cpuidle_enter+0x17/0x20
    [ 2103.748051] call_cpuidle+0x23/0x40
    [ 2103.748053] do_idle+0x19a/0x200
    [ 2103.748055] cpu_startup_entry+0x73/0x80
    [ 2103.748057] start_secondary+0x1ab/0x200
    [ 2103.748060] secondary_startup_64+0xa5/0xb0
    [ 2103.748061] Code: 36 00 49 63 4e e8 eb 92 4c 89 ef c6 05 d8 ca d7 00 01 e8 32 1f fd ff 89 d9 48 89 c2 4c 89 ee 48 c7 c7 40 78 39 9f e8 4e 71 7f ff <0f> 0b eb c0 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 6
    90 55 48
    [ 2103.748091] ---[ end trace 9222b51587ffbe79 ]---
    [ 2103.748097] sfc 0000:04:00.1 eth5: TX stuck with port_enabled=1: resetting channels
    [ 2103.748206] sfc 0000:04:00.1 eth5: resetting (RECOVER_OR_ALL)
    [ 2103.825671] sfc 0000:04:00.1 eth5: link down
    [ 2103.991537] sfc 0000:04:00.1 eth5: link up at 10000Mbps full-duplex (MTU 1500)
    [ 2296.761180] INFO: task lzop:8382 blocked for more than 120 seconds.
    [ 2296.761223] Tainted: G W I 4.15.18-9-pve #1
    [ 2296.761252] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 2296.761273] lzop D 0 8382 8357 0x00000000
    [ 2296.761275] Call Trace:
    [ 2296.761281] __schedule+0x3e0/0x870
    [ 2296.761283] ? bit_wait+0x60/0x60
    [ 2296.761284] schedule+0x36/0x80
    [ 2296.761286] io_schedule+0x16/0x40
    [ 2296.761287] bit_wait_io+0x11/0x60
    [ 2296.761288] __wait_on_bit+0x5a/0x90
    [ 2296.761289] out_of_line_wait_on_bit+0x8e/0xb0
    [ 2296.761291] ? bit_waitqueue+0x40/0x40
    [ 2296.761305] nfs_wait_on_request+0x46/0x50 [nfs]
    [ 2296.761311] nfs_lock_and_join_requests+0x121/0x510 [nfs]
    [ 2296.761313] ? radix_tree_lookup_slot+0x22/0x50
    [ 2296.761320] nfs_updatepage+0x151/0x910 [nfs]
    [ 2296.761325] nfs_write_end+0x129/0x4e0 [nfs]
    [ 2296.761327] generic_perform_write+0xff/0x1b0
    [ 2296.761333] nfs_file_write+0xd7/0x250 [nfs]
    [ 2296.761346] new_sync_write+0xe7/0x140
    [ 2296.761348] __vfs_write+0x29/0x40
    [ 2296.761349] vfs_write+0xb5/0x1a0
    [ 2296.761351] SyS_write+0x55/0xc0
    [ 2296.761353] do_syscall_64+0x73/0x130
    [ 2296.761355] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    [ 2296.761356] RIP: 0033:0x7f5411d99730
    [ 2296.761357] RSP: 002b:00007ffd36bc2438 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    [ 2296.761358] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5411d99730
    [ 2296.761359] RDX: 0000000000000004 RSI: 00007ffd36bc24a0 RDI: 0000000000000001
    [ 2296.761360] RBP: 0000000000000004 R08: fffffffffffffff5 R09: 00007f54123e9000
    [ 2296.761360] R10: 00000000000001f9 R11: 0000000000000246 R12: 00007f5412492698
    [ 2296.761361] R13: 0000000000000001 R14: 0000000000000019 R15: 00007ffd36bc24a0
    [ 2417.586366] INFO: task lzop:8382 blocked for more than 120 seconds.
    [ 2417.586392] Tainted: G W I 4.15.18-9-pve #1
    [ 2417.586408] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 2417.586429] lzop D 0 8382 8357 0x00000000
    [ 2417.586431] Call Trace:
    [ 2417.586437] __schedule+0x3e0/0x870
    [ 2417.586439] ? bit_wait+0x60/0x60
    [ 2417.586440] schedule+0x36/0x80
    [ 2417.586442] io_schedule+0x16/0x40
    [ 2417.586443] bit_wait_io+0x11/0x60
    [ 2417.586443] __wait_on_bit+0x5a/0x90
    [ 2417.586445] out_of_line_wait_on_bit+0x8e/0xb0
    [ 2417.586447] ? bit_waitqueue+0x40/0x40
    [ 2417.586461] nfs_wait_on_request+0x46/0x50 [nfs]
    [ 2417.586467] nfs_lock_and_join_requests+0x121/0x510 [nfs]
    [ 2417.586469] ? radix_tree_lookup_slot+0x22/0x50
    [ 2417.586475] nfs_updatepage+0x151/0x910 [nfs]
    [ 2417.586480] nfs_write_end+0x129/0x4e0 [nfs]
    [ 2417.586483] generic_perform_write+0xff/0x1b0
    [ 2417.586488] nfs_file_write+0xd7/0x250 [nfs]
    [ 2417.586490] new_sync_write+0xe7/0x140
    [ 2417.586491] __vfs_write+0x29/0x40
    [ 2417.586493] vfs_write+0xb5/0x1a0
    [ 2417.586494] SyS_write+0x55/0xc0
    [ 2417.586496] do_syscall_64+0x73/0x130
    [ 2417.586497] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    [ 2417.586499] RIP: 0033:0x7f5411d99730
    [ 2417.586499] RSP: 002b:00007ffd36bc2438 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    [ 2417.586501] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5411d99730
    [ 2417.586501] RDX: 0000000000000004 RSI: 00007ffd36bc24a0 RDI: 0000000000000001
    [ 2417.586502] RBP: 0000000000000004 R08: fffffffffffffff5 R09: 00007f54123e9000
    [ 2417.586503] R10: 00000000000001f9 R11: 0000000000000246 R12: 00007f5412492698
    [ 2417.586504] R13: 0000000000000001 R14: 0000000000000019 R15: 00007ffd36bc24a0
    [ 2538.411658] INFO: task lzop:8382 blocked for more than 120 seconds.
    [ 2538.411702] Tainted: G W I 4.15.18-9-pve #1
    [ 2538.411734] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 2538.411778] lzop D 0 8382 8357 0x00000000
    [ 2538.411780] Call Trace:
    [ 2538.411787] __schedule+0x3e0/0x870
    [ 2538.411789] ? bit_wait+0x60/0x60
    [ 2538.411790] schedule+0x36/0x80
    [ 2538.411791] io_schedule+0x16/0x40
    [ 2538.411792] bit_wait_io+0x11/0x60
    [ 2538.411793] __wait_on_bit+0x5a/0x90
    [ 2538.411795] out_of_line_wait_on_bit+0x8e/0xb0
    [ 2538.411797] ? bit_waitqueue+0x40/0x40
    [ 2538.411810] nfs_wait_on_request+0x46/0x50 [nfs]
    [ 2538.411816] nfs_lock_and_join_requests+0x121/0x510 [nfs]
    [ 2538.411818] ? radix_tree_lookup_slot+0x22/0x50
    [ 2538.411824] nfs_updatepage+0x151/0x910 [nfs]
    [ 2538.411830] nfs_write_end+0x129/0x4e0 [nfs]
    [ 2538.411832] generic_perform_write+0xff/0x1b0
    [ 2538.411837] nfs_file_write+0xd7/0x250 [nfs]
    [ 2538.411839] new_sync_write+0xe7/0x140
    [ 2538.411841] __vfs_write+0x29/0x40
    [ 2538.411842] vfs_write+0xb5/0x1a0
    [ 2538.411843] SyS_write+0x55/0xc0
    [ 2538.411845] do_syscall_64+0x73/0x130
    [ 2538.411847] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    [ 2538.411848] RIP: 0033:0x7f5411d99730
    [ 2538.411849] RSP: 002b:00007ffd36bc2438 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    [ 2538.411850] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5411d99730
    [ 2538.411851] RDX: 0000000000000004 RSI: 00007ffd36bc24a0 RDI: 0000000000000001
    [ 2538.411851] RBP: 0000000000000004 R08: fffffffffffffff5 R09: 00007f54123e9000
    [ 2538.411852] R10: 00000000000001f9 R11: 0000000000000246 R12: 00007f5412492698
    [ 2538.411853] R13: 0000000000000001 R14: 0000000000000019 R15: 00007ffd36bc24a0
    [ 2659.236894] INFO: task lzop:8382 blocked for more than 120 seconds.
    [ 2659.236936] Tainted: G W I 4.15.18-9-pve #1
    [ 2659.236968] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 2659.237003] lzop D 0 8382 8357 0x00000000
    [ 2659.237004] Call Trace:
    [ 2659.237010] __schedule+0x3e0/0x870
    [ 2659.237012] ? bit_wait+0x60/0x60
    [ 2659.237013] schedule+0x36/0x80
    [ 2659.237015] io_schedule+0x16/0x40
    [ 2659.237016] bit_wait_io+0x11/0x60
    [ 2659.237017] __wait_on_bit+0x5a/0x90
    [ 2659.237018] out_of_line_wait_on_bit+0x8e/0xb0
    [ 2659.237020] ? bit_waitqueue+0x40/0x40
    [ 2659.237034] nfs_wait_on_request+0x46/0x50 [nfs]
    [ 2659.237040] nfs_lock_and_join_requests+0x121/0x510 [nfs]
    [ 2659.237043] ? radix_tree_lookup_slot+0x22/0x50
    [ 2659.237049] nfs_updatepage+0x151/0x910 [nfs]
    [ 2659.237055] nfs_write_end+0x129/0x4e0 [nfs]
    [ 2659.237057] generic_perform_write+0xff/0x1b0
    [ 2659.237062] nfs_file_write+0xd7/0x250 [nfs]
    [ 2659.237064] new_sync_write+0xe7/0x140
    [ 2659.237066] __vfs_write+0x29/0x40
    [ 2659.237067] vfs_write+0xb5/0x1a0
    [ 2659.237068] SyS_write+0x55/0xc0
    [ 2659.237070] do_syscall_64+0x73/0x130
    [ 2659.237072] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    [ 2659.237073] RIP: 0033:0x7f5411d99730
    [ 2659.237074] RSP: 002b:00007ffd36bc2438 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    [ 2659.237075] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5411d99730
    [ 2659.237076] RDX: 0000000000000004 RSI: 00007ffd36bc24a0 RDI: 0000000000000001
    [ 2659.237076] RBP: 0000000000000004 R08: fffffffffffffff5 R09: 00007f54123e9000
    [ 2659.237077] R10: 00000000000001f9 R11: 0000000000000246 R12: 00007f5412492698
    [ 2659.237078] R13: 0000000000000001 R14: 0000000000000019 R15: 00007ffd36bc24a0
    [ 2671.524157] nfs: server 10.9.80.101 not responding, still trying
    [ 2780.062084] INFO: task lzop:8382 blocked for more than 120 seconds.
    [ 2780.062126] Tainted: G W I 4.15.18-9-pve #1
    [ 2780.062160] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 2780.062182] lzop D 0 8382 8357 0x00000000
    [ 2780.062184] Call Trace:
    [ 2780.062190] __schedule+0x3e0/0x870
    [ 2780.062192] ? bit_wait+0x60/0x60
    [ 2780.062193] schedule+0x36/0x80
    [ 2780.062195] io_schedule+0x16/0x40
    [ 2780.062196] bit_wait_io+0x11/0x60
    [ 2780.062197] __wait_on_bit+0x5a/0x90
    [ 2780.062198] out_of_line_wait_on_bit+0x8e/0xb0
    [ 2780.062200] ? bit_waitqueue+0x40/0x40
    [ 2780.062214] nfs_wait_on_request+0x46/0x50 [nfs]
    [ 2780.062221] nfs_lock_and_join_requests+0x121/0x510 [nfs]
    [ 2780.062223] ? radix_tree_lookup_slot+0x22/0x50
    [ 2780.062229] nfs_updatepage+0x151/0x910 [nfs]
    [ 2780.062235] nfs_write_end+0x129/0x4e0 [nfs]
    [ 2780.062237] generic_perform_write+0xff/0x1b0
    [ 2780.062243] nfs_file_write+0xd7/0x250 [nfs]
    [ 2780.062244] new_sync_write+0xe7/0x140
    [ 2780.062246] __vfs_write+0x29/0x40
    [ 2780.062247] vfs_write+0xb5/0x1a0
    [ 2780.062249] SyS_write+0x55/0xc0
    [ 2780.062250] do_syscall_64+0x73/0x130
    [ 2780.062252] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    [ 2780.062253] RIP: 0033:0x7f5411d99730
    [ 2780.062254] RSP: 002b:00007ffd36bc2438 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    [ 2780.062255] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5411d99730
    [ 2780.062256] RDX: 0000000000000004 RSI: 00007ffd36bc24a0 RDI: 0000000000000001
    [ 2780.062257] RBP: 0000000000000004 R08: fffffffffffffff5 R09: 00007f54123e9000
    [ 2780.062257] R10: 00000000000001f9 R11: 0000000000000246 R12: 00007f5412492698
    [ 2780.062258] R13: 0000000000000001 R14: 0000000000000019 R15: 00007ffd36bc24a0
    [ 2900.887397] INFO: task lzop:8382 blocked for more than 120 seconds.
    [ 2900.887438] Tainted: G W I 4.15.18-9-pve #1
    [ 2900.887464] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 2900.887485] lzop D 0 8382 8357 0x00000000
    [ 2900.887487] Call Trace:
    [ 2900.887493] __schedule+0x3e0/0x870
    [ 2900.887494] ? bit_wait+0x60/0x60
    [ 2900.887495] schedule+0x36/0x80
    [ 2900.887497] io_schedule+0x16/0x40
    [ 2900.887498] bit_wait_io+0x11/0x60
    [ 2900.887499] __wait_on_bit+0x5a/0x90
    [ 2900.887500] out_of_line_wait_on_bit+0x8e/0xb0
    [ 2900.887502] ? bit_waitqueue+0x40/0x40
    [ 2900.887516] nfs_wait_on_request+0x46/0x50 [nfs]
    [ 2900.887522] nfs_lock_and_join_requests+0x121/0x510 [nfs]
    [ 2900.887524] ? radix_tree_lookup_slot+0x22/0x50
    [ 2900.887530] nfs_updatepage+0x151/0x910 [nfs]
    [ 2900.887535] nfs_write_end+0x129/0x4e0 [nfs]
    [ 2900.887537] generic_perform_write+0xff/0x1b0
    [ 2900.887543] nfs_file_write+0xd7/0x250 [nfs]
    [ 2900.887545] new_sync_write+0xe7/0x140
    [ 2900.887546] __vfs_write+0x29/0x40
    [ 2900.887547] vfs_write+0xb5/0x1a0
    [ 2900.887549] SyS_write+0x55/0xc0
    [ 2900.887551] do_syscall_64+0x73/0x130
    [ 2900.887552] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    [ 2900.887554] RIP: 0033:0x7f5411d99730
    [ 2900.887554] RSP: 002b:00007ffd36bc2438 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    [ 2900.887556] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5411d99730
    [ 2900.887556] RDX: 0000000000000004 RSI: 00007ffd36bc24a0 RDI: 0000000000000001
    [ 2900.887557] RBP: 0000000000000004 R08: fffffffffffffff5 R09: 00007f54123e9000
    [ 2900.887558] R10: 00000000000001f9 R11: 0000000000000246 R12: 00007f5412492698
    [ 2900.887558] R13: 0000000000000001 R14: 0000000000000019 R15: 00007ffd36bc24a0
    [ 3021.712639] INFO: task lzop:8382 blocked for more than 120 seconds.
    [ 3021.712682] Tainted: G W I 4.15.18-9-pve #1
    [ 3021.712702] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 3021.712723] lzop D 0 8382 8357 0x00000000
    [ 3021.712728] Call Trace:
    [ 3021.712736] __schedule+0x3e0/0x870
    [ 3021.712739] ? bit_wait+0x60/0x60
    [ 3021.712741] schedule+0x36/0x80
    [ 3021.712745] io_schedule+0x16/0x40
    [ 3021.712747] bit_wait_io+0x11/0x60
    [ 3021.712749] __wait_on_bit+0x5a/0x90
    [ 3021.712751] out_of_line_wait_on_bit+0x8e/0xb0
    [ 3021.712755] ? bit_waitqueue+0x40/0x40
    [ 3021.712774] nfs_wait_on_request+0x46/0x50 [nfs]
    [ 3021.712780] nfs_lock_and_join_requests+0x121/0x510 [nfs]
    [ 3021.712782] ? radix_tree_lookup_slot+0x22/0x50
    [ 3021.712789] nfs_updatepage+0x151/0x910 [nfs]
    [ 3021.712794] nfs_write_end+0x129/0x4e0 [nfs]
    [ 3021.712796] generic_perform_write+0xff/0x1b0
    [ 3021.712802] nfs_file_write+0xd7/0x250 [nfs]
    [ 3021.712804] new_sync_write+0xe7/0x140
    [ 3021.712805] __vfs_write+0x29/0x40
    [ 3021.712806] vfs_write+0xb5/0x1a0
    [ 3021.712808] SyS_write+0x55/0xc0
    [ 3021.712809] do_syscall_64+0x73/0x130
    [ 3021.712811] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    [ 3021.712812] RIP: 0033:0x7f5411d99730
    [ 3021.712813] RSP: 002b:00007ffd36bc2438 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    [ 3021.712814] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5411d99730
    [ 3021.712815] RDX: 0000000000000004 RSI: 00007ffd36bc24a0 RDI: 0000000000000001
    [ 3021.712816] RBP: 0000000000000004 R08: fffffffffffffff5 R09: 00007f54123e9000
    [ 3021.712817] R10: 00000000000001f9 R11: 0000000000000246 R12: 00007f5412492698
    [ 3021.712817] R13: 0000000000000001 R14: 0000000000000019 R15: 00007ffd36bc24a0
    [ 3142.537959] INFO: task lzop:8382 blocked for more than 120 seconds.
    [ 3142.537993] Tainted: G W I 4.15.18-9-pve #1
    [ 3142.538009] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 3142.538030] lzop D 0 8382 8357 0x00000000
    [ 3142.538032] Call Trace:
    [ 3142.538038] __schedule+0x3e0/0x870
    [ 3142.538040] ? bit_wait+0x60/0x60
    [ 3142.538041] schedule+0x36/0x80
    [ 3142.538043] io_schedule+0x16/0x40
    [ 3142.538044] bit_wait_io+0x11/0x60
    [ 3142.538045] __wait_on_bit+0x5a/0x90
    [ 3142.538046] out_of_line_wait_on_bit+0x8e/0xb0
    [ 3142.538048] ? bit_waitqueue+0x40/0x40
    [ 3142.538061] nfs_wait_on_request+0x46/0x50 [nfs]
    [ 3142.538067] nfs_lock_and_join_requests+0x121/0x510 [nfs]
    [ 3142.538070] ? radix_tree_lookup_slot+0x22/0x50
    [ 3142.538076] nfs_updatepage+0x151/0x910 [nfs]
    [ 3142.538082] nfs_write_end+0x129/0x4e0 [nfs]
    [ 3142.538084] generic_perform_write+0xff/0x1b0
    [ 3142.538089] nfs_file_write+0xd7/0x250 [nfs]
    [ 3142.538091] new_sync_write+0xe7/0x140
    [ 3142.538093] __vfs_write+0x29/0x40
    [ 3142.538094] vfs_write+0xb5/0x1a0
    [ 3142.538095] SyS_write+0x55/0xc0
    [ 3142.538097] do_syscall_64+0x73/0x130
    [ 3142.538098] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    [ 3142.538100] RIP: 0033:0x7f5411d99730
    [ 3142.538101] RSP: 002b:00007ffd36bc2438 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    [ 3142.538102] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5411d99730
    [ 3142.538103] RDX: 0000000000000004 RSI: 00007ffd36bc24a0 RDI: 0000000000000001
    [ 3142.538103] RBP: 0000000000000004 R08: fffffffffffffff5 R09: 00007f54123e9000
    [ 3142.538104] R10: 00000000000001f9 R11: 0000000000000246 R12: 00007f5412492698
    [ 3142.538105] R13: 0000000000000001 R14: 0000000000000019 R15: 00007ffd36bc24a0
    [ 3200.320822] device tap100i0 entered promiscuous mode
    [ 3200.331590] vmbr0: port 3(tap100i0) entered blocking state
    [ 3200.331592] vmbr0: port 3(tap100i0) entered disabled state
    [ 3200.331699] vmbr0: port 3(tap100i0) entered blocking state
    [ 3200.331701] vmbr0: port 3(tap100i0) entered forwarding state
    [ 3222.152389] device tap101i0 entered promiscuous mode
    [ 3222.161898] vmbr0: port 10(tap101i0) entered blocking state
    [ 3222.161900] vmbr0: port 10(tap101i0) entered disabled state
    [ 3222.161995] vmbr0: port 10(tap101i0) entered blocking state
    [ 3222.161997] vmbr0: port 10(tap101i0) entered forwarding state
    [ 3234.692797] nfs: server 10.9.80.101 not responding, still trying
    [ 3244.546872] device tap106i0 entered promiscuous mode
    [ 3244.555741] vmbr0: port 11(tap106i0) entered blocking state
    [ 3244.555743] vmbr0: port 11(tap106i0) entered disabled state
    [ 3244.555835] vmbr0: port 11(tap106i0) entered blocking state
    [ 3244.555837] vmbr0: port 11(tap106i0) entered forwarding state
    [ 3263.363288] INFO: task lzop:8382 blocked for more than 120 seconds.
    [ 3263.363310] Tainted: G W I 4.15.18-9-pve #1
    [ 3263.363326] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 3263.363347] lzop D 0 8382 8357 0x00000000
    [ 3263.363349] Call Trace:
    [ 3263.363355] __schedule+0x3e0/0x870
    [ 3263.363357] ? bit_wait+0x60/0x60
    [ 3263.363358] schedule+0x36/0x80
    [ 3263.363360] io_schedule+0x16/0x40
    [ 3263.363361] bit_wait_io+0x11/0x60
    [ 3263.363362] __wait_on_bit+0x5a/0x90
    [ 3263.363363] out_of_line_wait_on_bit+0x8e/0xb0
    [ 3263.363365] ? bit_waitqueue+0x40/0x40
    [ 3263.363379] nfs_wait_on_request+0x46/0x50 [nfs]
    [ 3263.363386] nfs_lock_and_join_requests+0x121/0x510 [nfs]
    [ 3263.363388] ? radix_tree_lookup_slot+0x22/0x50
    [ 3263.363394] nfs_updatepage+0x151/0x910 [nfs]
    [ 3263.363399] nfs_write_end+0x129/0x4e0 [nfs]
    [ 3263.363401] generic_perform_write+0xff/0x1b0
    [ 3263.363407] nfs_file_write+0xd7/0x250 [nfs]
    [ 3263.363409] new_sync_write+0xe7/0x140
    [ 3263.363411] __vfs_write+0x29/0x40
    [ 3263.363412] vfs_write+0xb5/0x1a0
    [ 3263.363413] SyS_write+0x55/0xc0
    [ 3263.363415] do_syscall_64+0x73/0x130
    [ 3263.363417] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    [ 3263.363418] RIP: 0033:0x7f5411d99730
    [ 3263.363419] RSP: 002b:00007ffd36bc2438 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    [ 3263.363420] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5411d99730
    [ 3263.363421] RDX: 0000000000000004 RSI: 00007ffd36bc24a0 RDI: 0000000000000001
    [ 3263.363421] RBP: 0000000000000004 R08: fffffffffffffff5 R09: 00007f54123e9000
    [ 3263.363422] R10: 00000000000001f9 R11: 0000000000000246 R12: 00007f5412492698
    [ 3263.363423] R13: 0000000000000001 R14: 0000000000000019 R15: 00007ffd36bc24a0
    [ 3331.954972] device tap113i0 entered promiscuous mode
    [ 3331.964983] vmbr0: port 12(tap113i0) entered blocking state
    [ 3331.964985] vmbr0: port 12(tap113i0) entered disabled state
    [ 3331.965074] vmbr0: port 12(tap113i0) entered blocking state
    [ 3331.965076] vmbr0: port 12(tap113i0) entered forwarding state
    [ 3384.188538] INFO: task lzop:8382 blocked for more than 120 seconds.
    [ 3384.188566] Tainted: G W I 4.15.18-9-pve #1
    [ 3384.188582] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 3384.188603] lzop D 0 8382 8357 0x00000000
    [ 3384.188605] Call Trace:
    [ 3384.188611] __schedule+0x3e0/0x870
    [ 3384.188613] ? bit_wait+0x60/0x60
    [ 3384.188614] schedule+0x36/0x80
    [ 3384.188616] io_schedule+0x16/0x40
    [ 3384.188617] bit_wait_io+0x11/0x60
    [ 3384.188618] __wait_on_bit+0x5a/0x90
    [ 3384.188619] out_of_line_wait_on_bit+0x8e/0xb0
    [ 3384.188621] ? bit_waitqueue+0x40/0x40
    [ 3384.188639] nfs_wait_on_request+0x46/0x50 [nfs]
    [ 3384.188646] nfs_lock_and_join_requests+0x121/0x510 [nfs]
    [ 3384.188648] ? radix_tree_lookup_slot+0x22/0x50
    [ 3384.188654] nfs_updatepage+0x151/0x910 [nfs]
    [ 3384.188659] nfs_write_end+0x129/0x4e0 [nfs]
    [ 3384.188661] generic_perform_write+0xff/0x1b0
    [ 3384.188667] nfs_file_write+0xd7/0x250 [nfs]
    [ 3384.188669] new_sync_write+0xe7/0x140
    [ 3384.188670] __vfs_write+0x29/0x40
    [ 3384.188672] vfs_write+0xb5/0x1a0
    [ 3384.188673] SyS_write+0x55/0xc0
    [ 3384.188675] do_syscall_64+0x73/0x130
    [ 3384.188677] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    [ 3384.188678] RIP: 0033:0x7f5411d99730
    [ 3384.188679] RSP: 002b:00007ffd36bc2438 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    [ 3384.188680] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5411d99730
    [ 3384.188681] RDX: 0000000000000004 RSI: 00007ffd36bc24a0 RDI: 0000000000000001
    [ 3384.188681] RBP: 0000000000000004 R08: fffffffffffffff5 R09: 00007f54123e9000
    [ 3384.188682] R10: 00000000000001f9 R11: 0000000000000246 R12: 00007f5412492698
    [ 3384.188683] R13: 0000000000000001 R14: 0000000000000019 R15: 00007ffd36bc24a0

    All 3 nodes are up to date. No Ceph, no ZFS. Feel free to ask if more informations needed

    have a good day
    Jerome
     
  13. coppola_f

    coppola_f Member

    Joined:
    Apr 2, 2012
    Messages:
    55
    Likes Received:
    2
    guys,
    we've setup two more nodes....
    these two are identycal DL680 gen8.

    done fresh install using downloaded .iso image,
    operated all updates (on these units we've an active subscription!)

    then set units as cluster members ad after some "blank" runtime days,
    moved some VMs to the new nodes (these are the newer nodes then we've moved there the most expensive VMs!)

    all seems to run fine for many ours (i think one or two full days at max!)
    then users started to report extremely slow response from the terminal server and really long query response time from MySQL linux VM!

    after some rapid checks,
    we've moved VMs back to the older nodes (still running 4.13.x kernel!)

    then magically issue seems to have solved, normal response time from both VMs....

    actually rolled back units to 4.13.x kernel,
    then restarted units and after some testing time, moved VMs back to newer nodes!

    now all running fine!
    our cluster updates have been locked to 4.13.x kernel branch!!

    waiting suggestions or any other information may come from you all!

    best regards,

    Francesco
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice