4.15.17 kernel panic

Matteo Costantini · Jun 20, 2018

Sirs,

we've a 4 nodes Proxmox cluster running on v.5.2,

this cluster is actually operating in production environment and all nodes are covered with 'community' subscription...

all servers have identical hardware: HP Proliant DL380 G6, dual processor, 32gb ram and dual path fiber channel connection to msa storage for VMs disk images.

We've done just yesterday some routine maintenance actions (cable clean-up and re-route, bios upgrades and various firmware upgrades from HP)

then operated a routine upgrade of proxmox o.s. doing :

apt-get update

apt-get dist-upgrade

during reboot no issue at all, but during VMs start-up, two of four nodes has crashed, leaving the cluster out of quorum....

so all VMs was stopped in less than 5 minutes we was unable to operate at all!!

After over an hour of cross testing, we've noticed that nodes was crashing only if operating charge was over 65/70% approx of total ram....

So we've searched inside forum also and read about some issues related to 4.15.17 version of kernel (high latency on disks?!)

to solve the issue, we've restarted units choosing to start with 4.13.xx kernel versions during grub prompt

we're running actually on 4.13.13-5 or 4.13.16-3.

Waiting instructions on how to full roll-back without disrupt of cluster or suggestions on new updated version of 4.15.xx family of kernel...

Available to give you more info if requested

thanks in advance for your time,

regards,

trystan · Jun 20, 2018

By any chance are you using MTU settings different from the 1500 default and igb/Intel nics?

coppola_f · Jun 22, 2018

@trystan,

we've checked...
MTU is set to std 1500 bytes an all nics (phisycal or bridge / bonds)
nics are mixed,
4 broadcom BCM5709 + 2 intel 82571EB

any other suggestion?!?

waiting....
regards,

Francesco

udo · Jun 22, 2018

Hi,
your posting suggest, thst you don't use zfs?!

You wrote, thst two nodes crashed - any info about why (kernel trace).

Udo

coppola_f · Jun 25, 2018

Hi,

find attached logs extracted for 18.06
as you can see, we were working on fc cables,
changing schema from direct attach to full mesh through dual fc switches

really don't remember time of issue...

waiting your suggestions...
regards,
Francesco

udo · Jun 25, 2018

Hi,
the kernel trace wasn't logged... perhaps you can see parts of the trace on the screen.

Your multipath messages looks not good... is it now better with your changes?

BTW. you use e1000e-Nics, which can make trouble with this kernel (see https://forum.proxmox.com/threads/4...or-pve-5-x-available.42097/page-5#post-211914 ) but not with kernel traces... but you can give the new kernel an try.

Udo

coppola_f · Jun 27, 2018

Udo,
many thanks for your time,

after conversion to full mesh cabling scheme,
multipath seems now to be working fine!

we've broadcom + e1000 nics on these nodes (some onboard and some on additional nic)
btw we've never had any issue from nics prior to this kernel update....

actually we can't run on this kernel version because we're not able to identify issue origin,
it seems to run fine immediately after node restart.... after random amount of time, it seems to crash
(sure we've not tested a scenario where 3 nodes are running old kernel and only 1 node runs on the new version!)

we're looking around to know if someone else had some symptoms using similar configuration
(we're operating on hp proliant dl380 g6 )

regards,
Francesco

coppola_f · Jul 23, 2018

forum admins!!!

may be this is related with:
https://forum.proxmox.com/threads/4-15-based-test-kernel-for-pve-5-x-available.42097/page-7

i'm going to crosslink these topics!!
so we all may look around!!

regards,
Francesco

David Herselman · Jul 23, 2018

We're running 4.15.18-1-pve on pure Intel servers (two integrated 10GbE NICs and two 10Gbps SFP+ add-on module NICs):

Everything's working as expected. We're running active/backup bond interfaces using OVS and disable GRO on the VM traffic NICs:

/etc/rc.local

Code:

ethtool -K eth0 gro off;
ethtool -K eth1 gro off;

# Set active-backup bond slave interface priority:
ovs-appctl bond/set-active-slave bond0 eth0;
ovs-appctl bond/set-active-slave bond1 eth3;

/etc/network/interfaces

Code:

auto lo
iface lo inet loopback

allow-vmbr0 bond0
iface bond0 inet manual
        ovs_bridge vmbr0
        ovs_type OVSBond
        ovs_bonds eth0 eth1
        pre-up ( ifconfig eth0 mtu 9216 && ifconfig eth1 mtu 9216 )
        ovs_options bond_mode=active-backup tag=1 vlan_mode=native-untagged
        mtu 9216

auto vmbr0
allow-ovs vmbr0
iface vmbr0 inet manual
        ovs_type OVSBridge
        ovs_ports bond0 vlan1
        mtu 9216

allow-vmbr0 vlan1
iface vlan1 inet static
        ovs_type OVSIntPort
        ovs_bridge vmbr0
        ovs_options tag=1
        ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
        address 198.19.17.66
        netmask 255.255.255.224
        gateway 198.19.17.65
        mtu 1500


allow-vmbr1 bond1
iface bond1 inet manual
        ovs_bridge vmbr1
        ovs_type OVSBond
        ovs_bonds eth2 eth3
        pre-up ( ifconfig eth2 mtu 9216 && ifconfig eth3 mtu 9216 )
        ovs_options bond_mode=active-backup tag=33 vlan_mode=native-untagged
        mtu 9216

auto vmbr1
allow-ovs vmbr1
iface vmbr1 inet manual
        ovs_type OVSBridge
        ovs_ports bond1 vlan33
        mtu 9216

allow-vmbr1 vlan33
iface vlan33 inet static
        ovs_type OVSIntPort
        ovs_bridge vmbr1
        ovs_options tag=33
        ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
        address 10.254.1.2
        netmask 255.255.255.0
        mtu 9212

Code:

[admin@kvm5a ~]# grep ixg /var/log/messages
Jul 22 17:20:55 kvm5a kernel: [    1.690361] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver - version 5.1.0-k
Jul 22 17:20:55 kvm5a kernel: [    1.690361] ixgbe: Copyright (c) 1999-2016 Intel Corporation.
Jul 22 17:20:55 kvm5a kernel: [    1.848254] ixgbe 0000:03:00.0: Multiqueue Enabled: Rx Queue count = 40, Tx Queue count = 40 XDP Queue count = 0
Jul 22 17:20:55 kvm5a kernel: [    1.848378] ixgbe 0000:03:00.0: PCI Express bandwidth of 32GT/s available
Jul 22 17:20:55 kvm5a kernel: [    1.848380] ixgbe 0000:03:00.0: (Speed:5.0GT/s, Width: x8, Encoding Loss:20%)
Jul 22 17:20:55 kvm5a kernel: [    1.848461] ixgbe 0000:03:00.0: MAC: 2, PHY: 15, SFP+: 7, PBA No: FFFFFF-0FF
Jul 22 17:20:55 kvm5a kernel: [    1.848462] ixgbe 0000:03:00.0: 00:1e:67:9b:f1:38
Jul 22 17:20:55 kvm5a kernel: [    1.849519] ixgbe 0000:03:00.0: Intel(R) 10 Gigabit Network Connection
Jul 22 17:20:55 kvm5a kernel: [    2.008537] ixgbe 0000:03:00.1: Multiqueue Enabled: Rx Queue count = 40, Tx Queue count = 40 XDP Queue count = 0
Jul 22 17:20:55 kvm5a kernel: [    2.008663] ixgbe 0000:03:00.1: PCI Express bandwidth of 32GT/s available
Jul 22 17:20:55 kvm5a kernel: [    2.008664] ixgbe 0000:03:00.1: (Speed:5.0GT/s, Width: x8, Encoding Loss:20%)
Jul 22 17:20:55 kvm5a kernel: [    2.008748] ixgbe 0000:03:00.1: MAC: 2, PHY: 15, SFP+: 8, PBA No: FFFFFF-0FF
Jul 22 17:20:55 kvm5a kernel: [    2.008751] ixgbe 0000:03:00.1: 00:1e:67:9b:f1:39
Jul 22 17:20:55 kvm5a kernel: [    2.010024] ixgbe 0000:03:00.1: Intel(R) 10 Gigabit Network Connection
Jul 22 17:20:55 kvm5a kernel: [    2.282123] ixgbe 0000:05:00.0: Multiqueue Enabled: Rx Queue count = 40, Tx Queue count = 40 XDP Queue count = 0
Jul 22 17:20:55 kvm5a kernel: [    2.318165] ixgbe 0000:05:00.0: PCI Express bandwidth of 32GT/s available
Jul 22 17:20:55 kvm5a kernel: [    2.318166] ixgbe 0000:05:00.0: (Speed:5.0GT/s, Width: x8, Encoding Loss:20%)
Jul 22 17:20:55 kvm5a kernel: [    2.342369] ixgbe 0000:05:00.0: MAC: 3, PHY: 0, PBA No: 000000-000
Jul 22 17:20:55 kvm5a kernel: [    2.342373] ixgbe 0000:05:00.0: 00:1e:67:fd:06:bc
Jul 22 17:20:55 kvm5a kernel: [    2.489438] ixgbe 0000:05:00.0: Intel(R) 10 Gigabit Network Connection
Jul 22 17:20:55 kvm5a kernel: [    2.762077] ixgbe 0000:05:00.1: Multiqueue Enabled: Rx Queue count = 40, Tx Queue count = 40 XDP Queue count = 0
Jul 22 17:20:55 kvm5a kernel: [    2.798106] ixgbe 0000:05:00.1: PCI Express bandwidth of 32GT/s available
Jul 22 17:20:55 kvm5a kernel: [    2.798109] ixgbe 0000:05:00.1: (Speed:5.0GT/s, Width: x8, Encoding Loss:20%)
Jul 22 17:20:55 kvm5a kernel: [    2.822305] ixgbe 0000:05:00.1: MAC: 3, PHY: 0, PBA No: 000000-000
Jul 22 17:20:55 kvm5a kernel: [    2.822310] ixgbe 0000:05:00.1: 00:1e:67:fd:06:bd
Jul 22 17:20:55 kvm5a kernel: [    2.987177] ixgbe 0000:05:00.1: Intel(R) 10 Gigabit Network Connection
Jul 22 17:20:55 kvm5a kernel: [    2.988139] ixgbe 0000:03:00.1 rename3: renamed from eth1
Jul 22 17:20:55 kvm5a kernel: [    3.080417] ixgbe 0000:05:00.1 eth1: renamed from eth3
Jul 22 17:20:55 kvm5a kernel: [    3.096193] ixgbe 0000:03:00.0 rename2: renamed from eth0
Jul 22 17:20:55 kvm5a kernel: [    3.132160] ixgbe 0000:05:00.0 eth0: renamed from eth2
Jul 22 17:20:55 kvm5a kernel: [    3.188122] ixgbe 0000:03:00.1 eth3: renamed from rename3
Jul 22 17:20:55 kvm5a kernel: [    3.216148] ixgbe 0000:03:00.0 eth2: renamed from rename2
Jul 22 17:20:56 kvm5a kernel: [   32.576738] ixgbe 0000:05:00.0 eth0: changing MTU from 1500 to 9216
Jul 22 17:20:56 kvm5a kernel: [   32.577352] ixgbe 0000:05:00.1 eth1: changing MTU from 1500 to 9216
Jul 22 17:20:56 kvm5a kernel: [   32.791761] ixgbe 0000:05:00.0: registered PHC device on eth0
Jul 22 17:20:57 kvm5a kernel: [   33.079997] ixgbe 0000:05:00.1: registered PHC device on eth1
Jul 22 17:20:57 kvm5a kernel: [   33.522759] ixgbe 0000:03:00.0 eth2: changing MTU from 1500 to 9216
Jul 22 17:20:57 kvm5a kernel: [   33.523353] ixgbe 0000:03:00.1 eth3: changing MTU from 1500 to 9216
Jul 22 17:20:57 kvm5a kernel: [   33.716496] ixgbe 0000:03:00.0: registered PHC device on eth2
Jul 22 17:20:57 kvm5a kernel: [   33.888211] ixgbe 0000:03:00.0 eth2: detected SFP+: 7
Jul 22 17:20:57 kvm5a kernel: [   33.996400] ixgbe 0000:03:00.1: registered PHC device on eth3
Jul 22 17:20:58 kvm5a kernel: [   34.036053] ixgbe 0000:03:00.0 eth2: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Jul 22 17:20:58 kvm5a kernel: [   34.176006] ixgbe 0000:03:00.1 eth3: detected SFP+: 8
Jul 22 17:20:58 kvm5a kernel: [   34.320169] ixgbe 0000:03:00.1 eth3: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Jul 22 17:21:01 kvm5a kernel: [   37.367935] ixgbe 0000:05:00.0 eth0: NIC Link is Up 10 Gbps, Flow Control: None
Jul 22 17:21:01 kvm5a kernel: [   37.488771] ixgbe 0000:05:00.0 eth0: NIC Link is Down
Jul 22 17:21:01 kvm5a kernel: [   37.655785] ixgbe 0000:05:00.1 eth1: NIC Link is Up 10 Gbps, Flow Control: None
Jul 22 17:21:01 kvm5a kernel: [   37.817978] ixgbe 0000:05:00.1 eth1: NIC Link is Down
Jul 22 17:21:03 kvm5a kernel: [   39.124334] ixgbe 0000:05:00.0 eth0: NIC Link is Up 10 Gbps, Flow Control: None
Jul 22 17:21:03 kvm5a kernel: [   39.412261] ixgbe 0000:05:00.1 eth1: NIC Link is Up 10 Gbps, Flow Control: None

For what it's worth:

Code:

[admin@kvm5a ~]# ethtool -i eth0
driver: ixgbe
version: 5.1.0-k
firmware-version: 0x800004f8
expansion-rom-version:
bus-info: 0000:05:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
[admin@kvm5a ~]# ethtool -i eth1
driver: ixgbe
version: 5.1.0-k
firmware-version: 0x800004f8
expansion-rom-version:
bus-info: 0000:05:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
[admin@kvm5a ~]# ethtool -i eth2
driver: ixgbe
version: 5.1.0-k
firmware-version: 0x8000047d
expansion-rom-version:
bus-info: 0000:03:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
[admin@kvm5a ~]# ethtool -i eth3
driver: ixgbe
version: 5.1.0-k
firmware-version: 0x8000047d
expansion-rom-version:
bus-info: 0000:03:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

coppola_f · Aug 3, 2018

due to our working needs,
we've rolled back to previous kernel (4.13 )
now all seems to operate correctly,

we've freezed updates to prevent 4.15 kernels to be installed on nodes....

actually we've not experienced any system crash under heavy load
(like it was before rollback action!)

many thanks to you all for your time!

regards,
Francesco

Guillaume · Aug 5, 2018

Hi

Rollback on 4.13.16 and no crash

coppola_f · Aug 6, 2018

Guillaume,
same here!
using 4.13 kernel no issues!!

actually working using this environment!

Search

Search

4.15.17 kernel panic

Matteo Costantini

Active Member

trystan

New Member

coppola_f

Renowned Member

udo

Distinguished Member

coppola_f

Renowned Member

Attachments

udo

Distinguished Member

coppola_f

Renowned Member

coppola_f

Renowned Member

David Herselman

Renowned Member

coppola_f

Renowned Member

Guillaume

New Member

coppola_f

Renowned Member