4.15.17 kernel panic

Matteo Costantini

Active Member
Jun 9, 2016
4
0
41
40
Sirs,

we've a 4 nodes Proxmox cluster running on v.5.2,

this cluster is actually operating in production environment and all nodes are covered with 'community' subscription...

all servers have identical hardware: HP Proliant DL380 G6, dual processor, 32gb ram and dual path fiber channel connection to msa storage for VMs disk images.

We've done just yesterday some routine maintenance actions (cable clean-up and re-route, bios upgrades and various firmware upgrades from HP)

then operated a routine upgrade of proxmox o.s. doing :

apt-get update

apt-get dist-upgrade

during reboot no issue at all, but during VMs start-up, two of four nodes has crashed, leaving the cluster out of quorum....

so all VMs was stopped in less than 5 minutes we was unable to operate at all!!

After over an hour of cross testing, we've noticed that nodes was crashing only if operating charge was over 65/70% approx of total ram....

So we've searched inside forum also and read about some issues related to 4.15.17 version of kernel (high latency on disks?!)

to solve the issue, we've restarted units choosing to start with 4.13.xx kernel versions during grub prompt

we're running actually on 4.13.13-5 or 4.13.16-3.

Waiting instructions on how to full roll-back without disrupt of cluster or suggestions on new updated version of 4.15.xx family of kernel...

Available to give you more info if requested


thanks in advance for your time,

regards,
 
Last edited:
By any chance are you using MTU settings different from the 1500 default and igb/Intel nics?
 
@trystan,

we've checked...
MTU is set to std 1500 bytes an all nics (phisycal or bridge / bonds)
nics are mixed,
4 broadcom BCM5709 + 2 intel 82571EB

any other suggestion?!?

waiting....
regards,

Francesco
 
Hi,

find attached logs extracted for 18.06
as you can see, we were working on fc cables,
changing schema from direct attach to full mesh through dual fc switches

really don't remember time of issue...

waiting your suggestions...
regards,
Francesco
 

Attachments

  • syslog.zip
    427.8 KB · Views: 5
Udo,
many thanks for your time,

after conversion to full mesh cabling scheme,
multipath seems now to be working fine!

we've broadcom + e1000 nics on these nodes (some onboard and some on additional nic)
btw we've never had any issue from nics prior to this kernel update....

actually we can't run on this kernel version because we're not able to identify issue origin,
it seems to run fine immediately after node restart.... after random amount of time, it seems to crash
(sure we've not tested a scenario where 3 nodes are running old kernel and only 1 node runs on the new version!)

we're looking around to know if someone else had some symptoms using similar configuration
(we're operating on hp proliant dl380 g6 )

regards,
Francesco
 
We're running 4.15.18-1-pve on pure Intel servers (two integrated 10GbE NICs and two 10Gbps SFP+ add-on module NICs):

Everything's working as expected. We're running active/backup bond interfaces using OVS and disable GRO on the VM traffic NICs:

/etc/rc.local
Code:
ethtool -K eth0 gro off;
ethtool -K eth1 gro off;

# Set active-backup bond slave interface priority:
ovs-appctl bond/set-active-slave bond0 eth0;
ovs-appctl bond/set-active-slave bond1 eth3;


/etc/network/interfaces
Code:
auto lo
iface lo inet loopback

allow-vmbr0 bond0
iface bond0 inet manual
        ovs_bridge vmbr0
        ovs_type OVSBond
        ovs_bonds eth0 eth1
        pre-up ( ifconfig eth0 mtu 9216 && ifconfig eth1 mtu 9216 )
        ovs_options bond_mode=active-backup tag=1 vlan_mode=native-untagged
        mtu 9216

auto vmbr0
allow-ovs vmbr0
iface vmbr0 inet manual
        ovs_type OVSBridge
        ovs_ports bond0 vlan1
        mtu 9216

allow-vmbr0 vlan1
iface vlan1 inet static
        ovs_type OVSIntPort
        ovs_bridge vmbr0
        ovs_options tag=1
        ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
        address 198.19.17.66
        netmask 255.255.255.224
        gateway 198.19.17.65
        mtu 1500


allow-vmbr1 bond1
iface bond1 inet manual
        ovs_bridge vmbr1
        ovs_type OVSBond
        ovs_bonds eth2 eth3
        pre-up ( ifconfig eth2 mtu 9216 && ifconfig eth3 mtu 9216 )
        ovs_options bond_mode=active-backup tag=33 vlan_mode=native-untagged
        mtu 9216

auto vmbr1
allow-ovs vmbr1
iface vmbr1 inet manual
        ovs_type OVSBridge
        ovs_ports bond1 vlan33
        mtu 9216

allow-vmbr1 vlan33
iface vlan33 inet static
        ovs_type OVSIntPort
        ovs_bridge vmbr1
        ovs_options tag=33
        ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
        address 10.254.1.2
        netmask 255.255.255.0
        mtu 9212


Code:
[admin@kvm5a ~]# grep ixg /var/log/messages
Jul 22 17:20:55 kvm5a kernel: [    1.690361] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver - version 5.1.0-k
Jul 22 17:20:55 kvm5a kernel: [    1.690361] ixgbe: Copyright (c) 1999-2016 Intel Corporation.
Jul 22 17:20:55 kvm5a kernel: [    1.848254] ixgbe 0000:03:00.0: Multiqueue Enabled: Rx Queue count = 40, Tx Queue count = 40 XDP Queue count = 0
Jul 22 17:20:55 kvm5a kernel: [    1.848378] ixgbe 0000:03:00.0: PCI Express bandwidth of 32GT/s available
Jul 22 17:20:55 kvm5a kernel: [    1.848380] ixgbe 0000:03:00.0: (Speed:5.0GT/s, Width: x8, Encoding Loss:20%)
Jul 22 17:20:55 kvm5a kernel: [    1.848461] ixgbe 0000:03:00.0: MAC: 2, PHY: 15, SFP+: 7, PBA No: FFFFFF-0FF
Jul 22 17:20:55 kvm5a kernel: [    1.848462] ixgbe 0000:03:00.0: 00:1e:67:9b:f1:38
Jul 22 17:20:55 kvm5a kernel: [    1.849519] ixgbe 0000:03:00.0: Intel(R) 10 Gigabit Network Connection
Jul 22 17:20:55 kvm5a kernel: [    2.008537] ixgbe 0000:03:00.1: Multiqueue Enabled: Rx Queue count = 40, Tx Queue count = 40 XDP Queue count = 0
Jul 22 17:20:55 kvm5a kernel: [    2.008663] ixgbe 0000:03:00.1: PCI Express bandwidth of 32GT/s available
Jul 22 17:20:55 kvm5a kernel: [    2.008664] ixgbe 0000:03:00.1: (Speed:5.0GT/s, Width: x8, Encoding Loss:20%)
Jul 22 17:20:55 kvm5a kernel: [    2.008748] ixgbe 0000:03:00.1: MAC: 2, PHY: 15, SFP+: 8, PBA No: FFFFFF-0FF
Jul 22 17:20:55 kvm5a kernel: [    2.008751] ixgbe 0000:03:00.1: 00:1e:67:9b:f1:39
Jul 22 17:20:55 kvm5a kernel: [    2.010024] ixgbe 0000:03:00.1: Intel(R) 10 Gigabit Network Connection
Jul 22 17:20:55 kvm5a kernel: [    2.282123] ixgbe 0000:05:00.0: Multiqueue Enabled: Rx Queue count = 40, Tx Queue count = 40 XDP Queue count = 0
Jul 22 17:20:55 kvm5a kernel: [    2.318165] ixgbe 0000:05:00.0: PCI Express bandwidth of 32GT/s available
Jul 22 17:20:55 kvm5a kernel: [    2.318166] ixgbe 0000:05:00.0: (Speed:5.0GT/s, Width: x8, Encoding Loss:20%)
Jul 22 17:20:55 kvm5a kernel: [    2.342369] ixgbe 0000:05:00.0: MAC: 3, PHY: 0, PBA No: 000000-000
Jul 22 17:20:55 kvm5a kernel: [    2.342373] ixgbe 0000:05:00.0: 00:1e:67:fd:06:bc
Jul 22 17:20:55 kvm5a kernel: [    2.489438] ixgbe 0000:05:00.0: Intel(R) 10 Gigabit Network Connection
Jul 22 17:20:55 kvm5a kernel: [    2.762077] ixgbe 0000:05:00.1: Multiqueue Enabled: Rx Queue count = 40, Tx Queue count = 40 XDP Queue count = 0
Jul 22 17:20:55 kvm5a kernel: [    2.798106] ixgbe 0000:05:00.1: PCI Express bandwidth of 32GT/s available
Jul 22 17:20:55 kvm5a kernel: [    2.798109] ixgbe 0000:05:00.1: (Speed:5.0GT/s, Width: x8, Encoding Loss:20%)
Jul 22 17:20:55 kvm5a kernel: [    2.822305] ixgbe 0000:05:00.1: MAC: 3, PHY: 0, PBA No: 000000-000
Jul 22 17:20:55 kvm5a kernel: [    2.822310] ixgbe 0000:05:00.1: 00:1e:67:fd:06:bd
Jul 22 17:20:55 kvm5a kernel: [    2.987177] ixgbe 0000:05:00.1: Intel(R) 10 Gigabit Network Connection
Jul 22 17:20:55 kvm5a kernel: [    2.988139] ixgbe 0000:03:00.1 rename3: renamed from eth1
Jul 22 17:20:55 kvm5a kernel: [    3.080417] ixgbe 0000:05:00.1 eth1: renamed from eth3
Jul 22 17:20:55 kvm5a kernel: [    3.096193] ixgbe 0000:03:00.0 rename2: renamed from eth0
Jul 22 17:20:55 kvm5a kernel: [    3.132160] ixgbe 0000:05:00.0 eth0: renamed from eth2
Jul 22 17:20:55 kvm5a kernel: [    3.188122] ixgbe 0000:03:00.1 eth3: renamed from rename3
Jul 22 17:20:55 kvm5a kernel: [    3.216148] ixgbe 0000:03:00.0 eth2: renamed from rename2
Jul 22 17:20:56 kvm5a kernel: [   32.576738] ixgbe 0000:05:00.0 eth0: changing MTU from 1500 to 9216
Jul 22 17:20:56 kvm5a kernel: [   32.577352] ixgbe 0000:05:00.1 eth1: changing MTU from 1500 to 9216
Jul 22 17:20:56 kvm5a kernel: [   32.791761] ixgbe 0000:05:00.0: registered PHC device on eth0
Jul 22 17:20:57 kvm5a kernel: [   33.079997] ixgbe 0000:05:00.1: registered PHC device on eth1
Jul 22 17:20:57 kvm5a kernel: [   33.522759] ixgbe 0000:03:00.0 eth2: changing MTU from 1500 to 9216
Jul 22 17:20:57 kvm5a kernel: [   33.523353] ixgbe 0000:03:00.1 eth3: changing MTU from 1500 to 9216
Jul 22 17:20:57 kvm5a kernel: [   33.716496] ixgbe 0000:03:00.0: registered PHC device on eth2
Jul 22 17:20:57 kvm5a kernel: [   33.888211] ixgbe 0000:03:00.0 eth2: detected SFP+: 7
Jul 22 17:20:57 kvm5a kernel: [   33.996400] ixgbe 0000:03:00.1: registered PHC device on eth3
Jul 22 17:20:58 kvm5a kernel: [   34.036053] ixgbe 0000:03:00.0 eth2: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Jul 22 17:20:58 kvm5a kernel: [   34.176006] ixgbe 0000:03:00.1 eth3: detected SFP+: 8
Jul 22 17:20:58 kvm5a kernel: [   34.320169] ixgbe 0000:03:00.1 eth3: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Jul 22 17:21:01 kvm5a kernel: [   37.367935] ixgbe 0000:05:00.0 eth0: NIC Link is Up 10 Gbps, Flow Control: None
Jul 22 17:21:01 kvm5a kernel: [   37.488771] ixgbe 0000:05:00.0 eth0: NIC Link is Down
Jul 22 17:21:01 kvm5a kernel: [   37.655785] ixgbe 0000:05:00.1 eth1: NIC Link is Up 10 Gbps, Flow Control: None
Jul 22 17:21:01 kvm5a kernel: [   37.817978] ixgbe 0000:05:00.1 eth1: NIC Link is Down
Jul 22 17:21:03 kvm5a kernel: [   39.124334] ixgbe 0000:05:00.0 eth0: NIC Link is Up 10 Gbps, Flow Control: None
Jul 22 17:21:03 kvm5a kernel: [   39.412261] ixgbe 0000:05:00.1 eth1: NIC Link is Up 10 Gbps, Flow Control: None


For what it's worth:
Code:
[admin@kvm5a ~]# ethtool -i eth0
driver: ixgbe
version: 5.1.0-k
firmware-version: 0x800004f8
expansion-rom-version:
bus-info: 0000:05:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
[admin@kvm5a ~]# ethtool -i eth1
driver: ixgbe
version: 5.1.0-k
firmware-version: 0x800004f8
expansion-rom-version:
bus-info: 0000:05:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
[admin@kvm5a ~]# ethtool -i eth2
driver: ixgbe
version: 5.1.0-k
firmware-version: 0x8000047d
expansion-rom-version:
bus-info: 0000:03:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
[admin@kvm5a ~]# ethtool -i eth3
driver: ixgbe
version: 5.1.0-k
firmware-version: 0x8000047d
expansion-rom-version:
bus-info: 0000:03:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
 
due to our working needs,
we've rolled back to previous kernel (4.13 )
now all seems to operate correctly,

we've freezed updates to prevent 4.15 kernels to be installed on nodes....

actually we've not experienced any system crash under heavy load
(like it was before rollback action!)

many thanks to you all for your time!

regards,
Francesco
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!