Infiniband IP address change on IB link-layer addresses is causing kernel trap 12 on FreeNAS iSCSI server

getcom

Member
Sep 7, 2019
17
0
21
57
Kitzingen
getcom.de
Hello to all,

we have a running Proxmox VE 5.4-13 three node cluster with separate 40Gbit Infiniband dual port card on each server for connecting to a FreeNAS iSCSI server (release is FreeNAS-11.2-U5, latest version).
For the cluster communication we have setup a bond0 with two Gigabit cards.
For the VM communication we have setup bond1 with two 10Gbit SFP+ cards.

This is all generally working as it should with one exception:
The FreeNAS iSCSI server has sometimes a kernel trap 12 and then it is rebooting.

On the FreeNAS side we have also a dual port Mellanox Infiniband ConnectX-3 card which is connected to an Infiniband switch (Grid Director 4036).
The three Proxmox cluster nodes are also connected to this switch over the Infiniband cards.

The Infiniband cards on both ends are configured for connected mode with a MTU of 40950 (with a default MTU=65520 we got lots of connection errors).
We are using a multipath setup with two subnets for IP over Infiniband.
This is working and we get a throughput of ~1-1.1 Gigabyte per second on VMs running on each cluster node in parallel.

Sporadically we get a kernel trap on the FreeNAS server which is then rebooting.
This can happen from half an hour up to 4 days uptime.
In this situation while the FreeNAS server is rebooting, the VMs are not crashing, they are in a delay until the FreeNAS server is online again.
Nevertheless we have to fix it.

FreeBSD-kernel-trap_packetSizeProblem.png

I analyzed the FreeNAS crash dumps in /data/crash/ and found out that the last what is happening before the kernel is crashing are events like that:

<118>Fri Sep 6 05:20:07 CEST 2019
<6>arp: 10.20.24.111 moved from 80:00:02:08:fe:80:00:00:00:00:00:00:00:02:c9:03:00:09:9f:c1 to 80:00:02:09:fe:80:00:00:00:00:00:00:00:02:c9:03:00:09:9f:c2 on ib0
<6>arp: 10.20.24.110 moved from 80:00:02:08:fe:80:00:00:00:00:00:00:00:02:c9:03:00:09:20:e3 to 80:00:02:09:fe:80:00:00:00:00:00:00:00:02:c9:03:00:09:20:e4 on ib0
<6>arp: 10.20.25.111 moved from 80:00:02:09:fe:80:00:00:00:00:00:00:00:02:c9:03:00:09:9f:c2 to 80:00:02:08:fe:80:00:00:00:00:00:00:00:02:c9:03:00:09:9f:c1 on ib1
<6>arp: 10.20.25.110 moved from 80:00:02:09:fe:80:00:00:00:00:00:00:00:02:c9:03:00:09:20:e4 to 80:00:02:08:fe:80:00:00:00:00:00:00:00:02:c9:03:00:09:20:e3 on ib1
<6>arp: 10.20.24.111 moved from 80:00:02:09:fe:80:00:00:00:00:00:00:00:02:c9:03:00:09:9f:c2 to 80:00:02:08:fe:80:00:00:00:00:00:00:00:02:c9:03:00:09:9f:c1 on ib0
<4>ib0: packet len 12380 (> 2044) too long to send, dropping


It is every time the same behavior. The both 20-octet IPoIB link-layer addresses on all three Proxmox clients are changing from time to time.
After that it looks like that the FreeBSD server sometimes is using the Datagram mode for the new connections and then it tries to send a large packet over this connection which could come from a previous client request/connection in Connected mode.

The root cause seems to be the changing of the IP addresses/link layer addresses on the Proxmox client side and secondary the Datagram mode behavior on the FreeBSD side.

Maybe somebody has an idea what is happening here with the link layer addresses and how to avoid that?

Here are more details of the setup:

On the FreeNAS side I`m using an own subnet for each IB port and I have two portals in the iSCSI setup for each IP (10.20.24.100/24 & 10.20.25.100/24).

The kernel of FreeNAS:
root@freenas1[/data/crash]# uname -a
FreeBSD freenas1 11.2-STABLE FreeBSD 11.2-STABLE #0 r325575+6aad246318c(HEAD): Mon Jun 24 17:25:47 UTC 2019 root@nemesis:/freenas-releng/freenas/_BE/objs/freenas-releng/freenas/_BE/os/sys/FreeNAS.amd64 amd64


The modules loaded on FreeNAS side:
root@freenas1[/data/crash]# cat /boot/loader.conf.local
mlx4ib_load="YES" # Be sure that Kernel modul Melloanox 4 Infiniband will be loaded
ipoib_load="YES" # Be sure that Kernel modul IP over Infiniband will be loaded
kernel="kernel"
module_path="/boot/kernel;/boot/modules;/usr/local/modules"
kern.cam.ctl.ha_id=0

root@freenas1[/data/crash]# kldstat
Id Refs Address Size Name
1 72 0xffffffff80200000 25608a8 kernel
2 1 0xffffffff82762000 100eb0 ispfw.ko
3 1 0xffffffff82863000 f9f8 ipmi.ko
4 2 0xffffffff82873000 2d28 smbus.ko
5 1 0xffffffff82876000 8a10 freenas_sysctl.ko
6 1 0xffffffff8287f000 3aff0 mlx4ib.ko
7 1 0xffffffff828ba000 1a388 ipoib.ko
8 1 0xffffffff82d11000 32e048 vmm.ko
9 1 0xffffffff83040000 a74 nmdm.ko
10 1 0xffffffff83041000 e610 geom_mirror.ko
11 1 0xffffffff83050000 3a3c geom_multipath.ko
12 1 0xffffffff83054000 2ec dtraceall.ko
13 9 0xffffffff83055000 3acf8 dtrace.ko
14 1 0xffffffff83090000 5b8 dtmalloc.ko
15 1 0xffffffff83091000 1898 dtnfscl.ko
16 1 0xffffffff83093000 1d31 fbt.ko
17 1 0xffffffff83095000 53390 fasttrap.ko
18 1 0xffffffff830e9000 bfc sdt.ko
19 1 0xffffffff830ea000 6d80 systrace.ko
20 1 0xffffffff830f1000 6d48 systrace_freebsd32.ko
21 1 0xffffffff830f8000 f9c profile.ko
22 1 0xffffffff830f9000 13ec0 hwpmc.ko
23 1 0xffffffff8310d000 7340 t3_tom.ko
24 2 0xffffffff83115000 ab8 toecore.ko
25 1 0xffffffff83116000 ddac t4_tom.ko

Kernel running on Proxmox:
root@pvecn1:~# uname -a
Linux pvecn1 4.15.18-20-pve #1 SMP PVE 4.15.18-46 (Thu, 8 Aug 2019 10:42:06 +0200) x86_64 GNU/Linux

The modules loaded on Proxmox side:
root@pvecn1:~# cat /etc/modules-load.d/mellanox.conf
mlx4_core
mlx4_ib
mlx4_en
ib_cm
ib_core
ib_ipoib
ib_iser
ib_umad




The Infiniband network setup for example on the first Proxmox client:
# Mellanox Infiniband
auto ib0
iface ib0 inet static
address 10.20.24.110
netmask 255.255.255.0
pre-up echo connected > /sys/class/net/$IFACE/mode
#post-up /sbin/ifconfig $IFACE mtu 65520
post-up /sbin/ifconfig $IFACE mtu 40950

# Mellanox Infiniband
auto ib1
iface ib1 inet static
address 10.20.25.110
netmask 255.255.255.0
pre-up echo connected > /sys/class/net/$IFACE/mode
#post-up /sbin/ifconfig $IFACE mtu 65520
post-up /sbin/ifconfig $IFACE mtu 40950


On the Proxmox side I`m running a multipath setup.
This is the content of /etc/multipath.conf:
defaults {
polling_interval 2
path_selector "round-robin 0"
path_grouping_policy multibus
uid_attribute ID_SERIAL
rr_min_io_rq 1
rr_weight uniform
failback immediate
no_path_retry queue
user_friendly_names yes
}

...

ifconfig for the both Infiniband ports on the FreeNAS server looks like that:

ib0: flags=8043<UP,BROADCAST,RUNNING,MULTICAST> metric 0 mtu 40950
options=80018<VLAN_MTU,VLAN_HWTAGGING,LINKSTATE>
lladdr 80.0.2.8.fe.80.0.0.0.0.0.0.0.2.c9.3.0.3a.ed.41
inet 10.20.24.210 netmask 0xffffff00 broadcast 10.20.24.255
nd6 options=9<PERFORMNUD,IFDISABLED>
ib1: flags=8043<UP,BROADCAST,RUNNING,MULTICAST> metric 0 mtu 40950
options=80018<VLAN_MTU,VLAN_HWTAGGING,LINKSTATE>
lladdr 80.0.2.9.fe.80.0.0.0.0.0.0.0.2.c9.3.0.3a.ed.42
inet 10.20.25.210 netmask 0xffffff00 broadcast 10.20.25.255
nd6 options=9<PERFORMNUD,IFDISABLED>


ifconfig on the first Proxmox client:

ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 40950
inet 10.20.24.110 netmask 255.255.255.0 broadcast 10.20.24.255
inet6 fe80::202:c903:9:20e3 prefixlen 64 scopeid 0x20<link>
unspec 80-00-02-08-FE-80-00-00-00-00-00-00-00-00-00-00 txqueuelen 256 (UNSPEC)
RX packets 5596912 bytes 10293861835 (9.5 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 3744669 bytes 48471009082 (45.1 GiB)
TX errors 0 dropped 125 overruns 0 carrier 0 collisions 0

ib1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 40950
inet 10.20.25.110 netmask 255.255.255.0 broadcast 10.20.25.255
inet6 fe80::202:c903:9:20e4 prefixlen 64 scopeid 0x20<link>
unspec 80-00-02-09-FE-80-00-00-00-00-00-00-00-00-00-00 txqueuelen 256 (UNSPEC)
RX packets 6863837 bytes 8858149718 (8.2 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 6197948 bytes 96516756048 (89.8 GiB)
TX errors 0 dropped 257 overruns 0 carrier 0 collisions 0



Any hints are welcome.

Regards
Ralf
 
Hello all,

I think I have found an answer to the address change problem on the Linux clients:

I found a comment on embeddedlinux.org:
  • A Linux host replies to any ARP solicitation requests that specify a target IP address configured on any of its interfaces, even if the request was received on this host by a different interface. To make Linux behave as if addresses belong to interfaces, administrators can use the ARP_IGNORE feature described later in the section "/proc Options."
  • Hosts can experience the ARP flux problem, in which the wrong interface becomes associated with an L3 address. This problem is described in the text that follows.
other sources:
http://www.mellanox.com/related-doc...OFED_Release_Notes-1.5.1-1.3.6_for_Oracle.txt

- When multiple vNics are connected to the same network, hosts can experience the "ARP flux" problem, in which the wrong interface becomes associated with an L3 address (FM #87335).

Workaround:

Set the following kernel configuration parameters: include the following lines in /etc/sysctl.conf and reboot the machine:
net.ipv4.conf.all.arp_ignore=1
net.ipv4.conf.all.arp_announce=2


https://downloads.openfabrics.org/O...release/OFED-1.4-docs/ipoib_release_notes.txt

3. Known Issues ===============================================================================

1. If a host has multiple interfaces and
(a) each interface belongs to a different IP subnet,
(b) they all use the same InfiniBand Partition, and
(c) they are connected to the same IB Switch,
then the host violates the IP rule requiring different broadcast domains.
Consequently, the host may build an incorrect ARP table.

The correct setting of a multi-homed IPoIB host is achieved by using a different PKEY for each IP subnet.
If a host has multiple interfaces on the same IP subnet, then to prevent a peer from building an incorrect ARP entry (neighbor) set the net.ipv4.conf.X.arp_ignore value to 1 or 2, where X stands for the IPoIB (non-child) interfaces (e.g., ib0, ib1, etc). This causes the network stack to send ARP replies only on the interface with the IP address specified in the ARP request:

sysctl -w net.ipv4.conf.ib0.arp_ignore=1
sysctl -w net.ipv4.conf.ib1.arp_ignore=1

Or, globally,

sysctl -w net.ipv4.conf.all.arp_ignore=1

For the running kernel on each client I executed following:
echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore; echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce; echo 1 >/proc/sys/net/ipv4/conf/ib0/arp_ignore; echo 1 >/proc/sys/net/ipv4/conf/ib1/arp_ignore

I added the corresponding post-up lines to /etc/network/interface to get it permanent.

Hopefully the kernel trap 12 is gone now.
I received an arp address change message on server side every 1 to 15 minutes. This is gone now.
There are no such arp messages since 2 hours.

Surprisingly I could also switch to a MTU of 65520 which was not working previously without lots of connection errors on each client.
On one client there is still something what I have to check. I had two events like that:
connection3:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4408262456, last ping 4408263744, now 4408265024
Sep 9 04:53:45 pvecn3 kernel: [453492.780173] connection3:0: detected conn error (1022)
Sep 9 04:53:45 pvecn3 kernel: [453492.780363] scsi_io_completion: 10 callbacks suppressed

Maybe this thread is helpful for others if they run into the same situation.

Regards,
Ralf
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!