[SOLVED] Bringing up vlan aware bridge takes ages

Aug 28, 2017
67
12
48
44
hi there,

just set up a cluster with a weird behaviour:

it takes +5 Minutes to bring up the bridge after clicking apply configuration

/etc/network/interfaces:
Code:
auto lo
iface lo inet loopback

iface enp65s0f0 inet manual

iface enp65s0f1 inet manual

iface enp65s0f2 inet manual

iface enp65s0f3 inet manual

auto enp33s0f0
iface enp33s0f0 inet manual

auto enp33s0f1
iface enp33s0f1 inet manual

auto enp34s0f0
iface enp34s0f0 inet manual

auto enp34s0f1
iface enp34s0f1 inet manual

auto bond0
iface bond0 inet static
    address 10.60.104.2/24
    bond-slaves enp33s0f0 enp33s0f1
    bond-miimon 100
    bond-mode active-backup
    bond-primary enp33s0f0
    mtu 9000
#Ceph Cluster Network

auto bond1
iface bond1 inet manual
    bond-slaves enp34s0f0 enp34s0f1
    bond-miimon 100
    bond-mode active-backup
    bond-primary enp34s0f0

auto vmbr0
iface vmbr0 inet static
    address 10.60.101.16/24
    gateway 10.60.101.254
    bridge-ports enp65s0f0
    bridge-stp off
    bridge-fd 0

auto vmbr1
iface vmbr1 inet manual
    bridge-ports bond1
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 2-4094
#Frontends


Syslog has the following Messages in that time:
Code:
Nov 10 17:15:45 kvm01-bt01-b kernel: [  925.902601] vmbr0: the hash_elasticity option has been deprecated and is always 16
Nov 10 17:15:45 kvm01-bt01-b kernel: [  926.245464] device bond1 left promiscuous mode
Nov 10 17:15:45 kvm01-bt01-b kernel: [  926.245466] device enp34s0f0 left promiscuous mode
Nov 10 17:15:45 kvm01-bt01-b kernel: [  926.245526] vmbr1: the hash_elasticity option has been deprecated and is always 16
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.065645] kworker/45:1    D    0   527      2 0x80004000
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.065653] Workqueue: events switchdev_deferred_process_work
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.065654] Call Trace:
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.065661]  __schedule+0x2e6/0x6f0
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.065663]  ? __switch_to_asm+0x34/0x70
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.065664]  schedule+0x33/0xa0
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.065665]  schedule_preempt_disabled+0xe/0x10
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.065666]  __mutex_lock.isra.10+0x2c9/0x4c0
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.065667]  ? __switch_to_asm+0x34/0x70
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.065668]  ? __switch_to_asm+0x34/0x70
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.065669]  __mutex_lock_slowpath+0x13/0x20
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.065670]  mutex_lock+0x2c/0x30
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.065673]  rtnl_lock+0x15/0x20
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.065674]  switchdev_deferred_process_work+0xe/0x20
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.065676]  process_one_work+0x20f/0x3d0
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.065677]  worker_thread+0x34/0x400
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.065680]  kthread+0x120/0x140
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.065680]  ? process_one_work+0x3d0/0x3d0
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.065681]  ? kthread_park+0x90/0x90
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.065682]  ret_from_fork+0x22/0x40
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067773] lldpd           D    0  2129   2103 0x00000320
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067775] Call Trace:
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067778]  __schedule+0x2e6/0x6f0
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067779]  ? __switch_to_asm+0x34/0x70
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067781]  schedule+0x33/0xa0
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067784]  schedule_preempt_disabled+0xe/0x10
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067786]  __mutex_lock.isra.10+0x2c9/0x4c0
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067788]  __mutex_lock_slowpath+0x13/0x20
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067789]  mutex_lock+0x2c/0x30
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067790]  rtnl_lock+0x15/0x20
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067793]  dev_ioctl+0xb7/0x570
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067794]  sock_do_ioctl+0xa0/0x140
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067796]  sock_ioctl+0x2ca/0x3c0
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067797]  ? schedule_hrtimeout_range+0x13/0x20
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067801]  ? ep_poll+0x293/0x430
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067803]  do_vfs_ioctl+0xa9/0x640
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067807]  ? __secure_computing+0x3e/0xd0
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067808]  ksys_ioctl+0x67/0x90
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067810]  __x64_sys_ioctl+0x1a/0x20
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067814]  do_syscall_64+0x57/0x190
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067815]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067818] RIP: 0033:0x7f902d4a5427
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067822] Code: Bad RIP value.
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067822] RSP: 002b:00007ffed8b64178 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067824] RAX: ffffffffffffffda RBX: 000055fadd93af00 RCX: 00007f902d4a5427
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067825] RDX: 00007ffed8b641a0 RSI: 0000000000008946 RDI: 0000000000000006
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067825] RBP: 00007ffed8b642d0 R08: 000055fadd947e70 R09: 0000000000000000
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067826] R10: 00000000000003b6 R11: 0000000000000246 R12: 000055fadd947e70
Nov 10 17:18:31 kvm01-bt01-b kernel: [ 1092.067827] R13: 00007ffed8b641d4 R14: 00007ffed8b641d0 R15: 0000000000000000

Also booting up the Machine takes ages, as the network interfaces take such a long time to come up.

This time i also got messages, but from other parts of ther kernel (mainly inside the mlx5 driver)
Code:
Nov 10 17:27:32 kvm01-bt01-b kernel: [   22.624255] vmbr1: port 1(bond1) entered blocking state
Nov 10 17:27:32 kvm01-bt01-b kernel: [   22.624257] vmbr1: port 1(bond1) entered disabled state
Nov 10 17:27:34 kvm01-bt01-b kernel: [   23.791100] igb 0000:41:00.0 enp65s0f0: igb: enp65s0f0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.494858] kworker/u128:0  D    0     9      2 0x80004000
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.494896] Workqueue: mlx5_lag mlx5_do_bond_work [mlx5_core]
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.494897] Call Trace:
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.494906]  __schedule+0x2e6/0x6f0
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.494909]  schedule+0x33/0xa0
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.494912]  schedule_preempt_disabled+0xe/0x10
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.494913]  __mutex_lock.isra.10+0x2c9/0x4c0
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.494918]  __mutex_lock_slowpath+0x13/0x20
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.494919]  mutex_lock+0x2c/0x30
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.494923]  rtnl_lock+0x15/0x20
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.494926]  register_netdevice_notifier+0x39/0x230
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.494930]  ? kvfree+0x33/0x40
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.494938]  mlx5_add_netdev_notifier+0x3a/0x60 [mlx5_ib]
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.494945]  mlx5_ib_stage_common_roce_init+0x57/0x70 [mlx5_ib]
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.494952]  mlx5_ib_stage_roce_init+0x3f/0x110 [mlx5_ib]
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.494959]  __mlx5_ib_add+0x2c/0x80 [mlx5_ib]
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.494966]  mlx5_ib_add+0xde/0x2c0 [mlx5_ib]
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.494991]  ? mlx5_nic_vport_update_local_lb+0xcb/0x140 [mlx5_core]
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495016]  mlx5_add_device+0x57/0xd0 [mlx5_core]
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495038]  mlx5_add_dev_by_protocol+0x49/0x50 [mlx5_core]
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495059]  mlx5_do_bond+0x139/0x1d0 [mlx5_core]
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495080]  mlx5_do_bond_work+0x1f/0x40 [mlx5_core]
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495084]  process_one_work+0x20f/0x3d0
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495085]  worker_thread+0x34/0x400
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495088]  kthread+0x120/0x140
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495090]  ? process_one_work+0x3d0/0x3d0
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495092]  ? kthread_park+0x90/0x90
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495094]  ret_from_fork+0x22/0x40
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495285] kworker/9:2     D    0   718      2 0x80004000
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495291] Workqueue: ipv6_addrconf addrconf_dad_work
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495292] Call Trace:
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495295]  __schedule+0x2e6/0x6f0
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495297]  schedule+0x33/0xa0
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495299]  schedule_preempt_disabled+0xe/0x10

Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495301]  __mutex_lock.isra.10+0x2c9/0x4c0
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495303]  ? __switch_to_asm+0x34/0x70
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495304]  ? __switch_to_asm+0x34/0x70
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495305]  ? __switch_to_asm+0x40/0x70
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495307]  __mutex_lock_slowpath+0x13/0x20
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495308]  mutex_lock+0x2c/0x30
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495311]  rtnl_lock+0x15/0x20
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495312]  addrconf_dad_work+0x3e/0x420
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495314]  ? __schedule+0x2ee/0x6f0
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495316]  process_one_work+0x20f/0x3d0
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495318]  worker_thread+0x34/0x400
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495320]  kthread+0x120/0x140
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495322]  ? process_one_work+0x3d0/0x3d0
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495324]  ? kthread_park+0x90/0x90
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495325]  ret_from_fork+0x22/0x40
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495461] kworker/5:2     D    0  2066      2 0x80004000
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495465] Workqueue: events linkwatch_event
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495465] Call Trace:
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495468]  __schedule+0x2e6/0x6f0
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495471]  schedule+0x33/0xa0
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495473]  schedule_preempt_disabled+0xe/0x10
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495474]  __mutex_lock.isra.10+0x2c9/0x4c0
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495476]  ? __switch_to_asm+0x34/0x70
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495477]  ? __switch_to_asm+0x34/0x70
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495479]  __mutex_lock_slowpath+0x13/0x20
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495480]  mutex_lock+0x2c/0x30
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495482]  rtnl_lock+0x15/0x20
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495483]  linkwatch_event+0xe/0x30
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495485]  process_one_work+0x20f/0x3d0
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495487]  worker_thread+0x34/0x400
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495489]  kthread+0x120/0x140
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495490]  ? process_one_work+0x3d0/0x3d0
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495492]  ? kthread_park+0x90/0x90
Nov 10 17:31:16 kvm01-bt01-b kernel: [  246.495494]  ret_from_fork+0x22/0x40
Nov 10 17:31:34 kvm01-bt01-b kernel: [  264.127686] vmbr0: port 1(enp65s0f0) entered blocking state
Nov 10 17:31:34 kvm01-bt01-b kernel: [  264.127689] vmbr0: port 1(enp65s0f0) entered forwarding state
Nov 10 17:31:34 kvm01-bt01-b kernel: [  264.127856] IPv6: ADDRCONF(NETDEV_CHANGE): vmbr0: link becomes ready
Nov 10 17:31:34 kvm01-bt01-b kernel: [  264.135030] vmbr1: port 1(bond1) entered blocking state
Nov 10 17:31:34 kvm01-bt01-b kernel: [  264.135032] vmbr1: port 1(bond1) entered forwarding state
 
Hi,

what is the exact model of this Mellanox NIC?
And do you use the actual firmware on it?
 
I have tested it here with a Mellanox connect-X4 LX EN OCP 2.0 Type 1 Form Factor.[1]
I have used your network setting, and everything is working.

The working Ubuntu 20.04 LTS/Fedora may not use all functions on this card.
If you are still interested to reproduce this error on Ubuntu/Fedora, we can send you a script where all functions are set manually.

1.) https://www.mellanox.com/products/ethernet-adapters/connectx-4-lx-en
 
Hi Wolfgang,
i'd apreciate if you could send me the scripts.

I have a bunch of other Servers all with those standard PCIe Connect-X4 LX EN running with PVE, but none uses a vlan aware bridge - so i never bothered ;)

Thanks for your help :cool:
 
Code:
### Only for Ubuntu:
# remove netplan
sudo apt purge netplan.io

# cleanup
sudo apt autoremove --purge

### End Only for Ubuntu

# make sure NICs are down before enslaving them
sudo ip link set enp33s0f0 down
sudo ip link set enp33s0f1 down
sudo ip link set enp34s0f0 down
sudo ip link set enp34s0f1 down

# create active-backup bond0
sudo ip link add name bond0 type bond
sudo ip link set dev bond0 type bond mode active-backup
sudo ip link set dev enp33s0f0 master bond0
sudo ip link set dev enp33s0f1 master bond0
sudo ip link set dev bond0 up
sudo ip link set dev bond0 mtu 9000

# add IP to bond0
sudo ip address add 10.60.104.2/24 dev bond0

# create active-backup bond1
sudo ip link add name bond1 type bond
sudo ip link set dev bond1 type bond mode active-backup
sudo ip link set dev enp34s0f0 master bond1
sudo ip link set dev enp34s0f1 master bond1
sudo ip link set dev bond1 up

# create bridge
sudo ip link add name vmbr1 type bridge
sudo ip link set dev vmbr1 up
sudo ip link set dev bond1 master vmbr0

# enable vlan awareness
sudo bridge vlan add dev bond0 vid 2-4094

# check if vlans are applied
sudo bridge vlan show
 
  • Like
Reactions: Sebastian Schubert
Hi There,

i just got an experimental firmware release from mellanox for this specific card - and the "link down" issue is now gone.
The startup of the vmbr with vlans is still somehow "shitty" as it still takes about a minute or so to bring up the nic (before it was about 3 minutes) - is this behaviour somewhat "normal" - like bringing up vlan aware brige "usually" takes longer that non vlan version?
 
I guess this can be handle with restriction the VLAN range.

Code:
bridge-vids 5-15 103 200-205
 
Last edited:
  • Like
Reactions: Sebastian Schubert

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!