[PVE-9-BETA] no vmbr0-communication since upgrade and using nic-naming-tool

jsterr

Renowned Member
Jul 24, 2020
838
240
68
33
1. We do not have ip-communication on all vms and ct since upgrading to pve9 (cant reach 10.26.0.1)
-- untouched pve8.4.5 clusters work well on the same subnet including their vms/ct

2. the same lan works well on the pve-hosts itself (they can reach the gw 10.26.0.1)
2. I did use the name-pinning tool (see below)
3. I get some frr-errors, despite not having used frr before
4. I attached the pve-reports of all 3 nodes

I also used and tried the proxmox-network-interface-pinning tool and it works great. It does include ports that are currently down aswell (nic1 = created), but the down port (old nics alternative-name) still shows up in the web-ui, allthough its NOT present in the /etc/network/interfaces file but its Alternative-Name: enxaacf63306908 this might lead to confusion?

Code:
root@PMX7:~# ip a | grep DOWN
2: nic0: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc mq master bond0 state DOWN group default qlen 1000
5: enp1s0f1np1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
8: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
12: bond0: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue master vmbr0 state DOWN group default qlen 1000

root@PMX7:~# lshw -c network -businfo
Bus info          Device        Class          Description
==========================================================
pci@0000:01:00.0  nic0          network        BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller
pci@0000:01:00.1  enp1s0f1np1   network        BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller
pci@0000:41:00.0  nic4          network        I350 Gigabit Network Connection
pci@0000:41:00.1  nic5          network        I350 Gigabit Network Connection
pci@0000:e1:00.0  nic2          network        BCM57508 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet
pci@0000:e1:00.1  nic3          network        BCM57508 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet

1753168811677.png

Edit: we ran into an issue, seems like nic1 was not created correctly?

Code:
Removed '/etc/systemd/system/multi-user.target.wants/dnsmasq@zNB.service'.
Created symlink '/etc/systemd/system/multi-user.target.wants/dnsmasq@zNB.service' -> '/usr/lib/systemd/system/dnsmasq@.service'.
bond0 : error: bond0: skipping slave nic1, does not exist

TASK ERROR: command 'ifreload -a' failed: exit code 1

Different node, same test-cluster.

Code:
Removed '/etc/systemd/system/multi-user.target.wants/dnsmasq@zNB.service'.
Created symlink '/etc/systemd/system/multi-user.target.wants/dnsmasq@zNB.service' -> '/usr/lib/systemd/system/dnsmasq@.service'.
nic1 : warning: nic1: interface not recognized - please check interface configuration
TASK ERROR: daemons file does not exist at /usr/share/perl5/PVE/Network/SDN/Frr.pm line 133.
 

Attachments

Last edited:
What's the content of /usr/local/lib/systemd/network?
listing + contents of the files.

FRR seems to be an issue when upgrading from 8 to 9 - I'll look into it more closely.
 
  • Like
Reactions: jsterr
Code:
root@PMX8:~# cat /usr/local/lib/systemd/network/50-pve-nic0.link
[Match]
MACAddress=14:23:f2:e6:86:f0
Type=ether

[Link]
Name=nic0
root@PMX8:~# cat /usr/local/lib/systemd/network/50-pve-nic1.link
[Match]
MACAddress=14:23:f2:e6:86:f0
Type=ether

[Link]
Name=nic1
root@PMX8:~# cat /usr/local/lib/systemd/network/50-pve-nic2.link
[Match]
MACAddress=bc:97:e1:da:92:a0
Type=ether

[Link]
Name=nic2
root@PMX8:~# cat /usr/local/lib/systemd/network/50-pve-nic3.link
[Match]
MACAddress=bc:97:e1:da:92:a1
Type=ether

[Link]
Name=nic3
root@PMX8:~# cat /usr/local/lib/systemd/network/50-pve-nic4.link
[Match]
MACAddress=bc:ec:a0:ec:e7:6c
Type=ether

[Link]
Name=nic4
root@PMX8:~# cat /usr/local/lib/systemd/network/50-pve-nic5.link
[Match]
MACAddress=bc:ec:a0:ec:e7:6d
Type=ether

[Link]
Name=nic5

those are all the files that are in there.
 
Seems like the issue is with nic0 and nic1 having the same MAC address pinned (most likely due to the bond, even though the pinning tool should already consider that), as a workaround you should be able to edit the 50-pve-nic1.link file and add the proper MAC: 14:23:f2:e6:47:91 . Then reboot.

The FRR error should be avoidable for now by installing FRR and disabling all daemons in /etc/frr/daemons and then restarting FRR - I'll provide a proper fix via a patch, seems like the detection if FRR is installed failed, so it tries to reload the FRR configuration.

Thanks for the report!
 
  • Like
Reactions: smueller
Regarding the FRR issue: Did you at any point have a configured Controller or a Fabric on that node? Because the FRR configuration should only get applied, if either a controller or a fabric exist - I wasn't able to reproduce this on my test cluster.

Which node did you try this on? PMX8? Is this still occuring every time you reload the network configuration? Can you send me a report of the affected node, as well as the output of the following command?

Code:
perl -e 'use PVE::Network::SDN; use Data::Dumper; print Dumper(PVE::Network::SDN::running_config_has_frr());'
 
Thanks we changed nic1 to fit the mac, rebootet and also changed frr.
This is how my /etc/frr/daemons file looks like, should all the lines with "yes" be switched to "no" and then systemctl restart frr?

Code:
root@PMX7:~# cat /etc/frr/daemons
# This file tells the frr package which daemons to start.
#
# Sample configurations for these daemons can be found in
# /usr/share/doc/frr/examples/.
#
# ATTENTION:
#
# When activating a daemon for the first time, a config file, even if it is
# empty, has to be present *and* be owned by the user and group "frr", else
# the daemon will not be started by /etc/init.d/frr. The permissions should
# be u=rw,g=r,o=.
# When using "vtysh" such a config file is also needed. It should be owned by
# group "frrvty" and set to ug=rw,o= though. Check /etc/pam.d/frr, too.
#
# The watchfrr, zebra and staticd daemons are always started.
#
bgpd=yes
ospfd=no
ospf6d=no
ripd=no
ripngd=no
isisd=no
pimd=no
pim6d=no
ldpd=no
nhrpd=no
eigrpd=no
babeld=no
sharpd=no
pbrd=no
bfdd=yes
fabricd=no
vrrpd=no
pathd=no

#
# If this option is set the /etc/init.d/frr script automatically loads
# the config via "vtysh -b" when the servers are started.
# Check /etc/pam.d/frr if you intend to use "vtysh"!
#
vtysh_enable=yes
zebra_options="  -A 127.0.0.1 -s 90000000"
mgmtd_options="  -A 127.0.0.1"
bgpd_options="   -A 127.0.0.1"
ospfd_options="  -A 127.0.0.1"
ospf6d_options=" -A ::1"
ripd_options="   -A 127.0.0.1"
ripngd_options=" -A ::1"
isisd_options="  -A 127.0.0.1"
pimd_options="   -A 127.0.0.1"
pim6d_options="  -A ::1"
ldpd_options="   -A 127.0.0.1"
nhrpd_options="  -A 127.0.0.1"
eigrpd_options=" -A 127.0.0.1"
babeld_options=" -A 127.0.0.1"
sharpd_options=" -A 127.0.0.1"
pbrd_options="   -A 127.0.0.1"
staticd_options="-A 127.0.0.1"
bfdd_options="   -A 127.0.0.1"
fabricd_options="-A 127.0.0.1 --dummy_as_loopback"
vrrpd_options="  -A 127.0.0.1"
pathd_options="  -A 127.0.0.1"#


# If you want to pass a common option to all daemons, you can use the
# "frr_global_options" variable.
#
#frr_global_options=""


# The list of daemons to watch is automatically generated by the init script.
# This variable can be used to pass options to watchfrr that will be passed
# prior to the daemon list.
#
# To make watchfrr create/join the specified netns, add the the "--netns"
# option here. It will only have an effect in /etc/frr/<somename>/daemons, and
# you need to start FRR with "/usr/lib/frr/frrinit.sh start <somename>".
#
#watchfrr_options=""


# configuration profile
#
#frr_profile="traditional"
#frr_profile="datacenter"


# This is the maximum number of FD's that will be available.  Upon startup this
# is read by the control files and ulimit is called.  Uncomment and use a
# reasonable value for your setup if you are expecting a large number of peers
# in say BGP.
#
#MAX_FDS=1024

# Uncomment this option if you want to run FRR as a non-root user. Note that
# you should know what you are doing since most of the daemons need root
# to work. This could be useful if you want to run FRR in a container
# for instance.
# FRR_NO_ROOT="yes"

# For any daemon, you can specify a "wrap" command to start instead of starting
# the daemon directly. This will simply be prepended to the daemon invocation.
# These variables have the form daemon_wrap, where 'daemon' is the name of the
# daemon (the same pattern as the daemon_options variables).
#
# Note that when daemons are started, they are told to daemonize with the `-d`
# option. This has several implications. For one, the init script expects that
# when it invokes a daemon, the invocation returns immediately. If you add a
# wrap command here, it must comply with this expectation and daemonize as
# well, or the init script will never return. Furthermore, because daemons are
# themselves daemonized with -d, you must ensure that your wrapper command is
# capable of following child processes after a fork() if you need it to do so.
#
# If your desired wrapper does not support daemonization, you can wrap it with
# a utility program that daemonizes programs, such as 'daemonize'. An example
# of this might look like:
#
# bgpd_wrap="/usr/bin/daemonize /usr/bin/mywrapper"
#
# This is particularly useful for programs which record processes but lack
# daemonization options, such as perf and rr.
#
# If you wish to wrap all daemons in the same way, you may set the "all_wrap"
# variable.
#
#all_wrap=""

Im 99% sure I never touched frr on this relatively-new cluster. We only use linux-adapters + ovs-rstp for vmbr1 (ceph full meshed) we do not use frr for ceph full mesh and never tried on this one.

I attached the status of systemctl status frr of all 3 nodes. after restarting it after setting all lines = no. Nothing changes. I then reapplied the network-config via web-ui and ping works again inside vm. I tried reloading all network configs via web-ui, no errors appear regarding frr or nic1.

The output of your command

perl -e 'use PVE::Network::SDN; use Data::Dumper; print Dumper(PVE::Network::SDN::running_config_has_frr());'

on all 3 nodes is: empty (no output).

I also then tried to reboot a node, to see whats the behaviour, this appeared on pmx7:


Code:
root@PMX7:~# systemctl status frr
● frr.service - FRRouting
     Loaded: loaded (/usr/lib/systemd/system/frr.service; enabled; preset: enabled)
     Active: active (running) since Tue 2025-07-22 11:52:30 CEST; 1min 59s ago
 Invocation: 503d210b115044b2bdc26de3fb20ceaa
       Docs: https://frrouting.readthedocs.io/en/latest/setup.html
   Main PID: 2377 (watchfrr)
     Status: "FRR Operational"
      Tasks: 8 (limit: 309140)
     Memory: 37M (peak: 54.4M)
        CPU: 577ms
     CGroup: /system.slice/frr.service
             ├─2377 /usr/lib/frr/watchfrr -d -F traditional mgmtd zebra staticd
             ├─2537 /usr/lib/frr/mgmtd -d -F traditional -A 127.0.0.1
             ├─2560 /usr/lib/frr/zebra -d -F traditional -A 127.0.0.1 -s 90000000
             └─2566 /usr/lib/frr/staticd -d -F traditional -A 127.0.0.1

Jul 22 11:53:12 PMX7 zebra[2560]: libyang Invalid type uint32 empty value. (/frr-interface:lib/interface/state/speed)
Jul 22 11:53:14 PMX7 zebra[2560]: libyang Invalid boolean value "". (/frr-vrf:lib/vrf/state/active)
Jul 22 11:53:14 PMX7 zebra[2560]: libyang Invalid type uint32 empty value. (/frr-vrf:lib/vrf/state/id)
Jul 22 11:53:15 PMX7 zebra[2560]: libyang Invalid boolean value "". (/frr-vrf:lib/vrf/state/active)
Jul 22 11:53:15 PMX7 zebra[2560]: libyang Invalid type uint32 empty value. (/frr-vrf:lib/vrf/state/id)
Jul 22 11:53:15 PMX7 zebra[2560]: libyang Invalid boolean value "". (/frr-vrf:lib/vrf/state/active)
Jul 22 11:53:15 PMX7 zebra[2560]: libyang Invalid type uint32 empty value. (/frr-vrf:lib/vrf/state/id)
Jul 22 11:53:15 PMX7 zebra[2560]: libyang Unsatisfied pattern - "" does not conform to "[0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5}". (/frr-interface:lib/interface/state/phy-address)
Jul 22 11:53:30 PMX7 zebra[2560]: [WPPMZ-G9797] if_zebra_speed_update: fwbr119i0 old speed: 0 new speed: 10000
Jul 22 11:53:30 PMX7 zebra[2560]: libyang Invalid type uint32 empty value. (/frr-interface:lib/interface/state/speed)

currently all vm-communication is working again.

Edit2: reports are already in this post (see posts above).
 

Attachments

Last edited: