Hello
I am running Proxmox pve-manager/8.2.2/9355359cd7afbae4 (running kernel: 6.8.4-2-pve), fresh install, I have 4 NIC and this is my network config
---------
auto lo
iface lo inet loopback
auto eno1
iface eno1 inet manual
auto enp4s0f4
iface enp4s0f4 inet manual
auto eno2
iface eno2 inet manual
auto enp4s0f5
iface enp4s0f5 inet manual
auto bond0
iface bond0 inet manual
bond-slaves eno1 eno2
bond-miimon 100
bond-mode active-backup
bond-primary eno1
auto bond1
iface bond1 inet manual
bond-slaves enp4s0f4 enp4s0f5
bond-miimon 100
bond-mode active-backup
bond-primary enp4s0f4
auto vmbr0v100
iface vmbr0v100 inet static
address 10.19.18.2/22
gateway 10.19.16.1
bridge-ports vmbr0.100
bridge-stp off
bridge-fd 0
auto vmbr0
iface vmbr0 inet manual
bridge-ports bond0
bridge-stp off
bridge-fd 0
auto vmbr0.100
iface vmbr0.100 inet manual
auto vmbr1
iface vmbr1 inet manual
bridge-ports bond1
bridge-stp off
bridge-fd 0
----------------
Hardware
HP C7000 chassis
Blade BL465g8
As you can see from my network config I have created 2 bonds and a bridge on top of each bond. Also on top ov vmbr0 I have vmbr0v100 which has my IP for the management network. So far so good
I created a couple of VMs to test and here comes the issue, at the moment of assigning a bridge to a VM in order to provide network access. From this I get this scenarios:
1- if I put vmbr1 with any vlan id all works fine, the VM starts with no issues and is able to get network access.
2- if I put vmbr1v100 and assign a vlan id then Proxmox does not allow me to start the VM and complains about I cannot have a vlan id there. I DO understand this, so so far so good
3- if I put vmbr0 -the management bridge- with any vlan id other than 100 then the VM starts fine with no issues.
4- if I put vmbr0 with vlan id 100 then this issue appears. The VM starts but I lost network connection to the host. Going to the console I can see that NICs eno1 and eno2 report no network connection at all, tcpdump shows no packages coming in. At this point eno1, eno2, bond0, vmbr0, and vmbr0v100 all suddenly report "link down"
After reaching this point the only way to recover is to:
- kill the VM (optional, as the next step will do it anyways)
- Reboot the host
- Once the host comes back the nics still report link down (via ethtool) but tcpdump is able to capture some packages
- Then go to the chassis edit the profile for those two NIC, remove vlan 100, replace it with any other random Vlan, apply changes,
- After this ethtool reports link as up
- Then go back to the profile on the chassis edit it and remove the Vlan you put and put back vlan 100
- At this point the sky is blue and all works as expected
My questions/concerns are these:
A- If this configuration - having two tagged bridges with the same vlan id on top of the same bridge- for xyz reason not supported then why not alerting the user with a clear error and prevent all this issue all together like in scenario 2,
B- Is there a way to avoid this?? Maybe with a different configuration or so. As you can see is really easy to run into scenario #4 and that will bring all the management host connectivity down, this will be a disaster in production.
Thanks in advance for reading this. All suggestions are welcomed
Here some logs - At 15:45:00 we started the VM with scenario #4
journalctl -f
Jun 04 15:44:24 bl465g8-test12-roc1 kernel: be2net 0000:04:00.1 eno2: Link is Up
Jun 04 15:44:24 bl465g8-test12-roc1 kernel: be2net 0000:04:00.0 eno1: Link is Up
Jun 04 15:44:24 bl465g8-test12-roc1 kernel: bond0: (slave eno1): link status definitely up, 10000 Mbps full duplex
Jun 04 15:44:24 bl465g8-test12-roc1 kernel: bond0: (slave eno2): link status definitely up, 10000 Mbps full duplex
Jun 04 15:44:24 bl465g8-test12-roc1 kernel: bond0: (slave eno1): making interface the new active one
Jun 04 15:44:24 bl465g8-test12-roc1 kernel: be2net 0000:04:00.0 eno1: entered allmulticast mode
Jun 04 15:44:24 bl465g8-test12-roc1 kernel: bond0: active interface up!
Jun 04 15:44:24 bl465g8-test12-roc1 kernel: vmbr0: port 1(bond0) entered blocking state
Jun 04 15:44:24 bl465g8-test12-roc1 kernel: vmbr0: port 1(bond0) entered forwarding state
Jun 04 15:44:24 bl465g8-test12-roc1 kernel: vmbr0v100: port 2(bond0.100) entered blocking state
Jun 04 15:44:24 bl465g8-test12-roc1 kernel: vmbr0v100: port 2(bond0.100) entered forwarding state
Jun 04 15:44:24 bl465g8-test12-roc1 kernel: vmbr0v100: port 1(vmbr0.100) entered blocking state
Jun 04 15:44:24 bl465g8-test12-roc1 kernel: vmbr0v100: port 1(vmbr0.100) entered forwarding state
Jun 04 15:45:50 bl465g8-test12-roc1 kernel: be2net 0000:04:00.0 eno1: Link is Down
Jun 04 15:45:50 bl465g8-test12-roc1 kernel: be2net 0000:04:00.0 eno1: Link is Down
Jun 04 15:45:50 bl465g8-test12-roc1 kernel: bond0: (slave eno1): link status definitely down, disabling slave
Jun 04 15:45:50 bl465g8-test12-roc1 kernel: bond0: (slave eno2): making interface the new active one
Jun 04 15:45:50 bl465g8-test12-roc1 kernel: be2net 0000:04:00.0 eno1: left allmulticast mode
Jun 04 15:45:50 bl465g8-test12-roc1 kernel: be2net 0000:04:00.1 eno2: entered promiscuous mode
Jun 04 15:45:50 bl465g8-test12-roc1 kernel: be2net 0000:04:00.1 eno2: entered allmulticast mode
Jun 04 15:45:51 bl465g8-test12-roc1 kernel: be2net 0000:04:00.1 eno2: Link is Down
Jun 04 15:45:51 bl465g8-test12-roc1 kernel: be2net 0000:04:00.1 eno2: Link is Down
Jun 04 15:45:51 bl465g8-test12-roc1 kernel: bond0: (slave eno2): link status definitely down, disabling slave
Jun 04 15:45:51 bl465g8-test12-roc1 kernel: be2net 0000:04:00.1 eno2: left promiscuous mode
Jun 04 15:45:51 bl465g8-test12-roc1 kernel: be2net 0000:04:00.1 eno2: left allmulticast mode
Jun 04 15:45:51 bl465g8-test12-roc1 kernel: bond0: now running without any active interface!
Jun 04 15:45:51 bl465g8-test12-roc1 kernel: vmbr0: port 1(bond0) entered disabled state
Jun 04 15:45:51 bl465g8-test12-roc1 kernel: vmbr0v100: port 2(bond0.100) entered disabled state
Jun 04 15:45:51 bl465g8-test12-roc1 kernel: be2net 0000:04:00.1: Disabling VLAN promiscuous mode
Jun 04 15:45:52 bl465g8-test12-roc1 kernel: vmbr0v100: port 1(vmbr0.100) entered disabled state
Jun 04 15:46:04 bl465g8-test12-roc1 kernel: be2net 0000:04:00.0 eno1: left promiscuous mode
Jun 04 15:46:04 bl465g8-test12-roc1 kernel: be2net 0000:04:00.0: Disabling VLAN promiscuous mode
---
I am running Proxmox pve-manager/8.2.2/9355359cd7afbae4 (running kernel: 6.8.4-2-pve), fresh install, I have 4 NIC and this is my network config
---------
auto lo
iface lo inet loopback
auto eno1
iface eno1 inet manual
auto enp4s0f4
iface enp4s0f4 inet manual
auto eno2
iface eno2 inet manual
auto enp4s0f5
iface enp4s0f5 inet manual
auto bond0
iface bond0 inet manual
bond-slaves eno1 eno2
bond-miimon 100
bond-mode active-backup
bond-primary eno1
auto bond1
iface bond1 inet manual
bond-slaves enp4s0f4 enp4s0f5
bond-miimon 100
bond-mode active-backup
bond-primary enp4s0f4
auto vmbr0v100
iface vmbr0v100 inet static
address 10.19.18.2/22
gateway 10.19.16.1
bridge-ports vmbr0.100
bridge-stp off
bridge-fd 0
auto vmbr0
iface vmbr0 inet manual
bridge-ports bond0
bridge-stp off
bridge-fd 0
auto vmbr0.100
iface vmbr0.100 inet manual
auto vmbr1
iface vmbr1 inet manual
bridge-ports bond1
bridge-stp off
bridge-fd 0
----------------
Hardware
HP C7000 chassis
Blade BL465g8
As you can see from my network config I have created 2 bonds and a bridge on top of each bond. Also on top ov vmbr0 I have vmbr0v100 which has my IP for the management network. So far so good
I created a couple of VMs to test and here comes the issue, at the moment of assigning a bridge to a VM in order to provide network access. From this I get this scenarios:
1- if I put vmbr1 with any vlan id all works fine, the VM starts with no issues and is able to get network access.
2- if I put vmbr1v100 and assign a vlan id then Proxmox does not allow me to start the VM and complains about I cannot have a vlan id there. I DO understand this, so so far so good
3- if I put vmbr0 -the management bridge- with any vlan id other than 100 then the VM starts fine with no issues.
4- if I put vmbr0 with vlan id 100 then this issue appears. The VM starts but I lost network connection to the host. Going to the console I can see that NICs eno1 and eno2 report no network connection at all, tcpdump shows no packages coming in. At this point eno1, eno2, bond0, vmbr0, and vmbr0v100 all suddenly report "link down"
After reaching this point the only way to recover is to:
- kill the VM (optional, as the next step will do it anyways)
- Reboot the host
- Once the host comes back the nics still report link down (via ethtool) but tcpdump is able to capture some packages
- Then go to the chassis edit the profile for those two NIC, remove vlan 100, replace it with any other random Vlan, apply changes,
- After this ethtool reports link as up
- Then go back to the profile on the chassis edit it and remove the Vlan you put and put back vlan 100
- At this point the sky is blue and all works as expected
My questions/concerns are these:
A- If this configuration - having two tagged bridges with the same vlan id on top of the same bridge- for xyz reason not supported then why not alerting the user with a clear error and prevent all this issue all together like in scenario 2,
B- Is there a way to avoid this?? Maybe with a different configuration or so. As you can see is really easy to run into scenario #4 and that will bring all the management host connectivity down, this will be a disaster in production.
Thanks in advance for reading this. All suggestions are welcomed
Here some logs - At 15:45:00 we started the VM with scenario #4
journalctl -f
Jun 04 15:44:24 bl465g8-test12-roc1 kernel: be2net 0000:04:00.1 eno2: Link is Up
Jun 04 15:44:24 bl465g8-test12-roc1 kernel: be2net 0000:04:00.0 eno1: Link is Up
Jun 04 15:44:24 bl465g8-test12-roc1 kernel: bond0: (slave eno1): link status definitely up, 10000 Mbps full duplex
Jun 04 15:44:24 bl465g8-test12-roc1 kernel: bond0: (slave eno2): link status definitely up, 10000 Mbps full duplex
Jun 04 15:44:24 bl465g8-test12-roc1 kernel: bond0: (slave eno1): making interface the new active one
Jun 04 15:44:24 bl465g8-test12-roc1 kernel: be2net 0000:04:00.0 eno1: entered allmulticast mode
Jun 04 15:44:24 bl465g8-test12-roc1 kernel: bond0: active interface up!
Jun 04 15:44:24 bl465g8-test12-roc1 kernel: vmbr0: port 1(bond0) entered blocking state
Jun 04 15:44:24 bl465g8-test12-roc1 kernel: vmbr0: port 1(bond0) entered forwarding state
Jun 04 15:44:24 bl465g8-test12-roc1 kernel: vmbr0v100: port 2(bond0.100) entered blocking state
Jun 04 15:44:24 bl465g8-test12-roc1 kernel: vmbr0v100: port 2(bond0.100) entered forwarding state
Jun 04 15:44:24 bl465g8-test12-roc1 kernel: vmbr0v100: port 1(vmbr0.100) entered blocking state
Jun 04 15:44:24 bl465g8-test12-roc1 kernel: vmbr0v100: port 1(vmbr0.100) entered forwarding state
Jun 04 15:45:50 bl465g8-test12-roc1 kernel: be2net 0000:04:00.0 eno1: Link is Down
Jun 04 15:45:50 bl465g8-test12-roc1 kernel: be2net 0000:04:00.0 eno1: Link is Down
Jun 04 15:45:50 bl465g8-test12-roc1 kernel: bond0: (slave eno1): link status definitely down, disabling slave
Jun 04 15:45:50 bl465g8-test12-roc1 kernel: bond0: (slave eno2): making interface the new active one
Jun 04 15:45:50 bl465g8-test12-roc1 kernel: be2net 0000:04:00.0 eno1: left allmulticast mode
Jun 04 15:45:50 bl465g8-test12-roc1 kernel: be2net 0000:04:00.1 eno2: entered promiscuous mode
Jun 04 15:45:50 bl465g8-test12-roc1 kernel: be2net 0000:04:00.1 eno2: entered allmulticast mode
Jun 04 15:45:51 bl465g8-test12-roc1 kernel: be2net 0000:04:00.1 eno2: Link is Down
Jun 04 15:45:51 bl465g8-test12-roc1 kernel: be2net 0000:04:00.1 eno2: Link is Down
Jun 04 15:45:51 bl465g8-test12-roc1 kernel: bond0: (slave eno2): link status definitely down, disabling slave
Jun 04 15:45:51 bl465g8-test12-roc1 kernel: be2net 0000:04:00.1 eno2: left promiscuous mode
Jun 04 15:45:51 bl465g8-test12-roc1 kernel: be2net 0000:04:00.1 eno2: left allmulticast mode
Jun 04 15:45:51 bl465g8-test12-roc1 kernel: bond0: now running without any active interface!
Jun 04 15:45:51 bl465g8-test12-roc1 kernel: vmbr0: port 1(bond0) entered disabled state
Jun 04 15:45:51 bl465g8-test12-roc1 kernel: vmbr0v100: port 2(bond0.100) entered disabled state
Jun 04 15:45:51 bl465g8-test12-roc1 kernel: be2net 0000:04:00.1: Disabling VLAN promiscuous mode
Jun 04 15:45:52 bl465g8-test12-roc1 kernel: vmbr0v100: port 1(vmbr0.100) entered disabled state
Jun 04 15:46:04 bl465g8-test12-roc1 kernel: be2net 0000:04:00.0 eno1: left promiscuous mode
Jun 04 15:46:04 bl465g8-test12-roc1 kernel: be2net 0000:04:00.0: Disabling VLAN promiscuous mode
---