[SOLVED] LXC network failure after PVE9 upgrade

nhand42

Active Member
Jan 17, 2019
14
4
43
50
I upgraded from PVE8 to PVE9 and everything went smoothly. However afterwards all the PCT were in the stopped state.

Manually starting PCT shows the error

Code:
# pct start 100
run_buffer: 571 Script exited with status 2
lxc_create_network_priv: 3466 Success - Failed to create network device
lxc_spawn: 1852 Failed to create the network
__lxc_start: 2119 Failed to spawn container "100"
startup for container '100' failed

And debug shows a bit more detail

Code:
INFO     utils - ../src/lxc/utils.c:run_script_argv:587 - Executing script "/usr/share/lxc/lxcnetaddbr" for container "100", config section "net"
DEBUG    utils - ../src/lxc/utils.c:run_buffer:560 - Script exec /usr/share/lxc/lxcnetaddbr 100 net up veth  veth100i0 produced output: RTNETLINK answers: Unknown error 524
DEBUG    utils - ../src/lxc/utils.c:run_buffer:560 - Script exec /usr/share/lxc/lxcnetaddbr 100 net up veth  veth100i0 produced output: can't enslave 'fwpr100p0' to 'vmbr0'
ERROR    utils - ../src/lxc/utils.c:run_buffer:571 - Script exited with status 2
ERROR    network - ../src/lxc/network.c:lxc_create_network_priv:3466 - Success - Failed to create network device
ERROR    start - ../src/lxc/start.c:lxc_spawn:1852 - Failed to create the network
DEBUG    network - ../src/lxc/network.c:lxc_delete_network:4221 - Deleted network devices
ERROR    start - ../src/lxc/start.c:__lxc_start:2119 - Failed to spawn container "100"
WARN     start - ../src/lxc/start.c:lxc_abort:1037 - No such process - Failed to send SIGKILL via pidfd 16 for process 4227
startup for container '100' failed

This is the network status. It's normal for eno2 to be DOWN (it's unplugged).

Code:
# ip l show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UP mode DEFAULT group default qlen 1000
    link/ether 64:51:06:d8:12:34 brd ff:ff:ff:ff:ff:ff
    altname enp3s0f0
    altname enx645106d81234
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master vmbr0 state DOWN mode DEFAULT group default qlen 1000
    link/ether 64:51:06:d8:12:35 brd ff:ff:ff:ff:ff:ff
    altname enp3s0f1
    altname enx645106d81235
4: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 64:51:06:d8:12:34 brd ff:ff:ff:ff:ff:ff

And the network configuration in /etc/network/interfaces

Code:
auto lo
iface lo inet loopback
iface eno1 inet manual
iface eno2 inet manual
auto vmbr0
iface vmbr0 inet static
    address 192.168.1.55/24
    gateway 192.168.1.1
    bridge-ports eno1 eno2
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 2-4094
source /etc/network/interfaces.d/*

Tried various things and eventually stumbled on this workaround (not a fix).

Code:
# ifdown vmbr0 ; ifup vmbr0
# pct start 100
#

And now the container is running. But reboot the host and the fault returns. I'm holding off upgrading more hosts to PVE9 because a reboot effectively bricks all PCT until I manually intervene with the workaround.

I have no explanation for how the workaround even works. I compared ifconfig before/after and nothing changes except the index number of vmbr0 (goes from 4 to 22). I'll continue troubleshooting tomorrow.

PS: I searched the forums and found one other thread with a similar error message, but the fix there to reinstall proxmox-kernel did not work for me.
 
It sounds like the issue might be related to missing or misconfigured settings after the upgrade. One common cause of PCT containers failing to start is a mismatch in the underlying configuration files or storage paths. Las Atlantis Casino website is a visually stunning online casino that delivers an engaging gaming experience. With a vibrant, underwater-themed design, it creates an immersive atmosphere for players. The platform offers a wide selection of games, ranging from slot machines to table games, catering to various preferences.
 
Last edited:
It sounds like the issue might be related to missing or misconfigured settings after the upgrade. One common cause of PCT containers failing to start is a mismatch in the underlying configuration files or storage paths.
The PCT containers do start after "ifdown vmbr0; ifup vmbr0". However this workaround is not persistent across reboots.

I would not think the PCT configuration is changed by "ifdown vmbr0; ifup vmbr0" on the proxmox host.
 
Code:
Code:
auto lo
iface lo inet loopback
iface eno1 inet manual
iface eno2 inet manual
auto vmbr0
iface vmbr0 inet static
    address 192.168.1.55/24
    gateway 192.168.1.1
    bridge-ports eno1 eno2
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 2-4094
source /etc/network/interfaces.d/*
Hello, how do you think work this setup with your two Network cards? Are both connected to the switch ? Then you have a problem.
 
Hello, how do you think work this setup with your two Network cards? Are both connected to the switch ? Then you have a problem.
This has worked fine for several years. I use a bridge so I don't have to remember which port is which on the back. I plug a single cable into any port and then use vmbr0. I already said this worked on PVE8 and works on PVE9 after ifdown/ifup and I also pointed out eno2 being down is normal (it's unplugged), so I don't know what clever remark you thought you were making.
 
Wow you will that I don't use my German right to free speech?
You make a mistake and you won't fix it.
I hope the proxmox staff can give you the right way to handle your Problem.
 
And the network configuration in /etc/network/interfaces

auto lo iface lo inet loopback iface eno1 inet manual iface eno2 inet manual auto vmbr0 iface vmbr0 inet static address 192.168.1.55/24 gateway 192.168.1.1 bridge-ports eno1 eno2 bridge-stp off bridge-fd 0 bridge-vlan-aware yes bridge-vids 2-4094 source /etc/network/interfaces.d/*
Tried various things and eventually stumbled on this workaround (not a fix).

# ifdown vmbr0 ; ifup vmbr0 # pct start 100 #
And now the container is running. But reboot the host and the fault returns. I'm holding off upgrading more hosts to PVE9 because a reboot effectively bricks all PCT until I manually intervene with the workaround.

I have no explanation for how the workaround even works. I compared ifconfig before/after and nothing changes except the index number of vmbr0 (goes from 4 to 22). I'll continue troubleshooting tomorrow.
Please check the system log for any related errors / warnings, maybe it's timing related.

Which ifupdown version does your setup use? apt list --installed | grep ifupdown
 
Please check the system log for any related errors / warnings, maybe it's timing related.

Which ifupdown version does your setup use? apt list --installed | grep ifupdown

I've been looking through logs and might have found something useful. After a reboot the dmesg shows eth0 and eth1 being renamed at T+3, vmbr0 being created and entering forwarding state at T+19, then three containers trying to start from T+37 but all three failling.

# dmesg | egrep '(vmbr|eno)'
[ 3.836182] tg3 0000:03:00.1 eno2: renamed from eth1
[ 3.836354] tg3 0000:03:00.0 eno1: renamed from eth0
[ 16.291891] vmbr0: port 1(eno1) entered blocking state
[ 16.291897] vmbr0: port 1(eno1) entered disabled state
[ 16.291916] tg3 0000:03:00.0 eno1: entered allmulticast mode
[ 16.292132] vmbr0: port 2(eno2) entered blocking state
[ 16.292135] vmbr0: port 2(eno2) entered disabled state
[ 16.292145] tg3 0000:03:00.1 eno2: entered allmulticast mode
[ 16.292177] tg3 0000:03:00.1 eno2: entered promiscuous mode
[ 16.292188] tg3 0000:03:00.0 eno1: entered promiscuous mode
[ 19.975722] tg3 0000:03:00.0 eno1: Link is up at 1000 Mbps, full duplex
[ 19.975732] tg3 0000:03:00.0 eno1: Flow control is off for TX and off for RX
[ 19.975734] tg3 0000:03:00.0 eno1: EEE is disabled
[ 19.975740] vmbr0: port 1(eno1) entered blocking state
[ 19.975745] vmbr0: port 1(eno1) entered forwarding state
[ 23.834078] netpoll: netconsole: interface 'vmbr0'
[ 37.513778] vmbr0: port 3(fwpr100p0) entered blocking state
[ 37.513784] vmbr0: port 3(fwpr100p0) entered disabled state
[ 39.545804] vmbr0: port 3(fwpr116p0) entered blocking state
[ 39.545811] vmbr0: port 3(fwpr116p0) entered disabled state
[ 41.704603] vmbr0: port 3(veth124i0) entered blocking state
[ 41.704610] vmbr0: port 3(veth124i0) entered disabled state

After the reboot the fault with the container has returned, as evidenced by dmesg logs at T+37, T+39 and T+41.

# pct start 100
run_buffer: 571 Script exited with status 2
lxc_create_network_priv: 3466 Success - Failed to create network device
lxc_spawn: 1852 Failed to create the network
__lxc_start: 2119 Failed to spawn container "100"
startup for container '100' failed

The full dmesg from this manual attempt to start the container. I've highlighted one interesting message "doesn't support polling".

[ 792.860520] audit: type=1400 audit(1759134701.942:133): apparmor="STATUS" operation="profile_load" profile="/usr/bin/lxc-start" name="lxc-100_</var/lib/lxc>" pid=8433 comm="apparmor_parser"
[ 793.669678] vmbr0: port 3(fwpr100p0) entered blocking state
[ 793.669686] vmbr0: port 3(fwpr100p0) entered disabled state
[ 793.669708] fwpr100p0: entered allmulticast mode
[ 793.669733] netpoll: (null): fwpr100p0 doesn't support polling, aborting
[ 793.669745] fwpr100p0: left allmulticast mode
[ 793.810805] audit: type=1400 audit(1759134702.892:134): apparmor="STATUS" operation="profile_remove" profile="/usr/bin/lxc-start" name="lxc-100_</var/lib/lxc>" pid=8492 comm="apparmor_parser"

After the ifdown/ifup workaround the container now successfully starts.

# ifdown vmbr0 ; ifup vmbr0
# pct start 100
#

The dmesg from this successful start of the container. That suspicious message about polling is absent. The interface enters the forwarding state and the virtual interfaces came up shortly aftewards.

[ 1018.159669] audit: type=1400 audit(1759134927.241:135): apparmor="STATUS" operation="profile_load" profile="/usr/bin/lxc-start" name="lxc-100_</var/lib/lxc>" pid=9698 comm="apparmor_parser"
[ 1018.959886] vmbr0: port 3(fwpr100p0) entered blocking state
[ 1018.959893] vmbr0: port 3(fwpr100p0) entered disabled state
[ 1018.959913] fwpr100p0: entered allmulticast mode
[ 1018.959956] fwpr100p0: entered promiscuous mode
[ 1018.960397] vmbr0: port 3(fwpr100p0) entered blocking state
[ 1018.960400] vmbr0: port 3(fwpr100p0) entered forwarding state
[ 1019.012617] fwbr100i0: port 1(fwln100i0) entered blocking state
[ 1019.012633] fwbr100i0: port 1(fwln100i0) entered disabled state
[ 1019.012669] fwln100i0: entered allmulticast mode
[ 1019.012718] fwln100i0: entered promiscuous mode
[ 1019.012775] fwbr100i0: port 1(fwln100i0) entered blocking state
[ 1019.012780] fwbr100i0: port 1(fwln100i0) entered forwarding state
[ 1019.021856] fwbr100i0: port 2(veth100i0) entered blocking state
[ 1019.021861] fwbr100i0: port 2(veth100i0) entered disabled state
[ 1019.021875] veth100i0: entered allmulticast mode
[ 1019.021913] veth100i0: entered promiscuous mode
[ 1019.065378] eth0: renamed from vethnyG4ol

Here's the ifupdown version you asked for.

# apt list --installed | grep ifupdown
ifupdown2/stable,now 3.3.0-1+pmx10 all [installed]
 
Google suggested this polling error might be due to the netconsole service. As an experiment I've disabled netconsole and rebooted.

# systemctl disable netconsole
Synchronizing state of netconsole.service with SysV service script with /usr/lib/systemd/systemd-sysv-install.
Executing: /usr/lib/systemd/systemd-sysv-install disable netconsole
Removed '/etc/systemd/system/multi-user.target.wants/netconsole.service'.
# reboot

After the reboot the containers have all started correctly.

# pct list
VMID Status Lock Name
100 running aaaa
116 running bbbb
124 running cccc
#

Your intuition it was a timing error seems correct. I suspect a race between netconsole and networking.

I've removed the netconsole package from all hosts and that's solved the problem.
 
Last edited:
I was able to reproduce this problem as well by loading the netconsole module:
Bash:
root@host:~# pct config 103
...
net0: name=eth0,bridge=vmbr0,firewall=1,..
...
root@host:~#pct start 103
run_buffer: 571 Script exited with status 2
lxc_create_network_priv: 3466 Success - Failed to create network device
lxc_spawn: 1852 Failed to create the network
startup for container '103' failed
From what I found, the problem occurs when netconsole uses the same bridge interface as the container.

For example, if the LXC container uses vmbr1 and netconsole on the host uses vmbr0, the issue does not occur:
Bash:
root@host:~# pct config 103
...
net0: name=eth0,bridge=vmbr1,firewall=1,...
...
root@host:~# pct start 103
root@host:~#   
root@host:~# pct list
VMID       Status     Lock         Name               
103        running                 CT103

A new issue [1] was filled against proxmox Bugzilla instance. Please follow it to see the current status.

[1] https://bugzilla.proxmox.com/show_bug.cgi?id=6873
 
Last edited:
  • Like
Reactions: nhand42