LXC network failure after PVE9 upgrade

nhand42

Active Member
Jan 17, 2019
9
4
43
50
I upgraded from PVE8 to PVE9 and everything went smoothly. However afterwards all the PCT were in the stopped state.

Manually starting PCT shows the error

Code:
# pct start 100
run_buffer: 571 Script exited with status 2
lxc_create_network_priv: 3466 Success - Failed to create network device
lxc_spawn: 1852 Failed to create the network
__lxc_start: 2119 Failed to spawn container "100"
startup for container '100' failed

And debug shows a bit more detail

Code:
INFO     utils - ../src/lxc/utils.c:run_script_argv:587 - Executing script "/usr/share/lxc/lxcnetaddbr" for container "100", config section "net"
DEBUG    utils - ../src/lxc/utils.c:run_buffer:560 - Script exec /usr/share/lxc/lxcnetaddbr 100 net up veth  veth100i0 produced output: RTNETLINK answers: Unknown error 524
DEBUG    utils - ../src/lxc/utils.c:run_buffer:560 - Script exec /usr/share/lxc/lxcnetaddbr 100 net up veth  veth100i0 produced output: can't enslave 'fwpr100p0' to 'vmbr0'
ERROR    utils - ../src/lxc/utils.c:run_buffer:571 - Script exited with status 2
ERROR    network - ../src/lxc/network.c:lxc_create_network_priv:3466 - Success - Failed to create network device
ERROR    start - ../src/lxc/start.c:lxc_spawn:1852 - Failed to create the network
DEBUG    network - ../src/lxc/network.c:lxc_delete_network:4221 - Deleted network devices
ERROR    start - ../src/lxc/start.c:__lxc_start:2119 - Failed to spawn container "100"
WARN     start - ../src/lxc/start.c:lxc_abort:1037 - No such process - Failed to send SIGKILL via pidfd 16 for process 4227
startup for container '100' failed

This is the network status. It's normal for eno2 to be DOWN (it's unplugged).

Code:
# ip l show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UP mode DEFAULT group default qlen 1000
    link/ether 64:51:06:d8:12:34 brd ff:ff:ff:ff:ff:ff
    altname enp3s0f0
    altname enx645106d81234
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master vmbr0 state DOWN mode DEFAULT group default qlen 1000
    link/ether 64:51:06:d8:12:35 brd ff:ff:ff:ff:ff:ff
    altname enp3s0f1
    altname enx645106d81235
4: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 64:51:06:d8:12:34 brd ff:ff:ff:ff:ff:ff

And the network configuration in /etc/network/interfaces

Code:
auto lo
iface lo inet loopback
iface eno1 inet manual
iface eno2 inet manual
auto vmbr0
iface vmbr0 inet static
    address 192.168.1.55/24
    gateway 192.168.1.1
    bridge-ports eno1 eno2
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 2-4094
source /etc/network/interfaces.d/*

Tried various things and eventually stumbled on this workaround (not a fix).

Code:
# ifdown vmbr0 ; ifup vmbr0
# pct start 100
#

And now the container is running. But reboot the host and the fault returns. I'm holding off upgrading more hosts to PVE9 because a reboot effectively bricks all PCT until I manually intervene with the workaround.

I have no explanation for how the workaround even works. I compared ifconfig before/after and nothing changes except the index number of vmbr0 (goes from 4 to 22). I'll continue troubleshooting tomorrow.

PS: I searched the forums and found one other thread with a similar error message, but the fix there to reinstall proxmox-kernel did not work for me.