All LXCs and VMs lost networking!

lifeboy · Feb 25, 2022

I had a really perplexing situation last night. I had previously upgraded one or four nodes running the latest pve to pve-kernel -15.5. Because the naming of the network interfaces was changed at some stage, I had to recreate the /etc/network/interfaces file with the new nic names on all the nodes one at a time before restarting them with the new kernel. This worked fine for all 4 nodes.

Then, after all the nodes were up an running again and the ceph cluster health 100% again, I added one additional line to the networking config file.

Code:

auto enp24s0f1
iface enp24s0f1 inet static
    address 172.16.10.1/24
#corosync

auto enp24s0f1:1
iface enp24s0f1:1 inet static
    address 172.16.5.201/24
#ILOM path

Then I ran systemctl restart networking, which briefly disconnected the node and then if come online again.

I did this on 3 nodes and when finally I added the same section to the last node (FT1-NodeC), which was the initial node that was upgraded about 10 days ago, the network just didn't appear to come up again. All VMs and LXCs on the whole cluster because inaccessible. Since I run 2 software proxmox firewalls in a CARP failover config, I lost connectivity to the whole cluster. In the end I had to drive to the DC and investigate the cause.

It seems that all the guests lost networking. This must have something to do with the pve cluster filesystem (?), since that it the component that will replice things to all the other nodes. Rebooting a guest didn't fix the problem. Shutting it down and restarting it, did. That behaviour appears when a hardware change is made to a guest's config. To effect, for example the increase in RAM), the guest has to be shut down and started again. However, the only change was adding an alias to a nic, which in theory should not affect any of the other nics.

Here are some potentionally relevant log entries from FT1-NodeC (on which the alias was added) and networking restarted.

Code:

Feb 24 21:24:37 FT1-NodeC networking[1203943]: error: netlink: enp25s0f0np0.100: cannot create vlan enp25s0f0np0.100 100: interface name exceeds max length of 15
Feb 24 21:24:37 FT1-NodeC systemd[1]: Reloading Postfix Mail Transport Agent (instance -).
Feb 24 21:24:37 FT1-NodeC postfix/postfix-script[1204211]: refreshing the Postfix mail system
Feb 24 21:24:37 FT1-NodeC postfix/master[1781]: reload -- version 3.5.6, configuration /etc/postfix
Feb 24 21:24:37 FT1-NodeC systemd[1]: Reloaded Postfix Mail Transport Agent (instance -).
Feb 24 21:24:37 FT1-NodeC systemd[1]: Reloading Postfix Mail Transport Agent.
Feb 24 21:24:37 FT1-NodeC systemd[1]: Reloaded Postfix Mail Transport Agent.
Feb 24 21:24:37 FT1-NodeC systemd-udevd[1203628]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 24 21:24:37 FT1-NodeC networking[1203943]: error: vmbr3: bridge port enp25s0f0np0.100 does not exist
Feb 24 21:24:37 FT1-NodeC zebra[1360]: [HSYZM-HV7HF] Extended Error: Carrier for nexthop device is down
Feb 24 21:24:37 FT1-NodeC zebra[1360]: [WVJCK-PPMGD][EC 4043309093] netlink-dp (NS 0) error: Network is down, type=RTM_NEWNEXTHOP(104), seq=56, pid=3081457202
Feb 24 21:24:37 FT1-NodeC zebra[1360]: [P2XBZ-RAFQ5][EC 4043309074] Failed to install Nexthop ID (78) into the kernel

Code:

Feb 24 21:24:43 FT1-NodeC corosync[1791]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 24 21:24:43 FT1-NodeC pmxcfs[1668]: [status] notice: node has quorum
Feb 24 21:24:43 FT1-NodeC pmxcfs[1668]: [status] notice: received sync request (epoch 1/1670/00000009)
Feb 24 21:24:43 FT1-NodeC pmxcfs[1668]: [status] notice: received all states
Feb 24 21:24:43 FT1-NodeC pmxcfs[1668]: [status] notice: all data is up to date
Feb 24 21:24:43 FT1-NodeC pmxcfs[1668]: [status] notice: dfsm_deliver_queue: queue length 3
Feb 24 21:24:43 FT1-NodeC pmxcfs[1668]: [dcdb] crit: cpg_send_message failed: 9
Feb 24 21:24:43 FT1-NodeC pmxcfs[1668]: [dcdb] crit: cpg_send_message failed: 9
Feb 24 21:24:43 FT1-NodeC pmxcfs[1668]: [dcdb] crit: cpg_send_message failed: 9
Feb 24 21:24:43 FT1-NodeC pmxcfs[1668]: [dcdb] crit: cpg_send_message failed: 9
Feb 24 21:24:44 FT1-NodeC pmxcfs[1668]: [dcdb] crit: cpg_send_message failed: 9
Feb 24 21:24:44 FT1-NodeC pmxcfs[1668]: [dcdb] crit: cpg_send_message failed: 9
Feb 24 21:24:44 FT1-NodeC pmxcfs[1668]: [dcdb] crit: cpg_send_message failed: 9
Feb 24 21:24:44 FT1-NodeC pmxcfs[1668]: [dcdb] crit: cpg_send_message failed: 9
Feb 24 21:24:45 FT1-NodeC pmxcfs[1668]: [dcdb] crit: cpg_send_message failed: 9
Feb 24 21:24:45 FT1-NodeC pmxcfs[1668]: [dcdb] crit: cpg_send_message failed: 9
Feb 24 21:24:45 FT1-NodeC pve-ha-lrm[2195]: status change active => lost_agent_lock
Feb 24 21:24:45 FT1-NodeC pmxcfs[1668]: [dcdb] crit: cpg_send_message failed: 9
Feb 24 21:24:45 FT1-NodeC pmxcfs[1668]: [dcdb] crit: cpg_send_message failed: 9
Feb 24 21:24:45 FT1-NodeC pve-ha-crm[1959]: status change master => lost_manager_lock
Feb 24 21:24:45 FT1-NodeC pve-ha-crm[1959]: watchdog closed (disabled)
Feb 24 21:24:45 FT1-NodeC pve-ha-crm[1959]: status change lost_manager_lock => wait_for_quorum
Feb 24 21:24:45 FT1-NodeC pmxcfs[1668]: [dcdb] crit: cpg_send_message failed: 9
Feb 24 21:24:45 FT1-NodeC pmxcfs[1668]: [dcdb] crit: cpg_send_message failed: 9
Feb 24 21:24:45 FT1-NodeC ceph-osd[2981]: 2022-02-24T21:24:45.706+0200 7fbe7436f700 -1 osd.15 pg_epoch: 365852 pg[23.17s0( v 363899'54475 (363899'47094,363899'54475] local-lis/les=365782/365783 n=8360 ec=215487/215487 lis/c=365782/365782 les/c/f=365783/365783/0 sis=365782) [15,5,10,20,3,14,17,11]p15(0) r=0 lpr=365782 crt=363899'54475 lcod 0'0 mlcod 0'0 active+clean]  scrubber pg(23.17s0) handle_scrub_reserve_grant: received unsolicited reservation grant from osd 17(6) (0x55f86965db80)
Feb 24 21:24:45 FT1-NodeC ceph-osd[2981]: 2022-02-24T21:24:45.706+0200 7fbe7436f700 -1 osd.15 pg_epoch: 365852 pg[23.17s0( v 363899'54475 (363899'47094,363899'54475] local-lis/les=365782/365783 n=8360 ec=215487/215487 lis/c=365782/365782 les/c/f=365783/365783/0 sis=365782) [15,5,10,20,3,14,17,11]p15(0) r=0 lpr=365782 crt=363899'54475 lcod 0'0 mlcod 0'0 active+clean]  scrubber pg(23.17s0) handle_scrub_reserve_grant: received unsolicited reservation grant from osd 11(7) (0x55f9033e14a0)
Feb 24 21:24:45 FT1-NodeC ceph-osd[2981]: 2022-02-24T21:24:45.706+0200 7fbe7436f700 -1 osd.15 pg_epoch: 365852 pg[23.17s0( v 363899'54475 (363899'47094,363899'54475] local-lis/les=365782/365783 n=8360 ec=215487/215487 lis/c=365782/365782 les/c/f=365783/365783/0 sis=365782) [15,5,10,20,3,14,17,11]p15(0) r=0 lpr=365782 crt=363899'54475 lcod 0'0 mlcod 0'0 active+clean]  scrubber pg(23.17s0) handle_scrub_reserve_grant: received unsolicited reservation grant from osd 20(3) (0x55f91ec25080)
Feb 24 21:24:45 FT1-NodeC ceph-osd[2981]: 2022-02-24T21:24:45.706+0200 7fbe7436f700 -1 osd.15 pg_epoch: 365852 pg[23.17s0( v 363899'54475 (363899'47094,363899'54475] local-lis/les=365782/365783 n=8360 ec=215487/215487 lis/c=365782/365782 les/c/f=365783/365783/0 sis=365782) [15,5,10,20,3,14,17,11]p15(0) r=0 lpr=365782 crt=363899'54475 lcod 0'0 mlcod 0'0 active+clean]  scrubber pg(23.17s0) handle_scrub_reserve_grant: received unsolicited reservation grant from osd 5(1) (0x55f88bbe4c60)
Feb 24 21:24:46 FT1-NodeC pmxcfs[1668]: [dcdb] notice: members: 1/1670, 2/1666, 3/1668, 4/1672
Feb 24 21:24:46 FT1-NodeC pmxcfs[1668]: [dcdb] notice: starting data syncronisation
Feb 24 21:24:46 FT1-NodeC pmxcfs[1668]: [dcdb] notice: received sync request (epoch 1/1670/00000009)
Feb 24 21:24:46 FT1-NodeC pmxcfs[1668]: [dcdb] notice: received all states
Feb 24 21:24:46 FT1-NodeC pmxcfs[1668]: [dcdb] notice: leader is 1/1670
Feb 24 21:24:46 FT1-NodeC pmxcfs[1668]: [dcdb] notice: synced members: 1/1670, 2/1666, 4/1672
Feb 24 21:24:46 FT1-NodeC pmxcfs[1668]: [dcdb] notice: waiting for updates from leader
Feb 24 21:24:46 FT1-NodeC pmxcfs[1668]: [dcdb] notice: dfsm_deliver_queue: queue length 1
Feb 24 21:24:46 FT1-NodeC pmxcfs[1668]: [dcdb] notice: update complete - trying to commit (got 6 inode updates)
Feb 24 21:24:46 FT1-NodeC pmxcfs[1668]: [dcdb] notice: all data is up to date
Feb 24 21:24:46 FT1-NodeC pmxcfs[1668]: [dcdb] notice: dfsm_deliver_queue: queue length 1
Feb 24 21:24:46 FT1-NodeC pve-ha-crm[1959]: successfully acquired lock 'ha_manager_lock'
Feb 24 21:24:46 FT1-NodeC pve-ha-crm[1959]: watchdog active
Feb 24 21:24:46 FT1-NodeC pve-ha-crm[1959]: status change wait_for_quorum => master
Feb 24 21:24:50 FT1-NodeC pve-ha-lrm[2195]: successfully acquired lock 'ha_agent_FT1-NodeC_lock'
Feb 24 21:24:50 FT1-NodeC pve-ha-lrm[2195]: status change lost_agent_lock => active
Feb 24 21:24:51 FT1-NodeC pvestatd[1896]: PBS-one: error fetching datastores - 500 Can't connect to 192.168.121.200:8007 (No route to host)
Feb 24 21:24:51 FT1-NodeC zebra[1360]: [WPPMZ-G9797] if_zebra_speed_update: vmbr0 old speed: 0 new speed: 25000
Feb 24 21:24:51 FT1-NodeC zebra[1360]: [WPPMZ-G9797] if_zebra_speed_update: vmbr1 old speed: 0 new speed: 1000
Feb 24 21:24:52 FT1-NodeC zebra[1360]: [WPPMZ-G9797] if_zebra_speed_update: enp25s0f0np0.25 old speed: 0 new speed: 25000
Feb 24 21:24:52 FT1-NodeC zebra[1360]: [WPPMZ-G9797] if_zebra_speed_update: vmbr2 old speed: 0 new speed: 25000
Feb 24 21:24:52 FT1-NodeC zebra[1360]: [WPPMZ-G9797] if_zebra_speed_update: vmbr3 old speed: 0 new speed: 4294967295
Feb 24 21:24:52 FT1-NodeC zebra[1360]: [WPPMZ-G9797] if_zebra_speed_update: enp25s0f0np0.35 old speed: 0 new speed: 25000
Feb 24 21:24:52 FT1-NodeC zebra[1360]: [WPPMZ-G9797] if_zebra_speed_update: vmbr4 old speed: 0 new speed: 25000

Stoiko Ivanov · Feb 25, 2022

lifeboy said:
Then I ran systemctl restart networking, which briefly disconnected the node and then if come online again.

on a hunch - check if you've installed ifupdown or ifupdown2 - with ifupdown this is (sadly) expected behavior) - the bridge gets removed - thus all guests lose the association with it and are not readded

with ifupdown2 you can run ifreload -a (or click on apply in the GUI) and everything should continue to work as expected

lifeboy · Feb 25, 2022

Stoiko Ivanov said:
on a hunch - check if you've installed ifupdown or ifupdown2 - with ifupdown this is (sadly) expected behavior) - the bridge gets removed - thus all guests lose the association with it and are not readded

with ifupdown2 you can run ifreload -a (or click on apply in the GUI) and everything should continue to work as expected

I have ifupdown2 installed... also, all guests on all nodes went offline

Code:

# dpkg -l 'ifupdown*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version      Architecture Description
+++-==============-============-============-=====================================================
rc  ifupdown       0.8.35+pve1  amd64        high level tools to configure network interfaces
ii  ifupdown2      3.1.0-1+pmx3 all          Network Interface Management tool similar to ifupdown

lifeboy · Feb 26, 2022

lifeboy said:
I have ifupdown2 installed... also, all guests on all nodes went offline

Let me clarify that: Because everything is virtualised, I lost the firewalls and thus remote access too. The Remote Management interfaces of the nodes are configured on a non-public network, so I guess I'll have to find a secure way of accessing these via some out-of-band system.

Search

Search

All LXCs and VMs lost networking!

lifeboy

Renowned Member

Stoiko Ivanov

Proxmox Staff Member

lifeboy

Renowned Member

lifeboy

Renowned Member

We value your privacy