[SOLVED] Help request - PVE8.4 - Networking appears to have died

bunk3m

Member
Feb 22, 2023
16
5
8
In front of my computer in Canada
Newbie alert.
I'm spinning around and trying to debug this and hope you might be able to help me narrow down and solve the issue.

I have a PVE-8 homelab running on a Lenovo tower P510. It has run fine since the first setup using PVE-7. I upgraded to PVE-8 over a year ago but haven't upgraded to PVE-9.

I think 2 days ago I noticed and did an update that added a new kernel with 6.??-17. I rebooted and everything ran well until this evening (about 48 hrs). I noticed that I couldn't communicate with a container running NextCloud so went to look at the PVE dashboard. The dashboard is down as are all the networks.

Thinking that perhaps this was a problem with the Intel NIC, so I did a manual restart by pushing the power button. The PVE-8 rebooted and gets to the screen and states to login to the web interface. Unfortunately since it looks like the networking is dead, I can't get to the interface.

I have 3 NICs in the Proxmox server but only 1 is used. I edited the /network/interface and changed the card to use one of the other nics. It didn't give an error but it also didn't bring up the network. I did the same for the third NIC but it didn't work either.

I'm stuck and not sure what to try next. Any suggestions are appreciated. Thanks!
 
I can't ping anything from the PVE to the regular home network or internet.

There are a number of messages like
Code:
There are similar messages where the port increases from port 2, port 3, port 4, port 5 and then

vmbr0: port 6(tap120i0) entering blocking state
vmbr0: port 6(tap120i0) entered disabled state

The last error I saw before the notice to login to the GUI was a samba error
Bash:
CIFS: VFS: Error connecting to socket. Aborting operation.
CIFS: VFS: cifs_mount failed w/return code = -113

I have never had such a problem and am not sure where to start in debugging.
thanks
 
All the messages you posted are normal for a system that is not properly connected. Take a look at the state of your networking:
ip a
ip route
cat /etc/network/interfaces


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
Thanks @bbgeek17 !!

It looks like the cable went bad. In 25 years I've never had that happen before. I swapped out the cable and now I can ping the gateway and internet.
I'm also back to the Proxmox GUI and can see that all VMs and containers have started.

There is still something not quite right as I can only connect to my pi-hole container. I can't connect to the apps in the VMs (Jellyfin & Nextcloud).

ip a
looks OK
ip route
looks OK
cat /etc/network/interfaces
looks OK also.

I'll continue debugging the networking.
 
I had to restart each of the containers and VMs in order for them to be found on the network.
I'd say this is not surprising at all. Very often when network-dependent services start without network being available they are not able to properly engage. A restart is a prudent step.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: bunk3m
When I swapped out the cable I ended up changing the port on the switch. This made me think the cable was the issue. But, when I tested the cable, I found it was still OK. Further step was to test the network card in the Proxmox server which also was OK. Finally I found it was the port on the switch that died. Since the switch was over 15 years old, I decided it was easier to replace the switch.

While I was debugging this issue, I did some research that I would like to share. Please find it below. I followed these steps.
===

Proxmox 8.4 Network Troubleshooting Guide (Debian 12)​

When your Proxmox server loses network connectivity but boots to the splash screen, follow these systematic command-line debugging steps to isolate and resolve the issue.

Phase 1: Basic Interface Diagnostics

Check Network Interface Status​

First, verify if your network interfaces are physically up and have IP addresses assigned:
Bash:
ip addr show

Look for your primary interface (typically enpXs0, eth0, or similar) and bridge (vmbr0). The interface should show UP state and have an IP address assigned.

If the interface shows as DOWN, bring it up manually:
Bash:
sudo ip link set dev <interface_name> up
Verify Link Layer Connectivity
Check if the physical link is established:
Bash:
ethtool <interface_name>
Look for "Link detected: yes" in the output. If it shows "no," check your cable and switch port.

Phase 2: IP Configuration Verification

Examine Network Configuration File​

Review your network configuration for errors:
Bash:
cat /etc/network/interfaces
Ensure your configuration includes:
  • Correct interface names (verify they match ip addr output)
  • Proper static IP address and subnet (e.g., 192.168.1.10/24)
  • Correct gateway address
  • Proper bridge configuration for vmbr0

Check Cluster Configuration (If Applicable)​

If this is a cluster node, verify the cluster network settings:
Bash:
cat /etc/corosync/corosync.conf

Ensure the ring0_addr matches your current IP address.

Phase 3: Routing and Connectivity Testing

Test Local Stack​

Verify the TCP/IP stack is working locally:
Bash:
ping -c 4 127.0.0.1
ping -c 4 <your_server_ip>

Check Routing Table​

Examine if the default route is configured correctly:
Bash:
ip route show
# or
route -n
You should see a default route (0.0.0.0) pointing to your gateway.

Test Gateway Connectivity​

Try pinging your default gateway:
Bash:
ping -c 4 <gateway_ip>

If this fails, you may have a subnet mismatch or ARP issue.

Check ARP Table​

Verify ARP resolution is working:
Bash:
arp -n
If the gateway MAC address shows (incomplete), there's a Layer 2 connectivity problem.

Phase 4: DNS and Name Resolution

Check DNS Configuration​

Verify your DNS settings:
Bash:
cat /etc/resolv.conf

Ensure you have valid nameservers listed. For testing, you can temporarily add:
Bash:
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf

Test DNS Resolution​

Try pinging an external host by IP first, then by name:
Bash:
ping -c 4 8.8.8.8
ping -c 4 google.com

Phase 5: Firewall and Security

Check iptables Rules​

Examine firewall rules that might be blocking traffic:
Bash:
sudo iptables -L -v -n
sudo iptables -t nat -L -v -n
Look for rules with DROP policies or invalid connection tracking.

Check Proxmox Firewall Status​

If Proxmox firewall is enabled, check its rules:
Bash:
sudo iptables -L PVEFW-FORWARD -v -n
A common issue is RST packets being dropped. If you see ctstate INVALID drops, you may need to adjust firewall settings.

Temporarily Disable Firewall (For Testing)​

If you suspect firewall issues, disable it temporarily:
Bash:
sudo systemctl stop pve-firewall
# or flush all rules
sudo iptables -F
sudo iptables -t nat -F

Important: Re-enable the firewall after testing.

Phase 6: Network Services and Configuration

Restart Networking Service​

Apply configuration changes and restart networking:
Bash:
sudo systemctl restart networking
# or use ifreload for live changes
sudo ifreload -a

Check Network Service Status​

Verify the networking service is active:
Bash:
sudo systemctl status networking

If it shows failed, check the logs:
Bash:
sudo journalctl -u networking

Check NetworkManager Interference​

Ensure NetworkManager isn't interfering with Proxmox networking:
Bash:
sudo systemctl status NetworkManager
sudo systemctl stop NetworkManager
sudo systemctl disable NetworkManager

Phase 7: Hardware and Driver Issues

Check Network Device Recognition​

Verify your NIC is detected:
Code:
lspci | grep -i ethernet

Check Driver Loading​

Look for driver-related messages:
Bash:
dmesg | grep -i network
dmesg | grep -i ethernet

Search for firmware loading failures or driver errors.

Check for Missing Firmware​

If dmesg shows firmware errors, install missing firmware:
Bash:
sudo apt update
sudo apt install linux-firmware

Phase 8: Proxmox-Specific Issues

Check Bridge Configuration​

Verify your bridge is properly configured:
Bash:
brctl show

Ensure vmbr0 shows your physical interface as a port.

Check Proxmox Services​

Restart Proxmox services:
Bash:
sudo systemctl restart pve-cluster
sudo systemctl restart pveproxy
sudo systemctl restart pvedaemon

Check Hosts File​

Ensure /etc/hosts is correctly configured:
Bash:
cat /etc/hosts

The file should contain your correct IP and hostname.

Phase 9: Recovery Steps

If the above steps don't resolve the issue:

Reconfigure Network Interfaces​

Edit the interfaces file carefully:
Bash:
sudo nano /etc/network/interfaces
Make necessary corrections, then:
Bash:
sudo ifdown <interface_name> && sudo ifup <interface_name>

Reboot and Check Logs​

If issues persist, reboot and monitor boot logs:
Bash:
sudo reboot
# After reboot, check:
sudo journalctl -b | grep -i network

Check for Recent Updates​

If the problem started after an update, check for known issues:
Bash:
cat /var/log/apt/history.log


Quick Troubleshooting Checklist

  1. Physical Layer: Cable, switch port, link lights
  2. Interface Status: ip link shows UP?
  3. IP Address: ip addr shows correct IP?
  4. Routing: ip route shows default gateway?
  5. Gateway Ping: Can ping gateway?
  6. DNS: /etc/resolv.conf has valid servers?
  7. Firewall: iptables rules blocking traffic?
  8. Services: networking service running?
  9. Hardware: NIC recognized and driver loaded?
  10. Logs: dmesg or journalctl shows errors?
Follow these steps methodically to isolate whether the issue is physical, configuration-related, or service-related.
 
  • Like
Reactions: UdoB