No vmbr0 - bus error on start networking

scubad

New Member
May 17, 2025
2
0
1
Yesterday morning i lost connectivity to my PVE1 box that has 3 VM's on it. I orginally was able to ping my PVE .10 address and then during troubleshooting, i rebooted the server and lost all connectivty to the PVE host and any VM's.

I lost the vmbr0 interface in "IP address" output, but the ethernet adapter in the interfaces file looks correct. going thru the journalctl i found the below bus error in start networking. Not sure what happened at ~5am in the morning to cause this. It is on UPS power so not a power outage. I originally thought it was a bad ethernet port before i dug into the logs. So i swapped the drives into a second exact copy of the Beelink mini-PC. Same results.

I can force the enp171s0 interface up with the ip link command but connectity. I have not tried setting up a static address since I don't think the networking stack is fully functional. I also suspect a issue ifupdown2 module.

attached are screenshots of pveversion output, /etc/network/interfaces and IP address.

Any thoughts on how to fix this?
 

Attachments

  • IMG_4817 Medium.jpeg
    IMG_4817 Medium.jpeg
    48.2 KB · Views: 13
  • IMG_4816 Medium.jpeg
    IMG_4816 Medium.jpeg
    94.6 KB · Views: 13
  • IMG_4818 Medium.jpeg
    IMG_4818 Medium.jpeg
    52.8 KB · Views: 13
Hello scubad! I/O errors can happen due to several reasons, e.g. also hardware issues. The images have a low quality and are thus hard to understand, but as far as I can see, there are also some errors related to the NVMe SSDs (in addition to the networking errors, where the NIC probably also runs on PCIe). This could mean that there are some general PCI Express bus errors, which could be related to some motherboard issues. However, you might also want to run memtest86+ to see whether there are RAM issues causing unintended side effects.

Also, please provide us with more information about the hardware.
 
So i swapped the drives into a second exact copy of the Beelink mini-PC. Same results.

So you powered down and put the drives into different hardware that is the same model etc.?
 
My hardware is Beelink SEi14 with a 2TB SSD in it. I bought two boxes at the time. Now here is where one of my screwups occurred in that i did not configure Qdevice as a 3rd entity for voting.

By swapping the SSD's into the 2nd box, i had the same networking problem where vmbr0 does not show up. and i get the error.

In the journal image the /dev/nvme1n1 is having I/O errors with the bus error right in the middle of the errors. So i am assuming I have some file corruption there.

Here is a better image. I am hoping i can either fix the corruption or replace the file to allow to boot up. Or get it to the point where i can grab the a couple of LVM's off and reinstall the cluster.

My user data is on my Synology NAS so no real data loss would happen if i just need to reinstall the whole cluster, then i can set it up as 3 nodes and do some high availability between nodes.

Wondering if there is a easy button or just spend the time and rebuild.
 

Attachments

  • IMG_4816 Large.jpeg
    IMG_4816 Large.jpeg
    350.7 KB · Views: 7
By swapping the SSD's into the 2nd box, i had the same networking problem where vmbr0 does not show up. and i get the error.

In the journal image the /dev/nvme1n1 is having I/O errors with the bus error right in the middle of the errors. So i am assuming I have some file corruption there.
At this point it sounds like either a data corruption, or a more general SSD issue. On the other hand, the fact that bus errors also show up when trying to initialize the network interface makes me wonder whether this might not be a motherboard or CPU issue. However, I would personally try the following, one step at a time:
  1. Begin by checking the S.M.A.R.T. values using smartctl -a /dev/nvme1n1 - and since you cannot boot into the system, you can do this using a live Linux distribution (booting from an USB stick). Feel free to post the results here.
  2. Otherwise, try to see whether there are any firmware updates available for the SSD, as these sometimes fix such issues.
  3. Otherwise you can try installing the latest BIOS updates.
If none of this helped, we can think about further steps. But, as I said, one step at a time ;)