NIC Going Offline After Some Time

MotoFlyGuy

New Member
Jun 6, 2024
6
0
1
I recently installed another 32 GB of RAM into my system, and now my NIC is going offline? It's strange because my Proxmox host still shows the device listed when I run ip a and such. My Modem doesn't show my host NIC being connected at all though... Reviewing the system logs shows this below before I can no longer access the internet. I am not sure if it's my NIC or what, but any help would be greatly appreciated.

Please note that everything was working fine for the last 3 months until I installed that 2nd stick of RAM which is identical to the 1st stick. I am unsure of how to update the firmware for the NIC if possible as well. I have already ran all the apt-get updates as well for the host. I am running PVE 8.2.4.

PC specs:
MB - ‎Gigabyte A520I AC
CPU - AMD Ryzen 5 4500 6-Core
RAM - Patriot Memory Viper Steel DDR4 32GB (1 x 32GB) 3600 MHz Module x2
SSD -TEAMGROUP MP33 512GB SLC Cache 3D NAND TLC NVMe 1.3 PCIe Gen3x4 M.2 2280 Internal Solid State Drive SSD
NIC - 10Gtek 10Gb NIC Network Card, Dual SFP+ Port, with Intel 82599ES Controller, Compare to Intel X520-DA2

Proxmox kernel: ixgbe 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000a address=0xb00ff12c8c0 flags=0x0030]
Proxmox kernel: ixgbe 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000a address=0xb00ff12cac0 flags=0x0030]
Proxmox kernel: ixgbe 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000a address=0xb00ff12ccc0 flags=0x0030]
Proxmox kernel: ixgbe 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000a address=0x2200ffd128c0 flags=0x0030]
Proxmox kernel: ixgbe 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000a address=0x2200ffd12ac0 flags=0x0030]
Proxmox kernel: ixgbe 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000a address=0x2200ffd12cc0 flags=0x0030]
Proxmox kernel: ixgbe 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000a address=0x2000fedd48c0 flags=0x0030]
Proxmox kernel: ixgbe 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000a address=0x2000fedd4ac0 flags=0x0030]
Proxmox kernel: ixgbe 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000a address=0x2000fedd4cc0 flags=0x0030]
 
Since it is after a hardware change, before we dive into the error itself, it's probably a good test to do/check the following:
  • Check that all cables and connectors are still properly connected (maybe you bumped something while installing the RAM)
  • Check that the RAM itself is properly seated too
  • Try running the system for a while without starting any VM's, to rule out a VM causing/triggering the issue.
  • Try running memtest on the memory (can be found in the proxmox boot menu as well) to (more) rule out a faulty stick.
  • Try running with just 1 stick again (the original)
  • Try running with just the new stick in the place of the old stick (maybe the 2nd ram-slot itself is the issue?)
  • Check the logs from before the ram-install, preferably also before the last reboot before the install-reboot, to make sure the log you showed is not there too (of course it's an error to look into, but if it was already there beforehand, it might not be directly related)
Also in general, how long are we talking before it goes offline? Seconds? Minutes? Hours? Days?
 
Last edited:
Since it is after a hardware change, before we dive into the error itself, it's probably a good test to do/check the following:
  • Check that all cables and connectors are still properly connected (maybe you bumped something while installing the RAM)
  • Check that the RAM itself is properly seated too
  • Try running the system for a while without starting any VM's, to rule out a VM causing/triggering the issue.
  • Try running memtest on the memory (can be found in the proxmox boot menu as well) to (more) rule out a faulty stick.
  • Try running with just 1 stick again (the original)
  • Try running with just the new stick in the place of the old stick (maybe the 2nd ram-slot itself is the issue?)
  • Check the logs from before the ram-install, preferably also before the last reboot before the install-reboot, to make sure the log you showed is not there too (of course it's an error to look into, but if it was already there beforehand, it might not be directly related)
Also in general, how long are we talking before it goes offline? Seconds? Minutes? Hours? Days?
Thank you for responding! For a bit more context, I have been troubleshooting the hardware for almost a week now. I have reverted the hardware changes, reseated the PCI card, reseated the riser it comes off, used a different riser completely. I have completely reinstalled Proxmox on the system. All the hardware was purchased at the same time in May, and everything was working fine until I installed that RAM stick. I tried running with the just the original RAM stick, just the new RAM stick. Since all 3 instances produce the same issue, I believe the RAM is not at fault here. The only thing I have not tried is not running any VMs. I only have 1 VM installed at this time, and it's OpenWRT to handle the passthrough/routing on my network. The Proxmox server itself is not hanging as I can still access it through the onboard NIC. This seems to be the NIC card at fault. It lasts a couple hours, sometimes only a couple minutes to a half hour. I was wondering if maybe there is a kernel issue that I am not familiar with. Those error messages are from just before the system is no longer reachable via the PCIe NIC.
 
Thought you'd probably had checked all that, but thanks for confirming.

A few more thoughts about what to maybe try (or you maybe already did):
That PCI, are you using passthrough on it to the VM or just the Proxmox Linux Bridge?
If you're using passthrough, have you tried using a Linux Bridge instead? You might lose a bit of performance, but 90% performance 100% of the time is still better then 100% performance 10% of the time ;)
If it is already a Bridge, do you have an IP for proxmox configured on that too, and is that IP reachable while the VM is not (and the on-board still is)?
Since the card connects through SFP, does disconnecting and re-connecting the sfp-module (not the card) itself do anything while it is offline?
 
Thought you'd probably had checked all that, but thanks for confirming.

A few more thoughts about what to maybe try (or you maybe already did):
That PCI, are you using passthrough on it to the VM or just the Proxmox Linux Bridge?
If you're using passthrough, have you tried using a Linux Bridge instead? You might lose a bit of performance, but 90% performance 100% of the time is still better then 100% performance 10% of the time ;)
If it is already a Bridge, do you have an IP for proxmox configured on that too, and is that IP reachable while the VM is not (and the on-board still is)?
Since the card connects through SFP, does disconnecting and re-connecting the sfp-module (not the card) itself do anything while it is offline?
I am utilizing a Linux Bridge 100%. I just got another card today (10Gb NIC Network Card, Dual SFP+ Port, with Intel 82599EN Controller, Compare to Intel X520-DA2) to test if it the other card was truly at fault. I was running fine for roughly 9 hours until it just took a dump... These are the last 3 entries in the system log for the node. I tried connecting directly to the onboard NIC from the MB (manually configuring my network settings), but I was unsuccessful at reaching the server this time...

Proxmox kernel: ixgbe 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000a address=0x1100ff4738e0 flags=0x0030]
Proxmox kernel: ixgbe 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000a address=0x1100ff473ac0 flags=0x0030]
Proxmox kernel: ixgbe 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000a address=0x1100ff473cc0 flags=0x0030]

Here are some screenshot for more information:
1720755767307.png
1720755783648.png
1720755802014.png
 
Quick note (busy with other things but did skim your reply) :
Why do you have multiple ports directly on a bridge? That is quite likely to cause / has already created a network-loop [1]
It's usually better (except for VERY specific scenarios) to create a Linux bond (the type also depends on your switch-capabilities [2], if unsure during testing I would suggest to use active-backup, you can later then figure out what setting works best for you) and set that bond as the single Port-member of a bridge.

[1] https://forum.proxmox.com/threads/error-while-adding-multiple-ports-in-bridge.65092/#post-293995
[2] https://pve.proxmox.com/wiki/Network_Configuration#sysadmin_network_bond
 
Last edited:
Quick note (busy with other things but did skim your reply) :
Why do you have multiple ports directly on a bridge? That is quite likely to cause / has already created a network-loop [1]
It's usually better (except for VERY specific scenarios) to create a Linux bond (the type also depends on your switch-capabilities [2], if unsure during testing I would suggest to use active-backup, you can later then figure out what setting works best for you) and set that bond as the single Port-member of a bridge.

[1] https://forum.proxmox.com/threads/error-while-adding-multiple-ports-in-bridge.65092/#post-293995
[2] https://pve.proxmox.com/wiki/Network_Configuration#sysadmin_network_bond
I actually separated them last night. (still having the same issue) I thought because I setup the Proxmox host without the NIC (Needed the slot for GPU), the management port was only the onboard lol. I asked Copilot about that string of error messages, and they point to IOMMU issues. Since I am not passing PCI directly, should I just disable that feature in my BIOS? I am starting to suspect an issue with my BIOS configuration. Should Secure Boot be enabled? Thank you again for your help!
 
I at least don't see a reason why it should hurt more then it is right now to disable that option. As for Secure Boot, I doubt that setting would be related, as that's only about starting/booting the system, and your system does boot, just cause issues later on, so I'd say don't mess with that at least right now.
 
I at least don't see a reason why it should hurt more then it is right now to disable that option. As for Secure Boot, I doubt that setting would be related, as that's only about starting/booting the system, and your system does boot, just cause issues later on, so I'd say don't mess with that at least right now.
So I reset my BIOS to default and only turned on SVM. I had upgraded my BIOS firmware recently, and I noticed that it actually says you should reset the BIOS after upgrading... Super weird Gigabyte... My MSI board resets when I upgrade the firmware automatically lol. Fingers crossed, but I have been going strong for a day now, and I have tried the usual tasks that seemed to trigger the failure before e.g. speedtest.net. I appreciate all your input, and I hope that maybe someone will read this and help them out as well!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!