Hey!
So I'm really out of luck with this and have to ask for help because I might go crazy.
We've recently had to add a couple of USB NICs to our cluster to be able to migrate our OPNSense Firewall VM, as we don't have managed switches or the budget for more dedicated network hardware, and have been getting random disconnects on them, more frequently on the node that does not host our live vm firewall.
We have a triple Public IPv4 WAN setup (One is PPPoE, one is Static IP assigned within DMZ (Not a USB NIC), one is MAC Linked Static DHCP Lease), each connected to a small switch to allow for the FW to be migrated (this works very well and we've had no issues with it!).
Anyways, the random USB NIC Disconnects seem to be prominent on:
Other things to have in mind:
I'll leave all the data I've gathered up to now.
Threads and solutions I've explored:
History - USB Auto-suspend disable:
/etc/network/interfaces snippet (only the relevant parts)
lsusb
journalctl -e
As you can probably see in the logs, vmbr11 never goes into forwarding state after the USB NIC disconnects and reconnects.
The same issue occurs with the other device (which is on vmbr12)
pveversion -v
In the meantime if this keeps happening I might resort to making a small python watchdog to do ifreload -a whenever this happens...
If anyone has ideas or has been struck by this issue, I'd really appreciate some help!
Regards,
Dylan
So I'm really out of luck with this and have to ask for help because I might go crazy.
We've recently had to add a couple of USB NICs to our cluster to be able to migrate our OPNSense Firewall VM, as we don't have managed switches or the budget for more dedicated network hardware, and have been getting random disconnects on them, more frequently on the node that does not host our live vm firewall.
We have a triple Public IPv4 WAN setup (One is PPPoE, one is Static IP assigned within DMZ (Not a USB NIC), one is MAC Linked Static DHCP Lease), each connected to a small switch to allow for the FW to be migrated (this works very well and we've had no issues with it!).
Anyways, the random USB NIC Disconnects seem to be prominent on:
- "Unused" NICs (i.e.: The host that isn't currently hosting the OPNSense VM randomly disconnects one or the other USB NIC, then after a while -maybe- both USB NICs go down).
- When the WAN DHCP Lease Expires (Not confirmed but an assumption, rather).
Other things to have in mind:
- IOMMU is effectively enabled on both nodes (have checked with a bash script/command found on this forum).
- Both nodes are AMD Ryzen 2600, same MOBO (Can't recall the Model but can add it if necessary)
- The BIOS/UEFI for both nodes has been updated about 6 months ago, so it's relatively recent.
I'll leave all the data I've gathered up to now.
Threads and solutions I've explored:
- Disabling USB Auto-suspend
- "Fixing" driver load order in modprobe
- https://bugzilla.kernel.org/show_bug.cgi?id=212731
- https://forum.manjaro.org/t/ax88179...to-ethernet-adapter-with-kernel-5-17/109966/3
- https://github.com/FreddyXin/ax88179_178a/issues/6
- https://forum.openmediavault.org/index.php?thread/47260-ax88179-ethernet-adapter-randomly-crashing/
History - USB Auto-suspend disable:
Code:
494 echo -1 | sudo tee /sys/bus/usb/devices/*/power/autosuspend >/dev/null
495 echo on | sudo tee /sys/bus/usb/devices/*/power/level >/dev/null
496 nano /etc/default/grub
497 sed -i 's/GRUB_CMDLINE_LINUX_DEFAULT="/&usbcore.autosuspend=-1 /' /etc/default/grub
498 update-grub
/etc/network/interfaces snippet (only the relevant parts)
Code:
auto enx7cc2c649a930
iface enx7cc2c649a930 inet manual
post-up /sbin/ethtool -offload enx7cc2c649a930 tx off rx off; /sbin/ethtool -K enx7cc2c649a930 gso off; /sbin/ethtool -K enx7cc2c649a930 tso off;
#TPLINK USB BOTTOM
auto enx7cc2c64b9069
iface enx7cc2c64b9069 inet manual
post-up /sbin/ethtool -offload enx7cc2c64b9069 tx off rx off; /sbin/ethtool -K enx7cc2c64b9069 gso off; /sbin/ethtool -K enx7cc2c64b9069 tso off;
#TPLINK USB TOP
auto vmbr11
iface vmbr11 inet manual
bridge-ports enx7cc2c64b9069
bridge-stp off
bridge-fd 0
#WAN - TELECOM
auto vmbr12
iface vmbr12 inet manual
bridge-ports enx7cc2c649a930
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-4094
#WAN - MOVISTAR
lsusb
Bash:
root@pve02:~# lsusb -tv
/: Bus 04.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 5000M
ID 1d6b:0003 Linux Foundation 3.0 root hub
|__ Port 3: Dev 6, If 0, Class=Vendor Specific Class, Driver=ax88179_178a, 5000M
ID 0b95:1790 ASIX Electronics Corp. AX88179 Gigabit Ethernet
|__ Port 4: Dev 5, If 0, Class=Vendor Specific Class, Driver=ax88179_178a, 5000M
ID 0b95:1790 ASIX Electronics Corp. AX88179 Gigabit Ethernet
/: Bus 03.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 480M
ID 1d6b:0002 Linux Foundation 2.0 root hub
/: Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/3p, 10000M
ID 1d6b:0003 Linux Foundation 3.0 root hub
/: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/9p, 480M
ID 1d6b:0002 Linux Foundation 2.0 root hub
|__ Port 4: Dev 2, If 1, Class=Human Interface Device, Driver=usbhid, 1.5M
ID 04f3:0103 Elan Microelectronics Corp. ActiveJet K-2024 Multimedia Keyboard
|__ Port 4: Dev 2, If 0, Class=Human Interface Device, Driver=usbhid, 1.5M
ID 04f3:0103 Elan Microelectronics Corp. ActiveJet K-2024 Multimedia Keyboard
|__ Port 5: Dev 3, If 0, Class=Human Interface Device, Driver=usbhid, 1.5M
ID 0458:003a KYE Systems Corp. (Mouse Systems) NetScroll+ Mini Traveler / Genius NetScroll 120
journalctl -e
Bash:
Mar 08 19:19:40 pve01 kernel: xhci_hcd 0000:08:00.3: WARN: HC couldn't access mem fast enough for slot 1 ep 2
Mar 08 19:21:22 pve01 pmxcfs[2204]: [dcdb] notice: data verification successful
Mar 08 19:21:23 pve01 smartd[1932]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 50 to 52
Mar 08 19:21:26 pve01 smartd[1932]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 80 to 81
Mar 08 19:21:26 pve01 smartd[1932]: Device: /dev/sdc [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 80 to 81
Mar 08 19:25:01 pve01 CRON[2011272]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Mar 08 19:25:01 pve01 CRON[2011273]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Mar 08 19:25:01 pve01 CRON[2011272]: pam_unix(cron:session): session closed for user root
Mar 08 19:28:15 pve01 pmxcfs[2204]: [status] notice: received log
Mar 08 19:28:16 pve01 pmxcfs[2204]: [status] notice: received log
Mar 08 19:28:52 pve01 kernel: xhci_hcd 0000:08:00.3: WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state.
Mar 08 19:28:52 pve01 kernel: ax88179_178a 4-4:1.0 enx7cc2c64b9069: unregister 'ax88179_178a' usb-0000:08:00.3-4, ASIX AX88179 USB 3.0 Gigabit Ethernet
Mar 08 19:28:52 pve01 kernel: ax88179_178a 4-4:1.0 enx7cc2c64b9069: Failed to read reg index 0x0002: -19
Mar 08 19:28:52 pve01 kernel: ax88179_178a 4-4:1.0 enx7cc2c64b9069: Failed to write reg index 0x0002: -19
Mar 08 19:28:52 pve01 kernel: vmbr11: port 1(enx7cc2c64b9069) entered disabled state
Mar 08 19:28:52 pve01 kernel: device enx7cc2c64b9069 left promiscuous mode
Mar 08 19:28:52 pve01 kernel: vmbr11: port 1(enx7cc2c64b9069) entered disabled state
Mar 08 19:28:52 pve01 kernel: ax88179_178a 4-4:1.0 enx7cc2c64b9069 (unregistered): Failed to write reg index 0x0002: -19
Mar 08 19:28:52 pve01 kernel: ax88179_178a 4-4:1.0 enx7cc2c64b9069 (unregistered): Failed to write reg index 0x0001: -19
Mar 08 19:28:52 pve01 kernel: ax88179_178a 4-4:1.0 enx7cc2c64b9069 (unregistered): Failed to write reg index 0x0002: -19
Mar 08 19:28:53 pve01 kernel: usb 4-4: reset SuperSpeed USB device number 7 using xhci_hcd
Mar 08 19:28:53 pve01 kernel: ax88179_178a 4-4:1.0 eth0: register 'ax88179_178a' at usb-0000:08:00.3-4, ASIX AX88179 USB 3.0 Gigabit Ethernet, 7c:c2:c6:4b:90:69
Mar 08 19:28:53 pve01 kernel: ax88179_178a 4-4:1.0 enx7cc2c64b9069: renamed from eth0
As you can probably see in the logs, vmbr11 never goes into forwarding state after the USB NIC disconnects and reconnects.
The same issue occurs with the other device (which is on vmbr12)
pveversion -v
Code:
proxmox-ve: 8.1.0 (running kernel: 6.2.16-19-pve)
pve-manager: 8.0.4 (running version: 8.0.4/d258a813cfa6b390)
pve-kernel-6.2: 8.0.5
proxmox-kernel-helper: 8.0.3
pve-kernel-5.15: 7.4-4
proxmox-kernel-6.5.13-1-pve-signed: 6.5.13-1
proxmox-kernel-6.5: 6.5.13-1
pve-kernel-5.4: 6.4-7
proxmox-kernel-6.2.16-19-pve: 6.2.16-19
proxmox-kernel-6.2: 6.2.16-19
pve-kernel-6.2.16-5-pve: 6.2.16-6
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.4.143-1-pve: 5.4.143-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown: residual config
ifupdown2: 3.2.0-1+pmx5
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.1
libpve-access-control: 8.0.5
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.9
libpve-guest-common-perl: 5.0.5
libpve-http-server-perl: 5.0.4
libpve-rs-perl: 0.8.5
libpve-storage-perl: 8.0.2
libqb0: 1.0.5-1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 3.0.4-1
proxmox-backup-file-restore: 3.0.4-1
proxmox-kernel-helper: 8.0.3
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.2
proxmox-widget-toolkit: 4.0.9
pve-cluster: 8.0.4
pve-container: 5.0.5
pve-docs: 8.0.5
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.3
pve-firmware: 3.8-3
pve-ha-manager: 4.0.2
pve-i18n: 3.0.7
pve-qemu-kvm: 8.0.2-7
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.7
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.13-pve1
In the meantime if this keeps happening I might resort to making a small python watchdog to do ifreload -a whenever this happens...
If anyone has ideas or has been struck by this issue, I'd really appreciate some help!
Regards,
Dylan
Last edited: