Network hickups during backup

parallax

Member
Feb 8, 2024
10
0
6
I use a HP mini PC to run Proxmox (9.1.6), with a Intel e1000 NIC.
I use the "Intel e1000e NIC Offloading Fix" helper script.

I backup to my NAS over NFS.
My NAS has good performance and has no issue maxing out 1Gbps bandwith when i test a filetransfer.

When i take a manual snapshot, there is no network issue.

Backup Job:
Mode: Snapshot
Compression; ZSTD (fast and good)

I did some research and applied the following tweaks under Advanced:
Bandwith Limit: 50Mib/s
Zstd Threads: default
IO-Workers: 4
Fleecing: on to local disk (same NVME as where my VMs are stored, with fleecing off i still have the issue)

Any ideas?
 
Good morning, I don't understand what you mean. Please test with real Hardware on real current Desktop PCs without Intel Nic.
 
Fleecing: on to local disk (same NVME as where my VMs are stored, with fleecing off i still have the issue)
So..., with fleecing enabled everything is fine? Then leave it turned on and call it a day ;-)

Some cheap consumer disks (SSD and NVMe) go easily into a "too slow"/"stuttering"/"high delay" mode. When such a device is really busy to supply the data to a backups job it may not be able to satisfy high IOPS demands from a VM. You've already activated two countermeasures, bandwidth limiting and fleecing.

What specific devices do you use? Enterprise grade? Which topology has the pool? Do you use ZFS at all? Do you produce vzdump-based backups (with a single large file at the destination) or is a PBS instance involved? So many questions... and you gave us so few details :-(

What shows the "<node> --> Summary --> IO Pressure Stall"-graph during a problematic backup? (A few single digit percents are okay, a high two digit value is... problematic.)

To watch the network condition you may look at "errors dropped missed"-counters in a terminal:
Code:
watch ip stats show dev enp1s0
...using your own NIC name of course.
 
  • Like
Reactions: parallax
So..., with fleecing enabled everything is fine? Then leave it turned on and call it a day ;-)

Some cheap consumer disks (SSD and NVMe) go easily into a "too slow"/"stuttering"/"high delay" mode. When such a device is really busy to supply the data to a backups job it may not be able to satisfy high IOPS demands from a VM. You've already activated two countermeasures, bandwidth limiting and fleecing.

What specific devices do you use? Enterprise grade? Which topology has the pool? Do you use ZFS at all? Do you produce vzdump-based backups (with a single large file at the destination) or is a PBS instance involved? So many questions... and you gave us so few details :-(

What shows the "<node> --> Summary --> IO Pressure Stall"-graph during a problematic backup? (A few single digit percents are okay, a high two digit value is... problematic.)

To watch the network condition you may look at "errors dropped missed"-counters in a terminal:
Code:
watch ip stats show dev enp1s0
...using your own NIC name of course.

Fleecing on or off does not make a difference.
There is no PBS instance involved.
Storage is single NVME (2TB KIOXIA XG8 Series) with LVM volume.

Graphs below definately show the issues, mouse cursor is at start of backup job.
Memory usage is also pretty high: 87.23% (54.24 GiB of 62.18 GiB)

TJESXl0.png
 
Have you followed the kernel logs (journalctl -kf) to see if that E1000E "fix" actually worked? Can you describe what you mean with hickup?
A snapshot does not usually involve the network unless your disk is on network storage. Please share qm config VMID and the backup task log.
 
Last edited:
Have you followed the kernel logs (journalctl -kf) to see if that E1000E "fix" actually worked? Can you describe what you mean with hickup?
A snapshot does not usually involve the network unless your disk is on network storage. Please share qm config VMID and the backup task log.

journalctl -kf gives the following:
Code:
Mar 24 05:39:15 proxmox4 kernel: tap105i0: left allmulticast mode
Mar 24 05:39:15 proxmox4 kernel: fwbr105i0: port 2(tap105i0) entered disabled state
Mar 24 05:39:15 proxmox4 kernel: fwbr105i0: port 1(fwln105i0) entered disabled state
Mar 24 05:39:15 proxmox4 kernel: vmbr0: port 10(fwpr105p0) entered disabled state
Mar 24 05:39:15 proxmox4 kernel: fwln105i0 (unregistering): left allmulticast mode
Mar 24 05:39:15 proxmox4 kernel: fwln105i0 (unregistering): left promiscuous mode
Mar 24 05:39:15 proxmox4 kernel: fwbr105i0: port 1(fwln105i0) entered disabled state
Mar 24 05:39:15 proxmox4 kernel: fwpr105p0 (unregistering): left allmulticast mode
Mar 24 05:39:15 proxmox4 kernel: fwpr105p0 (unregistering): left promiscuous mode
Mar 24 05:39:15 proxmox4 kernel: vmbr0: port 10(fwpr105p0) entered disabled state

I guess this doesn't say if the fix works or not. How can i trigger this to show in the log?
With hickup i mean there are multiple network disruptions.

I run gatus (not running on the same host) who alerts me that multiple VMs became unreachable.
I run Home Assistant and Node Red that are connecting to multiple endpoints which also results in disconnects.
A manual snapshot does not give this network outage.

Code:
root@proxmox4:~# qm config 104
agent: 1
bios: ovmf
boot: order=ide0
cores: 6
cpu: host
ide0: lvm-nvme2TB:vm-104-disk-0,size=150G
machine: pc-q35-8.1
memory: 16384
name: XXX
net0: virtio=BC:24:11:53:17:0A,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: win10
scsihw: virtio-scsi-pci
smbios1: uuid=95717d0e-bd5b-4ade-a4a3-56a51add2cd6
sockets: 1
startup: order=10
vmgenid: 6670a37b-eaa3-42ae-b08c-c5ecd6ecc10f


Backup log attached
 

Attachments

Last edited:
That's just a second worth of logs. Follow the logs like this, then trigger a backup and see what gets logged.
 
  • Like
Reactions: parallax
That's just a second worth of logs. Follow the logs like this, then trigger a backup and see what gets logged.

For what it's worth, not all VMs go down at the same time.
So i don't think it is a general adapter issue, as that would result in the node being completely unreachable?

Anything in the other logs/config i provided?
 
Further back in the kernel logs at the time of backup i found the following:

Mar 21 01:31:10 proxmox4 kernel: perf: interrupt took too long (5537 > 4011), lowering kernel.perf_event_max_sample_rate to 36000

Only saw this on Mar 21.

Another example: Backup ran from 00:00 till 06:37, these are the kernel logs:

Mar 23 00:07:25 proxmox4 kernel: usb 2-7: reset SuperSpeed USB device number 3 using xhci_hcd
Mar 23 00:07:25 proxmox4 kernel: usb 2-7: LPM exit latency is zeroed, disabling LPM.
Mar 23 01:20:20 proxmox4 kernel: usb 2-7: reset SuperSpeed USB device number 3 using xhci_hcd
Mar 23 01:20:20 proxmox4 kernel: usb 2-7: LPM exit latency is zeroed, disabling LPM.
Mar 23 02:45:51 proxmox4 kernel: usb 2-7: reset SuperSpeed USB device number 3 using xhci_hcd
Mar 23 02:45:52 proxmox4 kernel: usb 2-7: LPM exit latency is zeroed, disabling LPM.
Mar 23 05:00:27 proxmox4 kernel: usb 2-7: reset SuperSpeed USB device number 3 using xhci_hcd
Mar 23 05:00:27 proxmox4 kernel: usb 2-7: LPM exit latency is zeroed, disabling LPM.
Mar 23 05:02:35 proxmox4 kernel: tap105i0: entered promiscuous mode
Mar 23 05:02:36 proxmox4 kernel: vmbr0: port 10(fwpr105p0) entered blocking state
Mar 23 05:02:36 proxmox4 kernel: vmbr0: port 10(fwpr105p0) entered disabled state
Mar 23 05:02:36 proxmox4 kernel: fwpr105p0: entered allmulticast mode
Mar 23 05:02:36 proxmox4 kernel: fwpr105p0: entered promiscuous mode
Mar 23 05:02:36 proxmox4 kernel: vmbr0: port 10(fwpr105p0) entered blocking state
Mar 23 05:02:36 proxmox4 kernel: vmbr0: port 10(fwpr105p0) entered forwarding state
Mar 23 05:02:36 proxmox4 kernel: fwbr105i0: port 1(fwln105i0) entered blocking state
Mar 23 05:02:36 proxmox4 kernel: fwbr105i0: port 1(fwln105i0) entered disabled state
Mar 23 05:02:36 proxmox4 kernel: fwln105i0: entered allmulticast mode
Mar 23 05:02:36 proxmox4 kernel: fwln105i0: entered promiscuous mode
Mar 23 05:02:36 proxmox4 kernel: fwbr105i0: port 1(fwln105i0) entered blocking state
Mar 23 05:02:36 proxmox4 kernel: fwbr105i0: port 1(fwln105i0) entered forwarding state
Mar 23 05:02:36 proxmox4 kernel: fwbr105i0: port 2(tap105i0) entered blocking state
Mar 23 05:02:36 proxmox4 kernel: fwbr105i0: port 2(tap105i0) entered disabled state
Mar 23 05:02:36 proxmox4 kernel: tap105i0: entered allmulticast mode
Mar 23 05:02:36 proxmox4 kernel: fwbr105i0: port 2(tap105i0) entered blocking state
Mar 23 05:02:36 proxmox4 kernel: fwbr105i0: port 2(tap105i0) entered forwarding state
Mar 23 05:30:08 proxmox4 kernel: tap105i0: left allmulticast mode
Mar 23 05:30:08 proxmox4 kernel: fwbr105i0: port 2(tap105i0) entered disabled state
Mar 23 05:30:08 proxmox4 kernel: fwbr105i0: port 1(fwln105i0) entered disabled state
Mar 23 05:30:08 proxmox4 kernel: vmbr0: port 10(fwpr105p0) entered disabled state
Mar 23 05:30:08 proxmox4 kernel: fwln105i0 (unregistering): left allmulticast mode
Mar 23 05:30:08 proxmox4 kernel: fwln105i0 (unregistering): left promiscuous mode
Mar 23 05:30:08 proxmox4 kernel: fwbr105i0: port 1(fwln105i0) entered disabled state
Mar 23 05:30:08 proxmox4 kernel: fwpr105p0 (unregistering): left allmulticast mode
Mar 23 05:30:08 proxmox4 kernel: fwpr105p0 (unregistering): left promiscuous mode
Mar 23 05:30:08 proxmox4 kernel: vmbr0: port 10(fwpr105p0) entered disabled state
Mar 23 06:19:59 proxmox4 kernel: usb 2-7: reset SuperSpeed USB device number 3 using xhci_hcd
Mar 23 06:19:59 proxmox4 kernel: usb 2-7: LPM exit latency is zeroed, disabling LPM.
 
Can you share this too?
Bash:
lsblk -o+FSTYPE,LABEL,MODEL,TRAN
lsusb
lsusb -vt
I wonder if this happens via SMB/CIFS too.
 
Can you share this too?
Bash:
lsblk -o+FSTYPE,LABEL,MODEL,TRAN
lsusb
lsusb -vt
I wonder if this happens via SMB/CIFS too.

Something very strange i've noticed and i don't know if it makes sense, but it's the following:
As soon as i change settings in the backup job, like yesterday, i disabled all limitations that were in place to check the kernel logs.
Most of the time the next run of the backup job runs fine (or at least a lot better) without noticable issues.
After 1 or 2 runs, the issue starts again.

Code:
root@proxmox4:~# lsblk -o+FSTYPE,LABEL,MODEL,TRAN
NAME                                MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS FSTYPE      LABEL MODEL               TRAN
sda                                   8:0    0 279.5G  0 disk                               VK0300GDUQV         sata
├─sda1                                8:1    0  1007K  0 part
├─sda2                                8:2    0     1G  0 part /boot/efi   vfat
└─sda3                                8:3    0 278.5G  0 part             LVM2_member
  ├─pve-swap                        252:0    0     8G  0 lvm  [SWAP]      swap
  ├─pve-root                        252:1    0  79.6G  0 lvm  /           ext4
  ├─pve-data_tmeta                  252:2    0   1.7G  0 lvm
  │ └─pve-data-tpool                252:6    0 171.3G  0 lvm
  │   └─pve-data                    252:7    0 171.3G  1 lvm
  └─pve-data_tdata                  252:3    0 171.3G  0 lvm
    └─pve-data-tpool                252:6    0 171.3G  0 lvm
      └─pve-data                    252:7    0 171.3G  1 lvm
nvme0n1                             259:0    0   1.9T  0 disk             LVM2_member       KXG80ZNV2T04 KIOXIA nvme
├─lvm--nvme2TB-lvm--nvme2TB_tmeta   252:4    0  15.9G  0 lvm
│ └─lvm--nvme2TB-lvm--nvme2TB-tpool 252:8    0   1.8T  0 lvm
│   ├─lvm--nvme2TB-lvm--nvme2TB     252:9    0   1.8T  1 lvm
│   ├─lvm--nvme2TB-vm--108--disk--0 252:10   0   128G  0 lvm
│   ├─lvm--nvme2TB-vm--103--disk--0 252:11   0   300G  0 lvm
│   ├─lvm--nvme2TB-vm--101--disk--0 252:12   0   274G  0 lvm
│   ├─lvm--nvme2TB-vm--102--disk--0 252:13   0    16G  0 lvm
│   ├─lvm--nvme2TB-vm--104--disk--0 252:14   0   150G  0 lvm
│   ├─lvm--nvme2TB-vm--100--disk--1 252:15   0   120G  0 lvm
│   ├─lvm--nvme2TB-vm--100--disk--0 252:16   0     4M  0 lvm
│   ├─lvm--nvme2TB-vm--100--disk--2 252:17   0     4M  0 lvm
│   ├─lvm--nvme2TB-vm--107--disk--0 252:18   0    32G  0 lvm
│   ├─lvm--nvme2TB-vm--105--disk--0 252:19   0  80.1G  0 lvm
│   └─lvm--nvme2TB-vm--106--disk--0 252:20   0    32G  0 lvm
└─lvm--nvme2TB-lvm--nvme2TB_tdata   252:5    0   1.8T  0 lvm
  └─lvm--nvme2TB-lvm--nvme2TB-tpool 252:8    0   1.8T  0 lvm
    ├─lvm--nvme2TB-lvm--nvme2TB     252:9    0   1.8T  1 lvm
    ├─lvm--nvme2TB-vm--108--disk--0 252:10   0   128G  0 lvm
    ├─lvm--nvme2TB-vm--103--disk--0 252:11   0   300G  0 lvm
    ├─lvm--nvme2TB-vm--101--disk--0 252:12   0   274G  0 lvm
    ├─lvm--nvme2TB-vm--102--disk--0 252:13   0    16G  0 lvm
    ├─lvm--nvme2TB-vm--104--disk--0 252:14   0   150G  0 lvm
    ├─lvm--nvme2TB-vm--100--disk--1 252:15   0   120G  0 lvm
    ├─lvm--nvme2TB-vm--100--disk--0 252:16   0     4M  0 lvm
    ├─lvm--nvme2TB-vm--100--disk--2 252:17   0     4M  0 lvm
    ├─lvm--nvme2TB-vm--107--disk--0 252:18   0    32G  0 lvm
    ├─lvm--nvme2TB-vm--105--disk--0 252:19   0  80.1G  0 lvm
    └─lvm--nvme2TB-vm--106--disk--0 252:20   0    32G  0 lvm

root@proxmox4:~# lsusb
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 001 Device 002: ID 051d:0002 American Power Conversion Uninterruptible Power Supply
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 002 Device 002: ID 0b95:1790 ASIX Electronics Corp. AX88179 Gigabit Ethernet
Bus 002 Device 003: ID 18d1:9302 Google Inc.

root@proxmox4:~# lsusb -vt
/:  Bus 001.Port 001: Dev 001, Class=root_hub, Driver=xhci_hcd/16p, 480M
    ID 1d6b:0002 Linux Foundation 2.0 root hub
    |__ Port 001: Dev 002, If 0, Class=Human Interface Device, Driver=usbfs, 1.5M
        ID 051d:0002 American Power Conversion Uninterruptible Power Supply
/:  Bus 002.Port 001: Dev 001, Class=root_hub, Driver=xhci_hcd/10p, 10000M
    ID 1d6b:0003 Linux Foundation 3.0 root hub
    |__ Port 007: Dev 003, If 0, Class=Vendor Specific Class, Driver=usbfs, 5000M
        ID 18d1:9302 Google Inc.
    |__ Port 008: Dev 002, If 0, Class=Vendor Specific Class, Driver=ax88179_178a, 5000M
        ID 0b95:1790 ASIX Electronics Corp. AX88179 Gigabit Ethernet

USB devices are not used on the host itself, but are passed to VMs: APC UPS, USB NIC and a Google Coral
 
Hmm. I wanted to make sure this issue is not USB related (hence checking the storage transport) because I don't see anything else that sticks out to me. It's possible it's not directly network related. The node's Summary graphs might show something out of the ordinary.
 
Last edited:
Bus 002 Device 002: ID 0b95:1790 ASIX Electronics Corp. AX88179 Gigabit Ethernet
So you are using NW connection via USB - this is most likely your issue on the Mini PC. Those miniscule chipsets don't handle very well under load - and can/do crash USB buses. You may want to check temps too.
 
So you are using NW connection via USB - this is most likely your issue on the Mini PC. Those miniscule chipsets don't handle very well under load - and can/do crash USB buses. You may want to check temps too.
That is not the case, i have an onboard NIC that i use.
As mentioned: all usb devices passed to VMs and are not in use on the proxmox host
 
OK. But you mention that the VM/s become NW compromised - so if the USB bus crashed - that would amount to the same.
Does the PVE host also become NW compromised at that time?

Of the 64GB ram how much is left allocated for the PVE host?

If the USB bus crashes, i guess that all VMs and the host have issues at that moment.
That seems not to be the case VMs show network unreachablility at the moment of backup/snapshot.
I run Gatus to monitor them and the times of outage is not the same for all VMs.
Also the host does not become unreachable.

Memory usage is pretty high: 87.23% (54.24 GiB of 62.18 GiB) is in use.
I lowered this a little to check if this makes the situation better.
Thx for you insights.
 
Last edited:
I moved some VMs to another freshly setup host and could remove the USB network adapter.
With these actions, the memory usage dropped to around 50%, kernel logs during backup, see below.

But the VM connectivity issues are still there.

Code:
Mar 27 00:05:05 proxmox4 kernel: usb 2-7: reset SuperSpeed USB device number 3 using xhci_hcd
Mar 27 00:05:05 proxmox4 kernel: usb 2-7: LPM exit latency is zeroed, disabling LPM.
Mar 27 00:28:11 proxmox4 kernel: usb 2-7: reset SuperSpeed USB device number 3 using xhci_hcd
Mar 27 00:28:11 proxmox4 kernel: usb 2-7: LPM exit latency is zeroed, disabling LPM.
Mar 27 01:43:12 proxmox4 kernel: tap105i0: entered promiscuous mode
Mar 27 01:43:13 proxmox4 kernel: vmbr0: port 5(fwpr105p0) entered blocking state
Mar 27 01:43:13 proxmox4 kernel: vmbr0: port 5(fwpr105p0) entered disabled state
Mar 27 01:43:13 proxmox4 kernel: fwpr105p0: entered allmulticast mode
Mar 27 01:43:13 proxmox4 kernel: fwpr105p0: entered promiscuous mode
Mar 27 01:43:13 proxmox4 kernel: vmbr0: port 5(fwpr105p0) entered blocking state
Mar 27 01:43:13 proxmox4 kernel: vmbr0: port 5(fwpr105p0) entered forwarding state
Mar 27 01:43:13 proxmox4 kernel: fwbr105i0: port 1(fwln105i0) entered blocking state
Mar 27 01:43:13 proxmox4 kernel: fwbr105i0: port 1(fwln105i0) entered disabled state
Mar 27 01:43:13 proxmox4 kernel: fwln105i0: entered allmulticast mode
Mar 27 01:43:13 proxmox4 kernel: fwln105i0: entered promiscuous mode
Mar 27 01:43:13 proxmox4 kernel: fwbr105i0: port 1(fwln105i0) entered blocking state
Mar 27 01:43:13 proxmox4 kernel: fwbr105i0: port 1(fwln105i0) entered forwarding state
Mar 27 01:43:13 proxmox4 kernel: fwbr105i0: port 2(tap105i0) entered blocking state
Mar 27 01:43:13 proxmox4 kernel: fwbr105i0: port 2(tap105i0) entered disabled state
Mar 27 01:43:13 proxmox4 kernel: tap105i0: entered allmulticast mode
Mar 27 01:43:13 proxmox4 kernel: fwbr105i0: port 2(tap105i0) entered blocking state
Mar 27 01:43:13 proxmox4 kernel: fwbr105i0: port 2(tap105i0) entered forwarding state
Mar 27 01:45:43 proxmox4 kernel: tap105i0: left allmulticast mode
Mar 27 01:45:43 proxmox4 kernel: fwbr105i0: port 2(tap105i0) entered disabled state
Mar 27 01:45:43 proxmox4 kernel: fwbr105i0: port 1(fwln105i0) entered disabled state
Mar 27 01:45:43 proxmox4 kernel: vmbr0: port 5(fwpr105p0) entered disabled state
Mar 27 01:45:43 proxmox4 kernel: fwln105i0 (unregistering): left allmulticast mode
Mar 27 01:45:43 proxmox4 kernel: fwln105i0 (unregistering): left promiscuous mode
Mar 27 01:45:43 proxmox4 kernel: fwbr105i0: port 1(fwln105i0) entered disabled state
Mar 27 01:45:43 proxmox4 kernel: fwpr105p0 (unregistering): left allmulticast mode
Mar 27 01:45:43 proxmox4 kernel: fwpr105p0 (unregistering): left promiscuous mode
Mar 27 01:45:43 proxmox4 kernel: vmbr0: port 5(fwpr105p0) entered disabled state
Mar 27 01:46:24 proxmox4 kernel: tap107i0: entered promiscuous mode
Mar 27 01:46:24 proxmox4 kernel: vmbr0: port 5(tap107i0) entered blocking state
Mar 27 01:46:24 proxmox4 kernel: vmbr0: port 5(tap107i0) entered disabled state
Mar 27 01:46:24 proxmox4 kernel: tap107i0: entered allmulticast mode
Mar 27 01:46:24 proxmox4 kernel: vmbr0: port 5(tap107i0) entered blocking state
Mar 27 01:46:24 proxmox4 kernel: vmbr0: port 5(tap107i0) entered forwarding state
Mar 27 01:47:47 proxmox4 kernel: tap107i0: left allmulticast mode
Mar 27 01:47:47 proxmox4 kernel: vmbr0: port 5(tap107i0) entered disabled state

I will now setup the new host with backup and test if i see the issue there too, hardware is comparable.