Instability when Guest uploading to S3 via AWS CLI

telgareith

New Member
Oct 16, 2023
1
0
1
Host: Dell Lattitude 5501
CPU: i7-9850H
Memory: 32GB RAM
LAN: I219-LM
Disk:
  • Samsung SSD 860 2TB (SATA) Note: Root Pool, ZFS
  • Samsung SSD PM991 512GB (NVMe), ZFS

VMs:
  1. (Not running) Ubuntu 22.04: 1x socket 4x cores (Host CPU type), VirtIO SCSI disk, 16GB RAM (baloon=1, but min&max at 16gb), VirtIO network to vmbr0, 250GB disk discard=ON ssd=1 (Ubuntu 22.04 default, ext4?) disk #1, guest tools installed*
  2. Linux Mint 21.2: 1x socket 4x cores (KVM64 type), VirtIO SCSI Single disk, 12GB RAM (baloon=0), VirtIO network to vmbr0, 60GB disk discard=ON ssd=1 iothread=1 (guest formatted ZFS) disk #1 , guest tools installed*
* did not explicitly install VirtIO drivers- if that's seprate from the tools.

LXC:
none

containerd(on host):
  • the Lounge (IRC Client)

Problem: I executed `aws s3 cp` inside VM #2 (Linux Mint) to upload the contents of a SMB share(850GB in total, files large enough it could saturate gigabit) and *everything* started stuttering horribly:
  • SSH to host: Repeated disconnects, typing 'top' took seconds to appear.
  • SSH to VM #2: Repeated disconnects, 'top' also took seconds.
  • The Lounge (container): repeated ping timeouts on Libera.

I managed to gather this
Host:
  1. a kvm process was using ~100% of cpu according to `top`
  2. load average from top: 0.18, 0.52, 0.60 (same 20min period)
  3. Also, RAM was at ~20GB
  4. iftop reporting 1.10Gb Rx traffic
  5. `zpool iostat 1` is showing suspiciously low numbers, a few meg read/s from rpool, double digit IOPS, and zero write.

VM #2:
  1. aws process was using 50%-100% of CPU,
  2. cinnamon process was using 25% CPU until I killed it. (stated for completeness)
  3. smbd using 15%
  4. load average from top: 4.90, 5.38, 3.72 (after waiting 20min)

Other:
Edgerouter 4 reported outgoing traffic approaching 400mbit in fits and spurts. Internet is Frontier FiOS 500/500 & no additional devices between router and ONT/demarc.


Where do I start troubleshooting this? Or, if I've given the answer, what's wrong? I can understand the guest running the aws process crawling, but nothing I see signals a reason for the host to have issues (proxmox webUI, IRC container, and ssh). With such low host CPU compared to guest, I suspect something is doing a lot of waiting, but I don't know what. This is reinforced by things taking 10-15 seconds to return to 'normal' in the guest and on the host after killing the aws process.

Edit: I think I found a smoking gun:
Code:
e1000e 0000:00:1f.6 eno2: Detected Hardware Unit Hang:
Oct 16 18:44:22 proxlap kernel: [608080.398118]   TDH                  <a0>
Oct 16 18:44:22 proxlap kernel: [608080.398118]   TDT                  <f5>
Oct 16 18:44:22 proxlap kernel: [608080.398118]   next_to_use          <f5>
Oct 16 18:44:22 proxlap kernel: [608080.398118]   next_to_clean        <9f>
Oct 16 18:44:22 proxlap kernel: [608080.398118] buffer_info[next_to_clean]:
Oct 16 18:44:22 proxlap kernel: [608080.398118]   time_stamp           <1090e8059>
Oct 16 18:44:22 proxlap kernel: [608080.398118]   next_to_watch        <a0>
Oct 16 18:44:22 proxlap kernel: [608080.398118]   jiffies              <1090e8ab0>
Oct 16 18:44:22 proxlap kernel: [608080.398118]   next_to_watch.status <0>
Oct 16 18:44:22 proxlap kernel: [608080.398118] MAC Status             <80083>
Oct 16 18:44:22 proxlap kernel: [608080.398118] PHY Status             <796d>
Oct 16 18:44:22 proxlap kernel: [608080.398118] PHY 1000BASE-T Status  <3800>
Oct 16 18:44:22 proxlap kernel: [608080.398118] PHY Extended Status    <3000>
Oct 16 18:44:22 proxlap kernel: [608080.398118] PCI Status             <10>
Oct 16 18:44:23 proxlap kernel: [608081.485420] e1000e 0000:00:1f.6 eno2: Reset adapter unexpectedly
Oct 16 18:44:23 proxlap kernel: [608081.576454] vmbr0: port 1(eno2) entered disabled state
Oct 16 18:44:27 proxlap kernel: [608085.252451] e1000e 0000:00:1f.6 eno2: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
 
Last edited:
Hi,
Edit: I think I found a smoking gun:
Code:
e1000e 0000:00:1f.6 eno2: Detected Hardware Unit Hang:
Oct 16 18:44:22 proxlap kernel: [608080.398118]   TDH                  <a0>
Oct 16 18:44:22 proxlap kernel: [608080.398118]   TDT                  <f5>
Oct 16 18:44:22 proxlap kernel: [608080.398118]   next_to_use          <f5>
Oct 16 18:44:22 proxlap kernel: [608080.398118]   next_to_clean        <9f>
Oct 16 18:44:22 proxlap kernel: [608080.398118] buffer_info[next_to_clean]:
Oct 16 18:44:22 proxlap kernel: [608080.398118]   time_stamp           <1090e8059>
Oct 16 18:44:22 proxlap kernel: [608080.398118]   next_to_watch        <a0>
Oct 16 18:44:22 proxlap kernel: [608080.398118]   jiffies              <1090e8ab0>
Oct 16 18:44:22 proxlap kernel: [608080.398118]   next_to_watch.status <0>
Oct 16 18:44:22 proxlap kernel: [608080.398118] MAC Status             <80083>
Oct 16 18:44:22 proxlap kernel: [608080.398118] PHY Status             <796d>
Oct 16 18:44:22 proxlap kernel: [608080.398118] PHY 1000BASE-T Status  <3800>
Oct 16 18:44:22 proxlap kernel: [608080.398118] PHY Extended Status    <3000>
Oct 16 18:44:22 proxlap kernel: [608080.398118] PCI Status             <10>
Oct 16 18:44:23 proxlap kernel: [608081.485420] e1000e 0000:00:1f.6 eno2: Reset adapter unexpectedly
Oct 16 18:44:23 proxlap kernel: [608081.576454] vmbr0: port 1(eno2) entered disabled state
Oct 16 18:44:27 proxlap kernel: [608085.252451] e1000e 0000:00:1f.6 eno2: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
sounds like it could be the issue reported here: https://forum.proxmox.com/threads/e1000-driver-hang.58284
That issue is likely a hardware/firmware bug and you might need to disable offloading, i.e. tso/gso on the interface like suggested here: https://forum.proxmox.com/threads/e1000-driver-hang.58284/post-368615