Instability when Guest uploading to S3 via AWS CLI

telgareith

New Member
Oct 16, 2023
1
0
1
Host: Dell Lattitude 5501
CPU: i7-9850H
Memory: 32GB RAM
LAN: I219-LM
Disk:
  • Samsung SSD 860 2TB (SATA) Note: Root Pool, ZFS
  • Samsung SSD PM991 512GB (NVMe), ZFS

VMs:
  1. (Not running) Ubuntu 22.04: 1x socket 4x cores (Host CPU type), VirtIO SCSI disk, 16GB RAM (baloon=1, but min&max at 16gb), VirtIO network to vmbr0, 250GB disk discard=ON ssd=1 (Ubuntu 22.04 default, ext4?) disk #1, guest tools installed*
  2. Linux Mint 21.2: 1x socket 4x cores (KVM64 type), VirtIO SCSI Single disk, 12GB RAM (baloon=0), VirtIO network to vmbr0, 60GB disk discard=ON ssd=1 iothread=1 (guest formatted ZFS) disk #1 , guest tools installed*
* did not explicitly install VirtIO drivers- if that's seprate from the tools.

LXC:
none

containerd(on host):
  • the Lounge (IRC Client)

Problem: I executed `aws s3 cp` inside VM #2 (Linux Mint) to upload the contents of a SMB share(850GB in total, files large enough it could saturate gigabit) and *everything* started stuttering horribly:
  • SSH to host: Repeated disconnects, typing 'top' took seconds to appear.
  • SSH to VM #2: Repeated disconnects, 'top' also took seconds.
  • The Lounge (container): repeated ping timeouts on Libera.

I managed to gather this
Host:
  1. a kvm process was using ~100% of cpu according to `top`
  2. load average from top: 0.18, 0.52, 0.60 (same 20min period)
  3. Also, RAM was at ~20GB
  4. iftop reporting 1.10Gb Rx traffic
  5. `zpool iostat 1` is showing suspiciously low numbers, a few meg read/s from rpool, double digit IOPS, and zero write.

VM #2:
  1. aws process was using 50%-100% of CPU,
  2. cinnamon process was using 25% CPU until I killed it. (stated for completeness)
  3. smbd using 15%
  4. load average from top: 4.90, 5.38, 3.72 (after waiting 20min)

Other:
Edgerouter 4 reported outgoing traffic approaching 400mbit in fits and spurts. Internet is Frontier FiOS 500/500 & no additional devices between router and ONT/demarc.


Where do I start troubleshooting this? Or, if I've given the answer, what's wrong? I can understand the guest running the aws process crawling, but nothing I see signals a reason for the host to have issues (proxmox webUI, IRC container, and ssh). With such low host CPU compared to guest, I suspect something is doing a lot of waiting, but I don't know what. This is reinforced by things taking 10-15 seconds to return to 'normal' in the guest and on the host after killing the aws process.

Edit: I think I found a smoking gun:
Code:
e1000e 0000:00:1f.6 eno2: Detected Hardware Unit Hang:
Oct 16 18:44:22 proxlap kernel: [608080.398118]   TDH                  <a0>
Oct 16 18:44:22 proxlap kernel: [608080.398118]   TDT                  <f5>
Oct 16 18:44:22 proxlap kernel: [608080.398118]   next_to_use          <f5>
Oct 16 18:44:22 proxlap kernel: [608080.398118]   next_to_clean        <9f>
Oct 16 18:44:22 proxlap kernel: [608080.398118] buffer_info[next_to_clean]:
Oct 16 18:44:22 proxlap kernel: [608080.398118]   time_stamp           <1090e8059>
Oct 16 18:44:22 proxlap kernel: [608080.398118]   next_to_watch        <a0>
Oct 16 18:44:22 proxlap kernel: [608080.398118]   jiffies              <1090e8ab0>
Oct 16 18:44:22 proxlap kernel: [608080.398118]   next_to_watch.status <0>
Oct 16 18:44:22 proxlap kernel: [608080.398118] MAC Status             <80083>
Oct 16 18:44:22 proxlap kernel: [608080.398118] PHY Status             <796d>
Oct 16 18:44:22 proxlap kernel: [608080.398118] PHY 1000BASE-T Status  <3800>
Oct 16 18:44:22 proxlap kernel: [608080.398118] PHY Extended Status    <3000>
Oct 16 18:44:22 proxlap kernel: [608080.398118] PCI Status             <10>
Oct 16 18:44:23 proxlap kernel: [608081.485420] e1000e 0000:00:1f.6 eno2: Reset adapter unexpectedly
Oct 16 18:44:23 proxlap kernel: [608081.576454] vmbr0: port 1(eno2) entered disabled state
Oct 16 18:44:27 proxlap kernel: [608085.252451] e1000e 0000:00:1f.6 eno2: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
 
Last edited:
Hi,
Edit: I think I found a smoking gun:
Code:
e1000e 0000:00:1f.6 eno2: Detected Hardware Unit Hang:
Oct 16 18:44:22 proxlap kernel: [608080.398118]   TDH                  <a0>
Oct 16 18:44:22 proxlap kernel: [608080.398118]   TDT                  <f5>
Oct 16 18:44:22 proxlap kernel: [608080.398118]   next_to_use          <f5>
Oct 16 18:44:22 proxlap kernel: [608080.398118]   next_to_clean        <9f>
Oct 16 18:44:22 proxlap kernel: [608080.398118] buffer_info[next_to_clean]:
Oct 16 18:44:22 proxlap kernel: [608080.398118]   time_stamp           <1090e8059>
Oct 16 18:44:22 proxlap kernel: [608080.398118]   next_to_watch        <a0>
Oct 16 18:44:22 proxlap kernel: [608080.398118]   jiffies              <1090e8ab0>
Oct 16 18:44:22 proxlap kernel: [608080.398118]   next_to_watch.status <0>
Oct 16 18:44:22 proxlap kernel: [608080.398118] MAC Status             <80083>
Oct 16 18:44:22 proxlap kernel: [608080.398118] PHY Status             <796d>
Oct 16 18:44:22 proxlap kernel: [608080.398118] PHY 1000BASE-T Status  <3800>
Oct 16 18:44:22 proxlap kernel: [608080.398118] PHY Extended Status    <3000>
Oct 16 18:44:22 proxlap kernel: [608080.398118] PCI Status             <10>
Oct 16 18:44:23 proxlap kernel: [608081.485420] e1000e 0000:00:1f.6 eno2: Reset adapter unexpectedly
Oct 16 18:44:23 proxlap kernel: [608081.576454] vmbr0: port 1(eno2) entered disabled state
Oct 16 18:44:27 proxlap kernel: [608085.252451] e1000e 0000:00:1f.6 eno2: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
sounds like it could be the issue reported here: https://forum.proxmox.com/threads/e1000-driver-hang.58284
That issue is likely a hardware/firmware bug and you might need to disable offloading, i.e. tso/gso on the interface like suggested here: https://forum.proxmox.com/threads/e1000-driver-hang.58284/post-368615
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!