Host: Dell Lattitude 5501
CPU: i7-9850H
Memory: 32GB RAM
LAN: I219-LM
Disk:
VMs:
LXC:
none
containerd(on host):
Problem: I executed `aws s3 cp` inside VM #2 (Linux Mint) to upload the contents of a SMB share(850GB in total, files large enough it could saturate gigabit) and *everything* started stuttering horribly:
I managed to gather this
Host:
VM #2:
Other:
Edgerouter 4 reported outgoing traffic approaching 400mbit in fits and spurts. Internet is Frontier FiOS 500/500 & no additional devices between router and ONT/demarc.
Where do I start troubleshooting this? Or, if I've given the answer, what's wrong? I can understand the guest running the aws process crawling, but nothing I see signals a reason for the host to have issues (proxmox webUI, IRC container, and ssh). With such low host CPU compared to guest, I suspect something is doing a lot of waiting, but I don't know what. This is reinforced by things taking 10-15 seconds to return to 'normal' in the guest and on the host after killing the aws process.
Edit: I think I found a smoking gun:
CPU: i7-9850H
Memory: 32GB RAM
LAN: I219-LM
Disk:
- Samsung SSD 860 2TB (SATA) Note: Root Pool, ZFS
- Samsung SSD PM991 512GB (NVMe), ZFS
VMs:
- (Not running) Ubuntu 22.04: 1x socket 4x cores (Host CPU type), VirtIO SCSI disk, 16GB RAM (baloon=1, but min&max at 16gb), VirtIO network to vmbr0, 250GB disk discard=ON ssd=1 (Ubuntu 22.04 default, ext4?) disk #1, guest tools installed*
- Linux Mint 21.2: 1x socket 4x cores (KVM64 type), VirtIO SCSI Single disk, 12GB RAM (baloon=0), VirtIO network to vmbr0, 60GB disk discard=ON ssd=1 iothread=1 (guest formatted ZFS) disk #1 , guest tools installed*
LXC:
none
containerd(on host):
- the Lounge (IRC Client)
Problem: I executed `aws s3 cp` inside VM #2 (Linux Mint) to upload the contents of a SMB share(850GB in total, files large enough it could saturate gigabit) and *everything* started stuttering horribly:
- SSH to host: Repeated disconnects, typing 'top' took seconds to appear.
- SSH to VM #2: Repeated disconnects, 'top' also took seconds.
- The Lounge (container): repeated ping timeouts on Libera.
I managed to gather this
Host:
- a kvm process was using ~100% of cpu according to `top`
- load average from top: 0.18, 0.52, 0.60 (same 20min period)
- Also, RAM was at ~20GB
- iftop reporting 1.10Gb Rx traffic
- `zpool iostat 1` is showing suspiciously low numbers, a few meg read/s from rpool, double digit IOPS, and zero write.
VM #2:
- aws process was using 50%-100% of CPU,
- cinnamon process was using 25% CPU until I killed it. (stated for completeness)
- smbd using 15%
- load average from top: 4.90, 5.38, 3.72 (after waiting 20min)
Other:
Edgerouter 4 reported outgoing traffic approaching 400mbit in fits and spurts. Internet is Frontier FiOS 500/500 & no additional devices between router and ONT/demarc.
Where do I start troubleshooting this? Or, if I've given the answer, what's wrong? I can understand the guest running the aws process crawling, but nothing I see signals a reason for the host to have issues (proxmox webUI, IRC container, and ssh). With such low host CPU compared to guest, I suspect something is doing a lot of waiting, but I don't know what. This is reinforced by things taking 10-15 seconds to return to 'normal' in the guest and on the host after killing the aws process.
Edit: I think I found a smoking gun:
Code:
e1000e 0000:00:1f.6 eno2: Detected Hardware Unit Hang:
Oct 16 18:44:22 proxlap kernel: [608080.398118] TDH <a0>
Oct 16 18:44:22 proxlap kernel: [608080.398118] TDT <f5>
Oct 16 18:44:22 proxlap kernel: [608080.398118] next_to_use <f5>
Oct 16 18:44:22 proxlap kernel: [608080.398118] next_to_clean <9f>
Oct 16 18:44:22 proxlap kernel: [608080.398118] buffer_info[next_to_clean]:
Oct 16 18:44:22 proxlap kernel: [608080.398118] time_stamp <1090e8059>
Oct 16 18:44:22 proxlap kernel: [608080.398118] next_to_watch <a0>
Oct 16 18:44:22 proxlap kernel: [608080.398118] jiffies <1090e8ab0>
Oct 16 18:44:22 proxlap kernel: [608080.398118] next_to_watch.status <0>
Oct 16 18:44:22 proxlap kernel: [608080.398118] MAC Status <80083>
Oct 16 18:44:22 proxlap kernel: [608080.398118] PHY Status <796d>
Oct 16 18:44:22 proxlap kernel: [608080.398118] PHY 1000BASE-T Status <3800>
Oct 16 18:44:22 proxlap kernel: [608080.398118] PHY Extended Status <3000>
Oct 16 18:44:22 proxlap kernel: [608080.398118] PCI Status <10>
Oct 16 18:44:23 proxlap kernel: [608081.485420] e1000e 0000:00:1f.6 eno2: Reset adapter unexpectedly
Oct 16 18:44:23 proxlap kernel: [608081.576454] vmbr0: port 1(eno2) entered disabled state
Oct 16 18:44:27 proxlap kernel: [608085.252451] e1000e 0000:00:1f.6 eno2: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Last edited: