Sudden High I/O Delay -- System becomes unresponsive

Jugrnot

Active Member
Dec 18, 2019
8
2
43
45
My previously rock solid system has become nearly unusable and I'm not even remotely sure where to look. I understand my Proxmox version is in need of upgrading, but unfortunately there are some issues with me doing so at this time. Please help?!

Last night around 2300 hours the system become completely unresponsive. After a reset button, I managed to get logged in and found something managed to fill up the / root partition, which I was able to clear out and get my CT/VMs started up. Ever since that time, the system continues to exhibit >70% I/O wait and becomes unresponsive. Checking in top along with iotop, there is nothing showing any actual load on the disks yet the I/O wait remains very high and system becomes unresponsive for long periods of time. Example, running "pct list" takes minutes to display anything.

Proxmox is bare metal on a ZFS mirror vdev of 2x HGST HUS72302CLAR2000 7200rpm 2tb disks. ZFS Scrub hasn't found any issues with the disks, nor has smartctl (that I can decrypt anyway.) DIsk1 and Disk2 smartctl details.

As another test, I shut down all CTs but one and all VMs but two to see if anything changes, and it does not. The three I left running are necessarry for my network/internet functionality and have very little overhead. With nothing running but those three, I still have this:

1722801174631.png

Once again, checking in iotop/htop/top, nothing is accessing the disks. Attempting to run 'pveversion -v' at the console took almost 45 seconds for the output to be displayed.

root@pve2 ~# pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.203-1-pve)
pve-manager: 6.4-15 (running version: 6.4-15/af7986e6)
pve-kernel-5.4: 6.4-20
pve-kernel-helper: 6.4-20
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.203-1-pve: 5.4.203-1
pve-kernel-5.4.195-1-pve: 5.4.195-1
pve-kernel-5.4.140-1-pve: 5.4.140-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+deb10u1
corosync: 3.1.5-pve2~bpo10+1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve4~bpo10
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.22-pve2~bpo10+1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-5
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-5
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.14-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-2
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.7-pve1
 
Look at
Code:
zpool iostat -vy 1

I noticed my local-zfs volume was full, so deleting some old snapshots while running that yields:

1722803354993.png

I/O Waits vary from 14% - 70%, still running one CT and two VMs.
 
If zfs fills up or has low free space, it can cause high system load. This amount of free space now is not too much either.

You can also run this and check what specific operations load the pool.
Code:
zpool iostat -q
 
I quite regret only using 2tb disks for my install. Guessing the only realistic way to upgrade the bare metal install to larger disks I'm assuming would require a fresh install? Given my node is basically down at this moment, maybe that would be my best bet.

My hesitation to update proxmox is my lack of knowledge on zfs pools. Specifically my concern is importing my zfs storage pool which has about 50tb worth of data I'd be rather upset if I lost, which is why I'm still on PVE6.4.
 
I quite regret only using 2tb disks for my install. Guessing the only realistic way to upgrade the bare metal install to larger disks I'm assuming would require a fresh install? Given my node is basically down at this moment, maybe that would be my best bet.
Not necessarily. ZFS provides a lot of possibilities for such a situation. You can attach another pair of disks to this pool if you have the technical capabilities and then disconnect the smaller ones.
You can also use the zfs send | zfs receive technique ro replace disks at this poool.

Look for what could be taking up so much data on this pool, maybe moving backups, templates or iso files to another place.
 
Not necessarily. ZFS provides a lot of possibilities for such a situation. You can attach another pair of disks to this pool if you have the technical capabilities and then disconnect the smaller ones.
You can also use the zfs send | zfs receive technique ro replace disks at this poool.

Look for what could be taking up so much data on this pool, maybe moving backups, templates or iso files to another place.

Did a bit of googling, and believe might have found a solution to this..

Can I literally just add say, two 8tb disks to the mirror via 'attach', let the 8tb disks resilver, then detach both of the 2tb disks followed up with 'zpool online -e rpool' to expand out to 8tb? Will that actually work on a live proxmox filesystem? Is there any real danger detaching the 2tb disks while the system is live?

Thanks a lot for your guidance on this, I'm learning a lot more.
 
Last edited:
Yes, you can do that. The pool will evacuate data from the disconnected disks. You have to plan it well and practice a bit because these disks are bootable. You have to prepare new disks to be bootable, create partitions and refresh the system boot.
With this method, if the disks are healthy, everything should work.
So you use zpool add poolxxx mirror /dev/disk/by-id/partx /dev/disk/by-id/party
and disconnect zpool remove pooll xxx mirror-0
https://openzfs.github.io/openzfs-docs/man/master/8/zpool-remove.8.html
 
  • Like
Reactions: Jugrnot
Hello Team

Network issue when HA Takes Process​


i have 3 Nodes enabled with Cluster and HA. When some node goes down. VMs are moving to another node. But Network is not working and VM is restarted.
How to achieve when my node goes down, After VM moved to Another node. without restart VMs and Network should be pingable .
For my Setup with 3 Nodes
1) Cluster enabled
2)HA Enabled
3) Ceph Configuration is Done
4) Ceph Monitor also configured. But OSD is not configured. Because my VMs are running in Shared Drive. Still do we need Ceph OSD? I don;t have much storage, i have shared Drive

Please advise when my VMs moved to Another Node. should be work network without disconnected.
 

Attachments

  • 2024-08-02 16_12_40-gva-esx-srv-01 - Proxmox Virtual Environment.png
    2024-08-02 16_12_40-gva-esx-srv-01 - Proxmox Virtual Environment.png
    30.7 KB · Views: 5
  • 2024-08-02 16_11_31-gva-esx-srv-01 - Proxmox Virtual Environment.png
    2024-08-02 16_11_31-gva-esx-srv-01 - Proxmox Virtual Environment.png
    10.6 KB · Views: 5
  • 2024-08-02 16_08_10-gva-esx-srv-01 - Proxmox Virtual Environment.png
    2024-08-02 16_08_10-gva-esx-srv-01 - Proxmox Virtual Environment.png
    22.2 KB · Views: 3
  • 2024-08-02 16_06_59-gva-esx-srv-01 - Proxmox Virtual Environment.png
    2024-08-02 16_06_59-gva-esx-srv-01 - Proxmox Virtual Environment.png
    37.2 KB · Views: 3
  • 2024-08-02 14_15_39-gva-esx-srv-01 - Proxmox Virtual Environment.png
    2024-08-02 14_15_39-gva-esx-srv-01 - Proxmox Virtual Environment.png
    52 KB · Views: 5
Hey all, I wanted to circle back on this and close the loop as am 99% sure that I figured out what the actual issue was and seem to have resolved it entirely. Things really unraveled Saturday and several discoveries were made.


The majority of the containers and virtual machines my box runs don't really have much i/o on the host disks, the few that do have a lot of i/o use a zfs storage pool. Without getting too far into the meat and potatoes of my setup, I discovered if I killed my Home Assistant virtual machine, the system i/o wait diminished to almost nothing and the system returned to normal functionality. Waited a few hours, started Home Assistant back up and after about an hour once again teh system became completely unresponsive. This lead me down a false flag thinking the system disk were failing, when in reality it was something entirely different.

Apparently the high amount of data Home Assistant was writing to the database was the issue.

I plopped in a pair of disks, created a new ZFS mirror vdev, and moved the Home Assistant "disk" off the default proxmox location of the host hard disks over to the new storage pool. Friggin POOF. Legit all of my problems were GONE. Yeah, the system still reports "high i/o" but it literally doesn't matter because everything still works!! Knowing this, I've decided to move 100% of my CT and VM disks off the host root disks over to the new ZFS vdev and have noticed some incredible things.

Four days ago, if I restarted the VM that runs Home Assistant, it would take 30 mintues for the webapp to even load, much less everything else. Now, with the HA host disk on another vdev, after the initial VM boot presents the login prompt, I can immediately access the webapp. My plex server plays things immediately instead of buffering for 20-30 seconds, hell I even queried what the outside temp was on december 13th 2021 in Home Assistant in 10 seconds vs the 5 minutes it previously took.

Nowhere have I read you should put your VM/CT storage on another disk other than the host disks.. but here we are. If that kind of documentation exists, I haven't read it nor could find it over the last couple years fighting this issue. So yeah.. if you spin up a new Proxmox host, set up dedicated physical disks for your VMs/CTs to use. Probably woudln't be the same problem if your bare metal had SSDs... but I'm not rich.
 
  • Like
Reactions: UdoB
Nowhere have I read you should put your VM/CT storage on another disk other than the host disks

Some strategic decisions are "common sense". The classic approach is to separate OS/PVE from VMs virtual disks from user data. This is still recommended, but most Mini-PC in a Home-lab environment just don't have enough Sata-Ports or M.2 slots for this, and so it is often tolerated to just use the "rpool" for everything. My personal Homelab does it for this specific reason while in my dayjob I would never tolerate it.

In your case the culprit seems to be the write IO of Home Assistant, which may (just guessing!) be synchronous - a lot of database application do this to assure data integrity. Asynchronous writes are handled quickly by the ZFS ZIL - and this works well. But for sync-writes the slowness of rotating rust is actually perceptible as each "write" has to wait for several (more than one) physical head movements. A good solution is to switch to SSDs, like you did :-)

You created another pool with those 2 SSD. That's fine. Additionally I recommend to search for "Enterprise Class" vs "Consumer grade" and "Power-loss-Protecten"/PLP.

Another approach would have been to add those two SSD as a "Special Device" to the existing HDD pool. That has also been discussed here in the forum and I mentioned it here: https://forum.proxmox.com/threads/f...37/#-combining-classic-hdd-plus-fast-ssd-nvme
 
When you zoom out and really think about it, this makes a lot of sense to not have vms/cts share the host disk especially if they're rust, just never really thought about it until started experiencing problems. You are correct, the MariaDB add-on I'm using does in fact use synchronous writes, but I had to look that up to confirm as that exceeds my knowledge of databases. Your explanation of the difference between the two really connects the dots of my system performance issues with Home Assistant on rpool.

I didn't actually switch to SSDs for the VM pool yet, just used a pair of 8tb HGST 7200s I had "in stock" for pool spares. Currently have a 2tb HGST SAS SSD in my "inventory" but part of adding a new pool was to resolve perpetually running out of disk space for the VMs/CTs so even if had two, that wouldn't have worked for anything other than maybe the "special device." Still might consider doing that eventually, but need to resolve the issue of running out of open bays on the chassis for disks before we go that route. Haven't cracked that chassis open in a while, but think I recall either the system board *might* have a pair of SATA ports onboard.. Have been on the lookout for a matching 12 or 24 bay supermicro JBOD chassis the last few months, but haven't found anything that tickled my budget the right way on that front. Failing that, upgrading pairs of 8tb storage disks as they age with some 18-20tbs could free up a fair number of bays.
 
  • Like
Reactions: UdoB