I had originally created this topic: https://forum.proxmox.com/threads/proxmox-server-webgui-and-ssh-no-longer-accessible.156006/, as I had corrupted my single-disk PVE deployment.
I know have a RAID1 BTRFS with two different SATA SSD's for my current PVE installation, I remounted my ZFS pool to it, moved my proxmox backup USB HDD to a dedicated miniPC with PBS deployed (was a CT on my previous PVE deployment). I am able to restore VM's, but the original issue that caused me to corrupt my PVE deployment in the first place is back in full force, and I'm hoping to diagnose successfully.
First, I'm running very old hardware, I realize that. Everything with PVE worked GREAT up until about a month ago. I originally repurposed this build from a Hyper-V homelab core server to a PVE server, added a GPU for passthru to a Plex CT for accelerated transcoding, and was passing through the CPU GPU to a Win 10 VM for UI acceleration to run my applications (mostly Ninjatrader and other trading apps) effectively over RDP over VPN sessions.
System Specs:
CPU - i7 4770 (4C, 8T, non-k) + Thermalright aftermarket cooler (keeps it below 65C).
MB - Supermicro X10SLQ (OOBM is Intel vPro, Meshcommander mostly used outside of SSH direct to PVE).
RAM - Mushkin 32GB DDR3 (4x8GB DDR3 1600 @ 1600 CL11)
GPU - Asus Phoenix GTX 1660 Super OC 6GB in PCIe 4X Slot (NVENC Encoding for Plex)
Storage:
- SSD 1 - SATA, 250GB
- SSD 2 - SATA, 240GB
- BTRFS RAID1 at 240GB for PVE OS deployment and ISO storage.
RAID: Inspur LSI 9300-8i HBA in IT Mode
- 3TB HGST HDD's x 6 in ZFS RAID10 from TS430 backplane arrangement.
I keep all of this in a modified Lenovo TS430 case with 2x4 3.5 backplanes.
I don't expect high performance, but it seems whatever happened about a month ago has now caused the IO Delay on CPU Usage, which I have attributed to IOWait in Netdata, and linking with PSI values iosome and iofull in ATOP, are all related somehow. I suspect it is more to do with my ZFS pool than anything. This is where I'm hoping to get some guidance. I have done some researching and have installed ATOP for example. I ran a fresh scrub with no VM's running, no errors. All drives SMART values are OK.
It seems like the sever comes to a halt in different areas. The VM's stop responding overall, then the PVE itself after a time does the same, ultimately requiring my out-of-band restart process from Meshcommander or vpro webgui. I am hoping to get a little more mileage out of this hardware until I can afford to upgrade to something truly better. At this point, I can't even restore my full environment without running into issues...for what was so surprisingly smooth and fast a few months ago, it is only fast if I run nothing.
Even running my Unifi 101 CT yesterday, after a time IOWait starting climbing and climbing, until same result, gotta reboot the damn hardware.
I've ran memtest, cpu tests, etc., and those seem to be fine, no errors. Maybe ZFS is too much for this old hardware with recent PVE updates? Maybe there's a bug kicking my ass and I'm just too ignorant at the moment to clearly identify it?
Please help?
Here's some screenshots:
Proxmox dashboard, showing IO Delay staying up. This is after restoring a 56GB Plex LXC suddenly stopped part way through and I ultimately stopped it. I did the same yesterday, but let it sit for over 2hrs to no progress.
Here's the WebGUI ZFS storage results:
Here's zpool iostat and zpool status results from SSH:
Here's IOTop, mostly showing nothing happening:
Here's what ATOP looks like when IO Wait is getting higher, iosome and iofull under PSI continually stay in the mid to upper 90% range.
I know have a RAID1 BTRFS with two different SATA SSD's for my current PVE installation, I remounted my ZFS pool to it, moved my proxmox backup USB HDD to a dedicated miniPC with PBS deployed (was a CT on my previous PVE deployment). I am able to restore VM's, but the original issue that caused me to corrupt my PVE deployment in the first place is back in full force, and I'm hoping to diagnose successfully.
First, I'm running very old hardware, I realize that. Everything with PVE worked GREAT up until about a month ago. I originally repurposed this build from a Hyper-V homelab core server to a PVE server, added a GPU for passthru to a Plex CT for accelerated transcoding, and was passing through the CPU GPU to a Win 10 VM for UI acceleration to run my applications (mostly Ninjatrader and other trading apps) effectively over RDP over VPN sessions.
System Specs:
CPU - i7 4770 (4C, 8T, non-k) + Thermalright aftermarket cooler (keeps it below 65C).
MB - Supermicro X10SLQ (OOBM is Intel vPro, Meshcommander mostly used outside of SSH direct to PVE).
RAM - Mushkin 32GB DDR3 (4x8GB DDR3 1600 @ 1600 CL11)
GPU - Asus Phoenix GTX 1660 Super OC 6GB in PCIe 4X Slot (NVENC Encoding for Plex)
Storage:
- SSD 1 - SATA, 250GB
- SSD 2 - SATA, 240GB
- BTRFS RAID1 at 240GB for PVE OS deployment and ISO storage.
RAID: Inspur LSI 9300-8i HBA in IT Mode
- 3TB HGST HDD's x 6 in ZFS RAID10 from TS430 backplane arrangement.
I keep all of this in a modified Lenovo TS430 case with 2x4 3.5 backplanes.
I don't expect high performance, but it seems whatever happened about a month ago has now caused the IO Delay on CPU Usage, which I have attributed to IOWait in Netdata, and linking with PSI values iosome and iofull in ATOP, are all related somehow. I suspect it is more to do with my ZFS pool than anything. This is where I'm hoping to get some guidance. I have done some researching and have installed ATOP for example. I ran a fresh scrub with no VM's running, no errors. All drives SMART values are OK.
It seems like the sever comes to a halt in different areas. The VM's stop responding overall, then the PVE itself after a time does the same, ultimately requiring my out-of-band restart process from Meshcommander or vpro webgui. I am hoping to get a little more mileage out of this hardware until I can afford to upgrade to something truly better. At this point, I can't even restore my full environment without running into issues...for what was so surprisingly smooth and fast a few months ago, it is only fast if I run nothing.
Even running my Unifi 101 CT yesterday, after a time IOWait starting climbing and climbing, until same result, gotta reboot the damn hardware.
I've ran memtest, cpu tests, etc., and those seem to be fine, no errors. Maybe ZFS is too much for this old hardware with recent PVE updates? Maybe there's a bug kicking my ass and I'm just too ignorant at the moment to clearly identify it?
Please help?
Here's some screenshots:
Proxmox dashboard, showing IO Delay staying up. This is after restoring a 56GB Plex LXC suddenly stopped part way through and I ultimately stopped it. I did the same yesterday, but let it sit for over 2hrs to no progress.
Here's the WebGUI ZFS storage results:
Here's zpool iostat and zpool status results from SSH:
Here's IOTop, mostly showing nothing happening:
Here's what ATOP looks like when IO Wait is getting higher, iosome and iofull under PSI continually stay in the mid to upper 90% range.