Host is a 4-core Xeon (no HT) with 64GB RAM, ZFS mirror for local VM storage (2x Intel SSD DC S3520 1.6TB), and an Intel 2-port X540 10Gbase-T NIC plus onboard 2x1G.
I run a TrueNAS VM with HBA passthrough for storing photos, and Nextcloud in its own VM (with storage directly in TrueNAS) to easily synchronize photos from my client PC.
Proxmox VE 8.1.3
Nextcloud in an Ubuntu 22 VM using virtio network device.
TrueNAS Core in a VM using virtio network device.
pfSense in a VM using virtio network devices.
My client PC is connected directly using a 2.5G NIC to one of the 10G ports on the server NIC, at 1G interface speed (I previously had a Proxmox server-side change to allow for 2.5G speeds but it seems to have fallen off of the server).
Nextcloud and TrueNAS are on the same subnet with direct access to each other and the storage I upload files to in Nextcloud is a network device so files end up directly on the TrueNAS HDDs. Traffic between the client PC and the VM network flows via the pfSense router.
The system typically runs smoothly, with CPU usage around 20-30%, load average in the 1-1.5 range. Low level of network traffic. Most of the idle CPU usage comes from a Windows VM with 2 vCPUs that the Proxmox GUI tells me is idle at less than 30% CPU load.
Now to the problem.
When I synchronize a large batch of photos, and without any speed limit imposed from the Nextcloud client, I reach speeds of roughly 25-30 MB/s on average, with spikes around 50 MB/s. According to the Proxmox GUI, the Nextcloud VM which is currently allocated 3 vCPUs (I know this is high) and hits 50-55% CPU usage levels, the TrueNAS VM with 2 vCPUs hits around 30-30% CPU usage at the peak (typically idle at 2-4%), and the pfSense VM with two vCPUs also hits roughly 30-35% (typically idle at 4-5%). The other VMs are not seeing any spikes in CPU usage during this time.
CPU usage of the Proxmox host approaches 100%, and more notably the load average creeps up into the 50s!
Sometimes with a large batch of photos, where the sync runs for many minutes and unthrottled, the Proxmox server can quickly become unstable and unresponsive. It has even suffered automatic restarts which I suspect are watchdog related, if the behavior is the same as I see for partial service interruptions, but I do not have observational data from all the times it has happened and the logging is often incomplete or stopped prior to restarts.
I truly don't understand how the load average can exponentially explode up over 50 like this.
Lowering the Nextcloud VM "cpuunits" from default (100) to 25 seems to help during the brief testing I could do today, but it was perhaps too short. I will from next reboot of the Nextcloud VM also reduce the vCPUs from 3 to 2, perhaps that will at least help a bit.
Any ideas what this extreme load average comes from, or what I should monitor if I were to attempt to reproduce this in a controlled way?
Could there be some networking/NIC related settings that I need to change, on the host or for/in VMs?
I run a TrueNAS VM with HBA passthrough for storing photos, and Nextcloud in its own VM (with storage directly in TrueNAS) to easily synchronize photos from my client PC.
Proxmox VE 8.1.3
Nextcloud in an Ubuntu 22 VM using virtio network device.
TrueNAS Core in a VM using virtio network device.
pfSense in a VM using virtio network devices.
My client PC is connected directly using a 2.5G NIC to one of the 10G ports on the server NIC, at 1G interface speed (I previously had a Proxmox server-side change to allow for 2.5G speeds but it seems to have fallen off of the server).
Nextcloud and TrueNAS are on the same subnet with direct access to each other and the storage I upload files to in Nextcloud is a network device so files end up directly on the TrueNAS HDDs. Traffic between the client PC and the VM network flows via the pfSense router.
The system typically runs smoothly, with CPU usage around 20-30%, load average in the 1-1.5 range. Low level of network traffic. Most of the idle CPU usage comes from a Windows VM with 2 vCPUs that the Proxmox GUI tells me is idle at less than 30% CPU load.
Now to the problem.
When I synchronize a large batch of photos, and without any speed limit imposed from the Nextcloud client, I reach speeds of roughly 25-30 MB/s on average, with spikes around 50 MB/s. According to the Proxmox GUI, the Nextcloud VM which is currently allocated 3 vCPUs (I know this is high) and hits 50-55% CPU usage levels, the TrueNAS VM with 2 vCPUs hits around 30-30% CPU usage at the peak (typically idle at 2-4%), and the pfSense VM with two vCPUs also hits roughly 30-35% (typically idle at 4-5%). The other VMs are not seeing any spikes in CPU usage during this time.
CPU usage of the Proxmox host approaches 100%, and more notably the load average creeps up into the 50s!
Sometimes with a large batch of photos, where the sync runs for many minutes and unthrottled, the Proxmox server can quickly become unstable and unresponsive. It has even suffered automatic restarts which I suspect are watchdog related, if the behavior is the same as I see for partial service interruptions, but I do not have observational data from all the times it has happened and the logging is often incomplete or stopped prior to restarts.
I truly don't understand how the load average can exponentially explode up over 50 like this.
Lowering the Nextcloud VM "cpuunits" from default (100) to 25 seems to help during the brief testing I could do today, but it was perhaps too short. I will from next reboot of the Nextcloud VM also reduce the vCPUs from 3 to 2, perhaps that will at least help a bit.
Any ideas what this extreme load average comes from, or what I should monitor if I were to attempt to reproduce this in a controlled way?
Could there be some networking/NIC related settings that I need to change, on the host or for/in VMs?