Host CPU and RAM usage slowly increases while all VMs/LXCs remain stable (only host reboot resets)

Mister J.

New Member
Oct 21, 2024
8
0
1
Hi everyone,

Host CPU usage gradually increases over uptime (≈18% → ≈30%), and overall host memory usage increases as well (≈14.5 GB → ≈18 GB), while all VMs and LXCs remain stable.

Only a full host reboot resets the behavior. VM/LXC restarts do not.

I’ve already investigated a number of common causes, so I’ll keep this as factual and concise as possible.

Environment​

  • Proxmox VE 8.4.19 x86_64
  • Kernel: 6.8.12-25-pve
  • CPU: Intel i7-3770 (8 threads) @ 3.90 GHz
  • RAM: 32 GB
  • Storage: ZFS (ARC remains minimal and does not grow with uptime)
  • Workload: multiple dedicated game servers (UT2004, UT3, UT4, COD4, BF2142, Xonotic, rFactor, etc.)
  • Total VM usage: ~14.5 GB RAM, ~18% CPU

The actual problem​

Immediately after a host reboot:

Host CPU: ~18%
Host RAM: ~14.5 GB

After ~24 hours of uptime:

Host CPU: ~30%
Host RAM: ~18 GB

Important: the workload inside the VMs does not change. There is no observable increase in:
  • guest CPU
  • guest RAM
  • guest load
  • QEMU RSS
  • interrupts
  • IO
  • slab usage
  • SUnreclaim
  • softirq load
  • network load
  • disk load
Only the overall host CPU and memory usage increase over uptime.

What has been observed across multiple uptime cycles​

Guest and VM-related metrics remain stable while overall host CPU and memory usage gradually increase over time, including:
  • guest CPU / RAM / load remain stable
  • QEMU process RSS remains stable
  • no sustained change in IO or IO wait
  • no visible change in interrupt rates
  • no slab or SUnreclaim growth trend
  • no obvious scheduler / NUMA imbalance changes
  • no ballooning or memory pressure signals
Only a full host reboot resets the CPU and memory drift.

What I’m looking for​

I’m not looking for generic troubleshooting steps such as:
  • “run top/htop again”
  • “check interrupts”
  • “check IO”
  • “maybe it’s a runaway process”
  • “maybe it’s a leak”
Those areas have already been investigated repeatedly across multiple uptime cycles.

I’m looking for insights from people who have seen similar host-side CPU and memory drift where:
  • VMs and LXCs remain stable
  • no metrics visibly escalate
  • no runaway process is visible
  • only host CPU and memory usage rise over uptime
  • only a host reboot resets the behavior
  • modern kernel (6.8.x)
  • timer/tick-heavy workloads
  • long uptime

Specific questions​

  • Are there known regressions in 6.8.x related to:
    • scheduler
    • cpuidle / pstate
    • KVM halt polling
    • NO_HZ / tick handling
    • virtio
    • io_uring
  • Would it make sense to test:
    • an older PVE kernel branch (6.5 / 6.2)
    • a newer kernel (if available)
    • forcing CPU governor to performance
    • disabling io_uring for VM disks
  • Are there known cases where:
    • host CPU and overall memory usage rise over uptime
    • guest-side metrics remain stable
    • only a host reboot resolves the behavior
Any input is appreciated. I mainly want to know whether this is a known pattern with the current kernel stack, or if I’m hitting an edge case.

Thanks in advance.

PS.
The situation a few hours after a reboot of the host:

1779614928402.png

1779614940437.png

After a few days:

1779615418862.png

1779615444274.png

A reboot after a few days:

1779615311199.png

1779615338394.png
 
Last edited: