High CPU

Feb 17, 2025
12
1
3
Hello

I have an odd issue that started after my server crashed. My current setup is as followed:

256 core EPYC Milan
512GB memory

I have this split into three Vms (windows servers) and it's running the latest proxmox version.

My server crashed last week, not sure what happened but I had to force shut it down. Once it came back online, one of my Vms runs terrible now. CPU immediately goes to 50-60 percent and stays there everytime I start it. When I remote in, it feels so slow. Any tips or tricks to see what could be causing this poor performance from one of my vms after the crash?
 
Proxmox is mostly Debian, so maybe under /var/logs.

These are just wild guesses but my guess would be that your VM has a memory leak, and because of that Proxmox used up all its RAM, ARC had a to low swappiness, and/or since there is no swap file for ZFS the system crashed.
 
Proxmox is mostly Debian, so maybe under /var/logs.

These are just wild guesses but my guess would be that your VM has a memory leak, and because of that Proxmox used up all its RAM, ARC had a to low swappiness, and/or since there is no swap file for ZFS the system crashed.
Thanks. Tried setting up a new VM this morning and similar issue. Stuck at 100 percent once it loads the setup files. 1000006407.jpg
 

Attachments

  • 1000006406.jpg
    1000006406.jpg
    93.5 KB · Views: 7
Looks like it is related to the Windows Server 2025 ISO. 2019 works fine. Has anyone successfully loaded Server 2025 on a proxmox VM? Not sure why it would be any different than an old server image.
 
Do you have qemu-guest-tools enabled in the VM config, installed and running in the guest OS?
Do you have the balloon driver installed and running in the guest OS?

The former is needed for QEMU to give accurate memory usage information to PVE with Windows OS and the later for QEMU to be able to reclaim for the host memory pages no longer in use by the guest (if needed).
 
  • Like
Reactions: IsThisThingOn
Do you have qemu-guest-tools enabled in the VM config, installed and running in the guest OS?
Do you have the balloon driver installed and running in the guest OS?

The former is needed for QEMU to give accurate memory usage information to PVE with Windows OS and the later for QEMU to be able to reclaim for the host memory pages no longer in use by the guest (if needed).
Hi. I do have the tools running (I can see the ipv4 and ipv6 of the guest vm via proxmox dashboard), I don't recall installing the balloon driver but I disabled memory ballooning to see if that's what was causing it and it didn't make a difference.
 
Stuck at 100 percent once it loads the setup files.
Stuck at what? Loading a Windows setup? RAM usage? CPU usage?

So just to get this right, you create a new naked VM, you followed the Windows Server 2025 best practices and try to boot into Windows Server 2025 installation?
 
Stuck at what? Loading a Windows setup? RAM usage? CPU usage?

So just to get this right, you create a new naked VM, you followed the Windows Server 2025 best practices and try to boot into Windows Server 2025 installation?
I setup a fresh VM, went through the initial installation, after the reboot the CPU usage goes crazy:

1739900291174.png

Stuck on this:

1739900310113.png

Using these settings:

1739900362062.png

Thanks!
 

Attachments

  • 1739900352830.png
    1739900352830.png
    13.4 KB · Views: 2
Hi. I do have the tools running (I can see the ipv4 and ipv6 of the guest vm via proxmox dashboard), I don't recall installing the balloon driver but I disabled memory ballooning to see if that's what was causing it and it didn't make a difference.
The snapshot you posted seems to indicated that balloon was off and/or the balloon service wasn't running correctly. That may not be related to the issue, just got my attention the high mem usage shown.

Using these settings:
Change CPU type to host and try again. Also try with CPU type "x86-64-v2-AES".

Why such a high CPU count for a testing machine?
 
  • Like
Reactions: IsThisThingOn
The snapshot you posted seems to indicated that balloon was off and/or the balloon service wasn't running correctly. That may not be related to the issue, just got my attention the high mem usage shown.


Change CPU type to host and try again. Also try with CPU type "x86-64-v2-AES".

Why such a high CPU count for a testing machine?
Thank you. I am starting to notice some odd problems here. I did change the CPU type to your recommendations - thanks for that. I still had the same issue BUT once I bumped the cores down to 8, it worked. It seems like it doesnt like the high core settings. As soon as I add more cores, it freezes up to high CPU. Quite an odd issue here. When I had it set to 8 cores, worked fine. What do you think would cause that odd issue?
 
What do you think would cause that odd issue?
I would:
  • Try ticking NUMA in the VM CPU settings.
  • Install updated microcode [1].
  • I don't see any mention about your PVE version, running kernel, etc, so another option could be to use an updated kernel or even a previous one (maybe after the crash it booted with a different kernel? Check /var/log/apt/history.log* about when packages where installed/upgraded).
  • Check NPS setting in BIOS (Numa nodes per socket) and try with different settings [2].
  • Triple check that the storage is ok at low level, PVE host level and inside the VMs.

[1] https://pve.proxmox.com/wiki/Firmware_Updates
[2] https://infohub.delltechnologies.com/fr-fr/p/numa-configuration-settings-on-amd-epyc-2nd-generation/
 
  • Like
Reactions: IsThisThingOn
I would:
  • Try ticking NUMA in the VM CPU settings.
  • Install updated microcode [1].
  • I don't see any mention about your PVE version, running kernel, etc, so another option could be to use an updated kernel or even a previous one (maybe after the crash it booted with a different kernel? Check /var/log/apt/history.log* about when packages where installed/upgraded).
  • Check NPS setting in BIOS (Numa nodes per socket) and try with different settings [2].
  • Triple check that the storage is ok at low level, PVE host level and inside the VMs.

[1] https://pve.proxmox.com/wiki/Firmware_Updates
[2] https://infohub.delltechnologies.com/fr-fr/p/numa-configuration-settings-on-amd-epyc-2nd-generation/
Thank you very much for your suggestions.

- Tried enabling NUMA, rebooted VM no change
- Installed latest AMD microcode, rebooted server and no change
- Running on Proxmox version 8.3.3 and kernal version 6.8.12-8-pve

You mentioned check the storage at low level. All devices are "Passed" for S.M.A.R.T. under the disks.

Another odd thing is I have two other VMS on this cluster which are working fine.

1739925794159.png

1739925805868.png