[SOLVED] pmxcfs Taking 100% CPU/RAM Utilization - Host CPU Soft Lockup

PseudoResonance · Jul 3, 2021

I have just migrated from ESXi, where my server and VMs were running for months on end with zero issues. However, since migrating, my host has been extremely unstable, and will hang and lock up after a few hours. I have a large VM with 34GB of RAM and all 24 threads, plus 2 smaller ones each with 1GB. My host server is running Proxmox 6.4-9 and has 48GB of RAM. This should be a total of 36GB of RAM used by VMs alone, leaving 12GB of RAM for Proxmox and whatever it wants to do, yet it always seems to be running out of both RAM and CPU.

I can't find much info about it, but top says pmxcfs is eating up 500% of my 24 threads, and Proxmox web GUI reports a consistent 50% CPU utilization even when my VMs are shut off. In addition, I notice that over a few hours, my RAM utilization slowly rises from 30GB up to 48GB, then the entire swap fills up and Proxmox grinds to a halt. The Proxmox web GUI also breaks eventually, with all VMs and storage nodes showing as offline and performance graphs ending (attached image). Shutting down/restarting VMs does nothing, and I have no idea what is using the RAM as it's not listed in top. A few hours ago, I was having horrible performance issues in my VMs, so I disabled swap on the host, which greatly improved performance, and shouldn't be an issue seeing as I've left 12GB for it, yet I've found Proxmox still manages to eat up all 12GB of it and 100% CPU, then I get some kernel soft lockup messages before everything completely freezes and I'm forced to reset the server.

I feel like I'm going insane, I've spent hours staring at this and researching, but no matter what I try, Proxmox will eventually use 100% of my CPU and 100% of my RAM/swap until I have to reset the system. In the 10 minutes it took me to write this, I've watched my RAM usage rise from 40GB to 46GB, and it's still climbing ever closer to 48GB... My CPU utilization has also risen to 70% even though the VMs are currently idle. Does anybody have any idea what could be wrong? I could try reinstalling Proxmox I guess, but this install is brand new already, so I don't know how that would help. I apologize for this post ending up as kind of a rant. I'm not sure how best to present the information, but I can try to provide more if necessary. Thank you for your time.

Some other info I thought might be important: Proxmox is running on a Dell R710 with a Perc H700 that has 3 SATA SSDs as single-drive RAID 0 arrays, one for Proxmox, another for VMs, and a 3rd unused drive. I'm using ext4 everywhere and not ZFS because of concerns with running that on the RAID card. I have no dedicated GPU. I do have 2 unused QLogic QLE2560s in the server though. I can't find a single log about the crashes or a possible cause anywhere either.

EDIT: I forgot to add, on ESXi, CPU utilization was usually around 10% on the host, and RAM usage was also very consistent at around 46GB. I realize RAM usage isn't all that comparable between hypervisors, but CPU usage definitely is fairly comparable and shouldn't be jumping from 10% up to 50% or more. I even shut down my experimenting VM to try and help out Proxmox, reducing RAM consumption by another 8GB, but it's still performing worse.

Here is the output from "free -m" shortly before I posted this:

Code:

              total        used        free      shared  buff/cache   available
Mem:          48339       46985         216         528        1137         318
Swap:          8191         247        7944

PseudoResonance · Jul 4, 2021

I'm not entirely sure, but it seems like it's a specific Java 8 modded Minecraft server that breaks everything. I had it off for a while without issues, but once i started it up, within 2 hours, despite the VM sitting at 15% CPU utilization and 75% RAM utilization, the Proxmox host suddenly started to get bad. The graphs died, pmxcfs started taking up tons of CPU and all the RAM and SWAP got eaten up. I have no idea how it could cause that.

leesteken · Jul 4, 2021

Try setting lower limits on the memory and CPU of the VM(s). If you are using ZFS, and a VM does a lot of I/O, the ZFS memory cache (ARC) also grows; try lowering its memory limit.
In general: try to make sure that all maximum usage still leaves some memory and CPU for the Proxmox host. If your disks are limited in IOPS, heavy disk I/O can interfere with the Proxmox host and make its web GUI unusable. This can be managed by using separate drives for Proxmox and the VMs.

PseudoResonance · Jul 4, 2021

I'm using ext4, so ZFS shouldn't be caching anything, and I checked it to make sure it wasn't. My VMs are running off a separate SSD from Proxmox's SSD too, so they shouldn't interfere too much, but overall my VMs have very little disk usage. Proxmox should also has 12GB of RAM available to it. I have no idea why it's using it all up, or what is using it.

PseudoResonance · Jul 4, 2021

I just tried moving the problematic application to its own LXC container with only 6 cores and 8GB of RAM available to it. It still managed to break Proxmox somehow. RAM utilization on the host just slowly rose until all RAM and SWAP was completely used up. At around 80% RAM utilization, the web GUI broke, and the pmxcfs processes started rising in CPU usage, up to ~50%. At the time it crashed, pmxcfs was at nearly 100% CPU utilization across all 24 threads... I don't get it.

Edit: A clean install of Proxmox made no difference either.

PseudoResonance · Jul 5, 2021

I've been watching it very closely as it crashes, and IO delay never goes above ~0.5% at most. pmxcfs normally only runs for short intervals of time, using 0% CPU and 0.1% RAM. I'm not entirely sure, but I'm fairly certain this is a Proxmox issue, and has nothing to do with my VMs or hardware, especially given that it was never a problem on ESXi. Does anybody have any idea why this might be? Or should I try updating to a beta or something? I would greatly appreciate it if anybody has some insight as to what is going on. I just can't see any reason for Proxmox itself to use well over a quarter of my system's RAM and 100% CPU.

Edit: I just watched the web GUI break and checked the logs again. I found that in /var/log/daemon.log, there are hundreds of error messages being printed per second about too many files open.

Code:

Jul  4 21:49:02 pve1 pvestatd[19434]: ipcc_send_rec[1] failed: Too many open files
Jul  4 21:49:02 pve1 pvestatd[19434]: ipcc_send_rec[2] failed: Resource temporarily unavailable
Jul  4 21:49:02 pve1 pvestatd[19608]: ipcc_send_rec[3] failed: Too many open files
Jul  4 21:49:02 pve1 pvestatd[19434]: ipcc_send_rec[3] failed: Resource temporarily unavailable
Jul  4 21:49:02 pve1 pvestatd[19407]: can't lock file '/var/log/pve/tasks/.active.lock' - got timeout
Jul  4 21:49:02 pve1 pvestatd[19407]: status update error: Too many open files
Jul  4 21:49:02 pve1 pvestatd[19407]: status update time (10.897 seconds)
Jul  4 21:49:02 pve1 pvestatd[19407]: ipcc_send_rec[1] failed: Too many open files
Jul  4 21:49:02 pve1 pvestatd[22012]: ipcc_send_rec[2] failed: Too many open files
Jul  4 21:49:02 pve1 pvestatd[23201]: can't lock file '/var/log/pve/tasks/.active.lock' - got timeout
Jul  4 21:49:02 pve1 pvestatd[23201]: status update error: Too many open files
Jul  4 21:49:02 pve1 pvestatd[23201]: status update time (10.897 seconds)
Jul  4 21:49:02 pve1 pvestatd[23201]: ipcc_send_rec[1] failed: Too many open files
Jul  4 21:49:02 pve1 pvestatd[23201]: ipcc_send_rec[2] failed: Resource temporarily unavailable
Jul  4 21:49:02 pve1 pvestatd[19166]: ipcc_send_rec[2] failed: Too many open files
Jul  4 21:49:02 pve1 pvestatd[23201]: ipcc_send_rec[3] failed: Resource temporarily unavailable
Jul  4 21:49:02 pve1 pvestatd[19055]: can't lock file '/var/log/pve/tasks/.active.lock' - got timeout
Jul  4 21:49:02 pve1 pvestatd[19055]: status update error: Too many open files
Jul  4 21:49:02 pve1 pvestatd[19055]: status update time (10.899 seconds)
Jul  4 21:49:02 pve1 pvestatd[19055]: ipcc_send_rec[1] failed: Too many open files
Jul  4 21:49:02 pve1 pvestatd[24362]: ipcc_send_rec[3] failed: Too many open files
Jul  4 21:49:02 pve1 pvestatd[19055]: ipcc_send_rec[2] failed: Resource temporarily unavailable
Jul  4 21:49:02 pve1 pvestatd[22545]: can't lock file '/var/log/pve/tasks/.active.lock' - got timeout
Jul  4 21:49:02 pve1 pvestatd[22545]: status update error: Too many open files
Jul  4 21:49:02 pve1 pvestatd[22545]: status update time (10.896 seconds)
Jul  4 21:49:02 pve1 pvestatd[22545]: ipcc_send_rec[1] failed: Too many open files
Jul  4 21:49:02 pve1 pvestatd[22545]: ipcc_send_rec[2] failed: Resource temporarily unavailable
Jul  4 21:49:02 pve1 pvestatd[22545]: ipcc_send_rec[3] failed: Resource temporarily unavailable
Jul  4 21:49:02 pve1 pvestatd[23328]: ipcc_send_rec[3] failed: Too many open files
Jul  4 21:49:02 pve1 pvestatd[19437]: can't lock file '/var/log/pve/tasks/.active.lock' - got timeout
Jul  4 21:49:02 pve1 pvestatd[19437]: status update error: Too many open files
Jul  4 21:49:02 pve1 pvestatd[19437]: status update time (10.901 seconds)
Jul  4 21:49:02 pve1 pvestatd[19437]: ipcc_send_rec[1] failed: Too many open files
Jul  4 21:49:02 pve1 pvestatd[23329]: can't lock file '/var/log/pve/tasks/.active.lock' - got timeout
Jul  4 21:49:02 pve1 pvestatd[23329]: status update error: Too many open files
Jul  4 21:49:02 pve1 pvestatd[23329]: status update time (10.897 seconds)
Jul  4 21:49:02 pve1 pvestatd[19936]: ipcc_send_rec[2] failed: Too many open files
Jul  4 21:49:02 pve1 pvestatd[23329]: ipcc_send_rec[1] failed: Too many open files
Jul  4 21:49:02 pve1 pvestatd[23329]: ipcc_send_rec[2] failed: Resource temporarily unavailable
Jul  4 21:49:02 pve1 pvestatd[23329]: ipcc_send_rec[3] failed: Resource temporarily unavailable
Jul  4 21:49:02 pve1 pvestatd[21840]: ipcc_send_rec[3] failed: Too many open files
Jul  4 21:49:02 pve1 pvestatd[23168]: can't lock file '/var/log/pve/tasks/.active.lock' - got timeout
Jul  4 21:49:02 pve1 pvestatd[23168]: status update error: Too many open files
Jul  4 21:49:02 pve1 pvestatd[23168]: status update time (10.904 seconds)

PseudoResonance · Jul 6, 2021

I think I have finally solved the issue. I had a Graphite metrics server configured, and for some reason Proxmox wouldn't push data to it. It wasn't properly logging the error messages either, but was instead causing hundreds of error messages to be written per second to the logs, eventually crashing my server.

Since removing it and restarting my server, it's been running for nearly 24 hours without any signs of crashing.

esi_y · Sep 13, 2024

PseudoResonance said:
IIn the 10 minutes it took me to write this, I've watched my RAM usage rise from 40GB to 46GB, and it's still climbing ever closer to 48GB... My CPU utilization has also risen to 70% even though the VMs are currently idle. Does anybody have any idea what could be wrong?

Was it (as per your title) pmxcfs hogging the RAM and CPU?

PseudoResonance said:
I had a Graphite metrics server configured, and for some reason Proxmox wouldn't push data to it. It wasn't properly logging the error messages either, but was instead causing hundreds of error messages to be written per second to the logs, eventually crashing my server.

Where was it writing to?

Search

Search

[SOLVED] pmxcfs Taking 100% CPU/RAM Utilization - Host CPU Soft Lockup

PseudoResonance

Member

Attachments

PseudoResonance

Member

leesteken

Distinguished Member

PseudoResonance

Member

PseudoResonance

Member

PseudoResonance

Member

PseudoResonance

Member

esi_y

Renowned Member

We value your privacy