[SOLVED] /proc/stat has too much idle

JNG

New Member
Oct 22, 2019
4
0
1
36
Hello all,

I am struggling with a peculiar issue where I might not understand all the mechanisms: I have a Proxmox server (5.4-5) with only containers. In some of my container, I have sky high idle time in my /proc/stat (up to 15281440583165230629). However, it does not concern all my containers. Here is an extract of the /proc/stat of all my host and all its containers (idle time is the fourth number):
Code:
PROXMOX: cpu  249831240 739348 112704491 41524746240 36109750 0 10716115 0 0 0
Lxc 101:cpu  1087462 0 0 7237883161620866592 0 0 0 0 0 0
Lxc 202:cpu  6710018 0 0 10498331248841811323 0 0 0 0 0 0
Lxc 203:cpu  2823807 0 0 2618472444 0 0 0 0 0 0
Lxc 205:cpu  13133112 0 0 5230079033 0 0 0 0 0 0
Lxc 206:cpu  114943641 0 0 3760916293231833407 0 0 0 0 0 0
Lxc 214:cpu  5601276 0 0 5236861199 0 0 0 0 0 0
Lxc 215:cpu  3188382 0 0 753066715665654677 0 0 0 0 0 0
Lxc 216:cpu  7641882 0 0 5234613545 0 0 0 0 0 0
Lxc 218:cpu  7266864 0 0 9277556796665397440 0 0 0 0 0 0
Lxc 221:cpu  729038 0 0 5242362687 0 0 0 0 0 0
Lxc 222:cpu  7587024 0 0 15281440583165230629 0 0 0 0 0 0
Lxc 223:cpu  8118629 0 0 5234631398 0 0 0 0 0 0
Lxc 224:cpu  11540087 0 0 5229515948 0 0 0 0 0 0
Lxc 225:cpu  1513987 0 0 5240446385 0 0 0 0 0 0
Error: container '226' not running!
Lxc 229:cpu  4843023 0 0 5230115101 0 0 0 0 0 0
Lxc 326:cpu  44216243 0 0 5190742122 0 0 0 0 0 0

Restart a container does not reset completely this counter (it only halves it).

While I understand cpu counter line in bare metal computer, is there a mechanism in lxc or another Proxmox component which fake, obfuscate or modify the idle counter?

Normally, I would not care about idle, but my containers run a cassandra, which throw me an NumberFormatException because it tries to parse it into a long. Since this value is too big, it crashes upon boot. It is kind of a bug on cassandra side, but this idle value also strange.

Thank you in advance for all the leads you may have, I am falling short of ideas :)


EDIT: after some additional research, /proc in proxmox lxc is taken care by lxcfs. I continue my research in this direction. I will keep you informed.
 
Last edited:
hi,

i couldn't reproduce this, can you tell us more about your setup? (number of cts, what's running on them, uptime of host, update frequency etc.)

EDIT: after some additional research, /proc in proxmox lxc is taken care by lxcfs. I continue my research in this direction. I will keep you informed.

did you find anything else?
 
I have around 10 CT on this cluster, even if I am not sure it is related. They run java application, with an apache, mysql and a cassandra. However, it does not seems link with the container content since an "empty" container (without application) develops the same issue. More surprisingly, it can start with a reasonable idle value, and then, suddenly (I have monitored it) go to an exceptionally large value.
A server reboot does not fix the issue.

We have others proxmox running, but it is the only cluster with this issue. I don't know where does it comes from, but maybe it is linked to a specific hardware. It is the only cluster with AMD Epyc, but I don't see why this processor would not handle LXC.

From what we have observed, this issue seems deeply embedded in lxcfs. It is actually quite hard for them to emulate an idle value since cgroup does not give it by itself. I think lxcfs guys mainly tries to keep a value relatively coherente (because having an absolute value is impossible). However, we are on lxcfs 3.0.3, and they have done a major rework on it since (https://github.com/lxc/lxcfs/commit/8be92dd19ceecb0ab54f36cc69d5f3845ba29844) for lxcfs 3.1.2. However, due to this complexity, I don't know if my issue are coming from an older version of lxcfs.

Those are my two cents. I still try to figure out where this issue comes from, but I'm kind of lost, and since I might be the only one in this situation, maybe it is something else I have overlooked.
 
I have around 10 CT on this cluster, even if I am not sure it is related. They run java application, with an apache, mysql and a cassandra. However, it does not seems link with the container content since an "empty" container (without application) develops the same issue. More surprisingly, it can start with a reasonable idle value, and then, suddenly (I have monitored it) go to an exceptionally large value.
A server reboot does not fix the issue.

We have the exact same issue, roughly half of our containers show idle time in the order of magnitude of an unsigned long on 64 bit (we have even seen one with the equivalent of 0xFFFFFFFFFFFFFFFF). So far it doesn't seem to be related to the actual load, services running inside, physical machine or physical cpu attached to them.

We have others proxmox running, but it is the only cluster with this issue. I don't know where does it comes from, but maybe it is linked to a specific hardware. It is the only cluster with AMD Epyc, but I don't see why this processor would not handle LXC.

Our setup is running on Intel processors and all our machines are similarly affected.

From what we have observed, this issue seems deeply embedded in lxcfs. It is actually quite hard for them to emulate an idle value since cgroup does not give it by itself. I think lxcfs guys mainly tries to keep a value relatively coherente (because having an absolute value is impossible). However, we are on lxcfs 3.0.3, and they have done a major rework on it since (https://github.com/lxc/lxcfs/commit/8be92dd19ceecb0ab54f36cc69d5f3845ba29844) for lxcfs 3.1.2. However, due to this complexity, I don't know if my issue are coming from an older version of lxcfs.

Actually if you have lxcfs 3.0.3-pve1, it is not built from the 3.0.x branch, but from an undefined snapshot of the master branch around February 2019 (http://download.proxmox.com/debian/...ption/binary-amd64/lxcfs_3.0.3-pve1.changelog). This includes already the commit you are referring to and a next one making things even more complex (https://github.com/lxc/lxcfs/commit/056adcefe4c6b345e91b1e8c38be4f02db970d3b). I suspect the function cpuview_proc_stat to have some corner case/race condition leading to what we observe.

Interestingly we discovered the issue because having a constant value for the idle time leads to some monitoring tools to report 100% CPU usage, like htop for example: https://github.com/hishamhm/htop/issues/936 .
 
Hi all!
We manually compiled and deployed the version 4.0 of LXCFS on a test server with proxmox 6 (pve-manager/6.1-5/9bf06119) and so far (after running during a week with around 10 CT) the issue didn't happen again. I can't guarantee this will fix your issue @JNG , but it might be worth a try.

@oguz do you think it could be possible to update the LXCFS version packaged in proxmox in a next release? We would rather not do this same dirty hacks on our production servers ;-)

Thanks a lot and have a nice day!
 
Thank you @ebiii.
I will try as soon as I can (currently, with European events, it might take a while). However, since I reproduced it only on a specific server, I will have to do it on our testing server, which is quite used today.
However, I sustain you suggestion to update LXCFS, in my opinion, it is always nice to have sort of the latest versions.
 
@oguz do you think it could be possible to update the LXCFS version packaged in proxmox in a next release? We would rather not do this same dirty hacks on our production servers ;-)

we're also testing it, so it'll be up soon but not earlier than next week
 
Problem has been solved by upgrading to PVE6. I still believed that the earlier version of lxcfs has an issue.
However: this is solved! That you all for your help.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!