Hi All,
Coming from 20 years as a professional in Windows (2 years of hands-on Linux) I'm having issues with fully grasping cgroups. I recently upgraded my server cluster and am I encountering OOM kills regularly (a few times per day). I have a single cgroup (just out-of-the-box config). My new setup has 1 TB of memory in one box and 512 GB in the other. ZFS and the OS do take some, yet I still have plenty of breathing room.
For some reason, unknown to me, the OOM killer keeps attacking my containers. Netdata is reporting 99% RAM usage for the cgroup when it starts doing its thing and murdering my containers (Docker and LXC), yet when I check the systems they never go over 30% of physical memory for actual programs. Sometimes they buffer ZFS quite a bit and will buffer their cache, yet that should and does scale with actual need and it seems to do so.
I'm not sure why this wasn't an issue on my old boxes. They had less RAM and older processors with the same load. I've looked into cgroups and how to use them, however it's such a foreign concept, I'm having trouble with where to start my troubleshooting. I would assume with 1 cgroup it would just allocate everything to that. Furthermore, I've checked cgroup memory settings, the max is max and min is 0, yet I'm still having the OOM killer come into play with non-buffered RAM usage at 5-30% based upon where I place resources in the cluster.
My systems all up to date 100%, yet I am also running the 6.2 kernel branch. Not sure if this is an issue...?
I'd appreciate some guidance on where I should start. Happy to post logs, just need to know what would be helpful.
All the best and thanks in advance!
Keith
Coming from 20 years as a professional in Windows (2 years of hands-on Linux) I'm having issues with fully grasping cgroups. I recently upgraded my server cluster and am I encountering OOM kills regularly (a few times per day). I have a single cgroup (just out-of-the-box config). My new setup has 1 TB of memory in one box and 512 GB in the other. ZFS and the OS do take some, yet I still have plenty of breathing room.
For some reason, unknown to me, the OOM killer keeps attacking my containers. Netdata is reporting 99% RAM usage for the cgroup when it starts doing its thing and murdering my containers (Docker and LXC), yet when I check the systems they never go over 30% of physical memory for actual programs. Sometimes they buffer ZFS quite a bit and will buffer their cache, yet that should and does scale with actual need and it seems to do so.
I'm not sure why this wasn't an issue on my old boxes. They had less RAM and older processors with the same load. I've looked into cgroups and how to use them, however it's such a foreign concept, I'm having trouble with where to start my troubleshooting. I would assume with 1 cgroup it would just allocate everything to that. Furthermore, I've checked cgroup memory settings, the max is max and min is 0, yet I'm still having the OOM killer come into play with non-buffered RAM usage at 5-30% based upon where I place resources in the cluster.
My systems all up to date 100%, yet I am also running the 6.2 kernel branch. Not sure if this is an issue...?
I'd appreciate some guidance on where I should start. Happy to post logs, just need to know what would be helpful.
All the best and thanks in advance!
Keith