Hi,
I analyzed some strange behavior (including not being able to login) and I think I found the root cause by a shortage of main memory, leading to OOM killer various unneeded tasks such as systemd and sshd, which can lead to arbitrary bad and unreasonable behavior. A reboot solved them all, fortunately.
The machine was doing a disk load test (writing pseudo random data in big files, with random offsets, and comparing hashes of these big files). It has 48 GB RAM. There are to VMs with 16 GB each and one container with 4 GB, so 36 GB in total, leaving 12 GB for PVE+ZFS.
The "memory usage" graph in "node | summary" in the web GUI shows for "Day maximum" show 48 Gi total (46,86Gi) and has a peak in "RAM usage" 43.58 Gi at 16:00:00, 36.62 GB at 17:00:00 and 38.63 GB at 17:30:00 (no value in between). So no visible out of memory condition here.
But there was.
journalctl contained:
(NB: This is a the real complete output of "journalctl | grep "out of memory", I did not select the processes or shortened output))
Selecting systemd, sd-pam and sshd to be killed seems to be close to the worsest posible choice.
Is there a way to improve such behavior?
I know that OOM is hard to deal with, but could possible containers put in somehow "lower priority" or such? In my case, the 4 GB container is just a test container and I had preferred if it had been shutdown instead. At least, a visible error in such a fatal situation in the web GUI would be good (although in my case of couse I could not even login).
I analyzed some strange behavior (including not being able to login) and I think I found the root cause by a shortage of main memory, leading to OOM killer various unneeded tasks such as systemd and sshd, which can lead to arbitrary bad and unreasonable behavior. A reboot solved them all, fortunately.
The machine was doing a disk load test (writing pseudo random data in big files, with random offsets, and comparing hashes of these big files). It has 48 GB RAM. There are to VMs with 16 GB each and one container with 4 GB, so 36 GB in total, leaving 12 GB for PVE+ZFS.
The "memory usage" graph in "node | summary" in the web GUI shows for "Day maximum" show 48 Gi total (46,86Gi) and has a peak in "RAM usage" 43.58 Gi at 16:00:00, 36.62 GB at 17:00:00 and 38.63 GB at 17:30:00 (no value in between). So no visible out of memory condition here.
But there was.
journalctl contained:
Code:
root@pve:/var/log# journalctl |grep "out of memory"
Jul 21 17:07:54 pve kernel: Memory cgroup out of memory: Killed process 1395069 ((sd-pam)) total-vm:168576kB, anon-rss:2956kB, file-rss:0kB, shmem-rss:0kB, UID:100000 pgtables:92kB oom_score_adj:100
Jul 21 17:07:54 pve kernel: Memory cgroup out of memory: Killed process 1395068 (systemd) total-vm:18716kB, anon-rss:1408kB, file-rss:128kB, shmem-rss:0kB, UID:100000 pgtables:76kB oom_score_adj:100
Jul 21 17:07:54 pve kernel: Memory cgroup out of memory: Killed process 1119534 (systemd) total-vm:168744kB, anon-rss:3712kB, file-rss:0kB, shmem-rss:0kB, UID:100000 pgtables:92kB oom_score_adj:0
Jul 21 17:07:54 pve kernel: Memory cgroup out of memory: Killed process 1395779 (sshd) total-vm:17956kB, anon-rss:1792kB, file-rss:128kB, shmem-rss:0kB, UID:100000 pgtables:72kB oom_score_adj:0
root@pve:/var/log#
(NB: This is a the real complete output of "journalctl | grep "out of memory", I did not select the processes or shortened output))
Selecting systemd, sd-pam and sshd to be killed seems to be close to the worsest posible choice.
Is there a way to improve such behavior?
I know that OOM is hard to deal with, but could possible containers put in somehow "lower priority" or such? In my case, the 4 GB container is just a test container and I had preferred if it had been shutdown instead. At least, a visible error in such a fatal situation in the web GUI would be good (although in my case of couse I could not even login).
Last edited: