VM Crashing - Out of Memory

soapee01

Well-Known Member
Sep 7, 2016
39
6
48
69
Hi,

I have a VM that crashes every 10 days or so. I initially attributed the issue to something in Windows Server 2012R2. I wrote a little script to check status on this VM every 30 seconds, and restart it (and email me) if it crashes. The script just uses this:

Code:
qm list|grep $VM_NAME |grep stopped

This is suspect from syslog to me:
Code:
Jun 25 07:32:21 pve kernel: [2331259.233957] Out of memory: Killed process 13420 (kvm) total-vm:40537300kB, anon-rss:33601084kB, file-rss:3676kB, shmem-rss:4kB, UID:0 pgtables:69256kB oom_score_adj

Here's my settings:
Code:
node PVE 94.26GiB

VM100: 32GB balloon=0
VM101: 4.00 GiB / 16.00GiB baloon
VM102: 2.000 GiB / 8.00 GiB baloon

Worst case total: 66.00 GiB



ZFS Arc Size also seems a bit crazy

Code:
arc_summary

------------------------------------------------------------------------
ZFS Subsystem Report                            Fri Jun 25 08:42:33 2021
Linux 5.4.101-1-pve                                           2.0.3-pve2
Machine: pve (x86_64)                                         2.0.3-pve2

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                   100.1 %   47.2 GiB
        Target size (adaptive):                       100.0 %   47.1 GiB
        Min size (hard limit):                          6.2 %    2.9 GiB
        Max size (high water):                           16:1   47.1 GiB
        Most Frequently Used (MFU) cache size:         94.4 %   41.1 GiB
        Most Recently Used (MRU) cache size:            5.6 %    2.4 GiB
        Metadata cache size (hard limit):              75.0 %   35.3 GiB
        Metadata cache size (current):                 17.9 %    6.3 GiB
        Dnode cache size (hard limit):                 10.0 %    3.5 GiB
        Dnode cache size (current):                     1.1 %   38.7 MiB



zpool status

Code:
zpool status
  pool: rpool
 state: ONLINE

config:

        NAME                              STATE     READ WRITE CKSUM
        rpool                             ONLINE       0     0     0
          raidz2-0                        ONLINE       0     0     0
            scsi-35000c5003ea1e0c7-part3  ONLINE       0     0     0
            scsi-35000c5003ea21a2d-part3  ONLINE       0     0     0
            scsi-35000c5003ea21aa1-part3  ONLINE       0     0     0
            scsi-35000c5003ea1f1b2-part3  ONLINE       0     0     0
            scsi-35000c5003ea21ccd-part3  ONLINE       0     0     0
            scsi-35000c5003ea21a60-part3  ONLINE       0     0     0
            scsi-35000c5003ea21a44-part3  ONLINE       0     0     0
            scsi-35000c5003ea1f98f-part3  ONLINE       0     0     0

errors: No known data errors





Have I done something incorrect here that's causing memory exhaustion? I'd appreciate any pointers on what I need to do to stop this from happening. I've never seen this happen on any of my other Proxmox VE servers.





All relevant Syslog entries

Code:
Jun 25 07:32:01 pve systemd[1]: Started Proxmox VE replication runner.
Jun 25 07:32:21 pve kernel: [2331259.233498] zfs invoked oom-killer: gfp_mask=0x42dc0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_ZERO), order=2, oom_score_adj=0
Jun 25 07:32:21 pve kernel: [2331259.233501] CPU: 5 PID: 30400 Comm: zfs Tainted: P           O      5.4.101-1-pve #1
Jun 25 07:32:21 pve kernel: [2331259.233502] Hardware name: Dell Inc. PowerEdge T320/0MK701, BIOS 2.9.0 01/08/2020
Jun 25 07:32:21 pve kernel: [2331259.233503] Call Trace:
Jun 25 07:32:21 pve kernel: [2331259.233510]  dump_stack+0x6d/0x8b
Jun 25 07:32:21 pve kernel: [2331259.233514]  dump_header+0x4f/0x1e1
Jun 25 07:32:21 pve kernel: [2331259.233515]  oom_kill_process.cold.33+0xb/0x10
Jun 25 07:32:21 pve kernel: [2331259.233518]  out_of_memory+0x1ad/0x490
Jun 25 07:32:21 pve kernel: [2331259.233521]  __alloc_pages_slowpath+0xd40/0xe30
Jun 25 07:32:21 pve kernel: [2331259.233523]  __alloc_pages_nodemask+0x2df/0x330
Jun 25 07:32:21 pve kernel: [2331259.233525]  kmalloc_large_node+0x42/0x90
Jun 25 07:32:21 pve kernel: [2331259.233526]  __kmalloc_node+0x267/0x330
Jun 25 07:32:21 pve kernel: [2331259.233528]  ? lru_cache_add_active_or_unevictable+0x39/0xb0
Jun 25 07:32:21 pve kernel: [2331259.233535]  spl_kmem_zalloc+0xd1/0x120 [spl]
Jun 25 07:32:21 pve kernel: [2331259.233606]  zfsdev_ioctl+0x2b/0xe0 [zfs]
Jun 25 07:32:21 pve kernel: [2331259.233608]  do_vfs_ioctl+0xa9/0x640
Jun 25 07:32:21 pve kernel: [2331259.233610]  ? handle_mm_fault+0xc9/0x1f0
Jun 25 07:32:21 pve kernel: [2331259.233611]  ksys_ioctl+0x67/0x90
Jun 25 07:32:21 pve kernel: [2331259.233612]  __x64_sys_ioctl+0x1a/0x20
Jun 25 07:32:21 pve kernel: [2331259.233615]  do_syscall_64+0x57/0x190
Jun 25 07:32:21 pve kernel: [2331259.233618]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jun 25 07:32:21 pve kernel: [2331259.233619] RIP: 0033:0x7f12097c4427

...snip for char limit...


Jun 25 07:32:21 pve kernel: [2331259.233683] Tasks state (memory values in pages):
Jun 25 07:32:21 pve kernel: [2331259.233683] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Jun 25 07:32:21 pve kernel: [2331259.233692] [   2030]     0  2030    99092    74594   823296        0             0 systemd-journal
Jun 25 07:32:21 pve kernel: [2331259.233694] [   2055]     0  2055     5924      822    65536        0         -1000 systemd-udevd
Jun 25 07:32:21 pve kernel: [2331259.233695] [   2659]   106  2659     1705      413    49152        0             0 rpcbind
Jun 25 07:32:21 pve kernel: [2331259.233697] [   2669]   100  2669    23270      601    81920        0             0 systemd-timesyn
Jun 25 07:32:21 pve kernel: [2331259.233698] [   2704]     0  2704    56455      758    86016        0             0 rsyslogd
Jun 25 07:32:21 pve kernel: [2331259.233700] [   2707]     0  2707      535      145    40960        0         -1000 watchdog-mux
Jun 25 07:32:21 pve kernel: [2331259.233701] [   2708]     0  2708     3151      816    69632        0             0 smartd
Jun 25 07:32:21 pve kernel: [2331259.233702] [   2709]     0  2709     1022      379    45056        0             0 qmeventd
Jun 25 07:32:21 pve kernel: [2331259.233704] [   2711]   104  2711     2319      651    57344        0          -900 dbus-daemon
Jun 25 07:32:21 pve kernel: [2331259.233705] [   2719]     0  2719    41547      675    86016        0             0 zed
Jun 25 07:32:21 pve kernel: [2331259.233706] [   2721]     0  2721     5049     1117    81920        0             0 systemd-logind
Jun 25 07:32:21 pve kernel: [2331259.233707] [   2725]     0  2725   111600      419    98304        0             0 lxcfs
Jun 25 07:32:21 pve kernel: [2331259.233709] [   2730]     0  2730   100037     1691   147456        0             0 udisksd
Jun 25 07:32:21 pve kernel: [2331259.233710] [   2737]     0  2737   170352      340   131072        0             0 pve-lxc-syscall
Jun 25 07:32:21 pve kernel: [2331259.233712] [   2738]     0  2738   990876     3956   598016        0          -900 snapd
Jun 25 07:32:21 pve kernel: [2331259.233713] [   2764]     0  2764     1681      429    49152        0             0 ksmtuned
Jun 25 07:32:21 pve kernel: [2331259.233714] [   2836]     0  2836    58959      792    90112        0             0 polkitd
Jun 25 07:32:21 pve kernel: [2331259.233716] [   2990]     0  2990     1823      290    57344        0             0 lxc-monitord
Jun 25 07:32:21 pve kernel: [2331259.233717] [   3010]     0  3010      568      140    45056        0             0 none
Jun 25 07:32:21 pve kernel: [2331259.233718] [   3014]     0  3014    21785      377    61440        0             0 apcupsd
Jun 25 07:32:21 pve kernel: [2331259.233719] [   3015]     0  3015     3962      734    69632        0         -1000 sshd
Jun 25 07:32:21 pve kernel: [2331259.233720] [   3018]     0  3018     1722       61    53248        0             0 iscsid
Jun 25 07:32:21 pve kernel: [2331259.233722] [   3019]     0  3019     1848     1305    53248        0           -17 iscsid
Jun 25 07:32:21 pve kernel: [2331259.233723] [   3241]     0  3241    10868      686    73728        0             0 master
Jun 25 07:32:21 pve kernel: [2331259.233724] [   3243]   107  3243    10984      798    86016        0             0 qmgr
Jun 25 07:32:21 pve kernel: [2331259.233725] [   3530]     0  3530    59907      818    94208        0             0 lightdm

...snip for char limit...

Jun 25 07:32:21 pve kernel: [2331259.233770] [  41612]    33 41612    17654    13079   172032        0             0 spiceproxy work
Jun 25 07:32:21 pve kernel: [2331259.233772] [   7819]     0  7819    90596    32316   438272        0             0 pvedaemon worke
Jun 25 07:32:21 pve kernel: [2331259.233773] [  20414]     0 20414    90595    32349   438272        0             0 pvedaemon worke
Jun 25 07:32:21 pve kernel: [2331259.233774] [  19661]   107 19661    10958     1599    90112        0             0 pickup
Jun 25 07:32:21 pve kernel: [2331259.233776] [  47347]    33 47347    91026    32838   450560        0             0 pveproxy worker
Jun 25 07:32:21 pve kernel: [2331259.233778] [  35836]    33 35836    90979    32966   450560        0             0 pveproxy worker
Jun 25 07:32:21 pve kernel: [2331259.233779] [  46032]    33 46032    92026    34011   462848        0             0 pveproxy worker
Jun 25 07:32:21 pve kernel: [2331259.233781] [  12133]     0 12133    90595    30832   430080        0             0 task UPID:pve:0
Jun 25 07:32:21 pve kernel: [2331259.233782] [  12159]     0 12159    82172    27574   389120        0             0 qm
Jun 25 07:32:21 pve kernel: [2331259.233783] [  33470]     0 33470    90595    32101   438272        0             0 pvedaemon worke
Jun 25 07:32:21 pve kernel: [2331259.233785] [  24362]     0 24362     1314      188    49152        0             0 sleep
Jun 25 07:32:21 pve kernel: [2331259.233787] [  25449]     0 25449     1314      188    49152        0             0 sleep
Jun 25 07:32:21 pve kernel: [2331259.233788] [  30400]     0 30400     2708      547    61440        0             0 zfs
Jun 25 07:32:21 pve kernel: [2331259.233789] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/qemu.slice/100.scope,task=kvm,pid=13420,uid=0
Jun 25 07:32:21 pve kernel: [2331259.233957] Out of memory: Killed process 13420 (kvm) total-vm:40537300kB, anon-rss:33601084kB, file-rss:3676kB, shmem-rss:4kB, UID:0 pgtables:69256kB oom_score_adj:0
Jun 25 07:32:25 pve kernel: [2331262.921640] oom_reaper: reaped process 13420 (kvm), now anon-rss:0kB, file-rss:100kB, shmem-rss:4kB
Jun 25 07:32:28 pve kernel: [2331265.563718] fwbr100i0: port 2(tap100i0) entered disabled state
Jun 25 07:32:28 pve kernel: [2331265.563932] fwbr100i0: port 2(tap100i0) entered disabled state
Jun 25 07:32:28 pve systemd[1]: 100.scope: Succeeded.
Jun 25 07:32:28 pve qmeventd[2705]: Starting cleanup for 100
Jun 25 07:32:28 pve kernel: [2331266.384099] fwbr100i0: port 1(fwln100i0) entered disabled state
Jun 25 07:32:28 pve kernel: [2331266.384208] vmbr0: port 3(fwpr100p0) entered disabled state
Jun 25 07:32:28 pve kernel: [2331266.384361] device fwln100i0 left promiscuous mode
Jun 25 07:32:28 pve kernel: [2331266.384363] fwbr100i0: port 1(fwln100i0) entered disabled state
Jun 25 07:32:29 pve kernel: [2331266.419502] device fwpr100p0 left promiscuous mode
Jun 25 07:32:29 pve kernel: [2331266.419505] vmbr0: port 3(fwpr100p0) entered disabled state
Jun 25 07:32:29 pve qmeventd[2705]: Finished cleanup for 100
Jun 25 07:32:36 pve qm[31446]: <root@pam> starting task UPID:pve:00007AD8:0DE53C7C:60D5CCE4:qmstart:100:root@pam:
Jun 25 07:32:36 pve qm[31448]: start VM 100: UPID:pve:00007AD8:0DE53C7C:60D5CCE4:qmstart:100:root@pam:
Jun 25 07:32:37 pve systemd[1]: Started 100.scope.
Jun 25 07:32:37 pve systemd-udevd[31466]: Using default interface naming scheme 'v240'.
Jun 25 07:32:37 pve systemd-udevd[31466]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jun 25 07:32:37 pve systemd-udevd[31466]: Could not generate persistent MAC address for tap100i0: No such file or directory
Jun 25 07:32:37 pve kernel: [2331275.251419] device tap100i0 entered promiscuous mode
Jun 25 07:32:37 pve systemd-udevd[31466]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jun 25 07:32:37 pve systemd-udevd[31466]: Could not generate persistent MAC address for fwbr100i0: No such file or directory
Jun 25 07:32:37 pve systemd-udevd[31465]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jun 25 07:32:37 pve systemd-udevd[31466]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jun 25 07:32:37 pve systemd-udevd[31465]: Using default interface naming scheme 'v240'.
Jun 25 07:32:37 pve systemd-udevd[31466]: Could not generate persistent MAC address for fwpr100p0: No such file or directory
Jun 25 07:32:37 pve systemd-udevd[31465]: Could not generate persistent MAC address for fwln100i0: No such file or directory
Jun 25 07:32:37 pve kernel: [2331275.287708] fwbr100i0: port 1(fwln100i0) entered blocking state

...snip for char limit

Jun 25 07:32:37 pve kernel: [2331275.297070] fwbr100i0: port 2(tap100i0) entered forwarding state
Jun 25 07:32:38 pve qm[31446]: <root@pam> end task UPID:pve:00007AD8:0DE53C7C:60D5CCE4:qmstart:100:root@pam: OK
Jun 25 07:32:38 pve postfix/pickup[19661]: 116AA5705A: uid=0 from=<root>
Jun 25 07:32:38 pve postfix/cleanup[31524]: 116AA5705A: message-id=<20210625123238.116AA5705A@pve.contoso.local>
Jun 25 07:32:38 pve postfix/qmgr[3243]: 116AA5705A: from=<root@pve.contoso.local>, size=610, nrcpt=1 (queue active)
Jun 25 07:32:38 pve postfix/smtp[31531]: 116AA5705A: to=<support@me.com>, relay=mail.me.com[66.201.25.251]:587, delay=0.9, delays=0.02/0.02/0.67/0.19, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as C51BA121167)
 
Last edited:
ZFS uses up to 50%, hence the 47 GiB for ARC. Plus the 66 for your machines is more than the host can provide and ballooning can indeed fail if the memory cannot be provided fast enough.
If your 66 GiB are mandatory, I would restrict ZFS ARC to 24 GiB.
 
  • Like
Reactions: leesteken

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!