Hi,
I have a VM that crashes every 10 days or so. I initially attributed the issue to something in Windows Server 2012R2. I wrote a little script to check status on this VM every 30 seconds, and restart it (and email me) if it crashes. The script just uses this:
This is suspect from syslog to me:
Here's my settings:
ZFS Arc Size also seems a bit crazy
zpool status
Have I done something incorrect here that's causing memory exhaustion? I'd appreciate any pointers on what I need to do to stop this from happening. I've never seen this happen on any of my other Proxmox VE servers.
All relevant Syslog entries
I have a VM that crashes every 10 days or so. I initially attributed the issue to something in Windows Server 2012R2. I wrote a little script to check status on this VM every 30 seconds, and restart it (and email me) if it crashes. The script just uses this:
Code:
qm list|grep $VM_NAME |grep stopped
This is suspect from syslog to me:
Code:
Jun 25 07:32:21 pve kernel: [2331259.233957] Out of memory: Killed process 13420 (kvm) total-vm:40537300kB, anon-rss:33601084kB, file-rss:3676kB, shmem-rss:4kB, UID:0 pgtables:69256kB oom_score_adj
Here's my settings:
Code:
node PVE 94.26GiB
VM100: 32GB balloon=0
VM101: 4.00 GiB / 16.00GiB baloon
VM102: 2.000 GiB / 8.00 GiB baloon
Worst case total: 66.00 GiB
ZFS Arc Size also seems a bit crazy
Code:
arc_summary
------------------------------------------------------------------------
ZFS Subsystem Report Fri Jun 25 08:42:33 2021
Linux 5.4.101-1-pve 2.0.3-pve2
Machine: pve (x86_64) 2.0.3-pve2
ARC status: HEALTHY
Memory throttle count: 0
ARC size (current): 100.1 % 47.2 GiB
Target size (adaptive): 100.0 % 47.1 GiB
Min size (hard limit): 6.2 % 2.9 GiB
Max size (high water): 16:1 47.1 GiB
Most Frequently Used (MFU) cache size: 94.4 % 41.1 GiB
Most Recently Used (MRU) cache size: 5.6 % 2.4 GiB
Metadata cache size (hard limit): 75.0 % 35.3 GiB
Metadata cache size (current): 17.9 % 6.3 GiB
Dnode cache size (hard limit): 10.0 % 3.5 GiB
Dnode cache size (current): 1.1 % 38.7 MiB
zpool status
Code:
zpool status
pool: rpool
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
scsi-35000c5003ea1e0c7-part3 ONLINE 0 0 0
scsi-35000c5003ea21a2d-part3 ONLINE 0 0 0
scsi-35000c5003ea21aa1-part3 ONLINE 0 0 0
scsi-35000c5003ea1f1b2-part3 ONLINE 0 0 0
scsi-35000c5003ea21ccd-part3 ONLINE 0 0 0
scsi-35000c5003ea21a60-part3 ONLINE 0 0 0
scsi-35000c5003ea21a44-part3 ONLINE 0 0 0
scsi-35000c5003ea1f98f-part3 ONLINE 0 0 0
errors: No known data errors
Have I done something incorrect here that's causing memory exhaustion? I'd appreciate any pointers on what I need to do to stop this from happening. I've never seen this happen on any of my other Proxmox VE servers.
All relevant Syslog entries
Code:
Jun 25 07:32:01 pve systemd[1]: Started Proxmox VE replication runner.
Jun 25 07:32:21 pve kernel: [2331259.233498] zfs invoked oom-killer: gfp_mask=0x42dc0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_ZERO), order=2, oom_score_adj=0
Jun 25 07:32:21 pve kernel: [2331259.233501] CPU: 5 PID: 30400 Comm: zfs Tainted: P O 5.4.101-1-pve #1
Jun 25 07:32:21 pve kernel: [2331259.233502] Hardware name: Dell Inc. PowerEdge T320/0MK701, BIOS 2.9.0 01/08/2020
Jun 25 07:32:21 pve kernel: [2331259.233503] Call Trace:
Jun 25 07:32:21 pve kernel: [2331259.233510] dump_stack+0x6d/0x8b
Jun 25 07:32:21 pve kernel: [2331259.233514] dump_header+0x4f/0x1e1
Jun 25 07:32:21 pve kernel: [2331259.233515] oom_kill_process.cold.33+0xb/0x10
Jun 25 07:32:21 pve kernel: [2331259.233518] out_of_memory+0x1ad/0x490
Jun 25 07:32:21 pve kernel: [2331259.233521] __alloc_pages_slowpath+0xd40/0xe30
Jun 25 07:32:21 pve kernel: [2331259.233523] __alloc_pages_nodemask+0x2df/0x330
Jun 25 07:32:21 pve kernel: [2331259.233525] kmalloc_large_node+0x42/0x90
Jun 25 07:32:21 pve kernel: [2331259.233526] __kmalloc_node+0x267/0x330
Jun 25 07:32:21 pve kernel: [2331259.233528] ? lru_cache_add_active_or_unevictable+0x39/0xb0
Jun 25 07:32:21 pve kernel: [2331259.233535] spl_kmem_zalloc+0xd1/0x120 [spl]
Jun 25 07:32:21 pve kernel: [2331259.233606] zfsdev_ioctl+0x2b/0xe0 [zfs]
Jun 25 07:32:21 pve kernel: [2331259.233608] do_vfs_ioctl+0xa9/0x640
Jun 25 07:32:21 pve kernel: [2331259.233610] ? handle_mm_fault+0xc9/0x1f0
Jun 25 07:32:21 pve kernel: [2331259.233611] ksys_ioctl+0x67/0x90
Jun 25 07:32:21 pve kernel: [2331259.233612] __x64_sys_ioctl+0x1a/0x20
Jun 25 07:32:21 pve kernel: [2331259.233615] do_syscall_64+0x57/0x190
Jun 25 07:32:21 pve kernel: [2331259.233618] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jun 25 07:32:21 pve kernel: [2331259.233619] RIP: 0033:0x7f12097c4427
...snip for char limit...
Jun 25 07:32:21 pve kernel: [2331259.233683] Tasks state (memory values in pages):
Jun 25 07:32:21 pve kernel: [2331259.233683] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
Jun 25 07:32:21 pve kernel: [2331259.233692] [ 2030] 0 2030 99092 74594 823296 0 0 systemd-journal
Jun 25 07:32:21 pve kernel: [2331259.233694] [ 2055] 0 2055 5924 822 65536 0 -1000 systemd-udevd
Jun 25 07:32:21 pve kernel: [2331259.233695] [ 2659] 106 2659 1705 413 49152 0 0 rpcbind
Jun 25 07:32:21 pve kernel: [2331259.233697] [ 2669] 100 2669 23270 601 81920 0 0 systemd-timesyn
Jun 25 07:32:21 pve kernel: [2331259.233698] [ 2704] 0 2704 56455 758 86016 0 0 rsyslogd
Jun 25 07:32:21 pve kernel: [2331259.233700] [ 2707] 0 2707 535 145 40960 0 -1000 watchdog-mux
Jun 25 07:32:21 pve kernel: [2331259.233701] [ 2708] 0 2708 3151 816 69632 0 0 smartd
Jun 25 07:32:21 pve kernel: [2331259.233702] [ 2709] 0 2709 1022 379 45056 0 0 qmeventd
Jun 25 07:32:21 pve kernel: [2331259.233704] [ 2711] 104 2711 2319 651 57344 0 -900 dbus-daemon
Jun 25 07:32:21 pve kernel: [2331259.233705] [ 2719] 0 2719 41547 675 86016 0 0 zed
Jun 25 07:32:21 pve kernel: [2331259.233706] [ 2721] 0 2721 5049 1117 81920 0 0 systemd-logind
Jun 25 07:32:21 pve kernel: [2331259.233707] [ 2725] 0 2725 111600 419 98304 0 0 lxcfs
Jun 25 07:32:21 pve kernel: [2331259.233709] [ 2730] 0 2730 100037 1691 147456 0 0 udisksd
Jun 25 07:32:21 pve kernel: [2331259.233710] [ 2737] 0 2737 170352 340 131072 0 0 pve-lxc-syscall
Jun 25 07:32:21 pve kernel: [2331259.233712] [ 2738] 0 2738 990876 3956 598016 0 -900 snapd
Jun 25 07:32:21 pve kernel: [2331259.233713] [ 2764] 0 2764 1681 429 49152 0 0 ksmtuned
Jun 25 07:32:21 pve kernel: [2331259.233714] [ 2836] 0 2836 58959 792 90112 0 0 polkitd
Jun 25 07:32:21 pve kernel: [2331259.233716] [ 2990] 0 2990 1823 290 57344 0 0 lxc-monitord
Jun 25 07:32:21 pve kernel: [2331259.233717] [ 3010] 0 3010 568 140 45056 0 0 none
Jun 25 07:32:21 pve kernel: [2331259.233718] [ 3014] 0 3014 21785 377 61440 0 0 apcupsd
Jun 25 07:32:21 pve kernel: [2331259.233719] [ 3015] 0 3015 3962 734 69632 0 -1000 sshd
Jun 25 07:32:21 pve kernel: [2331259.233720] [ 3018] 0 3018 1722 61 53248 0 0 iscsid
Jun 25 07:32:21 pve kernel: [2331259.233722] [ 3019] 0 3019 1848 1305 53248 0 -17 iscsid
Jun 25 07:32:21 pve kernel: [2331259.233723] [ 3241] 0 3241 10868 686 73728 0 0 master
Jun 25 07:32:21 pve kernel: [2331259.233724] [ 3243] 107 3243 10984 798 86016 0 0 qmgr
Jun 25 07:32:21 pve kernel: [2331259.233725] [ 3530] 0 3530 59907 818 94208 0 0 lightdm
...snip for char limit...
Jun 25 07:32:21 pve kernel: [2331259.233770] [ 41612] 33 41612 17654 13079 172032 0 0 spiceproxy work
Jun 25 07:32:21 pve kernel: [2331259.233772] [ 7819] 0 7819 90596 32316 438272 0 0 pvedaemon worke
Jun 25 07:32:21 pve kernel: [2331259.233773] [ 20414] 0 20414 90595 32349 438272 0 0 pvedaemon worke
Jun 25 07:32:21 pve kernel: [2331259.233774] [ 19661] 107 19661 10958 1599 90112 0 0 pickup
Jun 25 07:32:21 pve kernel: [2331259.233776] [ 47347] 33 47347 91026 32838 450560 0 0 pveproxy worker
Jun 25 07:32:21 pve kernel: [2331259.233778] [ 35836] 33 35836 90979 32966 450560 0 0 pveproxy worker
Jun 25 07:32:21 pve kernel: [2331259.233779] [ 46032] 33 46032 92026 34011 462848 0 0 pveproxy worker
Jun 25 07:32:21 pve kernel: [2331259.233781] [ 12133] 0 12133 90595 30832 430080 0 0 task UPID:pve:0
Jun 25 07:32:21 pve kernel: [2331259.233782] [ 12159] 0 12159 82172 27574 389120 0 0 qm
Jun 25 07:32:21 pve kernel: [2331259.233783] [ 33470] 0 33470 90595 32101 438272 0 0 pvedaemon worke
Jun 25 07:32:21 pve kernel: [2331259.233785] [ 24362] 0 24362 1314 188 49152 0 0 sleep
Jun 25 07:32:21 pve kernel: [2331259.233787] [ 25449] 0 25449 1314 188 49152 0 0 sleep
Jun 25 07:32:21 pve kernel: [2331259.233788] [ 30400] 0 30400 2708 547 61440 0 0 zfs
Jun 25 07:32:21 pve kernel: [2331259.233789] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/qemu.slice/100.scope,task=kvm,pid=13420,uid=0
Jun 25 07:32:21 pve kernel: [2331259.233957] Out of memory: Killed process 13420 (kvm) total-vm:40537300kB, anon-rss:33601084kB, file-rss:3676kB, shmem-rss:4kB, UID:0 pgtables:69256kB oom_score_adj:0
Jun 25 07:32:25 pve kernel: [2331262.921640] oom_reaper: reaped process 13420 (kvm), now anon-rss:0kB, file-rss:100kB, shmem-rss:4kB
Jun 25 07:32:28 pve kernel: [2331265.563718] fwbr100i0: port 2(tap100i0) entered disabled state
Jun 25 07:32:28 pve kernel: [2331265.563932] fwbr100i0: port 2(tap100i0) entered disabled state
Jun 25 07:32:28 pve systemd[1]: 100.scope: Succeeded.
Jun 25 07:32:28 pve qmeventd[2705]: Starting cleanup for 100
Jun 25 07:32:28 pve kernel: [2331266.384099] fwbr100i0: port 1(fwln100i0) entered disabled state
Jun 25 07:32:28 pve kernel: [2331266.384208] vmbr0: port 3(fwpr100p0) entered disabled state
Jun 25 07:32:28 pve kernel: [2331266.384361] device fwln100i0 left promiscuous mode
Jun 25 07:32:28 pve kernel: [2331266.384363] fwbr100i0: port 1(fwln100i0) entered disabled state
Jun 25 07:32:29 pve kernel: [2331266.419502] device fwpr100p0 left promiscuous mode
Jun 25 07:32:29 pve kernel: [2331266.419505] vmbr0: port 3(fwpr100p0) entered disabled state
Jun 25 07:32:29 pve qmeventd[2705]: Finished cleanup for 100
Jun 25 07:32:36 pve qm[31446]: <root@pam> starting task UPID:pve:00007AD8:0DE53C7C:60D5CCE4:qmstart:100:root@pam:
Jun 25 07:32:36 pve qm[31448]: start VM 100: UPID:pve:00007AD8:0DE53C7C:60D5CCE4:qmstart:100:root@pam:
Jun 25 07:32:37 pve systemd[1]: Started 100.scope.
Jun 25 07:32:37 pve systemd-udevd[31466]: Using default interface naming scheme 'v240'.
Jun 25 07:32:37 pve systemd-udevd[31466]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jun 25 07:32:37 pve systemd-udevd[31466]: Could not generate persistent MAC address for tap100i0: No such file or directory
Jun 25 07:32:37 pve kernel: [2331275.251419] device tap100i0 entered promiscuous mode
Jun 25 07:32:37 pve systemd-udevd[31466]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jun 25 07:32:37 pve systemd-udevd[31466]: Could not generate persistent MAC address for fwbr100i0: No such file or directory
Jun 25 07:32:37 pve systemd-udevd[31465]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jun 25 07:32:37 pve systemd-udevd[31466]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jun 25 07:32:37 pve systemd-udevd[31465]: Using default interface naming scheme 'v240'.
Jun 25 07:32:37 pve systemd-udevd[31466]: Could not generate persistent MAC address for fwpr100p0: No such file or directory
Jun 25 07:32:37 pve systemd-udevd[31465]: Could not generate persistent MAC address for fwln100i0: No such file or directory
Jun 25 07:32:37 pve kernel: [2331275.287708] fwbr100i0: port 1(fwln100i0) entered blocking state
...snip for char limit
Jun 25 07:32:37 pve kernel: [2331275.297070] fwbr100i0: port 2(tap100i0) entered forwarding state
Jun 25 07:32:38 pve qm[31446]: <root@pam> end task UPID:pve:00007AD8:0DE53C7C:60D5CCE4:qmstart:100:root@pam: OK
Jun 25 07:32:38 pve postfix/pickup[19661]: 116AA5705A: uid=0 from=<root>
Jun 25 07:32:38 pve postfix/cleanup[31524]: 116AA5705A: message-id=<20210625123238.116AA5705A@pve.contoso.local>
Jun 25 07:32:38 pve postfix/qmgr[3243]: 116AA5705A: from=<root@pve.contoso.local>, size=610, nrcpt=1 (queue active)
Jun 25 07:32:38 pve postfix/smtp[31531]: 116AA5705A: to=<support@me.com>, relay=mail.me.com[66.201.25.251]:587, delay=0.9, delays=0.02/0.02/0.67/0.19, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as C51BA121167)
Last edited: