VM crashes over and over again - potential memory leak in kvm?

May 21, 2020
54
9
13
35
Setup:
Microserver Gen10 Plus with 48GB ECC (Memtest is OK)
4x4TB ZFS inside Proxmox 8.1.10
Latest TrueNAS Scale inside VM with 32GB RAM, 30GB System Disk and 3,2TB data Disk
Inside TrueNAS:
1 Dataset on ZFS (encrypted)
Exposed to LAN over SMB

When I copy over files, it crashes - the likelihood increases with larger files (starting from files of 2GB) but it does not happen always

Seems to happen more often with Windows and MacOS Clients than Debian based clients

What could be wrong?

When the crash happens the logs do look as follows

Code:
Mar 28 22:27:14 pve pvedaemon[1662]: worker exit
Mar 28 22:27:14 pve pvedaemon[1660]: worker 1662 finished
Mar 28 22:27:14 pve pvedaemon[1660]: starting 1 worker(s)
Mar 28 22:27:14 pve pvedaemon[1660]: worker 1143811 started
Mar 28 22:29:42 pve pveproxy[1138713]: worker exit
Mar 28 22:29:42 pve pveproxy[1678]: worker 1138713 finished
Mar 28 22:29:42 pve pveproxy[1678]: starting 1 worker(s)
Mar 28 22:29:42 pve pveproxy[1678]: worker 1144290 started
Mar 28 22:36:22 pve pvedaemon[1142148]: <root@pam> successful auth for user 'root@pam'
Mar 28 22:41:09 pve pveproxy[1141573]: worker exit
Mar 28 22:41:09 pve pveproxy[1678]: worker 1141573 finished
Mar 28 22:41:09 pve pveproxy[1678]: starting 1 worker(s)
Mar 28 22:41:09 pve pveproxy[1678]: worker 1146432 started
Mar 28 23:02:54 pve smartd[1166]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 82 to 83
Mar 28 23:02:55 pve smartd[1166]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 82 to 83
Mar 28 23:04:52 pve kernel: CPU 0/KVM invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
Mar 28 23:04:54 pve kernel: CPU: 1 PID: 836931 Comm: CPU 0/KVM Tainted: P           O       6.5.13-1-pve #1
Mar 28 23:04:54 pve kernel: Hardware name: HPE ProLiant MicroServer Gen10 Plus/ProLiant MicroServer Gen10 Plus, BIOS U48 06/24/2020
Mar 28 23:04:54 pve kernel: Call Trace:
Mar 28 23:04:54 pve kernel:  <TASK>
Mar 28 23:04:54 pve kernel:  dump_stack_lvl+0x48/0x70
Mar 28 23:04:54 pve kernel:  dump_stack+0x10/0x20
Mar 28 23:04:54 pve kernel:  dump_header+0x4f/0x260
Mar 28 23:04:54 pve kernel:  oom_kill_process+0x10d/0x1c0
Mar 28 23:04:54 pve kernel:  out_of_memory+0x270/0x560
Mar 28 23:04:54 pve kernel:  __alloc_pages+0x114f/0x12e0
Mar 28 23:04:54 pve kernel:  alloc_pages+0x90/0x160
Mar 28 23:04:54 pve kernel:  folio_alloc+0x1d/0x60
Mar 28 23:04:54 pve kernel:  filemap_alloc_folio+0xfd/0x110
Mar 28 23:04:54 pve kernel:  __filemap_get_folio+0xd8/0x230
Mar 28 23:04:55 pve kernel:  filemap_fault+0x584/0x9f0
Mar 28 23:04:55 pve kernel:  __do_fault+0x36/0x150
Mar 28 23:04:55 pve kernel:  do_fault+0x266/0x3e0
Mar 28 23:04:55 pve kernel:  __handle_mm_fault+0x6cc/0xc30
Mar 28 23:04:55 pve kernel:  handle_mm_fault+0x164/0x360
Mar 28 23:04:55 pve kernel:  do_user_addr_fault+0x212/0x6a0
Mar 28 23:04:55 pve kernel:  exc_page_fault+0x83/0x1b0
Mar 28 23:04:55 pve kernel:  asm_exc_page_fault+0x27/0x30
Mar 28 23:04:55 pve kernel: RIP: 0033:0x7e45252afc5b
Mar 28 23:04:55 pve kernel: Code: Unable to access opcode bytes at 0x7e45252afc31.
Mar 28 23:04:55 pve kernel: RSP: 002b:00007e452098cf30 EFLAGS: 00010246
Mar 28 23:04:55 pve kernel: RAX: 0000000000000000 RBX: 00005c675d8f5b10 RCX: 00007e45252afc5b
Mar 28 23:04:55 pve kernel: RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000001b
Mar 28 23:04:55 pve kernel: RBP: 000000000000ae80 R08: 00005c675b0acc90 R09: 0000000000000000
Mar 28 23:04:55 pve kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
Mar 28 23:04:55 pve kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Mar 28 23:04:55 pve kernel:  </TASK>
Mar 28 23:04:55 pve kernel: Mem-Info:
Mar 28 23:04:55 pve kernel: active_anon:773252 inactive_anon:4694662 isolated_anon:0
 active_file:46 inactive_file:49 isolated_file:0
 unevictable:768 dirty:3 writeback:3
 slab_reclaimable:6376 slab_unreclaimable:724886
 mapped:11799 shmem:12732 pagetables:13226
 sec_pagetables:6357 bounce:0
 kernel_misc_reclaimable:0
 free:53335 free_pcp:195 free_cma:0
Mar 28 23:04:55 pve kernel: Node 0 active_anon:3093944kB inactive_anon:18777712kB active_file:0kB inactive_file:196kB unevictable:3072kB isolated(anon):0kB isolated(file):0kB mapped:47196kB dirty:12kB writeback:12kB shmem:50928kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 9019392kB writeback_tmp:0kB kernel_stack:5424kB pagetables:52904kB sec_pagetables:25428kB all_unreclaimable? no
Mar 28 23:04:55 pve kernel: Node 0 DMA free:11264kB boost:0kB min:24kB low:36kB high:48kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Mar 28 23:04:55 pve kernel: lowmem_reserve[]: 0 1776 39874 39874 39874
Mar 28 23:04:55 pve kernel: Node 0 DMA32 free:155772kB boost:0kB min:3008kB low:4824kB high:6640kB reserved_highatomic:0KB active_anon:60880kB inactive_anon:1652912kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1944276kB managed:1877488kB mlocked:0kB bounce:0kB free_pcp:1184kB local_pcp:0kB free_cma:0kB
Mar 28 23:04:55 pve kernel: lowmem_reserve[]: 0 0 38097 38097 38097
Mar 28 23:04:55 pve kernel: Node 0 Normal free:40280kB boost:0kB min:64548kB low:103560kB high:142572kB reserved_highatomic:0KB active_anon:3039532kB inactive_anon:17118332kB active_file:684kB inactive_file:268kB unevictable:3072kB writepending:24kB present:39829504kB managed:39020580kB mlocked:0kB bounce:0kB free_pcp:700kB local_pcp:0kB free_cma:0kB
Mar 28 23:04:55 pve kernel: lowmem_reserve[]: 0 0 0 0 0
Mar 28 23:04:55 pve kernel: Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11264kB
Mar 28 23:04:55 pve kernel: Node 0 DMA32: 2*4kB (UM) 2*8kB (UM) 17*16kB (UM) 1*32kB (M) 23*64kB (UM) 35*128kB (UM) 33*256kB (UM) 32*512kB (UM) 29*1024kB (UM) 14*2048kB (U) 16*4096kB (UM) = 155016kB
Mar 28 23:04:55 pve kernel: Node 0 Normal: 2766*4kB (UME) 175*8kB (ME) 73*16kB (ME) 657*32kB (UME) 49*64kB (U) 2*128kB (U) 2*256kB (U) 2*512kB (U) 0*1024kB 0*2048kB 0*4096kB = 39584kB
Mar 28 23:04:55 pve kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Mar 28 23:04:55 pve kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Mar 28 23:04:55 pve kernel: 12792 total pagecache pages
Mar 28 23:04:55 pve kernel: 0 pages in swap cache
Mar 28 23:04:55 pve kernel: Free swap  = 0kB
Mar 28 23:04:55 pve kernel: Total swap = 0kB
Mar 28 23:04:55 pve kernel: 10447443 pages RAM
Mar 28 23:04:55 pve kernel: 0 pages HighMem/MovableOnly
Mar 28 23:04:55 pve kernel: 219086 pages reserved
Mar 28 23:04:55 pve kernel: 0 pages hwpoisoned
Mar 28 23:04:55 pve kernel: Tasks state (memory values in pages):
Mar 28 23:04:55 pve kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Mar 28 23:04:55 pve kernel: [    546]     0   546    12365      288   106496        0          -250 systemd-journal
Mar 28 23:04:55 pve kernel: [    573]     0   573     6737      472    73728        0         -1000 systemd-udevd
Mar 28 23:04:55 pve kernel: [   1145]   103  1145     1969       96    57344        0             0 rpcbind
Mar 28 23:04:55 pve kernel: [   1154]   101  1154     2307      160    57344        0          -900 dbus-daemon
Mar 28 23:04:55 pve kernel: [   1158]     0  1158    38187      128    61440        0         -1000 lxcfs
Mar 28 23:04:55 pve kernel: [   1161]     0  1161     1766       83    49152        0             0 ksmtuned
Mar 28 23:04:55 pve kernel: [   1164]     0  1164    69539      128    81920        0             0 pve-lxc-syscall
Mar 28 23:04:55 pve kernel: [   1166]     0  1166     3018      512    61440        0             0 smartd
Mar 28 23:04:55 pve kernel: [   1167]     0  1167     4289      288    73728        0             0 systemd-logind
Mar 28 23:04:55 pve kernel: [   1168]     0  1168     1327       64    57344        0             0 qmeventd
Mar 28 23:04:55 pve kernel: [   1169]     0  1169      583       64    40960        0         -1000 watchdog-mux
Mar 28 23:04:55 pve kernel: [   1171]     0  1171    60164      416    98304        0             0 zed
Mar 28 23:04:55 pve kernel: [   1403]     0  1403     1256      128    49152        0             0 lxc-monitord
Mar 28 23:04:55 pve kernel: [   1422]     0  1422     1468       64    49152        0             0 agetty
Mar 28 23:04:55 pve kernel: [   1463]     0  1463     3853      352    77824        0         -1000 sshd
Mar 28 23:04:55 pve kernel: [   1481]   100  1481     4715      170    69632        0             0 chronyd
Mar 28 23:04:55 pve kernel: [   1483]   100  1483     2633      210    65536        0             0 chronyd
Mar 28 23:04:55 pve kernel: [   1605]     0  1605    10663      165    73728        0             0 master
Mar 28 23:04:55 pve kernel: [   1607]   104  1607    10775      224    73728        0             0 qmgr
Mar 28 23:04:55 pve kernel: [   1609]     0  1609   181841      359   172032        0             0 rrdcached
Mar 28 23:04:55 pve kernel: [   1617]     0  1617   205475    14882   405504        0             0 pmxcfs
Mar 28 23:04:55 pve kernel: [   1623]     0  1623     1652       96    53248        0             0 cron
Mar 28 23:04:55 pve kernel: [   1631]     0  1631    39329    24215   303104        0             0 pve-firewall
Mar 28 23:04:55 pve kernel: [   1634]     0  1634    38044    22884   294912        0             0 pvestatd
Mar 28 23:04:55 pve kernel: [   1636]     0  1636      615       32    45056        0             0 bpfilter_umh
Mar 28 23:04:55 pve kernel: [   1660]     0  1660    58325    33546   405504        0             0 pvedaemon
Mar 28 23:04:55 pve kernel: [   1677]     0  1677    54844    27594   356352        0             0 pve-ha-crm
Mar 28 23:04:55 pve kernel: [   1678]    33  1678    58652    33918   458752        0             0 pveproxy
Mar 28 23:04:55 pve kernel: [   1684]    33  1684    20204    12640   192512        0             0 spiceproxy
Mar 28 23:04:55 pve kernel: [   1686]     0  1686    54685    27434   376832        0             0 pve-ha-lrm
Mar 28 23:04:55 pve kernel: [   1805]     0  1805    53630    27734   364544        0             0 pvescheduler
Mar 28 23:04:55 pve kernel: [ 836879]     0 836879  1249927   623936  5791744        0             0 kvm
Mar 28 23:04:55 pve kernel: [ 895097]     0 895097    19796       96    61440        0             0 pvefw-logger
Mar 28 23:04:55 pve kernel: [ 895102]    33 895102    20266    12669   176128        0             0 spiceproxy work
Mar 28 23:04:55 pve kernel: [1128213]     0 1128213  8506345  4932965 41132032        0             0 kvm
Mar 28 23:04:55 pve kernel: [1141159]     0 1141159    60436    33931   421888        0             0 pvedaemon worke
Mar 28 23:04:55 pve kernel: [1142148]     0 1142148    60466    33995   425984        0             0 pvedaemon worke
Mar 28 23:04:55 pve kernel: [1142797]    33 1142797    60883    34467   438272        0             0 pveproxy worker
Mar 28 23:04:55 pve kernel: [1143811]     0 1143811    60436    33867   421888        0             0 pvedaemon worke
Mar 28 23:04:55 pve kernel: [1144290]    33 1144290    60871    34435   438272        0             0 pveproxy worker
Mar 28 23:04:55 pve kernel: [1145678]   104 1145678    10764      224    77824        0             0 pickup
Mar 28 23:04:55 pve kernel: [1146432]    33 1146432    60670    34115   434176        0             0 pveproxy worker
Mar 28 23:04:55 pve kernel: [1150823]     0 1150823     1366       64    45056        0             0 sleep
Mar 28 23:04:55 pve kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=qemu.slice,mems_allowed=0,global_oom,task_memcg=/qemu.slice/100.scope,task=kvm,pid=1128213,uid=0
Mar 28 23:04:55 pve kernel: Out of memory: Killed process 1128213 (kvm) total-vm:34025380kB, anon-rss:19731476kB, file-rss:384kB, shmem-rss:0kB, UID:0 pgtables:40168kB oom_score_adj:0
Mar 28 23:04:55 pve systemd[1]: 100.scope: A process of this unit has been killed by the OOM killer.
Mar 28 23:04:55 pve systemd[1]: 100.scope: Failed with result 'oom-kill'.
Mar 28 23:04:55 pve systemd[1]: 100.scope: Consumed 52min 42.828s CPU time.
Mar 28 23:04:55 pve kernel: oom_reaper: reaped process 1128213 (kvm), now anon-rss:0kB, file-rss:256kB, shmem-rss:0kB
Mar 28 23:05:20 pve kernel: fwbr100i0: port 2(tap100i0) entered disabled state
Mar 28 23:05:20 pve kernel:  zd16: p1 p2 p3
Mar 28 23:05:20 pve kernel: tap100i0 (unregistering): left allmulticast mode
Mar 28 23:05:20 pve kernel: fwbr100i0: port 2(tap100i0) entered disabled state
Mar 28 23:05:23 pve kernel:  zd0: p1 p2
Mar 28 23:05:23 pve qmeventd[1151017]: Starting cleanup for 100
Mar 28 23:05:23 pve kernel: fwbr100i0: port 1(fwln100i0) entered disabled state
Mar 28 23:05:23 pve kernel: vmbr0: port 2(fwpr100p0) entered disabled state
Mar 28 23:05:23 pve kernel: fwln100i0 (unregistering): left allmulticast mode
Mar 28 23:05:23 pve kernel: fwln100i0 (unregistering): left promiscuous mode
Mar 28 23:05:23 pve kernel: fwbr100i0: port 1(fwln100i0) entered disabled state
Mar 28 23:05:23 pve kernel: fwpr100p0 (unregistering): left allmulticast mode
Mar 28 23:05:23 pve kernel: fwpr100p0 (unregistering): left promiscuous mode
Mar 28 23:05:23 pve kernel: vmbr0: port 2(fwpr100p0) entered disabled state
Mar 28 23:05:23 pve qmeventd[1151017]: Finished cleanup for 100
Mar 28 23:17:02 pve CRON[1153171]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Mar 28 23:17:02 pve CRON[1153172]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Mar 28 23:17:02 pve CRON[1153171]: pam_unix(cron:session): session closed for user root
Mar 28 23:32:54 pve smartd[1166]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 59 to 60
Mar 28 23:32:54 pve smartd[1166]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 41 to 40
Mar 29 00:00:15 pve systemd[1]: Starting dpkg-db-backup.service - Daily dpkg database backup service...
Mar 29 00:00:15 pve systemd[1]: Starting logrotate.service - Rotate log files...
Mar 29 00:00:16 pve systemd[1]: Reloading pveproxy.service - PVE API Proxy Server...
Mar 29 00:00:16 pve systemd[1]: dpkg-db-backup.service: Deactivated successfully.
Mar 29 00:00:16 pve systemd[1]: Finished dpkg-db-backup.service - Daily dpkg database backup service.
Mar 29 00:00:17 pve pveproxy[1160946]: send HUP to 1678
Mar 29 00:00:17 pve pveproxy[1678]: received signal HUP
Mar 29 00:00:17 pve pveproxy[1678]: server closing
Mar 29 00:00:17 pve pveproxy[1678]: server shutdown (restart)
Mar 29 00:00:17 pve systemd[1]: Reloaded pveproxy.service - PVE API Proxy Server.
Mar 29 00:00:17 pve systemd[1]: Reloading spiceproxy.service - PVE SPICE Proxy Server...
Mar 29 00:00:18 pve spiceproxy[1160950]: send HUP to 1684
Mar 29 00:00:18 pve spiceproxy[1684]: received signal HUP
Mar 29 00:00:18 pve spiceproxy[1684]: server closing
Mar 29 00:00:18 pve spiceproxy[1684]: server shutdown (restart)
Mar 29 00:00:18 pve systemd[1]: Reloaded spiceproxy.service - PVE SPICE Proxy Server.
Mar 29 00:00:18 pve systemd[1]: Stopping pvefw-logger.service - Proxmox VE firewall logger...
Mar 29 00:00:18 pve pvefw-logger[895097]: received terminate request (signal)
Mar 29 00:00:18 pve pvefw-logger[895097]: stopping pvefw logger
Mar 29 00:00:18 pve spiceproxy[1684]: restarting server
Mar 29 00:00:18 pve spiceproxy[1684]: starting 1 worker(s)
Mar 29 00:00:18 pve spiceproxy[1684]: worker 1160959 started
Mar 29 00:00:18 pve systemd[1]: pvefw-logger.service: Deactivated successfully.
Mar 29 00:00:18 pve systemd[1]: Stopped pvefw-logger.service - Proxmox VE firewall logger.
Mar 29 00:00:18 pve systemd[1]: pvefw-logger.service: Consumed 4.588s CPU time.
Mar 29 00:00:18 pve systemd[1]: Starting pvefw-logger.service - Proxmox VE firewall logger...
Mar 29 00:00:18 pve pvefw-logger[1160961]: starting pvefw logger
 
Hi,


Mar 28 23:04:52 pve kernel: CPU 0/KVM invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
As you said, the TrueNAS VM might be allocated too much RAM close to the host's total capacity. It's crucial to ensure that there's enough free memory on the host to accommodate all running VMs' needs.

Could you also share a screenshot of the `htop` from Proxmox VE side? Maybe the issue you might be to ZFS ARC!!
 
If you use "stacked ZFS", I recommend disabling caching for one of the ZFS datasets/pools depending on the level you want, e.g. set primarycache=metadata on the zvol that is passed to the TrueNAS VM. You will not want double caching if you'r already in an OOM situation.
 
  • Like
Reactions: codingspiderfox
Hi,



As you said, the TrueNAS VM might be allocated too much RAM close to the host's total capacity. It's crucial to ensure that there's enough free memory on the host to accommodate all running VMs' needs.

Could you also share a screenshot of the `htop` from Proxmox VE side? Maybe the issue you might be to ZFS ARC!!
It's hard to make a screenshot just at the right point in time as I don't know when the next crash will happen
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!