Proxmox reboots after "Purging GPU memory"

ok, so I didnt have chance to create that file, but I have disabled swap on this node. My vm's using 30gb, host have 64gb. Lets see how it goes.
 
  • Like
Reactions: wavesswe
ok, so I didnt have chance to create that file, but I have disabled swap on this node. My vm's using 30gb, host have 64gb. Lets see how it goes.
Please let us know how it goes. I have not yet updated to the latest softwares.
 
Today I have experienced a crash after purging GPU memory. Lenovo USFF M720Q.


Code:
2025-05-17T06:17:09.661788+02:00 smol kernel: [4893793.316289] Purging GPU memory, 0 pages freed, 0 pages still pinned, 1 pages left available.
2025-05-17T06:17:09.661789+02:00 smol kernel: [4893793.316916] Purging GPU memory, 0 pages freed, 0 pages still pinned, 1 pages left available.
2025-05-17T06:17:09.661789+02:00 smol kernel: [4893793.318953] Purging GPU memory, 0 pages freed, 0 pages still pinned, 1 pages left available.
2025-05-17T06:17:09.661789+02:00 smol kernel: [4893793.319016] Purging GPU memory, 0 pages freed, 0 pages still pinned, 1 pages left available.
2025-05-17T06:17:09.661791+02:00 smol kernel: [4893793.322473] Purging GPU memory, 0 pages freed, 0 pages still pinned, 1 pages left available.
2025-05-17T06:17:09.661792+02:00 smol kernel: [4893793.322531] Purging GPU memory, 0 pages freed, 0 pages still pinned, 1 pages left available.
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@2025-05-17T06:18:51.835180+02:00 smol systemd-modules-load[359]: Inserted module 'vhost_net'
2025-05-17T06:18:51.835273+02:00 smol dmeventd[370]: dmeventd ready for processing.
2025-05-17T06:18:51.835284+02:00 smol dmeventd[370]: Monitoring thin pool pve-data-tpool.
2025-05-17T06:18:51.835287+02:00 smol systemd[1]: Starting systemd-journal-flush.service - Flush Journal to Persistent Storage...
2025-05-17T06:18:51.835291+02:00 smol systemd[1]: Finished systemd-tmpfiles-setup-dev.service - Create Static Device Nodes in /dev.
2025-05-17T06:18:51.835294+02:00 smol systemd[1]: Starting systemd-udevd.service - Rule-based Manager for Device Events and Files...
2025-05-17T06:18:51.835298+02:00 smol systemd[1]: Finished lvm2-monitor.service - Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling.
2025-05-17T06:18:51.836188+02:00 smol kernel: [    0.000000] Linux version 6.8.12-10-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREE
MPT_DYNAMIC PMX 6.8.12-10 (2025-04-18T07:39Z) ()
2025-05-17T06:18:51.836194+02:00 smol kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-10-pve root=/dev/mapper/pve-root ro quiet
2025-05-17T06:18:51.836194+02:00 smol kernel: [    0.000000] KERNEL supported cpus:
2025-05-17T06:18:51.836195+02:00 smol kernel: [    0.000000]   Intel GenuineIntel
2025-05-17T06:18:51.836195+02:00 smol kernel: [    0.000000]   AMD AuthenticAMD


Code:
HOST: smol | KERNEL: 6.8.12-10-pve
CPU: Intel(R) Core(TM) i5-8500T CPU @ 2.10GHz Intel(R) Core(TM) i5-8500T CPU @ 2.10GHz To Be Filled By O.E.M. CPU @ 2.1GHz
BOARD: LENOVO 3136
BIOS: M1UKT47A (08/14/2019)
GPU: controller: Intel Corporation CoffeeLake-S GT2 [UHD Graphics 630]
RAM: 31Gi total
CMDLINE: BOOT_IMAGE=/boot/vmlinuz-6.8.12-10-pve root=/dev/mapper/pve-root ro quiet

I'll try to see if there's a BIOS update.
 
This just indicates your system is out of memory. The GPU indicator is because you have an iGPU that shares memory with your main memory, so it tries to purge that before killing other jobs (eventually it kills your entire system). So eg. If you have a 32GB system but 2GB assigned to GPU memory, you really have 30GB of memory available. Check that you don’t overcommit and leave sufficient memory for disk caches (eg if you use ZFS allocate 2-4GB/TB at the least)
 
  • Like
Reactions: MarkusKo
This just indicates your system is out of memory. The GPU indicator is because you have an iGPU that shares memory with your main memory, so it tries to purge that before killing other jobs (eventually it kills your entire system). So eg. If you have a 32GB system but 2GB assigned to GPU memory, you really have 30GB of memory available. Check that you don’t overcommit and leave sufficient memory for disk caches (eg if you use ZFS allocate 2-4GB/TB at the least)
Hey,
i have the Problem again with Kernelversion Linux 6.8.12-11-pve (2025-05-22T09:39Z) on a 64 gb machine with 4 vms, who have in summ 32gb ram usage. With ballooning it would never use full ram.
Also 2 tb SSD in zfs.
This night i got the error and reboot.

Any idea what i can do additionally? maybe zfs cache made the problem? the ram was full in the time.

thanks in forward.
 
This just indicates your system is out of memory. The GPU indicator is because you have an iGPU that shares memory with your main memory, so it tries to purge that before killing other jobs (eventually it kills your entire system). So eg. If you have a 32GB system but 2GB assigned to GPU memory, you really have 30GB of memory available. Check that you don’t overcommit and leave sufficient memory for disk caches (eg if you use ZFS allocate 2-4GB/TB at the least)
Hello,
As per my understanding ZFS should always free up unused cache in favor of programs, VMs and basically everything. But for some reason it is not working like that on my server:


Bash:
root@tiny:~# awk '/^c / {printf "ARC target size: %.2f GiB\n", $3/1024/1024/1024}
/^size / {printf "ARC current size: %.2f GiB\n", $3/1024/1024/1024}' /proc/spl/kstat/zfs/arcstats
ARC target size: 13.28 GiB
ARC current size: 13.29 GiB
root@tiny:~# free -h
               total        used        free      shared  buff/cache   available
Mem:            31Gi        29Gi       708Mi        39Mi       1.5Gi       1.7Gi
Swap:          7.6Gi       7.6Gi        48Ki
root@tiny:~# cat /proc/meminfo | grep -E '^(MemTotal|MemFree|Buffers|Cached|SwapTotal|SwapFree)'
MemTotal:       32703804 kB
MemFree:          737020 kB
Buffers:          532144 kB
Cached:           281544 kB
SwapTotal:       8003580 kB
SwapFree:             52 kB
root@tiny:~# swapon --show
NAME      TYPE      SIZE USED PRIO
/dev/dm-0 partition 7.6G 7.6G   -2
root@tiny:~# zfs list -t volume
NAME                                           USED  AVAIL  REFER  MOUNTPOINT
SSDVMSTORE/vm-102-disk-0                      2.03T   819G  1.98T  -
SSDVMSTORE/vm-102-disk-1                      5.12G   763G  5.12G  -
SSDVMSTORE/vm-109-disk-0                       219G   763G   217G  -
SSDVMSTORE/vm-109-state-before-DB-change      4.65G   763G  4.65G  -
SSDVMSTORE/vm-117-disk-0                      8.61G   763G  8.61G  -
SSDVMSTORE/vm-117-disk-1                       274G   763G   274G  -
SSDVMSTORE/vm-119-disk-0                       215G   763G   168G  -
SSDVMSTORE/vm-119-state-after_cloudflare      2.17G   763G  2.17G  -
SSDVMSTORE/vm-119-state-before-upgrade-nginx  4.52G   763G  4.52G  -
SSDVMSTORE/vm-119-state-before_geoip          2.17G   763G  2.17G  -
SSDVMSTORE/vm-119-state-before_nginx_upgrade  4.29G   763G  4.29G  -
SSDVMSTORE/vm-119-state-fresh                 4.17G   763G  4.17G  -
SSDVMSTORE/vm-119-state-live_now              2.12G   763G  2.12G  -
SSDVMSTORE/vm-119-state-snapshot              2.16G   763G  2.16G  -
SSDVMSTORE/vm-119-state-yes                   5.67G   763G  5.67G  -
root@tiny:~#
 
  • Like
Reactions: backpulver
@backpulver
check the arc cache with arc_summary
and look at the line ARC size (current):

on older installs arc cache was set to 50% of ram, on new installs it will be set to 10%

@tunhube not sure if the arc cache is freed up at all, from what i've seen on this forums your best bet is to limit the arc cache or add more ram
maybe freeing up the arc cache ist too slow and the oom killer steps in faster and kills some vm to prevent a oom event
 
@backpulver
check the arc cache with arc_summary
and look at the line ARC size (current):

on older installs arc cache was set to 50% of ram, on new installs it will be set to 10%

@tunhube not sure if the arc cache is freed up at all, from what i've seen on this forums your best bet is to limit the arc cache or add more ram
maybe freeing up the arc cache ist too slow and the oom killer steps in faster and kills some vm to prevent a oom event

Here are my Values of one node who starts in this moment with the error spam in system log:
32GB RAM installed

Code:
ARC size (current):                                    31.3 %    4.6 GiB
        Target size (adaptive):                        32.3 %    4.7 GiB
        Min size (hard limit):                          6.2 %  932.6 MiB
        Max size (high water):                           16:1   14.6 GiB
        Anonymous data size:                            0.1 %    3.5 MiB
        Anonymous metadata size:                      < 0.1 %  780.0 KiB
        MFU data target:                               32.2 %    1.4 GiB
        MFU data size:                                 34.5 %    1.5 GiB
        MFU ghost data size:                                   964.3 MiB
        MFU metadata target:                           19.8 %  896.8 MiB
        MFU metadata size:                              5.0 %  227.5 MiB
        MFU ghost metadata size:                               146.7 MiB
        MRU data target:                               29.8 %    1.3 GiB
        MRU data size:                                 45.6 %    2.0 GiB
        MRU ghost data size:                                   972.4 MiB
        MRU metadata target:                           18.2 %  824.0 MiB
        MRU metadata size:                             14.8 %  669.7 MiB
        MRU ghost metadata size:                               431.5 MiB
        Uncached data size:                             0.0 %    0 Bytes
        Uncached metadata size:                         0.0 %    0 Bytes
        Bonus size:                                   < 0.1 %   97.8 KiB
        Dnode cache target:                            10.0 %    1.5 GiB
        Dnode cache size:                               0.2 %    2.4 MiB
        Dbuf size:                                      0.1 %    3.1 MiB
        Header size:                                    2.9 %  136.0 MiB
        L2 header size:                                 0.0 %    0 Bytes
        ABD chunk waste size:                         < 0.1 %   12.5 KiB

I rebooted the non important vms and the error was gone.

Here are the values of the machine who rebootet this night:
64GB Ram installed

Code:
ARC size (current):                                    23.6 %    7.4 GiB
        Target size (adaptive):                        23.9 %    7.5 GiB
        Min size (hard limit):                          6.2 %    2.0 GiB
        Max size (high water):                           16:1   31.3 GiB
        Anonymous data size:                          < 0.1 %    1.7 MiB
        Anonymous metadata size:                      < 0.1 %    1.0 MiB
        MFU data target:                               37.5 %    2.7 GiB
        MFU data size:                                  8.4 %  617.9 MiB
        MFU ghost data size:                                     0 Bytes
        MFU metadata target:                           12.5 %  924.2 MiB
        MFU metadata size:                              1.6 %  117.9 MiB
        MFU ghost metadata size:                                 0 Bytes
        MRU data target:                               37.5 %    2.7 GiB
        MRU data size:                                 87.6 %    6.3 GiB
        MRU ghost data size:                                     0 Bytes
        MRU metadata target:                           12.5 %  924.2 MiB
        MRU metadata size:                              2.4 %  179.3 MiB
        MRU ghost metadata size:                                 0 Bytes
        Uncached data size:                             0.0 %    0 Bytes
        Uncached metadata size:                         0.0 %    0 Bytes
        Bonus size:                                   < 0.1 %   75.3 KiB
        Dnode cache target:                            10.0 %    3.1 GiB
        Dnode cache size:                               0.1 %    2.2 MiB
        Dbuf size:                                      0.1 %    4.9 MiB
        Header size:                                    2.2 %  166.3 MiB
        L2 header size:                                 0.0 %    0 Bytes
        ABD chunk waste size:                         < 0.1 %   13.0 KiB
 
the first system could possibly use 16Gb (~50%) of ram for arc cache and the second system 32GB (~50%), utilization is at 31% and 23% currently.
that could be a problem when the utilization reaches max, you should really limit the zfs arc cache
 
  • Like
Reactions: backpulver
the first system could possibly use 16Gb (~50%) of ram for arc cache and the second system 32GB (~50%), utilization is at 31% and 23% currently.
that could be a problem when the utilization reaches max, you should really limit the zfs arc cache
Thank you for the Link and explaination. I hope the Problem solves with this Solution.