Having a major problem and need some help.
Clean install, no outstanding updates with the no-subscription repositories configured. At this point no VMs or Containers defined, I'm running the task on the host. Server has I7-12700 processor, 128Gb ram. OS on 1.863 TiB ZFS raid-1 pool (Samsung 990 PRO), second pool 2 x (4 x 12.73TiB Raid-z2 [Seagate Iornwolf) + 250Gb NVME Raid-1 (Samsung EVO 970+) special mirror (no small blocks). Copied data (approx 20TiB) via rsync over SSH from an identically configure machine (except the OS is on the 250GB drive and the special mirror is in 1.863Gib drives with 16K small block limit, and running Proxmox 9.0.6) with no problem. Trying to copy (cp -r) a 309GiB Directory between 2 datasets in the second pool. The task runs for about 5-10 seconds and the then the server starts streaming OOMKills on multiple processes eventually killing the WebUI & SSH sessions. I've tried the copy directly logged in (without logging into the UI) , via the webUI & over a SSH session and the result is the same. The only other user processes running were htop in a second ssh session and/or a while true loop to log arcstats (run on console tty2) Things I've tried so far.
1 set zfs_arc_min/max to 2/16 Gib Max is ignored
2 no ZFS options set at all, zfs_arc_min/max both report zero, HTOP shows mas as 125GiB /proc/spl/kstat/zfs/arcstats shows min/max as 3.92/124.55
3 setting /sys/module/zfs/parameters/zfs_arc_evict_batch_limit to 16384, no effect on outcome.
4 run memtest-pro single pass on ram - passed
Running a while true loop to log /proc/spl/kstat/zfs/arcstats every 0.1 second during a test shows a max active size of approx 65.2GiB
Behavior is 100% consistent, reported ARC cache hits 60-65Gib and OOMKills start.
For context I was running proxmox 8.3 with only one active VM running borg backup on Machine A, installed 9.0.6 on machine B after running a full surface write (fio) and extended smart test on the rust drives with no errors reported, I then copied data from machine A to B. Decided the OS was better on the larger NVME drives, ran the same drive tests on machine A (one drive reports 8 bad block replacements before starting) but again no errors reported. Installed Proxmox 9 on machine A, which after updates reported 9.0.10 and copied data back (8 concurrent rsync over ssh threads) no OOMkill errors noted during the copy process but I can't be totally sure if machine A was on 9.0.8 or 9.0.10 during the rsync copies.
I've not tried the directory copy on 9.0.6 as that machine holds the only good copy of the data.
I've capture the dmesg log (attached). Other than the 'zfs_arc_evict_batch_limit' setting I've not found anything useful either by forum search, direct web search or through ChatGPT. Basically I have no idea what to do next or what other diagnostic info would help resolve the issue. Currently for me, proxmox 9.0.10 is effectively unusable.
Any help appreciated.
Clean install, no outstanding updates with the no-subscription repositories configured. At this point no VMs or Containers defined, I'm running the task on the host. Server has I7-12700 processor, 128Gb ram. OS on 1.863 TiB ZFS raid-1 pool (Samsung 990 PRO), second pool 2 x (4 x 12.73TiB Raid-z2 [Seagate Iornwolf) + 250Gb NVME Raid-1 (Samsung EVO 970+) special mirror (no small blocks). Copied data (approx 20TiB) via rsync over SSH from an identically configure machine (except the OS is on the 250GB drive and the special mirror is in 1.863Gib drives with 16K small block limit, and running Proxmox 9.0.6) with no problem. Trying to copy (cp -r) a 309GiB Directory between 2 datasets in the second pool. The task runs for about 5-10 seconds and the then the server starts streaming OOMKills on multiple processes eventually killing the WebUI & SSH sessions. I've tried the copy directly logged in (without logging into the UI) , via the webUI & over a SSH session and the result is the same. The only other user processes running were htop in a second ssh session and/or a while true loop to log arcstats (run on console tty2) Things I've tried so far.
1 set zfs_arc_min/max to 2/16 Gib Max is ignored
2 no ZFS options set at all, zfs_arc_min/max both report zero, HTOP shows mas as 125GiB /proc/spl/kstat/zfs/arcstats shows min/max as 3.92/124.55
3 setting /sys/module/zfs/parameters/zfs_arc_evict_batch_limit to 16384, no effect on outcome.
4 run memtest-pro single pass on ram - passed
Running a while true loop to log /proc/spl/kstat/zfs/arcstats every 0.1 second during a test shows a max active size of approx 65.2GiB
Behavior is 100% consistent, reported ARC cache hits 60-65Gib and OOMKills start.
For context I was running proxmox 8.3 with only one active VM running borg backup on Machine A, installed 9.0.6 on machine B after running a full surface write (fio) and extended smart test on the rust drives with no errors reported, I then copied data from machine A to B. Decided the OS was better on the larger NVME drives, ran the same drive tests on machine A (one drive reports 8 bad block replacements before starting) but again no errors reported. Installed Proxmox 9 on machine A, which after updates reported 9.0.10 and copied data back (8 concurrent rsync over ssh threads) no OOMkill errors noted during the copy process but I can't be totally sure if machine A was on 9.0.8 or 9.0.10 during the rsync copies.
I've not tried the directory copy on 9.0.6 as that machine holds the only good copy of the data.
I've capture the dmesg log (attached). Other than the 'zfs_arc_evict_batch_limit' setting I've not found anything useful either by forum search, direct web search or through ChatGPT. Basically I have no idea what to do next or what other diagnostic info would help resolve the issue. Currently for me, proxmox 9.0.10 is effectively unusable.
Any help appreciated.