The Kernel Crash: A Bug in ZFS

ZSasha

Member
Mar 1, 2023
6
1
8
Hi everyone,

just want to see if someone came across this situation and what should I do?
I have Proxmox VE 8.0 installed on my Intel NUC i3-1220P with 64G RAM with around 4 VMs and 5 CT running.
It is (was?) part of a cluster and everything seemed to be just fine, until...

During the last week it already crashed twice, both it was ZFS component that caused that.
I am not a Linux guru so analysing crash dumps is not something I can do but I copied-pasted dmesg output to Google and it told me

Short version: your kernel crashed inside the ZFS module while a write-back worker was flushing ZFS data. It faulted on an invalid pointer in the ARC eviction path, then the worker thread died noisily. You were also under heavy memory pressure (swap basically exhausted), which likely helped trigger the ARC eviction path that crashed.

I attached relevant piece of dmesg output if someone want to have a look.
It also corrupted replicated image - luckily, I have backups.

I will keep the server like that if you want me to do some debug/troubleshooting steps.

Cheers.
 

Attachments

You are running out of RAM
Code:
Out of memory: Killed process 1925 (kvm)
Maybe you should reduce vour VM's memory and check / limit ZFS arc cache

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysadmin_zfs_limit_memory_usage
Thank you MarkusKo.
Since it was talking about cache - I always thought that cache is just using whatever RAM it can get but it would be the number one to get kicked out of the ram if memory is needed for other things.
Looks like I was wrong, and the system prefer to serve a "cache", even at the cost of killing actual apps.
 
  • Like
Reactions: waltar
Looks like I was wrong, and the system prefer to serve a "cache", even at the cost of killing actual apps.

Well..., ZFS ARC does shrink if required. But it does so slowly. (Edit: ...and if it is allowed to by "zfs_arc_min" being lower than _max)

Too slow to be fast enough if one VM (or any other process) requests too much memory at once.

RAM is the one resource you cannot over-commit drastically without being punished, earlier or later.
 
Last edited: