High number of major page faults on PVE and PBS nodes running ZFS

Zeash · May 7, 2024

Hi,

Recently I set up a monitoring stack in my virtualized K8s cluster (Prometheus, Grafana) and set up node exporters on many VMs/nodes, including my PVE and PBS nodes. Not soon after I started getting alerts regarding major page faults (hundreds over half a minute, sometimes every half hour, sometimes every couple of hours) on one of my PVE nodes and primary PBS node. After closer inspection, I can tell they occur all throughout the day, but they consistently peak (about 1000 major page faults in ~15 minutes) during scheduled backups to the aforementioned PBS node (Pool based backup, about 10 or so VMs/LXCs, snapshot mode).

Here are links to snapshots for those nodes in Grafana during one of those backups:
- PVE node
- PBS node

Software versions:
- PVE 8.1.10 (zfs-2.2.3-pve1)
- PBS 3.1-4 (zfs-2.2.2-pve1)

Regarding hardware, the PVE node is built from brand new PC components on a Ryzen 7000 platform, notably without ECC RAM, while the PBS node is a very old laptop (also without ECC RAM).

I'm not sure exactly where I should look past this point to figure out the cause/solution, so I'd appreciate your help.

BobhWasatch · May 7, 2024

Page faults aren't an error, they are part of the normal operation of the Linux kernel.

Minor page faults mean that the page is present in RAM but not mapped to the current process. Things like shared library loading cause this when the library is resident but not mapped to this particular process. Major faults mean the page is not in RAM and needs to be read from the swap file or from an executable file in the case of code pages. Starting a new process will generally cause some number of both kinds of page fault.

A lot of major faults (ETA: absent any activity like starting processes) is indicative of RAM pressure. It is not too surprising that backups tend to cause them because the OS will need RAM for buffering at that time and it also rapidly spins up new processes to do the backup.

1000 in 15 minutes is really not all that many. If there isn't any noticeable performance impact I wouldn't worry.

Search

Search

High number of major page faults on PVE and PBS nodes running ZFS

Zeash

Member

BobhWasatch

Famous Member

We value your privacy