Hi,
Recently I set up a monitoring stack in my virtualized K8s cluster (Prometheus, Grafana) and set up node exporters on many VMs/nodes, including my PVE and PBS nodes. Not soon after I started getting alerts regarding major page faults (hundreds over half a minute, sometimes every half hour, sometimes every couple of hours) on one of my PVE nodes and primary PBS node. After closer inspection, I can tell they occur all throughout the day, but they consistently peak (about 1000 major page faults in ~15 minutes) during scheduled backups to the aforementioned PBS node (Pool based backup, about 10 or so VMs/LXCs, snapshot mode).
Here are links to snapshots for those nodes in Grafana during one of those backups:
- PVE node
- PBS node
Software versions:
- PVE 8.1.10 (zfs-2.2.3-pve1)
- PBS 3.1-4 (zfs-2.2.2-pve1)
Regarding hardware, the PVE node is built from brand new PC components on a Ryzen 7000 platform, notably without ECC RAM, while the PBS node is a very old laptop (also without ECC RAM).
I'm not sure exactly where I should look past this point to figure out the cause/solution, so I'd appreciate your help.
Recently I set up a monitoring stack in my virtualized K8s cluster (Prometheus, Grafana) and set up node exporters on many VMs/nodes, including my PVE and PBS nodes. Not soon after I started getting alerts regarding major page faults (hundreds over half a minute, sometimes every half hour, sometimes every couple of hours) on one of my PVE nodes and primary PBS node. After closer inspection, I can tell they occur all throughout the day, but they consistently peak (about 1000 major page faults in ~15 minutes) during scheduled backups to the aforementioned PBS node (Pool based backup, about 10 or so VMs/LXCs, snapshot mode).
Here are links to snapshots for those nodes in Grafana during one of those backups:
- PVE node
- PBS node
Software versions:
- PVE 8.1.10 (zfs-2.2.3-pve1)
- PBS 3.1-4 (zfs-2.2.2-pve1)
Regarding hardware, the PVE node is built from brand new PC components on a Ryzen 7000 platform, notably without ECC RAM, while the PBS node is a very old laptop (also without ECC RAM).
I'm not sure exactly where I should look past this point to figure out the cause/solution, so I'd appreciate your help.
Last edited: