Alright, so I want to clarify, I can totally appreciate the scepticism around what I'm saying here. I agree that it is not quite matching expectations. I run a cluster where two of the nodes exhibit this behaviour, and one doesn't. I've tried to find an explanation, and that's part of why I'm posting here in the first place, because I can't yet explain it. All three nodes do backup tasks daily, yet two show this problematic behaviour and one does not.
As for monitoring insights, here is the biggest node that is exhibiting the issue :
https://imgur.com/a/wQSOU7I
I want to add that all the VMs that run on that host have a static amount of RAM set. There is no ballooning going on, and I have not modified the proxmox install to run anything beyond snmpd on each node (for monitoring obviously). But snmpd is even running on the node that does not exhibit this behaviour.
This is the total RAM usage, I was unable to separate out caching. Swapping starts when the RAM is full, naturally, so I didn't include the swap graph (I only have a graph for swap usage in bytes, not swap in/out insights) as including that graph is redundant as we can safely assume swapping happens at 100% RAM usage for an extended period of time (as the image shows, several days).
As you can see, each day the backup happens, the RAM usage jumps up a good chunk, and the next day none of that is freed up. So the next backup task happens and it jumps up again, nothing getting freed.
The big drop on the right side of the image is me telling the system to flush caches and swap manually. I have not observed it flushing either automatically.
As I mentioned earlier, up until today the only reliable way I have found to address this is to periodically flush the cache and swap manually. However there is a lot of criticism about this method online, and I've been trying to find a better way to address this issue. I'd rather solve the root cause, but backups seem to be the cause in this case and I am not sure how I can adjust backups to address this issue.
Right now that cluster node has 12 VMs on it, and the RAM allocated for them is ~52GB out of the 128GB of RAM in the physical host node itself.
Also, ignore the date filter in the top of the image, it didn't adjust that range as I zoomed in, so the date filter at the top is not accurate. The days along the bottom of the image are accurate.