SSD wearout and rrdcache/pmxcfs commit interval

wildstray · Mar 23, 2023

Hello, I'm installed from some weeks Proxmox VE 7.3 (just updated to 7.4) to a HP ProDesk 600 G3 mini PC for my homelab (testing PVE because in the near future I intend to propose it to a business which I collaborate with, obviousy on serious hardware).
Actually, for homelab I installed Debian 11 and then PVE. The mini PC has an SSD (HP, I dunno if it is enterprise or consumer grade, I fear the second one...) with root and swap (two plain GPT partitions) and LVM with some LVs that I use for backups, ISOs and other things. And an NVME with LVM thin for VM and LXC.
Now, I reduced disk writing of the OS to as few as possibile. I noticed that the last two processes that writes constantly to disk (root partition) are:

rrdcache: updates of RRDs in /var/lib/rrdcached/db
pmxcfs: writing on /var/lib/pve-cluster/config.db (SQLITE3 with WAL)

Is there some way to avoid commits every few seconds?

I noticed that /etc/pve is generated in some way from pmxcfs from that config.db, from a single table. Just curious about internals, also because I have a question about it: I noticed that every single thread of pmxcfs opens config.db... so every single thread for the main process, cfs_loop (???) and for every single VM or container commits every few seconds config.db?

Thank you in advance... for an homelab it would be too expensive to change the SSD every few years or months...

pmxcfs 817 root mem REG 253,0 32768 1833422 /var/lib/pve-cluster/config.db-shm
pmxcfs 817 root 4u REG 253,0 36864 1836156 /var/lib/pve-cluster/config.db
pmxcfs 817 root 5u REG 253,0 4124152 1833414 /var/lib/pve-cluster/config.db-wal
pmxcfs 817 root 6u REG 253,0 32768 1833422 /var/lib/pve-cluster/config.db-shm
pmxcfs 817 818 cfs_loop root mem REG 253,0 32768 1833422 /var/lib/pve-cluster/config.db-shm
pmxcfs 817 818 cfs_loop root 4u REG 253,0 36864 1836156 /var/lib/pve-cluster/config.db
pmxcfs 817 818 cfs_loop root 5u REG 253,0 4124152 1833414 /var/lib/pve-cluster/config.db-wal
pmxcfs 817 818 cfs_loop root 6u REG 253,0 32768 1833422 /var/lib/pve-cluster/config.db-shm
pmxcfs 817 1359 server root mem REG 253,0 32768 1833422 /var/lib/pve-cluster/config.db-shm
pmxcfs 817 1359 server root 4u REG 253,0 36864 1836156 /var/lib/pve-cluster/config.db
pmxcfs 817 1359 server root 5u REG 253,0 4124152 1833414 /var/lib/pve-cluster/config.db-wal
pmxcfs 817 1359 server root 6u REG 253,0 32768 1833422 /var/lib/pve-cluster/config.db-shm
pmxcfs 817 1360 pmxcfs root mem REG 253,0 32768 1833422 /var/lib/pve-cluster/config.db-shm
pmxcfs 817 1360 pmxcfs root 4u REG 253,0 36864 1836156 /var/lib/pve-cluster/config.db
pmxcfs 817 1360 pmxcfs root 5u REG 253,0 4124152 1833414 /var/lib/pve-cluster/config.db-wal
pmxcfs 817 1360 pmxcfs root 6u REG 253,0 32768 1833422 /var/lib/pve-cluster/config.db-shm
pmxcfs 817 1361 pmxcfs root mem REG 253,0 32768 1833422 /var/lib/pve-cluster/config.db-shm
pmxcfs 817 1361 pmxcfs root 4u REG 253,0 36864 1836156 /var/lib/pve-cluster/config.db
pmxcfs 817 1361 pmxcfs root 5u REG 253,0 4124152 1833414 /var/lib/pve-cluster/config.db-wal
pmxcfs 817 1361 pmxcfs root 6u REG 253,0 32768 1833422 /var/lib/pve-cluster/config.db-shm
pmxcfs 817 1428 pmxcfs root mem REG 253,0 32768 1833422 /var/lib/pve-cluster/config.db-shm
pmxcfs 817 1428 pmxcfs root 4u REG 253,0 36864 1836156 /var/lib/pve-cluster/config.db
pmxcfs 817 1428 pmxcfs root 5u REG 253,0 4124152 1833414 /var/lib/pve-cluster/config.db-wal
pmxcfs 817 1428 pmxcfs root 6u REG 253,0 32768 1833422 /var/lib/pve-cluster/config.db-shm

esi_y · Thursday at 20:51

You were good with your observations, better than most on this recurrent topic:
https://forum.proxmox.com/threads/pmxcfs-writing-to-disk-all-the-time.35828/

wildstray said:
Is there some way to avoid commits every few seconds?

At the some cost to the durability and higher risk of corruption (this could be mitigated with frequent dumps), you would need e.g. to set PRAGMA journal_mode = MEMORY at pmxcfs/databse.c:
https://github.com/proxmox/pve-clus...4f9a41fa4322f6ba61/src/pmxcfs/database.c#L115

A dirtier (sort of) solution is to use some gymnastics, but it has caveats (keep a backup!):
https://github.com/isasmendiagus/pmxcfs-ram

VictorSTS · Thursday at 21:45

If anyone is using and solid state drive that can't withstand the writes from syslog/journal, rrd and PMXCFS they should simply replace the drive with a proper one. IMHO even worrying about rrd or PMXCFS when you have syslog/journald writing one or two orders of magnitude more bytes to disk is pointless. Even decent consumer SSD will give you terabytes of endurance. For me this is clearly an X/Y problem: the issue isn't that those processes write too much but that the drive isn't fit for the task.

esi_y · Thursday at 21:55

VictorSTS said:
even worrying about rrd or PMXCFS when you have syslog/journald writing one or two orders of magnitude more bytes to disk is pointless

It would be, except there's some very non-intuitive amplification going on with pmxcfs specifically where you are writing n blocks and get at least n^2 blocks flushed:

https://forum.proxmox.com/threads/etc-pve-500k-600m-amplification.154074/#post-701246

VictorSTS said:
Even decent consumer SSD will give you terabytes of endurance. For me this is clearly an X/Y problem: the issue isn't that those processes write too much but that the drive isn't fit for the task.

I understand with a PLP drive, you may think it won't even hit your NAND*, but getting 1000x+ amplified write is a design problem.

I simply backtracked whoever was getting issues with this in real life, to see more of that workload rather than my synthetic one.

EDIT: *Do note that PLPs do not really save you that much TBW (and many current non-PLP SSDs have ~1PB per 1TB capacity even), this is because "only a small amount of the DRAM is actually used to buffer user data" [1].

[1] https://www.micron.com/content/dam/.../ssd-power-loss-protection-white-paper-lo.pdf

esi_y · Thursday at 23:35

Just to quantify the difference for (a rather favourable case):

Code:

dd if=/dev/random of=/etc/pve/dd.out bs=128k count=4

This causes actual write (the default journal_mode = WAL):

Code:

1260627 be/4 1000000       0.00 B     37.68 M  0.00 %  0.47 % ./pmxcfs
1260626 be/4 1000000       0.00 B     38.05 M  0.00 %  0.42 % ./pmxcfs

0.5M -> 76M

esi_y said:
At the some cost to the durability and higher risk of corruption (this could be mitigated with frequent dumps), you would need e.g. to set PRAGMA journal_mode = MEMORY at pmxcfs/databse.c:
https://github.com/proxmox/pve-clus...4f9a41fa4322f6ba61/src/pmxcfs/database.c#L115

The same here causes:

Code:

1259225 be/4 1000000       0.00 B   1824.00 K  0.00 %  0.09 % ./pmxcfs
1259226 be/4 1000000       0.00 B    792.00 K  0.00 %  0.02 % ./pmxcfs

0.5M -> 2.6M

I would humbly submit that: 5x amplification is much better than 152x.

This is on top of ext4, so multiply as per your ZFS setup accordingly.

Now if you do not care for SSD TBW, you still might care for (extrapolate for lower than 128K bs):

WAL (default)
copied, 0.249173 s, 2.1 MB/s

MEMORY
copied, 0.0141982 s, 36.9 MB/s

VictorSTS · 2024-09-17T12:29:19+0200

Seems like PMXCFS can me improved to suffer less write amplification, at least with syntetic benchmarks. Still we should measure now how much does it really write to disk in normal use, it possible, because it doesn't seem that much to me. We should check sources to find out if PVE processes write to the files in /etc/pve or if they write to the SQLite database and the PMXCFS service creates the files in /etc/pve, because writes to SQLite may not suffer from the same write aplification than writing to the files and having the pve-cluster service to dump their content to the database.

Besides production systems with enterprise drives and all that I do have quite a few staging machines, labs and so on using consumer hardware, drives are mostly WD blue sn570 1tb (they were very cheap). They are at just 3% wearout after more than a year running at least 12x5, thousands of VM backup/restores, quite a few reinstalls, etc in a ZFS mirror both for OS and VMs/LXC. Most VMs are virtual Proxmox clusters with Ceph, too, for test configs, upgrades, etc. I've had some very cheap no-brand drives too and have had all kind of problems: dead drives in a month, bad performance, becoming innaccesible for a few minutes after doing a few GB copy...

esi_y · 2024-09-17T12:45:35+0200

VictorSTS said:
Seems like PMXCFS can me improved to suffer less write amplification, at least with syntetic benchmarks. Still we should measure now how much does it really write to disk in normal use, it possible, because it doesn't seem that much to me.

I have literally just updated the thread, right now:
https://forum.proxmox.com/threads/etc-pve-pmxcfs-amplification-inefficiencies.154074/#post-703765

VictorSTS said:
We should check sources to find out if PVE processes write to the files in /etc/pve or if they write to the SQLite database and the PMXCFS service creates the files in /etc/pve, because writes to SQLite may not suffer from the same write aplification than writing to the files and having the pve-cluster service to dump their content to the database.

I go from the backend "back to front", if something constantly writes does not interest me as much, e.g. the dotfiles are memory-only even, so they never write anything. I needed to check this because otherwise it's impossible to be looking for the culprit in terms of volumes, i.e. seeing 10G daily (as some) on an almost idle system definitely cannot be raw input.

VictorSTS said:
Besides production systems with enterprise drives and all that I do have quite a few staging machines, labs and so on using consumer hardware, drives are mostly WD blue sn570 1tb (they were very cheap). They are at just 3% wearout after more than a year running at least 12x5, thousands of VM backup/restores, quite a few reinstalls, etc in a ZFS mirror both for OS and VMs/LXC. Most VMs are virtual Proxmox clusters with Ceph, too, for test configs, upgrades, etc. I've had some very cheap no-brand drives too and have had all kind of problems: dead drives in a month, bad performance, becoming innaccesible for a few minutes after doing a few GB copy...

I understand the anecdotal evidence for some cases is "just fine", also in the synthetic testing (I have to do it to isolate it, it's not meant to create artificial load) it writes basically nothing when nothing touches the FUSE, but when it does, it starts to get interesting depending on the nature of those writes.

Some time ago, I tried to benchmark similar PLP vs non-PLP, as close to each other as possible, so took I believe Kingstons KC600 and DC600M, this had nothing to do with pmxcfs or ZFS, just wanted to see how much less hits the NAND on the PLP drive - the result was ... cannot really tell because the PLP SMART data were reported very differently, obviously the controller was different. So while I theoretically believe in PLP saving some actual writes, I really see the value in the, well, power loss protection, not saving TBWs, at least not necessarily.

esi_y · 2024-09-17T12:48:56+0200

I just realised, as lots of people starting these threads might not be aware - the config.db is there just for durability, it's basically implementing hard state. It is NEVER read from during the course of operation of the node. The file is read from at the machine start and then only being written to, all the writes hit RAM first and that's where they are read from. That is, there's in-memory structures that keep the config files there (which is great), nothing to do with SQLite that adds on top (see my other thread if interested).

Search

Search

SSD wearout and rrdcache/pmxcfs commit interval

wildstray

New Member

esi_y

Active Member

VictorSTS

Renowned Member

esi_y

Active Member

esi_y

Active Member

VictorSTS

Renowned Member

esi_y

Active Member

esi_y

Active Member