/etc/pve - pmxcfs - amplification & inefficiencies

esi_y

Active Member
Nov 29, 2023
1,553
214
43
github.com
I have been experimenting with this for a while now (see notes under the post as well), to avoid any false positives with the whole stack:

Installed:
  • Debian 12 stable netinst;
  • kernel 6.1.0-25-amd64;
  • ext4;
  • no swap;
  • tasksel standard ssh-server.

Standalone pmxcfs [1] launched - this is no cluster:
  • no CPG (corosync) to be posting phantom transactions;
  • no other services;
  • even journald is barely whizzling around.
It properly created its backend in /var/lib/pve-cluster/config.db* and mounted its FUSE with the rudimentary structure.

While running iotop, I wrote only twice into /etc/pve - 5K and 500K, the second write took over 1.5s, this is an NVMe Gen3 SSD.

Code:
root@guinea:/etc/pve# time dd if=/dev/random bs=512 count=10 of=/etc/pve/dd.out
10+0 records in
10+0 records out
5120 bytes (5.1 kB, 5.0 KiB) copied, 0.00205373 s, 2.5 MB/s


real    0m0.010s
user    0m0.002s
sys    0m0.001s


root@guinea:/etc/pve# time dd if=/dev/random bs=512 count=1000 of=/etc/pve/dd.out
1000+0 records in
1000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 1.55845 s, 329 kB/s


real    0m1.562s
user    0m0.008s
sys    0m0.013s

This is what iotop was showing:

Code:
root@guinea:~# iotop -bao

Total DISK READ:         0.00 B/s | Total DISK WRITE:         0.00 B/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:       0.00 B/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
Total DISK READ:         0.00 B/s | Total DISK WRITE:       674.73 K/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:       3.97 K/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
 534502 be/4 root          0.00 B     56.00 K  0.00 %  0.00 % pmxcfs
 534503 be/4 root          0.00 B    624.00 K  0.00 %  0.00 % pmxcfs
Total DISK READ:         0.00 B/s | Total DISK WRITE:         0.00 B/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:       0.00 B/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
 534502 be/4 root          0.00 B     56.00 K  0.00 %  0.00 % pmxcfs
 534503 be/4 root          0.00 B    624.00 K  0.00 %  0.00 % pmxcfs
Total DISK READ:         0.00 B/s | Total DISK WRITE:         0.00 B/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:       0.00 B/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
 534502 be/4 root          0.00 B     56.00 K  0.00 %  0.00 % pmxcfs
 534503 be/4 root          0.00 B    624.00 K  0.00 %  0.00 % pmxcfs
Total DISK READ:         0.00 B/s | Total DISK WRITE:         0.00 B/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:       0.00 B/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
 534502 be/4 root          0.00 B     56.00 K  0.00 %  0.00 % pmxcfs
 534503 be/4 root          0.00 B    624.00 K  0.00 %  0.00 % pmxcfs
Total DISK READ:         0.00 B/s | Total DISK WRITE:         0.00 B/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:       0.00 B/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
 534502 be/4 root          0.00 B     56.00 K  0.00 %  0.00 % pmxcfs
 534503 be/4 root          0.00 B    624.00 K  0.00 %  0.00 % pmxcfs
Total DISK READ:         0.00 B/s | Total DISK WRITE:         0.00 B/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:       0.00 B/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
 534502 be/4 root          0.00 B     56.00 K  0.00 %  0.00 % pmxcfs
 534503 be/4 root          0.00 B    624.00 K  0.00 %  0.00 % pmxcfs
Total DISK READ:         0.00 B/s | Total DISK WRITE:         3.97 K/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:      11.90 K/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
    202 be/3 root          0.00 B      4.00 K  0.00 %  0.00 % [jbd2/nvme0n1p2-8]
 534502 be/4 root          0.00 B     56.00 K  0.00 %  0.00 % pmxcfs
 534503 be/4 root          0.00 B    624.00 K  0.00 %  0.00 % pmxcfs
Total DISK READ:         0.00 B/s | Total DISK WRITE:         0.00 B/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:       0.00 B/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
    202 be/3 root          0.00 B      4.00 K  0.00 %  0.00 % [jbd2/nvme0n1p2-8]
 534502 be/4 root          0.00 B     56.00 K  0.00 %  0.00 % pmxcfs
 534503 be/4 root          0.00 B    624.00 K  0.00 %  0.00 % pmxcfs
Total DISK READ:         0.00 B/s | Total DISK WRITE:        86.20 M/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:      86.82 M/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
    202 be/3 root          0.00 B      4.00 K  0.00 %  0.00 % [jbd2/nvme0n1p2-8]
 534502 be/4 root          0.00 B     41.88 M  0.00 %  0.00 % pmxcfs
 534503 be/4 root          0.00 B     45.69 M  0.00 %  0.00 % pmxcfs
Total DISK READ:         0.00 B/s | Total DISK WRITE:       401.75 M/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:     398.56 M/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
    202 be/3 root          0.00 B      4.00 K  0.00 %  0.00 % [jbd2/nvme0n1p2-8]
 534502 be/4 root          0.00 B    259.89 M  0.00 %  0.00 % pmxcfs
 534503 be/4 root          0.00 B    231.52 M  0.00 %  0.00 % pmxcfs
Total DISK READ:         0.00 B/s | Total DISK WRITE:       117.91 M/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:     121.10 M/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
    202 be/3 root          0.00 B      4.00 K  0.00 %  0.00 % [jbd2/nvme0n1p2-8]
 534502 be/4 root          0.00 B    321.38 M  0.00 %  0.00 % pmxcfs
 534503 be/4 root          0.00 B    288.67 M  0.00 %  0.00 % pmxcfs
Total DISK READ:         0.00 B/s | Total DISK WRITE:         0.00 B/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:       0.00 B/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
    202 be/3 root          0.00 B      4.00 K  0.00 %  0.00 % [jbd2/nvme0n1p2-8]
 534502 be/4 root          0.00 B    321.38 M  0.00 %  0.00 % pmxcfs
 534503 be/4 root          0.00 B    288.67 M  0.00 %  0.00 % pmxcfs
Total DISK READ:         0.00 B/s | Total DISK WRITE:         0.00 B/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:       0.00 B/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
    202 be/3 root          0.00 B      4.00 K  0.00 %  0.00 % [jbd2/nvme0n1p2-8]
 534502 be/4 root          0.00 B    321.38 M  0.00 %  0.00 % pmxcfs
 534503 be/4 root          0.00 B    288.67 M  0.00 %  0.00 % pmxcfs
Total DISK READ:         0.00 B/s | Total DISK WRITE:        11.90 K/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:      19.84 K/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
    202 be/3 root          0.00 B     16.00 K  0.00 %  0.00 % [jbd2/nvme0n1p2-8]
 534502 be/4 root          0.00 B    321.38 M  0.00 %  0.00 % pmxcfs
 534503 be/4 root          0.00 B    288.67 M  0.00 %  0.00 % pmxcfs
Total DISK READ:         0.00 B/s | Total DISK WRITE:         0.00 B/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:       0.00 B/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
    202 be/3 root          0.00 B     16.00 K  0.00 %  0.00 % [jbd2/nvme0n1p2-8]
 534502 be/4 root          0.00 B    321.38 M  0.00 %  0.00 % pmxcfs
 534503 be/4 root          0.00 B    288.67 M  0.00 %  0.00 % pmxcfs

I then repeated the same test but checked SMART values before:

Code:
Data Units Written:                 85,358,591 [43.7 TB]
Host Write Commands:                1,865,050,181

... and after:

Code:
Data Units Written:                 85,359,841 [43.7 TB]
Host Write Commands:                1,865,053,472

Yes, that is 1,250 net delta [2], so ~640M written onto the SSD - there's literally nothing else writing on that system.

I understand there are typically not meant to be 5K let alone 500K file writes, but ...

... at 5K ... >100x amplified write

... at 500K ... >1000x amplified write




pmxcfs tested


I originally did this with pmxcfs built from source - gcc version 12.2.0 (Debian 12.2.0-14):
https://git.proxmox.com/?p=pve-cluster.git;a=blob;f=src/pmxcfs/Makefile

But it is - of course - identical (save for the symbol table) when using the "official" one from pve-cluster 8.0.7 - both tested.



bs 512 / 4k / 128k


With a bs = 4K, the 500K writes amount to 80M+.

With a bs = 128K, i.e. 4x 128k write causes 5M written, the amplification is still 10x.






[1] https://pve.proxmox.com/wiki/Proxmox_Cluster_File_System_(pmxcfs)

[2] Manufacturer SMART specs on this field:

Data Units Written: Contains the number of 512 byte data units the host has written to the controller;
this value does not include metadata. This value is reported in thousands (i.e., a value of 1 corresponds to
1000 units of 512 bytes written) and is rounded up. When the LBA size is a value other than 512 bytes, the
controller shall convert the amount of data written to 512 byte units.
For the NVM command set, logical blocks written as part of Write operations shall be included in this
value. Write Uncorrectable commands shall not impact this value.
 
Last edited:
full backend write on any change on all nodes


Any tiny edit, even a mere touch on a file to update its mtime causes full re-write into the backend of all nodes (DB + filesystem amplified).



backend database design

The issue is that for every single block (of total n) written:

  1. The blob is always written in whole, after each block, so if file is appended in n blocks, the consecutive writes result in n (n + 1) / 2blocks written into the database alone (overwriting the the same all over again):
    • for n = 10 ... 1 block ... 2 blocks ... 3 ... 10 blocks at a time ... n (n + 1) / 2 = 55
  2. there is always 2 DB transactions on the file ("inode") entry, i.e. SQLite table row UPDATEs into the backend:
    • one for the file blob itself; and
    • the other for the "root" version row - updating the version's mtime for each block write.
The version counter literally increases n+1 times , while 2n transactions hit the back-end and n (n + 1) / 2 blocks are written.

This is on a DB with a write-ahead-log, so checkpointing needs to flush it once again from the WAL into the base.



arbitrary fs limits, ZFS & CPG (corosync) behaviour

The current limits for pmxcfs are 1M for a file and 128M for the filesystem, this has increased from the past and while conservative, given the above, already making the issue palpable.

When /var/lib/pve-cluster is stored on (popular install choice) ZFS, the above is further amplified by the nature of that filesystem.

When there's high number of nodes in the cluster, especially with HA, the number of transactions hitting every nodes back-end DB is also further substantially increased.

All these add on top of each other.
 
Last edited:
to be continued


I will just reserve the rest of this post for suggesting remedies for the above, but that's for another day.

If you have a single node non - production deployment and would be interested in testing alternative pmxcfs, please drop me a message.

For now, think twice before storing anything custom in /etc/pve and of course be very mindful about not having any runaway process constantly going to write there, monitoring SMART on your host SSDs.
 
Last edited:
When /var/lib/pve-cluster is stored on (popular install choice) ZFS, the above is further amplified by the nature of that filesystem.
That's why Datacenter Enterprise disks are daily recommended for ZFS.
 
That's why Datacenter Enterprise disks are daily recommended for ZFS.

Thank you for the comment. I do not have any issue with that, just say e.g. with 7x write amplification of ZFS and 10x of pmxcfs, suddenly one is at 70x (that's for the 128k bs that e.g. cp uses). I am realistic for lots of homelab installs it is more practical to not use CoW filesystem, after all.

But this is not just a storage (ZFS) issue, it has bearing on that CPG, i.e. cluster health if it gets to slow back-end DB writes if only on one of the nodes, quorum will suffer.

I will add it here later on.

EDIT: See also below comment #12.
 
Last edited:
pmxcfs is not a filesystem used to store arbitrary data. Instead, It is used to store PVE configuration into
a transactional database. Each write is atomic and synced to the disk! So please use other filesystems if you want
to store large files. pmxcfs is not meant for that, and was never advertised for that.
 
pmxcfs is not a filesystem used to store arbitrary data. Instead, It is used to store PVE configuration into
a transactional database. Each write is atomic and synced to the disk! So please use other filesystems if you want
to store large files. pmxcfs is not meant for that, and was never advertised for that.

Thank you for your comment. I do not want to be misunderstood here. I do not use pmxcfs for storing anything, but there are times where it appears PVE stack writing into it produces quite a lot of back-end writes.

I have absolutely nothing against the architecture, I actually like how it uses corosync for that CPG, but the filesystem is essentially in-memory first (I am aware you know that full well as you must have written it, this is for anyone else).

What I do not think scales well into the future is how the current back-end plugs into it.

I think there's two good reasons to improve that - there actually are people who managed to hit even the current limit with PVE-related use [1].

And then if I remember correctly the last time it increased was from 500K per file to 1M - so I assume there was a reason for that and in the best case scenario (cp with bs=128k) the amplification is still count 4 => 4 x 5 /2 to count 8 => 8 x 9 /2 ... 10x - 36x

Lastly, while lots of the limitations are documented, the low bs write pitfalls are not, at least I did not find it.

EDIT: There's quite a few threads on the max users limit alone (e.g. [2] [3] as well), I can imagine Uni labs running into it, but generally things only keep increasing into the future, all of the time.

[1] https://forum.proxmox.com/threads/memdb_max_file_size-on-pmxcfs-is-too-small-1-mib.136392/
[2] https://forum.proxmox.com/threads/proxmox-ldap-sync-hard-limit-on-number-of-users.91320/
[3] https://forum.proxmox.com/threads/a..._max_file_size-restriction-too-narrow.142806/
 
Last edited:
Again, pmxcfs is not a general purpose file system, and we have bumped the limits where sensible in the past.

I suggest that you use a real cluster file system if you run into limits, for example cephfs, ...

Plus, we're always interested in improving things. As soon as I have an idea to do something
better, I'll implement it. Patches are also welcome.
 
Last edited:
  • Like
Reactions: esi_y
Again, pmxcfs is not a general purpose file system, and we have bumped the limits where sensible in the past.

I do not dispute that, I did not mean to suggest anyone tries to dd bs=512 on their /etc/pve, it was simply to quantify the effects (that multiply further yet due to other factors, potentially).

For me the topic is quite a bit wider, but I will take it then to pve-devel (probably with proper strace and reasoning). I think lots have changed since the original limits, e.g. large number of HA services having to write there. Also there's two separate topics of "a large file write" and "lots of files written" concurrently.

I understand it might be the most conservative part of PVE to be touching, but, well ... thanks for the initial reply and cheers for now!
 
  • Like
Reactions: ucholak
pmxcfs use sync writes (because all nodes need to be synced in case of of power failure), so if you want to avoid write amplification (as a 512bit write will rewrite a full ssd nand), you should really use ssd or nvme with a PLP/supercapacitor.

like https://www.kingston.com/fr/ssd/dc1000b-data-center-boot-ssd for example, it's cost less than 100€ for 200gb.
 
pmxcfs use sync writes (because all nodes need to be synced in case of of power failure), so if you want to avoid write amplification (as a 512bit write will rewrite a full ssd nand), you should really use ssd or nvme with a PLP/supercapacitor.

Thanks for the comment. I understand I opened myself to this by starting this on the forum instead of pve-devel and I know your piece of advice is well meant, but this thread is actually not about that part. It is definitely relevant e.g. to use PLP SSD for ZFS to begin with, however on pmxcfs alone:

1) In terms of how much is written (see above, writing n^2 blocks instead of n changed is undesirable) - this can be optimised, it simply did not expect large files to begin with, it's a matter of implementation;

2) In terms of how long it takes to sync , this is a problem, especially if unnecessary syncs occur, as it literally slows down the cluster, i.e. there can be more nodes in the cluster safely if this is redesigned;

3) The nodes do not need to be "synced across" in case of power loss, the one with most recent version will "sync over" its state upon recovery;

4) It is the atomicity, not durability on a per node level that is key - arguably, it is better to have state that is 5 seconds stale (on disk) but clean than a corrupt one - needing manual intervention;

5) The current mode of operation does not protect perfectly against power loss, not even with a PLP drive - due to frequent checkpointing, but also as a matter of design;

6) True power loss in a professional setup is very rare, as in all of the members of cluster going lights out at the very same time. Meanwhile stability of that closed process group is important - you do not want to have slow members choking / getting dropped and re-joining in a frenzy, combined with compounded watchdog reboot risk for the rest in HA scenarios.

*) You could actually stream replicate the database even outside of the cluster which I would even recommend to get more resiliency, better than any backup.

None of this is related to PLP vs non-PLP topic.

NB: I have now started some PoC with not-yet-alternative backend, but different approach to it:
https://forum.proxmox.com/threads/pmxcfs-any-other-backend-than-sqlite.153454/#post-702797

EDIT: I have now also added specific comment regarding PLP topic to a different (related) thread:
https://forum.proxmox.com/threads/s...he-pmxcfs-commit-interval.124638/#post-702752
 
Last edited:
  • Like
Reactions: ucholak and waltar
The toll of SQLite as a backend - execution cost (bdb_backend_write as % instructions of every single file update) and relative heap portion (sqlite3Malloc in the stacked chart).
 

Attachments

  • fuse_fs_write_buf.png
    fuse_fs_write_buf.png
    258.3 KB · Views: 32
  • pmxcfs-heap.png
    pmxcfs-heap.png
    237.1 KB · Views: 28
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!