pmxcfs - any other backend than SQLite?

esi_y

Active Member
Nov 29, 2023
1,535
210
43
github.com
Has there been any attempt to use some other backend, the docs mention BDB, but times have moved on and e.g. LMDB? If not, why not?

EDIT: I really just want to know if someone attempted e.g. key-value store DB before.
 
Last edited:
Why do you need to know this? What purpose does it serve?

You really do not like my questions, do you? :) Even purely technical ones.

Because the current implementation ends up relatively frequent corruption + is much less performant than e.g. LMDB would have been. If this was done (e.g. BDB in the past) and abandoned for some reason, it's nice to know those reasons before one starts experimenting with it.
 
You really do not like my questions, do you? :) Even purely technical ones.
No, because most of them amount to second-guessing or passive-agressive criticism.

Because the current implementation ends up relatively frequent corruption + is much less performant than e.g. LMDB would have been. If this was done (e.g. BDB in the past) and abandoned for some reason, it's nice to know those reasons before one starts experimenting with it.
Everybody's a developer.
 
Last edited:
  • Like
Reactions: Neobin
For anyone searching the same, quite a bit is explained [1]:

Distributed Configuration Database (DCDB)
===========================================

We want to implement a simple way to distribute small configuration
files among the cluster on top of corosync CPG.

The set of all configuration files defines the 'state'. That state is
stored persistently on all members using a backend
database. Configuration files are usually quite small, and we can even
set a limit for the file size.

* Backend Database

Each node stores the state using a backend database. That database
need to have transaction support, because we want to do atomic
updates. It must also be possible to get a copy/snapshot of the
current state.

** File Based Backend (not implemented)

Seems possible, but its hard to implement atomic update and snapshots.

** Berkeley Database Backend (not implemented)

The Berkeley DB provides full featured transaction support, including
atomic commits and snapshot isolation.

** SQLite Database Backend (currently in use)

This is simpler than BDB. All data is inside a single file. And there
is a defined way to access that data (SQL). It is also very stable.

We can use the following simple database table:

INODE PARENT NAME WRITER VERSION SIZE VALUE

We use a global 'version' number (64bit) to uniquely identify the
current version. This 'version' is incremented on any database
modification. We also use it as 'inode' number when we create a new
entry. The 'inode' is the primary key.

** RAM/File Based Backend

If the state is small enough we can hold all data in RAM. Then a
'snapshot' is a simple copy of the state in RAM. Although all data is
in RAM, a copy is written to the disk. The idea is that the state in
RAM is the 'correct' one. If any file/database operations fails the
saved state can become inconsistent, and the node must trigger a state
resync operation if that happens.

We can use the DB design from above to store data on disk.

[1] https://github.com/proxmox/pve-cluster/blob/master/src/README
 
I'm interessed into a "** RAM/File Based Backend" also :)

I actually was a bit surprised it was not the natural first choice as the current implementation basically holds everything in RAM (that's max 128MB today), so all it takes is to dump it onto persistent storage once in a while. Currently the filesystem operations in the cluster are basically mirrored into the backend DB and constantly shredding the persistent medium - this is straightforward as atomic updates are possible theoretically avoiding corruption, except when it hasn't quite work out ... and you end up with DB corruption discovered on next reboot.

It's a bit like ZFS ZIL on SLOG, where you keep writing there, but ideally never need to read off there. But on an occasion you do and find out it's not working out, all those supposedly ACID properties of your DB backend go out of the window. The current SQLite backend is using Write-ahead logging instead of traditional journal, that's fine and it is needed to allow for concurrency, but also needs checkpointing (incorporating the log into the base) - now imagine e.g. a power loss (or equivalent issue) in the middle of the checkpoint operation.

So yeah, I have been looking for other options, that do not have the (in)consistency problem and also happen to avoid excessive writes.
 
Last edited:
  • Like
Reactions: waltar
Another step in the direction of forking PVE ;)

Not really, it's about doing something smarter than crazy 3rd party gymnastics [1] which can't really work all that well when the currently being checkpointed WAL is flushed, also avoiding double RAM usage and (I would like to say) possibly removing arbitrary limits on the size of that database while adding the feature of having snapshots several hours back, also allowing live backups in an instant. That all due to a minor implementation change now after 15 years since its inception.

[1] https://github.com/isasmendiagus/pmxcfs-ram
 
Why do you need to know this? What purpose does it serve? Are you bored or something?

@Maximiliano I have now noticed and am a bit surprised a staff member gives a like to a reaction like this, especially that it came days later after the explanation on why exactly the question had been asked. I was considering, as had been suggested by @LnxBil above to go with further questions to pve-devel indeed, but if I am going to get similar reactions from staff there, you might as well accompany the like with some explanation, e.g. "we in-house have been asked not to touch pmxcfs under no circumstances and do not want to discuss any possible bugs or improvements even". In such case, I would not be proceeding to spamming the list. If I have misunderstood in any way, please let me know. Thank you.
 
So this was one of those things that people do not know that they do not know, but the answer essentially was this:
https://forum.proxmox.com/threads/etc-pve-500k-600m-amplification.154074/#post-701246

@waltar @ucholak Thanks for the likes! So I am still experimenting with this, but in the meantime, the low hanging fruit is apparently ditching the WAL altogether.

https://forum.proxmox.com/threads/s...he-pmxcfs-commit-interval.124638/#post-702765

Yes, it has certain implications, but e.g. having a systemd timer with sqlite3 config.db .dump > config.dump is something that has been sorely needed out of the box even with the current implementation that can and does get corrupt anyhow.

As a matter of fact, the WAL did really nothing for the writers.

For anyone running a single node, this is already worth a try, although it does not (obviously) remove the n^2 blocks written (just yet).

EDIT: If anyone is interested, let me know!

At least it would avoid fiddling with everything in all sorts of error-prone ways [1][2].

[1] https://www.reddit.com/r/Proxmox/comments/ncg2xo/minimizing_ssd_wear_through_pve_configuration/
[2] https://github.com/isasmendiagus/pmxcfs-ram
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!