pmxcfs - any other backend than SQLite?

E

esi_y

Renowned Member

Aug 26, 2024

#1

Has there been any attempt to use some other backend, the docs mention BDB, but times have moved on and e.g. LMDB? If not, why not?

EDIT: I really just want to know if someone attempted e.g. key-value store DB before.

Last edited: Aug 27, 2024

B

BobhWasatch

Famous Member

Proxmox Subscriber

Aug 26, 2024

#2

Why do you need to know this? What purpose does it serve? Are you bored or something?

Reactions: Maximiliano and Neobin

E

esi_y

Renowned Member

Aug 26, 2024

#3

BobhWasatch said:
Why do you need to know this? What purpose does it serve?

You really do not like my questions, do you?

Even purely technical ones.

Because the current implementation ends up relatively frequent corruption + is much less performant than e.g. LMDB would have been. If this was done (e.g. BDB in the past) and abandoned for some reason, it's nice to know those reasons before one starts experimenting with it.

B

BobhWasatch

Famous Member

Proxmox Subscriber

Aug 26, 2024

#4

esi_y said:
You really do not like my questions, do you? Even purely technical ones.

No, because most of them amount to second-guessing or passive-agressive criticism.

esi_y said:
Because the current implementation ends up relatively frequent corruption + is much less performant than e.g. LMDB would have been. If this was done (e.g. BDB in the past) and abandoned for some reason, it's nice to know those reasons before one starts experimenting with it.

Everybody's a developer.

Last edited: Aug 26, 2024

Reactions: Neobin

LnxBil

Distinguished Member

Aug 26, 2024

#5

esi_y said:
You really do not like my questions, do you? Even purely technical ones.

Maybe pve-devel is more suited for most of your questions.

E

esi_y

Renowned Member

Aug 30, 2024

#6

For anyone searching the same, quite a bit is explained [1]:

Distributed Configuration Database (DCDB)
===========================================

We want to implement a simple way to distribute small configuration
files among the cluster on top of corosync CPG.

The set of all configuration files defines the 'state'. That state is
stored persistently on all members using a backend
database. Configuration files are usually quite small, and we can even
set a limit for the file size.

* Backend Database

Each node stores the state using a backend database. That database
need to have transaction support, because we want to do atomic
updates. It must also be possible to get a copy/snapshot of the
current state.

** File Based Backend (not implemented)

Seems possible, but its hard to implement atomic update and snapshots.

** Berkeley Database Backend (not implemented)

The Berkeley DB provides full featured transaction support, including
atomic commits and snapshot isolation.

** SQLite Database Backend (currently in use)

This is simpler than BDB. All data is inside a single file. And there
is a defined way to access that data (SQL). It is also very stable.

We can use the following simple database table:

INODE PARENT NAME WRITER VERSION SIZE VALUE

We use a global 'version' number (64bit) to uniquely identify the
current version. This 'version' is incremented on any database
modification. We also use it as 'inode' number when we create a new
entry. The 'inode' is the primary key.

** RAM/File Based Backend

If the state is small enough we can hold all data in RAM. Then a
'snapshot' is a simple copy of the state in RAM. Although all data is
in RAM, a copy is written to the disk. The idea is that the state in
RAM is the 'correct' one. If any file/database operations fails the
saved state can become inconsistent, and the node must trigger a state
resync operation if that happens.

We can use the DB design from above to store data on disk.

[1] https://github.com/proxmox/pve-cluster/blob/master/src/README

W

waltar

Active Member

Aug 30, 2024

#7

I'm interessed into a "** RAM/File Based Backend" also

E

esi_y

Renowned Member

Aug 30, 2024

#8

waltar said:
I'm interessed into a "** RAM/File Based Backend" also

I actually was a bit surprised it was not the natural first choice as the current implementation basically holds everything in RAM (that's max 128MB today), so all it takes is to dump it onto persistent storage once in a while. Currently the filesystem operations in the cluster are basically mirrored into the backend DB and constantly shredding the persistent medium - this is straightforward as atomic updates are possible theoretically avoiding corruption, except when it hasn't quite work out ... and you end up with DB corruption discovered on next reboot.

It's a bit like ZFS ZIL on SLOG, where you keep writing there, but ideally never need to read off there. But on an occasion you do and find out it's not working out, all those supposedly ACID properties of your DB backend go out of the window. The current SQLite backend is using Write-ahead logging instead of traditional journal, that's fine and it is needed to allow for concurrency, but also needs checkpointing (incorporating the log into the base) - now imagine e.g. a power loss (or equivalent issue) in the middle of the checkpoint operation.

So yeah, I have been looking for other options, that do not have the (in)consistency problem and also happen to avoid excessive writes.

Last edited: Aug 30, 2024

Reactions: waltar

LnxBil

Distinguished Member

Sep 2, 2024

#9

esi_y said:
So yeah, I have been looking for other options, that do not have the (in)consistency problem and also happen to avoid excessive writes.

Another step in the direction of forking PVE

E

esi_y

Renowned Member

Sep 2, 2024

#10

LnxBil said:
Another step in the direction of forking PVE

Not really, it's about doing something smarter than crazy 3rd party gymnastics [1] which can't really work all that well when the currently being checkpointed WAL is flushed, also avoiding double RAM usage and (I would like to say) possibly removing arbitrary limits on the size of that database while adding the feature of having snapshots several hours back, also allowing live backups in an instant. That all due to a minor implementation change now after 15 years since its inception.

[1] https://github.com/isasmendiagus/pmxcfs-ram

E

esi_y

Renowned Member

Sep 2, 2024

#11

BobhWasatch said:
Why do you need to know this? What purpose does it serve? Are you bored or something?

@Maximiliano I have now noticed and am a bit surprised a staff member gives a like to a reaction like this, especially that it came days later after the explanation on why exactly the question had been asked. I was considering, as had been suggested by @LnxBil above to go with further questions to pve-devel indeed, but if I am going to get similar reactions from staff there, you might as well accompany the like with some explanation, e.g. "we in-house have been asked not to touch pmxcfs under no circumstances and do not want to discuss any possible bugs or improvements even". In such case, I would not be proceeding to spamming the list. If I have misunderstood in any way, please let me know. Thank you.

E

esi_y

Renowned Member

Sep 8, 2024

#12

BobhWasatch said:
Why do you need to know this? What purpose does it serve?

So this was one of those things that people do not know that they do not know, but the answer essentially was this:
https://forum.proxmox.com/threads/etc-pve-500k-600m-amplification.154074/#post-701246

Reactions: ucholak and waltar

E

esi_y

Renowned Member

Sep 13, 2024

#13

esi_y said:
So this was one of those things that people do not know that they do not know, but the answer essentially was this:
https://forum.proxmox.com/threads/etc-pve-500k-600m-amplification.154074/#post-701246

@waltar @ucholak Thanks for the likes! So I am still experimenting with this, but in the meantime, the low hanging fruit is apparently ditching the WAL altogether.

https://forum.proxmox.com/threads/s...he-pmxcfs-commit-interval.124638/#post-702765

Yes, it has certain implications, but e.g. having a systemd timer with sqlite3 config.db .dump > config.dump is something that has been sorely needed out of the box even with the current implementation that can and does get corrupt anyhow.

As a matter of fact, the WAL did really nothing for the writers.

For anyone running a single node, this is already worth a try, although it does not (obviously) remove the n^2 blocks written (just yet).

EDIT: If anyone is interested, let me know!

At least it would avoid fiddling with everything in all sorts of error-prone ways [1][2].

[1] https://www.reddit.com/r/Proxmox/comments/ncg2xo/minimizing_ssd_wear_through_pve_configuration/
[2] https://github.com/isasmendiagus/pmxcfs-ram

Last edited: Sep 13, 2024

Search

Search

pmxcfs - any other backend than SQLite?

esi_y

Renowned Member

BobhWasatch

Famous Member

esi_y

Renowned Member

BobhWasatch

Famous Member

LnxBil

Distinguished Member

esi_y

Renowned Member

waltar

Active Member

esi_y

Renowned Member

LnxBil

Distinguished Member

esi_y

Renowned Member

esi_y

Renowned Member

esi_y

Renowned Member

esi_y

Renowned Member