2 Node Cluster with the the least amount of "clusterization" - how?

For larger clusters (or, ANY cluster) this is probably a very bad idea. cluster information may be updated from other nodes. since your in-ram database is updated UNIDIRECTIONALLY (meaning, written down but not read up) you end up with conflicting dbs. You'll have out of sync nodes that are unaware of their condition.

My logic was a large cluster typically has all nodes on UPS to begin with, so having all of them suffering power loss at the same time is something virtually impossible, therefore any shutdowns would be properly flushing the content onto disk. For a sole or limited few nodes experiencing hw failure, nothing happens to the quorum.

But I am not sure what you mean by "written down but not read up" - it's in-memory virtual filesystem on all nodes at all times, the difference using or not using the tool is how often it's flushed onto the local drive.
 
But I am not sure what you mean by "written down but not read up" - it's in-memory virtual filesystem on all nodes at all times, the difference using or not using the tool is how often it's flushed onto the local drive.
In a cluster environment, /etc/pve is already kept in ram. Adding an additional ramdisk layer on top of that that can only write changes down to the lower layer, but cannot read up from it can wreak havok on the cluster. This has nothing to do with power. This "solution" is only potentially useful for a single non clustered node.

see https://pve.proxmox.com/wiki/Proxmox_Cluster_File_System_(pmxcfs) to understand how this works.
 
In a cluster environment, /etc/pve is already kept in ram. Adding an additional ramdisk layer on top of that that can only write changes down to the lower layer, but cannot read up from it can wreak havok on the cluster. This has nothing to do with power. This "solution" is only potentially useful for a single non clustered node.

see https://pve.proxmox.com/wiki/Proxmox_Cluster_File_System_(pmxcfs) to understand how this works.

I believe I do know how it works, which is why I am surprised you consider it to be a problem. The config.db which is backing the pmxcfs is being written to persist the changes within a cluster. The db file is being written to using PRAGMA journal_mode=WAL. In whichever state it is left it will be fine in otherwise healthy cluster because upon a node starting up with some stale config.db, it will be promptly refreshed from the rest of the cluster holding the more recent state. Which is why I mentioned - assuming people run clusters on UPS - that if just a node or two experience a sudden power loss, even if they were using the tool (which by default flushes once an hour*), they'll not be exposed to virtually any risks.

It's equivalent of the pmxcfs buffering the commits for an hour*, which could have been supported just as well. Not to mention, within this thread and the OP's question - the non-PLP SSD might even report all in NAND while that is not the case.

Now if you believe I am wrong in some specific part and that in particular brings up the risks, please point it out to me like to a 5-year old, I'll happily acknowledge where I was wrong.

For a local pmxcfs it's a non-topic, I do not think there's any substantial writes going on.

EDIT: The tool simply mounts the ramdisk over where the config.db would be on the drive, so reading back happens from most recent local copy, it just happens to be in RAM. The flushing is configurable, it's only there for power loss events really. On a shutdown it would flush correctly. It's a workaround for a missing feature.

EDIT2: Corrected myself above that it flushes onto drive once an hour by default (configurable).
 
Last edited:
except it neither copies in a consistent fashion, nor is it crash safe while persisting, so it is does in fact have a small chance of either corrupting your (on-disk) DB, or losing it entirely. yes, you have to be unlucky, but the window of opportunity is once an hour, so if your system is running long enough you might just hit that jackpot.
 
except it neither copies in a consistent fashion

The tool itself? I am not sure what you mean by "consistent", it copies whatever is there at any given point, what's the problem when I copy a db file, at any point, especially WAL journalled one?

, nor is it crash safe while persisting

I didn't check that, it's not my tool, but again, if you have multiple nodes, what's the problem? Especially just yesterday we discussed you'd rather not have corosync service even restarting on a live node.

, so it is does in fact have a small chance of either corrupting your (on-disk) DB

I do not know how an ACID DB file can be corrupted by definition?

, or losing it entirely.

When there's <number of nodes> - copies of it all around at any given time?

yes, you have to be unlucky, but the window of opportunity is once an hour, so if your system is running long enough you might just hit that jackpot.

Is there any reason why PVE does not allow tweaking how often it's flusing onto drive at all?

NB The tool is not mine.
 
nor is it crash safe while persisting

So [1]:

Bash:
function persist_data () {
    #Write data stored in RAM to disk
    rm "$VARLIBDIR_PERSISTENT_PATH"/*
    cp -r "$VARLIBDIR_RAM_PATH"/* "$VARLIBDIR_PERSISTENT_PATH"
}

This reminds me exactly what a Perl guru at Proxmox did in a code that has been shipping for 10+ years as part of PVE [2].

Perl:
    unlink $ssh_system_known_hosts;
    symlink $ssh_cluster_known_hosts, $ssh_system_known_hosts;

    warn "can't create symlink for ssh known hosts '$ssh_system_known_hosts' -> '$ssh_cluster_known_hosts'\n"
    if ! -l $ssh_system_known_hosts;

Yeah, I can make snide remarks too.

[1] https://github.com/isasmendiagus/pm...bc23e7181df8322fe78/pmxcfs-ram.sh#L89C1-L94C1
[2] https://github.com/proxmox/pve-clus...2a9b72a288771d8/src/PVE/Cluster/Setup.pm#L327
 
Last edited:
copying the DB files while the DB is running means you can end up with the following sequence of events

- copy of DB
- merging of WAL into DB (if completed, the WAL writes will now happen at the start of the WAL file)
- copy of WAL file

your copy of DB and WAL don't match -> you either lost writes, or corrupted your ("copy" of the) DB

Is there any reason why PVE does not allow tweaking how often it's flusing onto drive at all?

because rolling back to an hour ago might mean not being able to re-join the cluster upon reboot. and if you have a power outage (or a cluster-wide fence event in case of a network with HA), you might lose all your changes of the past hour, which can potentially cause a lot of trouble. and because it's only a problem if you use hardware that is not recommended (consumer SSDs).

TL;DR please don't use such tools unless you absolutely don't care about your systems/guests/data. and if you do, please include a prominent disclaimer in any posts you make here to avoid us wasting time on the resulting issues.
 
copying the DB files while the DB is running means you can end up with the following sequence of events

- copy of DB
- merging of WAL into DB (if completed, the WAL writes will now happen at the start of the WAL file)
- copy of WAL file

your copy of DB and WAL don't match -> you either lost writes, or corrupted your ("copy" of the) DB

I might be a bit slow today, but I indeed might have lost writes and that's a problem (on one node that will be restarted to rejoin the cluster)?

The WAL is doing checkpointing, so ... I am not sure I am getting how I might end up with corrupted DB? I might need to disregard WAL. What am I missing?

because rolling back to an hour ago might mean not being able to re-join the cluster upon reboot.

I'd like to know this in deterministic fashion. The "not being able" is under what circumstances?

and if you have a power outage (or a cluster-wide fence event in case of a network with HA), you might lose all your changes of the past hour, which can potentially cause a lot of trouble.

My whole point in this thread also in respect to OP was this is for clusters which are run on UPS.

and because it's only a problem if you use hardware that is not recommended (consumer SSDs).

But this is perfectly valid recommendation for your customers getting paid support, you do not wish to support hardware that increases support load. Why are you telling this to someone with 2 consumer grade computers to begin with, the OP in this case? And majority of the people on this forum? Because they do not help test the only setup that you want to support? Should he now resolder his RAM to ECC and get a XEON?

TL;DR please don't use such tools unless you absolutely don't care about your systems/guests/data.

Scare-mongering. Have backups - anyone, anytime, to begin with.

EDIT: If durability is of such a concern and only PLP SSDs are good enough for PVE, why is it that pmxcfs happens to run synchronous=NORMAL as I just noticed.

and if you do, please include a prominent disclaimer in any posts you make here to avoid us wasting time on the resulting issues.

So my point above was very valid.
 
Last edited:
If anyone is still following this up or later finds this thread, for the record I can rely on [1], just excerpts (emphasis mine):

3. Failure to sync

In order to guarantee that database files are always consistent, SQLite will occasionally ask the operating system to flush all pending writes to persistent storage then wait for that flush to complete. This is accomplished using the fsync() system call under unix and FlushFileBuffers() under Windows. We call this flush of pending writes a "sync".

Actually, if one is only concerned with atomic and consistent writes and is willing to forego durable writes, the sync operation does not need to wait until the content is completely stored on persistent media. Instead, the sync operation can be thought of as an I/O barrier. As long as all writes that occur before the sync are completed before any write that happens after the sync, no database corruption will occur. If sync is operating as an I/O barrier and not as a true sync, then a power failure or system crash might cause one or more previously committed transactions to roll back (in violation of the "durable" property of "ACID") but the database will at least continue to be consistent, and that is what most people care about.

I believe the premise at the beginning here is obsolete now, but not the conclusions.

3.1. Disk drives that do not honor sync requests

Unfortunately, most consumer-grade mass storage devices lie about syncing. Disk drives will report that content is safely on persistent media as soon as it reaches the track buffer and before actually being written to oxide. This makes the disk drives seem to operate faster (which is vitally important to the manufacturer so that they can show good benchmark numbers in trade magazines). And in fairness, the lie normally causes no harm, as long as there is no power loss or hard reset prior to the track buffer actually being written to oxide. But if a power loss or hard reset does occur, and if that results in content that was written after a sync reaching oxide while content written before the sync is still in a track buffer, then database corruption can occur.

USB flash memory sticks seem to be especially pernicious liars regarding sync requests. One can easily see this by committing a large transaction to an SQLite database on a USB memory stick. The COMMIT command will return relatively quickly, indicating that the memory stick has told the operating system and the operating system has told SQLite that all content is safely in persistent storage, and yet the LED on the end of the memory stick will continue flashing for several more seconds. Pulling out the memory stick while the LED is still flashing will frequently result in database corruption.

Note that SQLite must believe whatever the operating system and hardware tell it about the status of sync requests. There is no way for SQLite to detect that either is lying and that writes might be occurring out-of-order. However, SQLite in WAL mode is far more forgiving of out-of-order writes than in the default rollback journal modes. In WAL mode, the only time that a failed sync operation can cause database corruption is during a checkpoint operation. A sync failure during a COMMIT might result in loss of durability but not in a corrupt database file. Hence, one line of defense against database corruption due to failed sync operations is to use SQLite in WAL mode and to checkpoint as infrequently as possible.

[1] https://www.sqlite.org/howtocorrupt.html
 
Last edited:
the problem is that by doing the copying on a live DB, you are not taking those precautions that sqlite itself takes. you basically cross the sync and do a partial rollback that sqlite doesn't expect, and if that happens at the wrong point during a checkpoint (those happen automatically).

see https://www.sqlite.org/howtocorrupt.html#_backup_or_restore_while_a_transaction_is_active (and 1.4)

I know that, but I quoted the parts specifically because the whole section 1 is talking of major principles when it comes to corruption which in DB is in the context of ACID. The section 3 then gets to state that if one does not care for the D, then it is alright specifically with WAL or did I read that wrong? NB With synchronous=NORMAL you also do not worry THAT much about durability, do you?

And let's not forget the hourly flushing is there "just in case", because the idea is it's flushed on shutdown. And on all nodes, which spreads the risk very thin.
 
hey

i tried to read up on clusters, the quorum, the votes, qdevices etc but i still don't know how and if i should create a 2 node cluster
my pve1 (amd based) is my "main server" (funny to call a mini pc that) that is tasked to host all the little vms i need in my home.
i plan on getting a second mini pc (pve2, intel based)) to be primarily a host for plex/jellyfin, but probably within a proxmox environment

ideally i want to achieve, that i can manage both pve1 and pve2 over a single gui and being able to(manually) move a vm from pve1 to pve2 and vice versa

i don't need a shared storage, HA or probably many other features that come with a cluster

will creating a 2 node cluster open up more issues than i plan to actually achieve?
can i create a 2 node cluster without backing up/deleting/restoring every vm?
do i have to keep the quorum in mind or could this be somehow negated for my tuned-down needs? can i simply use the synology quorum server of my synology nas or should i re-purpose a raspberry pi?
what happens if i ever needed to change one of the cluster-nodes? as long as the q-device and pve1 are online, pve2 could be dropped from the cluster and a new node could join? would i have then to repeat the backup/delete/restore-step for the remaining node?
is there anything else that i should take into consideration or that is often missed by noobs?

thank you very much for your time and experience
Hi

I have a similar story of two nodes, Interested in another both servers are physical their physical interfaces are converted to bridge network connection . And this local network goes to a router in NAT through which I connect via web. Each of them has one more physical interface I want to use them for cluster via crossover cable,(network>HOST___host<network).
How to specify or the Cluster itself will connect on the necessary (cross-cable,) interface I want quorum to occur on crossover cable.
 
. The section 3 then gets to state that if one does not care for the D, then it is alright specifically with WAL or did I read that wrong?

no, it specifically says that
- "fake" sync/out-of-order writes during COMMIT -> loss of D
- .. during CHECKPOINT -> corrupt DB

both are a problem for PVE, although the first *might* be recoverable in a cluster.
 
no, it specifically says that
- "fake" sync/out-of-order writes during COMMIT -> loss of D
- .. during CHECKPOINT -> corrupt DB

both are a problem for PVE, although the first *might* be recoverable in a cluster.

During checkpoint what is the worst that happens? I get a DB file and WAL file which might not be taken from the same point in time. Is that it? The way WAL file format is it should not really matter. Either I get the WAL and by the time I get the DB it has gone through checkpoint so my WAL is stale. Or I get the DB and by the time I get the WAL it's been reduced because checkpoint happened. Corrupt as in I lost durability which is a non-topic as we are already talking up to 1hr stale data. Corrupt in case of normal rollback journal I can imagine, but what do I get corrupt with a WAL journalling when I do not care for durability? I suppose you mean corrupt as in inconsistent?
 
One more point (emphasis mine):

no, it specifically says that
- "fake" sync/out-of-order writes during COMMIT -> loss of D
- .. during CHECKPOINT -> corrupt DB

both are a problem for PVE, although the first *might* be recoverable in a cluster.

You sure about that first one? Because then how dare you have anything other than synchronous=FULL?
 
yes, if you use broken hardware things might be broken (you notice a pattern - please use proper hardware!). copying a live DB from outside mimics the behaviour of such broken hardware (potentially with additional failure causes such as partial/inconsistent reads, ..). that was my point. you can choose not to believe me (or sqlite upstream) and pretend it's all fine and dandy, but (at the risk of sounding like a broken record) - you get to keep the pieces (and the blame).
 
  • Like
Reactions: shanreich
yes, if you use broken hardware things might be broken (you notice a pattern - please use proper hardware!). copying a live DB from outside mimics the behaviour of such broken hardware (potentially with additional failure causes such as partial/inconsistent reads, ..). that was my point. you can choose not to believe me (or sqlite upstream) and pretend it's all fine and dandy, but (at the risk of sounding like a broken record) - you get to keep the pieces (and the blame).

@fabian My point in this whole thread (from non-PLP or low-TBW SSDs) through someone else's RAM-disk buffer for offsetting all that was not to argue, as it might not look so. TL;DR below.

1. I believe (as the OP's post documents) lots of the people on the forum are hobbyists and they want to use what makes most sense. If one starts with non-server hardware it's already RAM is non-ECC, so we would need to tell everyone to not use ZFS, etc., etc.

2. I also believe for your support tickets where paying customers are running this for production use, recommending all those things makes sense.

3. As all these people on the forum help you test (for free, they get the product for free in return), it comes across as elitist to tell them they should not use this or that, unless you prefer those people not be on the forum which shrinks your tester base.

In regards to the topic at hand (config.db), I do not care who is correct, I am at this moment of the opinion that because it's WAL journalled and I do know I do not care about durability (which I believe you do not care for either), it should not end up corrupting any of the rest ACI[D] principles. As you have not mentioned why exactly, all I can do is literally test run it with mismatching WAL, missing WAL, etc and see where my understanding of sqlite might be lacking.

TL;DR In either case, as also people like to make backups of running nodes (without using a snapshot), it might be great to add e.g. virtual /etc/pve/backup/configdb-timestamp/ path where the sqlite backup API dumps that in a proper way once in defined interval so as to facilitate better backups, disaster recovery, etc. The person who made the RAM-disk tool apparently wanted some drop-in solution which is not a patch. And somehow it is very clear to me you will never pull a patch where you have PRAGMA synchronous value available for hobbyists in a config file alongside with self-defined value of how often to flush.
 
Last edited:
copying a live DB from outside mimics the behaviour of such broken hardware (potentially with additional failure causes such as partial/inconsistent reads, ..)

For the record, I still do not see this statement (and the previous "might") as accurate under the given conditions because of the very fact how the how the WAL checkpointing [1] works (emphasis mine):

4.3. Checkpoint Algorithm​

On a checkpoint, the WAL is first flushed to persistent storage using the xSync method of the VFS. Then valid content of the WAL is transferred into the database file. Finally, the database is flushed to persistent storage using another xSync method call. The xSync operations serve as write barriers - all writes launched before the xSync must complete before any write that launches after the xSync begins.

A checkpoint need not run to completion. It might be that some readers are still using older transactions with data that is contained in the database file. In that case, transferring content for newer transactions from the WAL file into the database would delete the content out from under readers still using the older transactions. To avoid that, checkpoints only run to completion if all reader are using the last transaction in the WAL.

[1] http://www.sqlite.org/draft/fileformat2.html#walformat
 
Hi

I have a similar story of two nodes, Interested in another both servers are physical their physical interfaces are converted to bridge network connection . And this local network goes to a router in NAT through which I connect via web. Each of them has one more physical interface I want to use them for cluster via crossover cable,(network>HOST___host<network).
How to specify or the Cluster itself will connect on the necessary (cross-cable,) interface I want quorum to occur on crossover cable.
I think you better open a new thread, but your case sounds like you can read from the beginning (whether you prefer a qdevice or one node having 2 votes, etc. to not have it fall apart). For the networking, this is normal Debian configuration, once the interfaces have an IP, you create the cluster on that network - see 5.4.3 Adding Nodes with Separated Cluster Network here:
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_join_node_to_cluster
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!