2 Node Cluster with the the least amount of "clusterization" - how?

esi_y · Jan 30, 2024

alexskysilk said:
For larger clusters (or, ANY cluster) this is probably a very bad idea. cluster information may be updated from other nodes. since your in-ram database is updated UNIDIRECTIONALLY (meaning, written down but not read up) you end up with conflicting dbs. You'll have out of sync nodes that are unaware of their condition.

My logic was a large cluster typically has all nodes on UPS to begin with, so having all of them suffering power loss at the same time is something virtually impossible, therefore any shutdowns would be properly flushing the content onto disk. For a sole or limited few nodes experiencing hw failure, nothing happens to the quorum.

But I am not sure what you mean by "written down but not read up" - it's in-memory virtual filesystem on all nodes at all times, the difference using or not using the tool is how often it's flushed onto the local drive.

alexskysilk · Jan 30, 2024

tempacc346235 said:
But I am not sure what you mean by "written down but not read up" - it's in-memory virtual filesystem on all nodes at all times, the difference using or not using the tool is how often it's flushed onto the local drive.

In a cluster environment, /etc/pve is already kept in ram. Adding an additional ramdisk layer on top of that that can only write changes down to the lower layer, but cannot read up from it can wreak havok on the cluster. This has nothing to do with power. This "solution" is only potentially useful for a single non clustered node.

see https://pve.proxmox.com/wiki/Proxmox_Cluster_File_System_(pmxcfs) to understand how this works.

esi_y · Jan 30, 2024

alexskysilk said:
In a cluster environment, /etc/pve is already kept in ram. Adding an additional ramdisk layer on top of that that can only write changes down to the lower layer, but cannot read up from it can wreak havok on the cluster. This has nothing to do with power. This "solution" is only potentially useful for a single non clustered node.

see https://pve.proxmox.com/wiki/Proxmox_Cluster_File_System_(pmxcfs) to understand how this works.

I believe I do know how it works, which is why I am surprised you consider it to be a problem. The config.db which is backing the pmxcfs is being written to persist the changes within a cluster. The db file is being written to using PRAGMA journal_mode=WAL. In whichever state it is left it will be fine in otherwise healthy cluster because upon a node starting up with some stale config.db, it will be promptly refreshed from the rest of the cluster holding the more recent state. Which is why I mentioned - assuming people run clusters on UPS - that if just a node or two experience a sudden power loss, even if they were using the tool (which by default flushes once an hour*), they'll not be exposed to virtually any risks.

It's equivalent of the pmxcfs buffering the commits for an hour*, which could have been supported just as well. Not to mention, within this thread and the OP's question - the non-PLP SSD might even report all in NAND while that is not the case.

Now if you believe I am wrong in some specific part and that in particular brings up the risks, please point it out to me like to a 5-year old, I'll happily acknowledge where I was wrong.

For a local pmxcfs it's a non-topic, I do not think there's any substantial writes going on.

EDIT: The tool simply mounts the ramdisk over where the config.db would be on the drive, so reading back happens from most recent local copy, it just happens to be in RAM. The flushing is configurable, it's only there for power loss events really. On a shutdown it would flush correctly. It's a workaround for a missing feature.

EDIT2: Corrected myself above that it flushes onto drive once an hour by default (configurable).

fabian · Jan 31, 2024

except it neither copies in a consistent fashion, nor is it crash safe while persisting, so it is does in fact have a small chance of either corrupting your (on-disk) DB, or losing it entirely. yes, you have to be unlucky, but the window of opportunity is once an hour, so if your system is running long enough you might just hit that jackpot.

esi_y · Jan 31, 2024

fabian said:
except it neither copies in a consistent fashion

The tool itself? I am not sure what you mean by "consistent", it copies whatever is there at any given point, what's the problem when I copy a db file, at any point, especially WAL journalled one?

fabian said:
, nor is it crash safe while persisting

I didn't check that, it's not my tool, but again, if you have multiple nodes, what's the problem? Especially just yesterday we discussed you'd rather not have corosync service even restarting on a live node.

fabian said:
, so it is does in fact have a small chance of either corrupting your (on-disk) DB

I do not know how an ACID DB file can be corrupted by definition?

fabian said:
, or losing it entirely.

When there's <number of nodes> - copies of it all around at any given time?

fabian said:
yes, you have to be unlucky, but the window of opportunity is once an hour, so if your system is running long enough you might just hit that jackpot.

Is there any reason why PVE does not allow tweaking how often it's flusing onto drive at all?

NB The tool is not mine.

esi_y · Jan 31, 2024

fabian said:
nor is it crash safe while persisting

So [1]:

Bash:

function persist_data () {
    #Write data stored in RAM to disk
    rm "$VARLIBDIR_PERSISTENT_PATH"/*
    cp -r "$VARLIBDIR_RAM_PATH"/* "$VARLIBDIR_PERSISTENT_PATH"
}

This reminds me exactly what a Perl guru at Proxmox did in a code that has been shipping for 10+ years as part of PVE [2].

Perl:

    unlink $ssh_system_known_hosts;
    symlink $ssh_cluster_known_hosts, $ssh_system_known_hosts;

    warn "can't create symlink for ssh known hosts '$ssh_system_known_hosts' -> '$ssh_cluster_known_hosts'\n"
    if ! -l $ssh_system_known_hosts;

Yeah, I can make snide remarks too.

[1] https://github.com/isasmendiagus/pm...bc23e7181df8322fe78/pmxcfs-ram.sh#L89C1-L94C1
[2] https://github.com/proxmox/pve-clus...2a9b72a288771d8/src/PVE/Cluster/Setup.pm#L327

fabian · Jan 31, 2024

copying the DB files while the DB is running means you can end up with the following sequence of events

- copy of DB
- merging of WAL into DB (if completed, the WAL writes will now happen at the start of the WAL file)
- copy of WAL file

your copy of DB and WAL don't match -> you either lost writes, or corrupted your ("copy" of the) DB

Is there any reason why PVE does not allow tweaking how often it's flusing onto drive at all?

because rolling back to an hour ago might mean not being able to re-join the cluster upon reboot. and if you have a power outage (or a cluster-wide fence event in case of a network with HA), you might lose all your changes of the past hour, which can potentially cause a lot of trouble. and because it's only a problem if you use hardware that is not recommended (consumer SSDs).

TL;DR please don't use such tools unless you absolutely don't care about your systems/guests/data. and if you do, please include a prominent disclaimer in any posts you make here to avoid us wasting time on the resulting issues.

esi_y · Jan 31, 2024

fabian said:
copying the DB files while the DB is running means you can end up with the following sequence of events

- copy of DB
- merging of WAL into DB (if completed, the WAL writes will now happen at the start of the WAL file)
- copy of WAL file

your copy of DB and WAL don't match -> you either lost writes, or corrupted your ("copy" of the) DB

I might be a bit slow today, but I indeed might have lost writes and that's a problem (on one node that will be restarted to rejoin the cluster)?

The WAL is doing checkpointing, so ... I am not sure I am getting how I might end up with corrupted DB? I might need to disregard WAL. What am I missing?

fabian said:
because rolling back to an hour ago might mean not being able to re-join the cluster upon reboot.

I'd like to know this in deterministic fashion. The "not being able" is under what circumstances?

fabian said:
and if you have a power outage (or a cluster-wide fence event in case of a network with HA), you might lose all your changes of the past hour, which can potentially cause a lot of trouble.

My whole point in this thread also in respect to OP was this is for clusters which are run on UPS.

fabian said:
and because it's only a problem if you use hardware that is not recommended (consumer SSDs).

But this is perfectly valid recommendation for your customers getting paid support, you do not wish to support hardware that increases support load. Why are you telling this to someone with 2 consumer grade computers to begin with, the OP in this case? And majority of the people on this forum? Because they do not help test the only setup that you want to support? Should he now resolder his RAM to ECC and get a XEON?

fabian said:
TL;DR please don't use such tools unless you absolutely don't care about your systems/guests/data.

Scare-mongering. Have backups - anyone, anytime, to begin with.

EDIT: If durability is of such a concern and only PLP SSDs are good enough for PVE, why is it that pmxcfs happens to run synchronous=NORMAL as I just noticed.

fabian said:
and if you do, please include a prominent disclaimer in any posts you make here to avoid us wasting time on the resulting issues.

So my point above was very valid.

esi_y · Jan 31, 2024

If anyone is still following this up or later finds this thread, for the record I can rely on [1], just excerpts (emphasis mine):

3. Failure to sync
In order to guarantee that database files are always consistent, SQLite will occasionally ask the operating system to flush all pending writes to persistent storage then wait for that flush to complete. This is accomplished using the fsync() system call under unix and FlushFileBuffers() under Windows. We call this flush of pending writes a "sync".

Actually, if one is only concerned with atomic and consistent writes and is willing to forego durable writes, the sync operation does not need to wait until the content is completely stored on persistent media. Instead, the sync operation can be thought of as an I/O barrier. As long as all writes that occur before the sync are completed before any write that happens after the sync, no database corruption will occur. If sync is operating as an I/O barrier and not as a true sync, then a power failure or system crash might cause one or more previously committed transactions to roll back (in violation of the "durable" property of "ACID") but the database will at least continue to be consistent, and that is what most people care about.

I believe the premise at the beginning here is obsolete now, but not the conclusions.

3.1. Disk drives that do not honor sync requests
Unfortunately, most consumer-grade mass storage devices lie about syncing. Disk drives will report that content is safely on persistent media as soon as it reaches the track buffer and before actually being written to oxide. This makes the disk drives seem to operate faster (which is vitally important to the manufacturer so that they can show good benchmark numbers in trade magazines). And in fairness, the lie normally causes no harm, as long as there is no power loss or hard reset prior to the track buffer actually being written to oxide. But if a power loss or hard reset does occur, and if that results in content that was written after a sync reaching oxide while content written before the sync is still in a track buffer, then database corruption can occur.

USB flash memory sticks seem to be especially pernicious liars regarding sync requests. One can easily see this by committing a large transaction to an SQLite database on a USB memory stick. The COMMIT command will return relatively quickly, indicating that the memory stick has told the operating system and the operating system has told SQLite that all content is safely in persistent storage, and yet the LED on the end of the memory stick will continue flashing for several more seconds. Pulling out the memory stick while the LED is still flashing will frequently result in database corruption.

Note that SQLite must believe whatever the operating system and hardware tell it about the status of sync requests. There is no way for SQLite to detect that either is lying and that writes might be occurring out-of-order. However, SQLite in WAL mode is far more forgiving of out-of-order writes than in the default rollback journal modes. In WAL mode, the only time that a failed sync operation can cause database corruption is during a checkpoint operation. A sync failure during a COMMIT might result in loss of durability but not in a corrupt database file. Hence, one line of defense against database corruption due to failed sync operations is to use SQLite in WAL mode and to checkpoint as infrequently as possible.

[1] https://www.sqlite.org/howtocorrupt.html

fabian · Jan 31, 2024

the problem is that by doing the copying on a live DB, you are not taking those precautions that sqlite itself takes. you basically cross the sync and do a partial rollback that sqlite doesn't expect, and if that happens at the wrong point during a checkpoint (those happen automatically).

see https://www.sqlite.org/howtocorrupt.html#_backup_or_restore_while_a_transaction_is_active (and 1.4)

esi_y · Jan 31, 2024

fabian said:
the problem is that by doing the copying on a live DB, you are not taking those precautions that sqlite itself takes. you basically cross the sync and do a partial rollback that sqlite doesn't expect, and if that happens at the wrong point during a checkpoint (those happen automatically).

see https://www.sqlite.org/howtocorrupt.html#_backup_or_restore_while_a_transaction_is_active (and 1.4)

I know that, but I quoted the parts specifically because the whole section 1 is talking of major principles when it comes to corruption which in DB is in the context of ACID. The section 3 then gets to state that if one does not care for the D, then it is alright specifically with WAL or did I read that wrong? NB With synchronous=NORMAL you also do not worry THAT much about durability, do you?

And let's not forget the hourly flushing is there "just in case", because the idea is it's flushed on shutdown. And on all nodes, which spreads the risk very thin.

Prilipala · Feb 1, 2024

Iacov said:
hey

i tried to read up on clusters, the quorum, the votes, qdevices etc but i still don't know how and if i should create a 2 node cluster
my pve1 (amd based) is my "main server" (funny to call a mini pc that) that is tasked to host all the little vms i need in my home.
i plan on getting a second mini pc (pve2, intel based)) to be primarily a host for plex/jellyfin, but probably within a proxmox environment

ideally i want to achieve, that i can manage both pve1 and pve2 over a single gui and being able to(manually) move a vm from pve1 to pve2 and vice versa

i don't need a shared storage, HA or probably many other features that come with a cluster

will creating a 2 node cluster open up more issues than i plan to actually achieve?
can i create a 2 node cluster without backing up/deleting/restoring every vm?
do i have to keep the quorum in mind or could this be somehow negated for my tuned-down needs? can i simply use the synology quorum server of my synology nas or should i re-purpose a raspberry pi?
what happens if i ever needed to change one of the cluster-nodes? as long as the q-device and pve1 are online, pve2 could be dropped from the cluster and a new node could join? would i have then to repeat the backup/delete/restore-step for the remaining node?
is there anything else that i should take into consideration or that is often missed by noobs?

thank you very much for your time and experience

Hi

I have a similar story of two nodes, Interested in another both servers are physical their physical interfaces are converted to bridge network connection . And this local network goes to a router in NAT through which I connect via web. Each of them has one more physical interface I want to use them for cluster via crossover cable,(network>HOST___host<network).
How to specify or the Cluster itself will connect on the necessary (cross-cable,) interface I want quorum to occur on crossover cable.

fabian · Feb 1, 2024

tempacc346235 said:
. The section 3 then gets to state that if one does not care for the D, then it is alright specifically with WAL or did I read that wrong?

no, it specifically says that
- "fake" sync/out-of-order writes during COMMIT -> loss of D
- .. during CHECKPOINT -> corrupt DB

both are a problem for PVE, although the first *might* be recoverable in a cluster.

esi_y · Feb 1, 2024

fabian said:
no, it specifically says that
- "fake" sync/out-of-order writes during COMMIT -> loss of D
- .. during CHECKPOINT -> corrupt DB

both are a problem for PVE, although the first *might* be recoverable in a cluster.

During checkpoint what is the worst that happens? I get a DB file and WAL file which might not be taken from the same point in time. Is that it? The way WAL file format is it should not really matter. Either I get the WAL and by the time I get the DB it has gone through checkpoint so my WAL is stale. Or I get the DB and by the time I get the WAL it's been reduced because checkpoint happened. Corrupt as in I lost durability which is a non-topic as we are already talking up to 1hr stale data. Corrupt in case of normal rollback journal I can imagine, but what do I get corrupt with a WAL journalling when I do not care for durability? I suppose you mean corrupt as in inconsistent?

esi_y · Feb 1, 2024

One more point (emphasis mine):

fabian said:
no, it specifically says that
- "fake" sync/out-of-order writes during COMMIT -> loss of D
- .. during CHECKPOINT -> corrupt DB

both are a problem for PVE, although the first *might* be recoverable in a cluster.

You sure about that first one? Because then how dare you have anything other than synchronous=FULL?

fabian · Feb 1, 2024

yes, if you use broken hardware things might be broken (you notice a pattern - please use proper hardware!). copying a live DB from outside mimics the behaviour of such broken hardware (potentially with additional failure causes such as partial/inconsistent reads, ..). that was my point. you can choose not to believe me (or sqlite upstream) and pretend it's all fine and dandy, but (at the risk of sounding like a broken record) - you get to keep the pieces (and the blame).

esi_y · Feb 1, 2024

fabian said:
yes, if you use broken hardware things might be broken (you notice a pattern - please use proper hardware!). copying a live DB from outside mimics the behaviour of such broken hardware (potentially with additional failure causes such as partial/inconsistent reads, ..). that was my point. you can choose not to believe me (or sqlite upstream) and pretend it's all fine and dandy, but (at the risk of sounding like a broken record) - you get to keep the pieces (and the blame).

@fabian My point in this whole thread (from non-PLP or low-TBW SSDs) through someone else's RAM-disk buffer for offsetting all that was not to argue, as it might not look so. TL;DR below.

1. I believe (as the OP's post documents) lots of the people on the forum are hobbyists and they want to use what makes most sense. If one starts with non-server hardware it's already RAM is non-ECC, so we would need to tell everyone to not use ZFS, etc., etc.

2. I also believe for your support tickets where paying customers are running this for production use, recommending all those things makes sense.

3. As all these people on the forum help you test (for free, they get the product for free in return), it comes across as elitist to tell them they should not use this or that, unless you prefer those people not be on the forum which shrinks your tester base.

In regards to the topic at hand (config.db), I do not care who is correct, I am at this moment of the opinion that because it's WAL journalled and I do know I do not care about durability (which I believe you do not care for either), it should not end up corrupting any of the rest ACI[D] principles. As you have not mentioned why exactly, all I can do is literally test run it with mismatching WAL, missing WAL, etc and see where my understanding of sqlite might be lacking.

TL;DR In either case, as also people like to make backups of running nodes (without using a snapshot), it might be great to add e.g. virtual /etc/pve/backup/configdb-timestamp/ path where the sqlite backup API dumps that in a proper way once in defined interval so as to facilitate better backups, disaster recovery, etc. The person who made the RAM-disk tool apparently wanted some drop-in solution which is not a patch. And somehow it is very clear to me you will never pull a patch where you have PRAGMA synchronous value available for hobbyists in a config file alongside with self-defined value of how often to flush.

esi_y · Feb 1, 2024

fabian said:
copying a live DB from outside mimics the behaviour of such broken hardware (potentially with additional failure causes such as partial/inconsistent reads, ..)

For the record, I still do not see this statement (and the previous "might") as accurate under the given conditions because of the very fact how the how the WAL checkpointing [1] works (emphasis mine):

4.3. Checkpoint Algorithm
On a checkpoint, the WAL is first flushed to persistent storage using the xSync method of the VFS. Then valid content of the WAL is transferred into the database file. Finally, the database is flushed to persistent storage using another xSync method call. The xSync operations serve as write barriers - all writes launched before the xSync must complete before any write that launches after the xSync begins.

A checkpoint need not run to completion. It might be that some readers are still using older transactions with data that is contained in the database file. In that case, transferring content for newer transactions from the WAL file into the database would delete the content out from under readers still using the older transactions. To avoid that, checkpoints only run to completion if all reader are using the last transaction in the WAL.

[1] http://www.sqlite.org/draft/fileformat2.html#walformat

esi_y · Feb 1, 2024

Prilipala said:
Hi

I have a similar story of two nodes, Interested in another both servers are physical their physical interfaces are converted to bridge network connection . And this local network goes to a router in NAT through which I connect via web. Each of them has one more physical interface I want to use them for cluster via crossover cable,(network>HOST___host<network).
How to specify or the Cluster itself will connect on the necessary (cross-cable,) interface I want quorum to occur on crossover cable.

I think you better open a new thread, but your case sounds like you can read from the beginning (whether you prefer a qdevice or one node having 2 votes, etc. to not have it fall apart). For the networking, this is normal Debian configuration, once the interfaces have an IP, you create the cluster on that network - see 5.4.3 Adding Nodes with Separated Cluster Network here:
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_join_node_to_cluster

esi_y · Feb 2, 2024

fabian said:
you can choose not to believe me (or sqlite upstream) and pretend it's all fine and dandy

Asked the question in the SQLite forum [1] and waiting for more insight there. I just really want to understand this, linking through here in case anyone follows through on the same or has db corruption to figure out.

[1] https://sqlite.org/forum/forumpost/47107ab818977549

2 Node Cluster with the the least amount of "clusterization" - how?

Renowned Member

Distinguished Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

3. Failure to sync​

3.1. Disk drives that do not honor sync requests​

Proxmox Staff Member

Renowned Member

New Member

Proxmox Staff Member

Renowned Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

4.3. Checkpoint Algorithm​

Renowned Member

Renowned Member

We value your privacy

3. Failure to sync

3.1. Disk drives that do not honor sync requests

4.3. Checkpoint Algorithm