MDRAID & O_DIRECT

esi_y

Renowned Member
Nov 29, 2023
2,079
328
63
github.com
I moved this away from Proxmox 4.4 virtio_scsi regression to avoid necroposting, this is a reply to post from @t.lamprecht originally.

NB I do not solicit any reaction, I just wanted to let anyone, who wants to correct me, give that opportunity. But the below is my understanding.

Yeah, and as Wolfgang wrote one can trigger a modification of the butter after write was issued,

The way I understood this whole historical situation:
1. there was unexplained corruption happening with some user somewhere on mdraid;
2. hypothesis was made why this is happening;
3. test case written (and then passed on to Stanislav to post this to kernel BZ?);
4. displeased with inaction, the hypothesis (2) was added to the thread;
5. displeased with continuing inaction, other solution was chosen.

I do not care as much for (5), as this is your choice, for your own solution, but the reasoning is completely wrong:

6. If I now ship this test case (3) to Hannes, I would be source of good laugh, as I am reproducing something that is literally expected to behave the way it does;
7. I cannot reproduce the hypothetical (2) which the case (3) was supposed to demonstrate;
8. I can choose filesystem that ignores my O_DIRECT (as you did), but;
9. I can also choose not to use O_DIRECT.

Noteworthy is also that qemu does not even default to O_DIRECT.

so a storage technology should be resilient to those

You just killed off (your list) every single but CoW filesystems.

In any way the fact that there is a test case that can make this happen is enough for it to be a problem, the kernel cannot tell the difference if a user space behavior is made possible through a made up test case or a (also made up) "real" program.

The so-called test case is not relevant (6). It is a bit like saying that a test case exists that mdraid level 1 cannot fix silent corruption.

The "re-use just swapped out memory" is one vector Wolfgang singled out

But cannot reproduce.

but there can be others.

I have no reproducer for.

And sure O_DIRECT is a PITA interface

So instead of not using it you *need* a filesystem that ... ignores it.

as mentioned by Wolfgang/Fabian, that's why md-RAID is just not an option for Proxmox VE

You all reference each other, but historically only Dietmar had a point with "too difficult for users to recover from failures".

There are points not to use mdraid for sure, but those are well known, not this one blown out of proportion.

this thread was opened due to a user running into problems that were very real world for them.

To this day, we do not know what caused it. I know how it feels, I had it happen with unnamed filesystem too.
 
Last edited:
Your points and their order are quite off again, but as conveying some information to you, without wasting tons of time on you nit-picking some only very tangentially related details and overly dense interpretation of what's written here or some other post/doc/... is exceptionally hard and nerve racking for all Proxmox Staff, I, nor any other Staff definitively won't play your "disassembly line-by-line quoting with some new borderline-fantasy connections and context mixed in" game here. FWIW you always can hire a consultant or enterprise support to get help, there's no need to misuse the Proxmox Community Support Forum for this.

Just one thing for others that might stumble upon this weird thread, which solicits a reply through existence:
The so-called test case is not relevant (6). It is a bit like saying that a test case exists that mdraid level 1 cannot fix silent corruption.
This is getting pathetic, but one last time: mdraid 1 is a RAID mirror, or at least advertised it as such, allowing it to get out of sync by an unprivileged user space write pattern is a pretty bad bug as that breaks the one thing RAID 1 mirrors are there for.
MD RAID is also bad in UX, that and some design/code smells were the original reasons we do not support it, above just cemented it and is used as argument because it's a specific result of the code/design smell and as such easier to accept as reason for users, but even if it was magically fixed, we would not start supporting mdraid.
Besides that, the reproducer still "works wonders", as recently confirmed in the kernel BZ. Also, ZFS O_DIRECT support is designed to be safe against this issue, so it's possible to do so. Any RAID basically just needs a memcopy, or other more efficient approach to make the buffer/page "sticky", like ZFS seem to use, in some specific situations to avoid a recycled buffer being observed differently for each RAID mirror copy thread. Some devs of software RAID like solutions don't care about this, as performance is more important to them, independent of their arguments or the validity of doing so, it's a fact that it can be triggered by any user that can do O_DIRECT writes and cause the mirror pairs to not be an exact copy of each other anymore. If you're personally fine with that or your use case is vetted against not triggering that, then great, use it – but do not downplay such issues to others that might not be able to make the same choice, you just cannot know if their use case is affected.

Side note: BTRFS can be used such that this is also a problem, our implementation chooses defaults such that it won't be a problem though (ref). Albeit Roland found an edge case with local-storage migration from a non-btrfs to a btrfs backed storage where this might need to be explicitly added. Anyhow, that is not a blocker, as it can be worked around it, but was one reason of why BTRFS support was introduced as tech-preview in PVE.

Oh, it follows the code of the reproducer, which is just two threads and a shared buffer, a pattern which is used in tons of programs.

C:
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <stdarg.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>
#include <fcntl.h>

#include <stdbool.h>
#include <stdatomic.h>
#include <threads.h>

_Noreturn void bailerr(int errnum, const char *fmt, ...) {
    if (errnum) {
        char errbuf[128] = {};
        const char *errmsg = strerror_r(errnum, errbuf, sizeof(errbuf));
        errbuf[127] = 0; // whatever
        fprintf(stderr, "\x1b[1;31merror:\x1b[0;2m (%s)\x1b[0m ", errmsg);
    } else {
        fprintf(stderr, "\x1b[1;31merror:\x1b[0m ");
    }

    va_list ap;
    va_start(ap, fmt);
    vfprintf(stderr, fmt, ap);
    va_end(ap);
    exit(1);
}

#define bail(...) bailerr(errno, __VA_ARGS__)

#define autoclose __attribute__((cleanup(do_close)))
void do_close(int *fd) {
    if (*fd >= 0) {
        (void)close(*fd);
        *fd = -EBADF;
    }
}

// This alignment should be suitable for most cases:
static char _Alignas(4096) BUFFER[4096] = {0};

static atomic_bool DONE = false;

static int scribble_scrabble(void *arg) {
    char x = 0;

    (void)write(1, "Running\n", sizeof("Running\n")-1);
    while (!atomic_load_explicit(&DONE, memory_order_acquire)) {
        ++x; // otherwise we always write the same thing which is not what we want ;-)
        for (size_t i = 0; i != sizeof(BUFFER); ++i)
            BUFFER[i] += i+42;
    }

    return 0;
}

int main(int argc, char **argv) {
    if (argc != 2)
        bail("usage: %s <testfile>\nWARNIN: Might break some RAIDs.\n", argv[0]);

    int autoclose fd = open(argv[1], O_CREAT | O_RDWR | O_DIRECT, 0644);
    if (fd < 0)
        bail("failed to open file '%s'\n", argv[1]);

    thrd_t thread;
    if (thrd_success != thrd_create(&thread, scribble_scrabble, NULL))
        bailerr(0, "failed to create scribbling thread\n");

    for (size_t times = 0; times != 2; ++times) {
        for (size_t i = 0; i != 4096*32; ++i) {
            if (pwrite(fd, BUFFER, sizeof(BUFFER), i * sizeof(BUFFER)) != sizeof(BUFFER)) {
                bail("write error\n");
            }
        }
    }

    atomic_store_explicit(&DONE, true, memory_order_relaxed);

    int res;
    thrd_join(thread, &res);

    printf("Your RAID might be broken now.\nYou're welcome.\n");
    return 0;
}

Compile it with e.g.: clang -static -std=c17 -o break-raid-odirect break-raid-odirect.c to create a static executable that can copied around, e.g. to a test system/VM/.... or just gcc break-raid-odirect.c for the classic dynamically linked a.out
That breaks a host mdraid if triggered from inside the guest just fine here.
 
  • Like
Reactions: Johannes S
mdraid 1 is a RAID mirror, or at least advertised it as such, allowing it to get out of sync by an unprivileged user space write pattern is a pretty bad bug as that breaks the one thing RAID 1 mirrors are there for.

You cannot call something a bug in one storage solution which you yourself cater for in another (BTRFS):
https://bugzilla.proxmox.com/show_bug.cgi?id=5320

MD RAID is also bad in UX, that and some design/code smells were the original reasons we do not support it, above just cemented it and is used as argument because it's a specific result of the code/design smell and as such easier to accept as reason for users, but even if it was magically fixed, we would not start supporting mdraid.

I believe this is what should have been communicated, as the reproducer is indeed "invalid", this is not just me saying, I actually picked up on this only because you stated O_DIRECT should be catered for by the filesystem, but your pick ZFS effectively does so by ignoring it.

You can also reply to the gentlemen here, if you could formulate it logically, but I can't grasp it, especially that also BTRFS is "broken" under your view in that case:
https://marc.info/?l=linux-raid&m=172854310516409&w=2


Any RAID basically just needs a memcopy, or other more efficient approach to make the buffer/page "sticky", like ZFS seem to use, in some specific situations to avoid a recycled buffer being observed differently for each RAID mirror copy thread. Some devs of software RAID like solutions don't care about this, as performance is more important to them, independent of their arguments or the validity of doing

This is your opinion.

so, it's a fact that it can be triggered by any user that can do O_DIRECT writes and cause the mirror pairs to not be an exact copy of each other anymore.

Correct, but in case of qemu it's PVE that goes default cache=none, well, except for BTRFS, that got an excuse...

If you're personally fine with that or your use case is vetted against not triggering that, then great, use it – but do not downplay such issues to others that might not be able to make the same choice, you just cannot know if their use case is affected.

But this is PVE's responsibility to cater for the correct caching mode, is my take on this. I do not write this to annoy anyone, you already do exactly this for BTRFS.

Side note: BTRFS can be used such that this is also a problem, our implementation chooses defaults such that it won't be a problem though (ref). Albeit Roland found an edge case with local-storage migration from a non-btrfs to a btrfs backed storage where this might need to be explicitly added. Anyhow, that is not a blocker, as it can be worked around it, but was one reason of why BTRFS support was introduced as tech-preview in PVE.

I interleave, so I basically just summarize that for myself from the above: this not being a problem with btrfs and mdadm being a problem with same, is logically exclusive statements. But I take the part that you had other (unspecified) reasons to make your choices.

Oh, it follows the code of the reproducer, which is just two threads and a shared buffer, a pattern which is used in tons of programs.

Yes, see the comment linked that you are familiar with. For me, I want the freedom of using O_DIRECT when I know what I am doing. I know there's people who would call this approach a no-no, but you can always prevent your users to do this when you set other caching with qemu, which you literally do for BTRFS.

I have no problems with any choices, just your current Wiki basically makes everyone feel like MDRAID is some horrible technology, when every other is the same (except when the choice is ignored like in the case of up till recently, ZFS).
 
Last edited:
I just want to add, I am not calling anyone (or their reasoning) pathetic or anything because I disagree with people. If my answers appear combative, it's basically because I strongly disagree on points. I do not say anyone is anything because they have their own. What I do not like is how MDRAID is presented in the current Wiki (meanwhile BTRFS gets a pass).
 
Yes, see the comment linked that you are familiar with. For me, I want the freedom of using O_DIRECT when I know what I am doing. I know there's people who would call this approach a no-no, but you can always prevent your users to do this when you set other caching with qemu, which you literally do for BTRFS.

Also, this is one reason why I never had anything to say to the kernel BZ, even as Roland might get some answer. I intentionally do not add anything, I do not introduce any bias, but I can guess the answer.

https://bugzilla.kernel.org/show_bug.cgi?id=99171
 
Note that BTRFS is only affected in some modes of operation (CoW) and there "only" for the metadata and btrfs has file-level checksums and build in scrubbing, so much easier to notice and repair. Still far from ideal but IMO not really the same level and as already stated a part of the reason for why its support in PVE is still considered a technology preview inside PVE, so not really seeing us treating this really that different.
ZFS effectively does so by ignoring it.
Which is a valid solution, either support it and do it right or don't and ignore using the slightly slower but safe path, for storage the bar is simply higher (and again btrfs isn't over it in our opinion, but nearer). FWIW, most storage would do well with ignoring O_DIRECT, it is a stupid interface.

If mdadm had better tooling and UX and ways to repair this then it might have been up for consideration back in the day, and then we also would have catered to its quirks, but it didn't have and while FWICT it got a bit better w.r.t. UX it still seems to be a bit lacking, but that's beside the point: we now got an integrated storage system that has good UX provides good performance in general compared to the feature set and safety it provides and handles such problems much better; so there isn't a need for another local storage solution that is that well integrated natively into PVE as ZFS is – which doesn't mean that one cannot use others.

just your current Wiki basically makes everyone feel like MDRAID is some horrible technology,
Where does it state that? We only mention it in the Software RAID article, and there we do not even mention the O_DIRECT behavior there, and all other points stand and unknowing users should be made aware of them, and we even note how one can still set up a PVE system with MD-RAID for those that are fine with any drawbacks of doing so, which can be a legit choice, but our support doesn't want to handle that anymore for less experienced users, it's not like we started to think MD-RAID is not a good choice for the general user basis out of a mood.
 
Note that BTRFS is only affected in some modes of operation (CoW) and there "only" for the metadata and btrfs has file-level checksums and build in scrubbing, so much easier to notice and repair. Still far from ideal but IMO not really the same level and as already stated a part of the reason for why its support in PVE is still considered a technology preview inside PVE, so not really seeing us treating this really that different.

I do not follow BTRFS as close to start arguing even this :), last time I checked it was documented to be "normal" to throw checksum warnings into the logs due to this. Of course I can't know if they leave the status quo, or go more in the way of ZFS.

Which is a valid solution

Or a workaround, depending on how you see it. In specific case of PVE, you have the qemu caching mode in your hands.

FWIW, most storage would do well with ignoring O_DIRECT, it is a stupid interface.

:D I cannot say it's great interface, it is abused by lots of people, but I just do not think it's the job of kernel or storage to prevent this.

If mdadm had better tooling and UX and ways to repair this then it might have been up for consideration back in the day

There will always be some repair issues with storage, case in point (the post can be speedread, the title says it all) - basically there's not that much great tooling for ZFS either. I wish I could advise the OP there anything better, but that's about my limit with ZFS (and reason for distrust). I am under no illusion PVE can make some magic GUI tool to recover *that*, but these things happen and a wiki on ZFS recovery would be sorely needed.

compared to the feature set and safety it provides and handles such problems much better;

See above. But I am *not* asking for MDRAID, never was.

so there isn't a need for another local storage solution that is that well integrated natively into PVE as ZFS is – which doesn't mean that one cannot use others.

Thank you for stating this.


Where does it state that? We only mention it in the Software RAID article, and there we do not even mention the O_DIRECT behavior there, and all other points stand and unknowing users should be made aware of them,

I disagree on this part, now that you had no MDRAID on the offer for quite some time, there's no reason to even mention it, and if you do, no need to try to gather all facts that may go wrong (because strictly speaking, you have no way of knowing what one will run above the raid in terms of integrity, etc.) I cannot be forcing you to change your wiki, but if MDRAID was some SUSE commercial-only technology, you would have a hard time for that wiki in litigation.

and we even note how one can still set up a PVE system with MD-RAID for those that are fine with any drawbacks of doing so, which can be a legit choice, but our support doesn't want to handle that anymore for less experienced users,

The "For non-production/unsupported setups, where you still want to use mdraid (please don't!)" is uncalled for at the least though. The bitrot part is also not comparing apples with oranges.

it's not like we started to think MD-RAID is not a good choice for the general user basis out of a mood.

I get that, but on the forum the result often is that people are suggested to throw away what they got battle tested in favour of something undergoing rapid development (ZFS) and BTW I personally have no idea how you cherrypick from that even (no, I do not want to flame, it's just one more unknown unknown on PVE as user).

Thanks for decent replies, thank you for your time!
 
there's no reason to even mention it
We got semi-frequent requests for integrating MD-RAID, so there is a reason stating that it's unsupported by Proxmox VE as that avoids a few upfront and allows answering the others quicker. And tbh. I find it a bit out of line for you to argue this way, after all you lament lack of some documentation or rationale all the way. E.g., if one created an MD-RAID and run into problems and someone here would point out that it's an unsupported setup by Proxmox VE, not having that stated anywhere where the user could have found it upfront would IMO be far from ideal.
The "For non-production/unsupported setups, where you still want to use mdraid (please don't!)" is uncalled for at the least though.
The article tries to "scare" inexperienced users due to the years of work spent in support dealing with such things, that said, I reworded it a bit to try being more nuanced while now actually mentioning the drawbacks if O_DIRECT is used, pointing at the cache mode as knowing this might save somebody's data, which is nice to do even for tech we do not want to support.

but on the forum the result often is that people are suggested to throw away what they got battle tested
Yes, because Proxmox VE doesn't support that technology (lets ignore the reasons) and so users either need to ask on channels that are targeting the technology they use or distil the issue to another reproducer that doesn't involve that technology.
If one ignores that and by luck somebody from the community still helps them then great, but I would not find it surprising that others tell them that it's not really supported by Proxmox VE and that there are other options that are.
in favour of something undergoing rapid development (ZFS)
The core of ZFS is pretty stable and their releases undergo a lot of QA on which we build up with even more QA. Cherry-picks are done only for targeted issues and after checking what else changed in the surrounding area to ensure nothing gets missed. They also get either taken from existing issues/PRs or get proposed as such by us to upstream; from that discussion one can also ensure there's no negative feedback from devs that are even more experienced with the code base.
 
Last edited:
  • Like
Reactions: Johannes S
We got semi-frequent requests for integrating MD-RAID, so there is a reason stating that it's unsupported by Proxmox VE

Fair enough, the scaring part ... *

I find it a bit out of line for you to argue this way, after all you lament lack of some documentation or rationale all the way.

I started the entire BZ because I believed there was real case of the mess happening on swapping, but that's not the case. That's why "the change".

E.g., if one created an MD-RAID and run into problems and someone here would point out that it's an unsupported setup by Proxmox VE, not having that stated anywhere where the user could have found it upfront would IMO be far from ideal.
Just a nitpick, but the users who would be most likely bitten are not using Debian installer.

The article tries to "scare" inexperienced users due to the years of work spent in support dealing with such things

* ... here, was a bit questionable though.

while now actually mentioning the drawbacks if O_DIRECT is used, pointing at the cache mode as knowing this might save somebody's data, which is nice to do even for tech we do not want to support.

EXCELLENT!

If one ignores that and by luck somebody from the community still helps them then great

I would. :)

The core of ZFS is pretty stable and their releases undergo a lot of QA on which we build up with even more QA. Cherry-picks are done only for targeted issues and after checking what else changed in the surrounding area to ensure nothing gets missed. They also get either taken from existing issues/PRs or get proposed as such by us to upstream; from that discussion one can also ensure there's no negative feedback from devs that are even more experienced with the code base.

I understand what you are saying, but basically I would not be comfortable to take e.g. Live Ubuntu to be recoving pool from PVE like the above-quote case. And PVE does not allow for that either (to my knowledge).

PS The ZFS troubleshooting wiki would be really something to have for PVE.
 
real case of the mess happening on swapping
FWIW, all that's needed for that to happen is still there, nothing was really proven / unproven in that regard.
I.e., MD-RAID not coping with buffer modified after writes and swap out+in leading being able to lead to such buffers seems likely from the code and design, triggering this is just much harder to reproduce, that's why a simpler reproducer was chosen, but yes, that only proves the first part, no idea if anything exists for the second SWAP related part. And for us, it doesn't matter in any way because we do not support MD-RAID that's why we won't spend any time into investigating this and that's also why our forum is completely the wrong place to further discuss it, please refrain from doing so. Not sure if MD-RAID devs will care, as they probably give the fault to the memory management code, and good luck with that, anyhow, again very explicitly: we don't care, so wrong tree to bark up.
 
FWIW, all that's needed for that to happen is still there, nothing was really proven / unproven in that regard.
I.e., MD-RAID not coping with buffer modified after writes and swap out+in leading being able to lead to such buffers seems likely from the code and design, triggering this is just much harder to reproduce, that's why a simpler reproducer was chosen, but yes, that only proves the first part, no idea if anything exists for the second SWAP related part.

At least so that everyone has a some laugh out of this, I was swapping in and out for hours with artificially low RAM and huge SWAP with 32 threads hours and hours on high TBW NVMe SSD, put aside a mini PC. The result was inconclusive, the MDRAID was just fine, the SSD was just fine, but the NUC never turned on again. Thanks to this, I managed to RMA it 7 days short of warranty period. :D

BTW Someone mentioned stable pages should prevent this, but I am not going to become kernel vm handing expert for this either, so not arguing.

And for us, it doesn't matter in any way because we do not support MD-RAID that's why we won't spend any time into investigating this and that's also why our forum is completely the wrong place to further discuss it, please refrain from doing so. Not sure if MD-RAID devs will care, as they probably give the fault to the memory management code, and good luck with that, anyhow, again very explicitly: we don't care, so wrong tree to bark up.

I am fine with the wiki (ok maybe the "MD-RAID makes repairing corruption a lot harder" is speculative), but anyhow, a recovery wiki on ZFS would be great if it comes out of this.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!