Some updates on this:
One of our developers spent some time investigating this and found that it can be triggered by using mdraid (a software RAID technology with no real integrity checking, which we recommend against as it can easily cause broken raids when using direct IO) and fault injection.
Some more background, skip to the end if you're only interested in the kernel version fixing this.
The investigation led to a problematic patch that was introduced into the upstream 6.6 Linux kernel, namely
81ada09cc25e ("blk-flush: reuse rq queuelist in flush state machine"). We
worked with upstream to fix this edge case, resulting in a
patch that was backported into our 6.8 kernel about a week ago (for completeness' sake, a first revision of this fix a week earlier still had some other problems that we found).
While the original problem was in the common block layer, we could only reproduce it by using mdraid, so the problem seems to be at least amplified strongly when using mdraid. But it might also be the cause of some other, more rare or setup specific issues with similar symptoms.
Because of this, we could not notice this issue early on in our production loads, as we focus on testing recommended setups, nor did we get a report through Enterprise Support, where we could have looked into it much more quickly.
Since this is a data race issue, using different kernels, especially for relatively short periods of time, is not really an indicator that those kernels were unaffected. While we can't rule out the possibility that some other change in newer kernels has increased the likelihood of triggering this problem, it's more realistic that this issue went under our and others' radars because mdraid is not used as often due to better/safer options like ZFS or btrfs that provide real data integrity and security checks. That some more Proxmox VE users are affected may have to do with the fact that some hosting providers use mdraid in their default Proxmox VE server templates, even though it's against our recommendation for the best supported setup.
Disabling the command queue like through using the
libata.force=noncq
option seems to side-step this issue by avoiding that the race can occur, at least that's our current understanding.
Anyhow, the kernel that includes the patch is
proxmox-kernel-6.8
in version
6.8.8-1
(the jump from 6.8.4 to 6.8.8 has nothing to do with it though, this got back ported separately by us).
This package was uploaded to no-subscription a few moments ago.
It definitively fixes our reproducer showing similar symptoms as mentioned in this thread, while that's naturally not a 100% guarantee, we hope that the issue you're facing is also addressed, ideally also those where mdraid is not used.
We'd appreciate feedback, but please try to avoid mixing other issues into this though, one can be affected by more than one problem at the same time after all.