My use case is LXC containers on btrfs ( Btw, I should publish the patches updated for the latest kernel
https://forum.proxmox.com/threads/btrfs-experimental-storage-pve-4-4-13-testing-issues-patch.33896/)
For certain virtual machines, I get uncorrectable errors checksum errors
Code:
scrub device /dev/sda9 (id 1) done
Scrub started: Wed Aug 28 03:36:48 2019
Status: finished
Duration: 0:00:12
Total to scrub: 6.53GiB
Rate: 438.85MiB/s
Error summary: csum=1
Corrected: 0
Uncorrectable: 1
Unverified: 0
ERROR: there are uncorrectable errors
Aug 28 03:37:00 izabela kernel: BTRFS warning (device sda9): checksum error at logical 44597768192 on dev /dev/sda9, physical 6381367296, root 308, inode 257, offset 18116374528, length 4096, links 1 (path: data.raw)
Aug 28 03:37:00 izabela kernel: BTRFS error (device sda9): bdev /dev/sda9 errs: wr 0, rd 0, flush 0, corrupt 9, gen 0
Aug 28 03:37:00 izabela kernel: BTRFS error (device sda9): unable to fixup (regular) error at logical 44597768192 on dev /dev/sda9
There’s no drive corruption (all long smartctl tests are ok). The very same virtual machine, running on zfs or ext4 storage backends, behaves.
Googling, I’ve found
https://btrfs.wiki.kernel.org/index.php/Gotchas#Direct_IO_and_CRCs.
“Direct IO writes to Btrfs files can result in checksum warnings. This can happen with other filesystems, but most don't have checksums, so a mismatch between (updated) data and (out-of-date) checksum cannot arise.”
Here’s a link
https://lwn.net/Articles/442355/ that describes the issue:
“When a process writes to a file-backed page in memory (through either a memory mapping or with the write() system call), that page is marked dirty and must eventually be written to its backing store. The writeback code, when it gets around to that page, will mark the page read-only, set the "under writeback" page flag, and queue the I/O operation. The write-protection of the page is not there to prevent changes to the page; its purpose is to detect further writes which would require that another writeback be done. Current kernels will, in most situations, allow a process to modify a page while the writeback operation is in progress.
Most of the time, that works just fine. In the worst case, the second write to the page will happen before the first writeback I/O operation begins; in that case, the more recently written data will also be written to disk in the first I/O operation and a second, redundant disk write will be queued later. Either way, the data gets to its backing store, which is the real intent.
There are cases where modifying a page that is under writeback is a bad idea, though. Some devices can perform integrity checking, meaning that the data written to disk is checksummed by the hardware and compared against a pre-write checksum provided by the kernel. If the data changes after the kernel calculates its checksum, that check will fail, causing a spurious write error.”
So, the second write is the culprit.
In the article,
Re: Qemu disk images on BTRFS suffer checksum errors,
https://www.spinics.net/lists/linux-btrfs/msg25940.html, there are some mitigation solutions:
1) First,
“doing nodatacow for that particular image which will disable checksumming for just that file”,
https://btrfs.wiki.kernel.org/index...chattr_.2BC.29_but_still_have_checksumming.3F.
This really works, because there’s no checksum calculation involved, but there’s a high risk of data corruption ( It happened to me after 1 year and a half of smooth running with nodatacow. Fortunately, I’ve restored from daily proxmox backups).
The btrfs wiki
https://btrfs.wiki.kernel.org doesn’t have detailed warnings, only a general one:
“ Basically, nodatacow bypasses the very mechanisms that are meant to provide consistency in the filesystem. “ But
https://wiki.debian.org/Btrfs does:
“Please read earlier warning about using nodatacow. Applications that support integrity checks and/or self-healing, can somewhat mitigate the risk of nodatacow, but please note that nodatacow files are not protected by raid1 profile's second copy in the event that a disk fails.
At present, nodatacow implies nodatasum; this means that anything with the nodatacow attribute does not receive the benefits of btrfs' checksum protection and self-healing (for raid levels greater >= 1); disabling CoW (Copy on Write) means that the a VM disk image will not be consistent if the host crashes or loses power. Nodatacow also carries the following additional danger on multidisk systems: because nodatasum is disabled there is no way to verify which disk in a two disk raid1 profile volume contains the correct data. After a crash there is a roughly 50% probability that the bad copy will be read on each request. Consequently, it is almost always preferable to disable COW in the application, and nodatacow should only be used for disposable data.“
2)
cache=writethrough/writeback and that will use buffered io.
Proxmox itself recommends this, for ZFS :
“If you get the warning that the filesystem do not supporting O_DIRECT, set the disk cache type of your VM from none to writeback.” https://pve.proxmox.com/wiki/ZFS:_Tips_and_Tricks#QEMU_disk_cache_mode
Tested also with safer writethrough, and no more checksum errors.
3) Unfortunately, none of these solutions work for LXC containers. So, I’ve thought of a third solution, not listed here
https://www.spinics.net/lists/linux-btrfs/msg25940.html
“Flush runs outside of your LXC container, since your LXC container doesn't have its own kernel. LXC containers exist as a construct around cgroups, which is a feature of the Linux kernel that allows better limitations and isolation of process groups, but not its own kernel or flush daemon.” https://serverfault.com/questions/5...xc-container-writing-large-files-to-di/516088
So, for LXC, I should look for a solution at the kernel level.
From the article https://lwn.net/Articles/442355/, “implementing stable pages to prevent modifying a page that is under writeback”, could be that solution.
I’ve found many related articles that indirectly confirm this solution, so I’m pretty stuck with this. Please let me know if you see other solutions.
Trying to setup the parameter stable_pages_required like in
https://forum.proxmox.com/threads/how-to-setup-kernel-stable-pages-parameter.57403/#post-264404 failed.
Another relevant article
https://lwn.net/Articles/528031/ clarified their policy. In the first, place kernel developers allowed this setting to be done by SysAdmin, later on, removing this feature.
“Much of the discussion around this patch set has focused on just how that flag gets set. One possibility is that the driver for the low-level storage device will turn on stable pages; that can happen, for example, when hardware data integrity features are in use. Filesystem code could also enable stable pages if, for example, it is compressing data transparently as that data is written to disk. Thus far, things work fine: if either the storage device or the filesystem implementation requests stable pages, they will be enforced; otherwise things will run in the faster mode.
The real question is whether the system administrator should be able to change this setting. Initial versions of the patch gave complete control over stable pages to the user by way of a sysfs attribute, but a number of developers complained about that option. Neil Brown pointed out that, if the flag could change at any time, he could never rely on it within the MD RAID code; stable pages that could disappear without warning at any time might as well not exist at all. So there was little disagreement that users should never be able to turn off the stable-pages flag. That left the question of whether they should be able to enable the feature, even if neither the hardware nor the filesystem needs it, presumably because it would make them feel safer somehow. Darrick had left that capability in, saying:
“I dislike the idea that if a program is dirtying pages that are being written out, then I don't really know whether the disk will write the before or after version. If the power goes out before the inevitable second write, how do you know which version you get? Sure would be nice if I could force on stable writes if I'm feeling paranoid.” Once again, the prevailing opinion seemed to be that there is no actual value provided to the user in that case, so there is no point in making the flag user-settable in either direction. As a result, subsequent updates from Darrick took that feature out. “