How to setup kernel stable pages parameter

EuroDomenii

Well-Known Member
Sep 30, 2016
144
30
48
Slatina
www.domenii.eu
My use case is described here https://lwn.net/Articles/442355/, for containers with checksum integrity, because I can’t use the qemu workaround cache=writethrough .

Directly setting parameters like max_ratio or read_ahead_kb ( see https://github.com/torvalds/linux/blob/7d311cdab663f/Documentation/ABI/testing/sysfs-class-bdi#L52 ) works by editing the file vim /sys/devices/virtual/bdi/8:0/max_ratio, or using https://packages.debian.org/stretch/sysfsutils, for persistency after reboot

cat /etc/sysfs.conf
Code:
devices/virtual/bdi/8\:0/max_ratio = 90
devices/virtual/bdi/8\:0/read_ahead_kb = 256

Unfortunately, /sys/devices/virtual/bdi/8:0/stable_pages_required is read-only, even changing permissions as root, writing fails with E514: write error

Thanks!
 
sysfs, the one normally mounted at /sys is not your ordinary Filesystems, it's a virtual FS from the kernel, permissions here often really just reflect if the file is something which can display a state or a switch which can be actively modified.

And as even your link to the ABI docs proclaims, the "stable_pages_required" is read-only, it only shows true (1) if the underlying device has the DIF (Data Integrity Framework) functionality.

Can you please specify your actual use case a bit better?
 
  • Like
Reactions: EuroDomenii
My use case is LXC containers on btrfs ( Btw, I should publish the patches updated for the latest kernel https://forum.proxmox.com/threads/btrfs-experimental-storage-pve-4-4-13-testing-issues-patch.33896/)


For certain virtual machines, I get uncorrectable errors checksum errors
Code:
scrub device /dev/sda9 (id 1) done
Scrub started:    Wed Aug 28 03:36:48 2019
Status:           finished
Duration:         0:00:12
Total to scrub:   6.53GiB
Rate:             438.85MiB/s
Error summary:    csum=1
Corrected:      0
Uncorrectable:  1
Unverified:     0
ERROR: there are uncorrectable errors
Aug 28 03:37:00 izabela kernel: BTRFS warning (device sda9): checksum error at logical 44597768192 on dev /dev/sda9, physical 6381367296, root 308, inode 257, offset 18116374528, length 4096, links 1 (path: data.raw)
Aug 28 03:37:00 izabela kernel: BTRFS error (device sda9): bdev /dev/sda9 errs: wr 0, rd 0, flush 0, corrupt 9, gen 0
Aug 28 03:37:00 izabela kernel: BTRFS error (device sda9): unable to fixup (regular) error at logical 44597768192 on dev /dev/sda9

There’s no drive corruption (all long smartctl tests are ok). The very same virtual machine, running on zfs or ext4 storage backends, behaves.


Googling, I’ve found https://btrfs.wiki.kernel.org/index.php/Gotchas#Direct_IO_and_CRCs. “Direct IO writes to Btrfs files can result in checksum warnings. This can happen with other filesystems, but most don't have checksums, so a mismatch between (updated) data and (out-of-date) checksum cannot arise.”

Here’s a link https://lwn.net/Articles/442355/ that describes the issue: “When a process writes to a file-backed page in memory (through either a memory mapping or with the write() system call), that page is marked dirty and must eventually be written to its backing store. The writeback code, when it gets around to that page, will mark the page read-only, set the "under writeback" page flag, and queue the I/O operation. The write-protection of the page is not there to prevent changes to the page; its purpose is to detect further writes which would require that another writeback be done. Current kernels will, in most situations, allow a process to modify a page while the writeback operation is in progress.

Most of the time, that works just fine. In the worst case, the second write to the page will happen before the first writeback I/O operation begins; in that case, the more recently written data will also be written to disk in the first I/O operation and a second, redundant disk write will be queued later. Either way, the data gets to its backing store, which is the real intent.

There are cases where modifying a page that is under writeback is a bad idea, though. Some devices can perform integrity checking, meaning that the data written to disk is checksummed by the hardware and compared against a pre-write checksum provided by the kernel. If the data changes after the kernel calculates its checksum, that check will fail, causing a spurious write error.”

So, the second write is the culprit.

In the article, Re: Qemu disk images on BTRFS suffer checksum errors, https://www.spinics.net/lists/linux-btrfs/msg25940.html, there are some mitigation solutions:

1) First, “doing nodatacow for that particular image which will disable checksumming for just that file”,

https://btrfs.wiki.kernel.org/index...chattr_.2BC.29_but_still_have_checksumming.3F.

This really works, because there’s no checksum calculation involved, but there’s a high risk of data corruption ( It happened to me after 1 year and a half of smooth running with nodatacow. Fortunately, I’ve restored from daily proxmox backups).

The btrfs wiki https://btrfs.wiki.kernel.org doesn’t have detailed warnings, only a general one: “ Basically, nodatacow bypasses the very mechanisms that are meant to provide consistency in the filesystem. “ But https://wiki.debian.org/Btrfs does: “Please read earlier warning about using nodatacow. Applications that support integrity checks and/or self-healing, can somewhat mitigate the risk of nodatacow, but please note that nodatacow files are not protected by raid1 profile's second copy in the event that a disk fails.
At present, nodatacow implies nodatasum; this means that anything with the nodatacow attribute does not receive the benefits of btrfs' checksum protection and self-healing (for raid levels greater >= 1); disabling CoW (Copy on Write) means that the a VM disk image will not be consistent if the host crashes or loses power. Nodatacow also carries the following additional danger on multidisk systems: because nodatasum is disabled there is no way to verify which disk in a two disk raid1 profile volume contains the correct data. After a crash there is a roughly 50% probability that the bad copy will be read on each request. Consequently, it is almost always preferable to disable COW in the application, and nodatacow should only be used for disposable data.“


2) cache=writethrough/writeback and that will use buffered io.

Proxmox itself recommends this, for ZFS : “If you get the warning that the filesystem do not supporting O_DIRECT, set the disk cache type of your VM from none to writeback.” https://pve.proxmox.com/wiki/ZFS:_Tips_and_Tricks#QEMU_disk_cache_mode

Tested also with safer writethrough, and no more checksum errors.


3) Unfortunately, none of these solutions work for LXC containers. So, I’ve thought of a third solution, not listed here https://www.spinics.net/lists/linux-btrfs/msg25940.html

“Flush runs outside of your LXC container, since your LXC container doesn't have its own kernel. LXC containers exist as a construct around cgroups, which is a feature of the Linux kernel that allows better limitations and isolation of process groups, but not its own kernel or flush daemon.” https://serverfault.com/questions/5...xc-container-writing-large-files-to-di/516088

So, for LXC, I should look for a solution at the kernel level. From the article https://lwn.net/Articles/442355/, “implementing stable pages to prevent modifying a page that is under writeback”, could be that solution.

I’ve found many related articles that indirectly confirm this solution, so I’m pretty stuck with this. Please let me know if you see other solutions.

Trying to setup the parameter stable_pages_required like in https://forum.proxmox.com/threads/how-to-setup-kernel-stable-pages-parameter.57403/#post-264404 failed.

Another relevant article https://lwn.net/Articles/528031/ clarified their policy. In the first, place kernel developers allowed this setting to be done by SysAdmin, later on, removing this feature.

“Much of the discussion around this patch set has focused on just how that flag gets set. One possibility is that the driver for the low-level storage device will turn on stable pages; that can happen, for example, when hardware data integrity features are in use. Filesystem code could also enable stable pages if, for example, it is compressing data transparently as that data is written to disk. Thus far, things work fine: if either the storage device or the filesystem implementation requests stable pages, they will be enforced; otherwise things will run in the faster mode.

The real question is whether the system administrator should be able to change this setting. Initial versions of the patch gave complete control over stable pages to the user by way of a sysfs attribute, but a number of developers complained about that option. Neil Brown pointed out that, if the flag could change at any time, he could never rely on it within the MD RAID code; stable pages that could disappear without warning at any time might as well not exist at all. So there was little disagreement that users should never be able to turn off the stable-pages flag. That left the question of whether they should be able to enable the feature, even if neither the hardware nor the filesystem needs it, presumably because it would make them feel safer somehow. Darrick had left that capability in, saying:

“I dislike the idea that if a program is dirtying pages that are being written out, then I don't really know whether the disk will write the before or after version. If the power goes out before the inevitable second write, how do you know which version you get? Sure would be nice if I could force on stable writes if I'm feeling paranoid.” Once again, the prevailing opinion seemed to be that there is no actual value provided to the user in that case, so there is no point in making the flag user-settable in either direction. As a result, subsequent updates from Darrick took that feature out. “
 
At this stage, I’m should move the discussion to btrfs mail list, addressing the following questions. From Proxmox point of view, I would need an opinion on question 3 - compiling my own proxmox kernel


1) Is there a setting at btrfs level that I’am missing ( maybe mount option) to trigger this flag BDI_CAP_STABLE_WRITES / stable_pages_required? ( see patch https://gitlab.freedesktop.org/panfrost/linux/commit/7d311cdab663f4f7ab3a4c0d5d484234406f8268)


Searching the https://github.com/torvalds/linux/search?q=BDI_CAP_STABLE_WRITES&unscoped_q=BDI_CAP_STABLE_WRITES , I found an option for ceph

Code:
if (!ceph_test_opt(rbd_dev->rbd_client->client, NOCRC))


q->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES;
See https://github.com/torvalds/linux/blob/366a4e38b8d0d3e8c7673ab5c1b5e76bbfbc0085/drivers/block/rbd.c,

2) I was thinking to compile from scratch btrfs, like https://btrfs.wiki.kernel.org/index.php/Btrfs_source_repositories#Official_repositories.

But looking at https://github.com/torvalds/linux/tree/master/fs/btrfs for values similar values BDI_CAP_STABLE_WRITES / stable_pages_required, nothing was found.

Nevertheless, “Btrfs has implemented stable pages internally for some time, so no changes were required there”. https://lwn.net/Articles/442355/


3) Last resort, would be to compile from source the proxmox kernel, and change the sysfs attributes from RO to RW, like Darrick used to do.

See https://gitlab.freedesktop.org/panfrost/linux/commit/7d311cdab663f4f7ab3a4c0d5d484234406f8268 __ATTR_RO(stable_pages_required), to __ATTR_RW(stable_pages_required) ( of course adapted to the latest kernel).

Afterward, I should be able to update from file /sys/devices/virtual/bdi/btrfs-1/stable_pages_required, with persistent changes via https://packages.debian.org/stretch/sysfsutils, like I did it with max_ratio or read_ahead_kb values.

Even if Btrfs implements his own bdi, this should work. Or, the setting should be done elsewhere?
 
So, I'd first go to the BTRFS mailing list, asking for their opinion, maybe there's even a switch which allows this, else it'd be good to know if anything is planned to improve this, or at least work to do so would be accepted.

Once that is discussed we know more here. That said, I'm not unwilling to add a patch to make the sysfs switch RW, I mean I'd prefer to have it set on boot only, to avoid the issues Neil Brown mentions. E.g., as a kernel command line param. Or at least as "can be enabled but not disabled (until next reboot)" switch in sysfs.

If you post on btrfs mailinglist it'd be great if you could post a link to that thread from the archives here.
 
  • Like
Reactions: EuroDomenii
Due to my dummy error ( not sending the email in plain text format to btrfs mailing list linux-btrfs@vger.kernel.org), it was rejected several times. Next days, sending as plain text, the very same content from a different email, it wasn’t published to https://www.spinics.net/lists/linux-btrfs/, probably being marked as spam.

As a last resort, I shall attach here the full logs and send to btrfs mailing list only the reference to this forum post, as a brand new content.

Afterward, following Thomas advice, I’ll post a link to that thread from the archives.
 
In the end, the DEBUG information, for a KVM VM, as requested by https://btrfs.wiki.kernel.org/index...ion_to_provide_when_asking_a_support_question

There are 2 btrfs mounts ( only /dev/sda9 - /dev/sdb9 has the VM causing erros)


Code:
root@izabela:~# btrfs fi show
Label: none  uuid: f57bb914-7fc1-4fd6-b10e-904955050bff
        Total devices 2 FS bytes used 5.19GiB
        devid    1 size 9.41GiB used 6.53GiB path /dev/sda9
        devid    2 size 9.41GiB used 6.53GiB path /dev/sdb9

Label: none  uuid: 90ca9442-1508-4a1d-a496-7b7cd1443d5d
        Total devices 2 FS bytes used 44.42GiB
        devid    1 size 135.04GiB used 48.03GiB path /dev/sda8
        devid    2 size 135.04GiB used 48.03GiB path /dev/sdb8

root@izabela:~# btrfs version
btrfs-progs v5.2.1


root@izabela:~# btrfs fi df /bt
Data, RAID1: total=6.00GiB, used=5.19GiB
System, RAID1: total=32.00MiB, used=16.00KiB
Metadata, RAID1: total=512.00MiB, used=6.77MiB
GlobalReserve, single: total=16.00MiB, used=0.00B


root@izabela:~# btrfs fi show /bt
Label: none  uuid: f57bb914-7fc1-4fd6-b10e-904955050bff
        Total devices 2 FS bytes used 5.19GiB
        devid    1 size 9.41GiB used 6.53GiB path /dev/sda9
        devid    2 size 9.41GiB used 6.53GiB path /dev/sdb9

Relevant scrub error

Code:
scrub device /dev/sda9 (id 1) done
Scrub started:    Wed Aug 28 03:36:48 2019
Status:           finished
Duration:         0:00:12
Total to scrub:   6.53GiB
Rate:             438.85MiB/s
Error summary:    csum=1
 Corrected:      0
 Uncorrectable:  1
 Unverified:     0
ERROR: there are uncorrectable errors

Aug 28 03:37:00 izabela kernel: BTRFS warning (device sda9): checksum error at logical 44597768192 on dev /dev/sda9, physical 6381367296, root 308, inode 257, offset 18116374528, length 4096, links 1 (path: data.raw)

Aug 28 03:37:00 izabela kernel: BTRFS error (device sda9): bdev /dev/sda9 errs: wr 0, rd 0, flush 0, corrupt 9, gen 0

Aug 28 03:37:00 izabela kernel: BTRFS error (device sda9): unable to fixup (regular) error at logical 44597768192 on dev /dev/sda9

root@izabela:~# btrfs su li -a /bt | grep 308

ID 308 gen 11747 top level 5 path images/282/vm-282-disk-1
After enabling cache writethrough for kvm, uncorrectable errors are gone, but scrub fixed other errors

Code:
root@izabela:~# sh  /root/iulian/scripts/btrfs_scrub_all.sh

scrub device /dev/sda8 (id 1) done
Scrub started:    Wed Aug 28 04:58:39 2019
Status:           finished
Duration:         0:01:44
Total to scrub:   47.03GiB
Rate:             434.31MiB/s
Error summary:    no errors found
Aug 28 05:00:03 izabela kernel: BTRFS info (device sda9): relocating block group 52578746368 flags data|raid1
Aug 28 05:00:04 izabela kernel: BTRFS info (device sda9): found 137 extents
Aug 28 05:00:05 izabela kernel: BTRFS info (device sda9): found 137 extents
Aug 28 05:00:05 izabela kernel: BTRFS info (device sda9): qgroup scan completed (inconsistency flag cleared)

scrub device /dev/sda9 (id 1) done
Scrub started:    Wed Aug 28 05:00:23 2019
Status:           finished
Duration:         0:00:12
Total to scrub:   6.53GiB
Rate:             440.75MiB/s
Error summary:    no errors found


Sending attached the full dmesg log ( /dev/sda8 /dev/sdb8 is health). Please note that I have daily cron balance script and 4 hours scrub.


Also sending the results of btrfs-inspect-internal before and after the issue.

Code:
btrfs ins dump-super -fFa /dev/sda9
btrfs ins dump-super -fFa /dev/sdb9
btrfs ins dump-tree -t chunk /dev/sda9
btrfs ins dump-tree -t chunk /dev/sdb9

Thanks!
 

Attachments

  • btrfs inspect after error.txt
    31.4 KB · Views: 1
  • btrfs inspect before error.txt
    32.4 KB · Views: 0
  • dmesg.txt
    84 KB · Views: 0

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!