important information on btrfs getting lost in wiki & wrong vm disk defaults with btrfs storage

RolandK

Famous Member
Mar 5, 2019
1,085
269
128
52
via google search for btrfs and o_direct i found this wiki page

https://pve.proxmox.com/wiki/Special:WhatLinksHere/Storage:_BTRFS

that mentions the folloging important information

"BTRFS will honor the O_DIRECT flag when opening files, meaning VMsshould not use cache mode none, otherwise there will be checksum errors."

this indeed is a known problem with btrfs, and not resolved.

https://bugzilla.redhat.com/show_bug.cgi?id=1914433

furthermore, when using compression, O_DIRECT renders compression useless ( https://marc.info/?l=linux-btrfs&m=171053186915054&w=2 )

i can confirm this problem, because i just experienced such errors and started searching for the reason.

two questions:

1. why does this important page/information seem to get lost in the wiki ?

2. why does a virtual disk of a VM created on top of btrfs still default to cache=none ?
 
  • Like
Reactions: Dunuin
meanwhile, i have found that there are quirks in action:

https://forum.proxmox.com/threads/virtual-disk-default-no-cache-settings-weirdness.143430/

https://pve.proxmox.com/wiki/Storage:_BTRFS
"BTRFS will honor the O_DIRECT flag when opening files, meaning VMs should not use cache mode none, otherwise there will be checksum errors."

I think it's not easy understandable this way.

When that page contents being relinked/trasnformed/moved - i would recommend something like this for replacement:

"BTRFS can do DirectIO, that means files can be opened with O_DIRECT flag, avoiding Linux pagecache. Unfortunately there are issues with virtual machines using this setting ( https://bugzilla.redhat.com/show_bug.cgi?id=693530 ) , so proxmox currently is applying a quirk to circumvent this (removing cache=none, which makes underlying qemu use cache=writeback)

there are even more severe issues with DirectIO and btrfs:
https://lore.kernel.org/linux-btrfs...24ae7146606c.1676684984.git.boris@bur.io/T/#u
 
Last edited:
Let me explain the problem from the btrfs side.

The O_DIRECT problem with btrfs is, data checksum is calculated just before submitting the page.
If the page's content is changed during writeback, it will cause checksum mismatch.

Thus btrfs has a lot of its opeartions to wait for the page writeback before modifying the page, that's fine for page cached IO, as btrfs has the full control of the page and is able to wait for the IO.

But when O_DIRECT is involved, btrfs has no control of the source (can be a user pace), thus if the content of the direct write changes, it will definitely cause checksum mismatch.

For this particular case, I guess the problem is the XFS/EXT4 fs inside the VM. They are allowed to modified the page cache even if it's under writeback.
And since the VM is using no-cache, aka O_DIRECT for the file, it means btrfs has to face exactly the worst scenario.

How to solve that? I have no idea.

Btrfs may be able to double check the csum (before and after submission, check if they match), but that's a lot of extra cost.

Just disable csum for direct IO? A lot of valid direct IO from properly designed programs can benefit from data csum. And currently btrfs is doing its data checksum per-file, thus it can not just disable csum for several direct IOs (But I guess we can enhance that in the future).

Go blame XFS/EXT4? Their ego are so strong and it will not result any useful discussion.

So I guess the most reasonable solution is to make btrfs support partial nodatacsum.