PVE 6.1, BTRFS, and an SSD array.

Adam Talbot

Member
Apr 11, 2018
8
3
23
Just figured I would document my saga of using BTRFS with Proxmox. This saga is over two years of experimentation and ripping out much of my hair.

My system:
Disks: 10X SSD's, Toshiba HG6 Enterprise class 512GB SSD's.
CPU: Ryzen 7 3700X
Motherboard: X470D4U
RAM: 32GB DDR4 ECC
Proxmox 6.1 (5.3.13-1-pve)

Sense ProxMox 5, I have been trying to run VM's on BTRFS. The results are rather bad. From poor performance to corruption, segfaults, kernel panics... VM's and BTRFS just don't seem to mix. Even testing on PVE 6.1 I still get the same errors. I even tried the 5.5-RC1 kernel. Always get the same errors:

Code:
[18826.338460] BTRFS warning (device sdg1): csum failed root 5 ino 1852027 off 34015096832 csum 0xe73ebde6 expected csum 0xa77de7f0 mirror 1
[18826.341593] BTRFS info (device sdg1): read error corrected: ino 1852027 off 34015096832 (dev /dev/sdg1 sector 499402784)

To make life even more fun, my OpenWrt based VM's would just segfault while booting. That's even more strange considering I use the SquashFS based image of OpenWrt.

I have 10 SSDs... My VM's disk performance should haul! So I have two choices, ZFS-zvol, or BTRFS. But BTRFS is NOT an option if it's unstable.

My old hack was to expose a folder, through Samba (CIFS) to run my VM's in that mount. This is a slow and nasty hack. But it worked for years. My VM's would run on BTRFS... Then my server died of old age... I figured it would be a good time to revisit the BTRFS problem. I moved over my 10X SSD's and my LSI SAS2008 IT cards.

My first step was simple. Just format up my SSD's and restore my VM's from backup.
Code:
mkfs.btrfs -m raid10 -d raid6 /dev/sda /dev/sdb /dev/sdc......
edit /etc/fstab
  UUID=12345678-18f7-4e3f-a94d-95887f50142a /data btrfs defaults,nofail,noatime,ssd,discard,compress=lzo 0 2
mkdir -p /data/vmfs_pve  
mount -a 
restore my VM's

Within seconds of starting my first VM, errors! "csum failed" "read error corrected". Looks like the bugs of many years ago still exist. Not all that surprised. I could put my old Samba hack back in place... YUCK!

What about dumping BTRFS and going with MD/XFS?
Code:
mdadm --create --level=6 --raid-devices=10 /dev/md0 /dev/sda /dev/sdb...........
mkfs.xfs -K -m reflink=1 /md0
edit /etc/fstab  
  UUID=12345678-18f7-4e3f-a94d-95887f50142a /data xfs defaults,discard,noatime,nodiratime 0 2
restore my VM's...

Benchmarking:
Everyone has there own way of benchmarking disk performance. My way, for testing my Virtual Machine File System (VMFS) is simple. PassMark's disk mark (Version 9, Build 1035, 64-bit). I use a windows test VM for this benchmark. A simple clean install of Windows 10 Pro 1909 with the following specs:
Code:
4vCPU
8GB RAM (Balloon=0)
OVMF (UEFI BIOS)
VirtIO SCSI Controller 
vm-100-disk0.qcow2, discard=on, size=128G, ssd=1

In my old hacked up CIFS setup, I could never get a all that good of a benchmark. Lets see how XFS on MD RAID6 fairs. Higher numbers are better.
Disk Mark: 5830 (MDadm/XFS)
Disk Mark: 6977 (BTRFS/CIFS hack)

Well... That not very encouraging! Spent a bunch of time trying to make sure the basic stuff was running and to make sure trim was working as expected. I could never get my XFS MD RAID6 setup to perform any better. At this point BTRFS/CIFS hack is still working better.

What about ZFS? Every single time I ask for help with BTRFS, every one calls me stupid an says I should be running ZFS.
Code:
zpool create ssd_pool raidz2 -f /dev/sda /dev/sdb....
zfs set mountpoint=/data ssd_pool
zfs set xattr=sa ssd_pool
zfs set acltype=posixacl ssd_pool
zfs set compression=lz4 ssd_pool
zfs set atime=off ssd_pool
zfs set relatime=off ssd_pool
zpool trim ssd_pool

But there is an important point here. I require reflink support. So I use a zvol with XFS to solve this.
Code:
zfs create -s -V 1024G ssd_pool/vmfs_pve
mkfs.xfs -K -m reflink=1 /dev/zvol/ssd_pool/vmfs_pve
edit /etc/fstab  
  UUID=12345678-18f7-4e3f-a94d-95887f50142a /data/vmfs_pve xfs defaults,discard,noatime,nodiratime 0 2
mount -a
and restore my VM's.... Again.

Disk Mark: 11005(ZFS /w compression on, RAIDZ2)

Better... About 58% faster then my BTRFS/CIFS hack. Looks like I might be moving over to ZFS! Wait, but what if I did a poor mans "zvol" setup on BTRFS?

Starting off with the normal setup of BTRFS.
Code:
mkfs.btrfs -m raid10 -d raid6 /dev/sda /dev/sdb /dev/sdc......
edit /etc/fstab
  UUID=12345678-18f7-4e3f-a94d-95887f50142a /data btrfs defaults,nofail,noatime,ssd,discard,compress=lzo 0 2
mkdir -p /data/vmfs_pve  
mount -a 

#But now lets build the fake block device:
dd if=/dev/zero of= /data/vmfs.img bs=1k seek=512M count=1  #This will great a 512GB thin file called vmfs.img
mkfs.xfs -K -m reflink=1 /data/vmfs.img
edit /etc/fstab  
    /data/vmfs.img /data/vmfs_pve xfs defaults,discard,noatime,nodiratime 0 2
mount -a
fstrim -v /data/vmfs_pve
Restore VM's and have fun

PassMark Disk: 15162 (BTRFS w/ XFS block file on top AKA, poor mans zvol)

Better! About 37% better than ZFS ZVOL. Or about 260% better then XFS on MD RAID6. NOT bad!

At this point I have settled on my poor mans zvol on top of BTRFS and am rather happy with the outcome. I have verified that fstrim/unmap/discard is working at all disk layers and I still have reflink support!
 
Thanks for sharing your experience with that topic!

Sense ProxMox 5, I have been trying to run VM's on BTRFS. The results are rather bad. From poor performance to corruption, segfaults, kernel panics... VM's and BTRFS just don't seem to mix. Even testing on PVE 6.1 I still get the same errors. I even tried the 5.5-RC1 kernel. Always get the same errors:

We had the same experience here when we tried to integrate it (the installer even has "dead" code for it), so we decided against integration then (3 or even 4 years ago) a bit sad that nowadays some of those basic underlying issues seem to still exist, I rooted hard for BTRFS then..

Better! About 37% better than ZFS ZVOL. Or about 260% better then XFS on MD RAID6. NOT bad!

At this point I have settled on my poor mans zvol on top of BTRFS and am rather happy with the outcome. I have verified that fstrim/unmap/discard is working at all disk layers and I still have reflink support!

Interesting approach, to say at least, and I mean, if it works for you: great :)
 
@t.lamprecht
I'm using btrfs for the nas-drives in my proxmox host. Never had any problem so far (I use btrfs raid1 mode). So I would really appreciate if there's a native btrfs storage plugin in proxmox. I hope this will be realized soon (even if postponed in the roadmap at the moment).

Marco
 
the problem with btrfs is that O_DIRECT has issues, see https://bugzilla.redhat.com/show_bug.cgi?id=1914433 for exampe.
Yes, we know and that's why we expose disabling copy-on-write for data (which is mentioned also as workaround in the linked report), it's currently configurable per storage:
https://git.proxmox.com/?p=pve-storage.git;a=commitdiff;h=d3c5cf24876d2b0e1399f717e2f77eaf063ae7a7

And we do not default to O_DIRECT for btrfs storages by default since the initial integration:
https://git.proxmox.com/?p=qemu-server.git;a=commitdiff;h=0fe779a62cb4755b74fe83e8323495ee03d0176c

It naturally has its trade-offs, and having to decide this is also one of the bigger reason we're calling BTRFS integration into Proxmox VE still a tech-preview.
https://pve.proxmox.com/wiki/Storage:_BTRFS mentions that, i'm currently trying to find out, where that page is being linked in the wiki, found it via google

apparently , that page is orphaned/lost https://pve.proxmox.com/wiki/Special:WhatLinksHere/Storage:_BTRFS
It's actually from the reference docs (that get also mounted into the wiki), the original is here:
https://pve.proxmox.com/pve-docs/chapter-pvesm.html#storage_btrfs

Most of the storage plugin pages are not linked from another wiki article, but that doesn't mean that they are lost or orphaned.
 
  • Like
Reactions: RolandK
>Yes, we know and that's why we expose disabling copy-on-write for data (which is mentioned also as workaround in the linked report), it's currently configurable per storage:
https://git.proxmox.com/?p=pve-storage.git;a=commitdiff;h=d3c5cf24876d2b0e1399f717e2f77eaf063ae7a7


ah, good to know. thanks. it seems it can be set via pvesm and editing storage.cfg, but is not exposed to webui. ok.

using O_DIRECT seems to disable compression , but having checksum for your data is one reason why you want to use BTRFS
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!