Just figured I would document my saga of using BTRFS with Proxmox. This saga is over two years of experimentation and ripping out much of my hair.
My system:
Disks: 10X SSD's, Toshiba HG6 Enterprise class 512GB SSD's.
CPU: Ryzen 7 3700X
Motherboard: X470D4U
RAM: 32GB DDR4 ECC
Proxmox 6.1 (5.3.13-1-pve)
Sense ProxMox 5, I have been trying to run VM's on BTRFS. The results are rather bad. From poor performance to corruption, segfaults, kernel panics... VM's and BTRFS just don't seem to mix. Even testing on PVE 6.1 I still get the same errors. I even tried the 5.5-RC1 kernel. Always get the same errors:
To make life even more fun, my OpenWrt based VM's would just segfault while booting. That's even more strange considering I use the SquashFS based image of OpenWrt.
I have 10 SSDs... My VM's disk performance should haul! So I have two choices, ZFS-zvol, or BTRFS. But BTRFS is NOT an option if it's unstable.
My old hack was to expose a folder, through Samba (CIFS) to run my VM's in that mount. This is a slow and nasty hack. But it worked for years. My VM's would run on BTRFS... Then my server died of old age... I figured it would be a good time to revisit the BTRFS problem. I moved over my 10X SSD's and my LSI SAS2008 IT cards.
My first step was simple. Just format up my SSD's and restore my VM's from backup.
Within seconds of starting my first VM, errors! "csum failed" "read error corrected". Looks like the bugs of many years ago still exist. Not all that surprised. I could put my old Samba hack back in place... YUCK!
What about dumping BTRFS and going with MD/XFS?
Benchmarking:
Everyone has there own way of benchmarking disk performance. My way, for testing my Virtual Machine File System (VMFS) is simple. PassMark's disk mark (Version 9, Build 1035, 64-bit). I use a windows test VM for this benchmark. A simple clean install of Windows 10 Pro 1909 with the following specs:
In my old hacked up CIFS setup, I could never get a all that good of a benchmark. Lets see how XFS on MD RAID6 fairs. Higher numbers are better.
Disk Mark: 5830 (MDadm/XFS)
Disk Mark: 6977 (BTRFS/CIFS hack)
Well... That not very encouraging! Spent a bunch of time trying to make sure the basic stuff was running and to make sure trim was working as expected. I could never get my XFS MD RAID6 setup to perform any better. At this point BTRFS/CIFS hack is still working better.
What about ZFS? Every single time I ask for help with BTRFS, every one calls me stupid an says I should be running ZFS.
But there is an important point here. I require reflink support. So I use a zvol with XFS to solve this.
Disk Mark: 11005(ZFS /w compression on, RAIDZ2)
Better... About 58% faster then my BTRFS/CIFS hack. Looks like I might be moving over to ZFS! Wait, but what if I did a poor mans "zvol" setup on BTRFS?
Starting off with the normal setup of BTRFS.
PassMark Disk: 15162 (BTRFS w/ XFS block file on top AKA, poor mans zvol)
Better! About 37% better than ZFS ZVOL. Or about 260% better then XFS on MD RAID6. NOT bad!
At this point I have settled on my poor mans zvol on top of BTRFS and am rather happy with the outcome. I have verified that fstrim/unmap/discard is working at all disk layers and I still have reflink support!
My system:
Disks: 10X SSD's, Toshiba HG6 Enterprise class 512GB SSD's.
CPU: Ryzen 7 3700X
Motherboard: X470D4U
RAM: 32GB DDR4 ECC
Proxmox 6.1 (5.3.13-1-pve)
Sense ProxMox 5, I have been trying to run VM's on BTRFS. The results are rather bad. From poor performance to corruption, segfaults, kernel panics... VM's and BTRFS just don't seem to mix. Even testing on PVE 6.1 I still get the same errors. I even tried the 5.5-RC1 kernel. Always get the same errors:
Code:
[18826.338460] BTRFS warning (device sdg1): csum failed root 5 ino 1852027 off 34015096832 csum 0xe73ebde6 expected csum 0xa77de7f0 mirror 1
[18826.341593] BTRFS info (device sdg1): read error corrected: ino 1852027 off 34015096832 (dev /dev/sdg1 sector 499402784)
To make life even more fun, my OpenWrt based VM's would just segfault while booting. That's even more strange considering I use the SquashFS based image of OpenWrt.
I have 10 SSDs... My VM's disk performance should haul! So I have two choices, ZFS-zvol, or BTRFS. But BTRFS is NOT an option if it's unstable.
My old hack was to expose a folder, through Samba (CIFS) to run my VM's in that mount. This is a slow and nasty hack. But it worked for years. My VM's would run on BTRFS... Then my server died of old age... I figured it would be a good time to revisit the BTRFS problem. I moved over my 10X SSD's and my LSI SAS2008 IT cards.
My first step was simple. Just format up my SSD's and restore my VM's from backup.
Code:
mkfs.btrfs -m raid10 -d raid6 /dev/sda /dev/sdb /dev/sdc......
edit /etc/fstab
UUID=12345678-18f7-4e3f-a94d-95887f50142a /data btrfs defaults,nofail,noatime,ssd,discard,compress=lzo 0 2
mkdir -p /data/vmfs_pve
mount -a
restore my VM's
Within seconds of starting my first VM, errors! "csum failed" "read error corrected". Looks like the bugs of many years ago still exist. Not all that surprised. I could put my old Samba hack back in place... YUCK!
What about dumping BTRFS and going with MD/XFS?
Code:
mdadm --create --level=6 --raid-devices=10 /dev/md0 /dev/sda /dev/sdb...........
mkfs.xfs -K -m reflink=1 /md0
edit /etc/fstab
UUID=12345678-18f7-4e3f-a94d-95887f50142a /data xfs defaults,discard,noatime,nodiratime 0 2
restore my VM's...
Benchmarking:
Everyone has there own way of benchmarking disk performance. My way, for testing my Virtual Machine File System (VMFS) is simple. PassMark's disk mark (Version 9, Build 1035, 64-bit). I use a windows test VM for this benchmark. A simple clean install of Windows 10 Pro 1909 with the following specs:
Code:
4vCPU
8GB RAM (Balloon=0)
OVMF (UEFI BIOS)
VirtIO SCSI Controller
vm-100-disk0.qcow2, discard=on, size=128G, ssd=1
In my old hacked up CIFS setup, I could never get a all that good of a benchmark. Lets see how XFS on MD RAID6 fairs. Higher numbers are better.
Disk Mark: 5830 (MDadm/XFS)
Disk Mark: 6977 (BTRFS/CIFS hack)
Well... That not very encouraging! Spent a bunch of time trying to make sure the basic stuff was running and to make sure trim was working as expected. I could never get my XFS MD RAID6 setup to perform any better. At this point BTRFS/CIFS hack is still working better.
What about ZFS? Every single time I ask for help with BTRFS, every one calls me stupid an says I should be running ZFS.
Code:
zpool create ssd_pool raidz2 -f /dev/sda /dev/sdb....
zfs set mountpoint=/data ssd_pool
zfs set xattr=sa ssd_pool
zfs set acltype=posixacl ssd_pool
zfs set compression=lz4 ssd_pool
zfs set atime=off ssd_pool
zfs set relatime=off ssd_pool
zpool trim ssd_pool
But there is an important point here. I require reflink support. So I use a zvol with XFS to solve this.
Code:
zfs create -s -V 1024G ssd_pool/vmfs_pve
mkfs.xfs -K -m reflink=1 /dev/zvol/ssd_pool/vmfs_pve
edit /etc/fstab
UUID=12345678-18f7-4e3f-a94d-95887f50142a /data/vmfs_pve xfs defaults,discard,noatime,nodiratime 0 2
mount -a
and restore my VM's.... Again.
Disk Mark: 11005(ZFS /w compression on, RAIDZ2)
Better... About 58% faster then my BTRFS/CIFS hack. Looks like I might be moving over to ZFS! Wait, but what if I did a poor mans "zvol" setup on BTRFS?
Starting off with the normal setup of BTRFS.
Code:
mkfs.btrfs -m raid10 -d raid6 /dev/sda /dev/sdb /dev/sdc......
edit /etc/fstab
UUID=12345678-18f7-4e3f-a94d-95887f50142a /data btrfs defaults,nofail,noatime,ssd,discard,compress=lzo 0 2
mkdir -p /data/vmfs_pve
mount -a
#But now lets build the fake block device:
dd if=/dev/zero of= /data/vmfs.img bs=1k seek=512M count=1 #This will great a 512GB thin file called vmfs.img
mkfs.xfs -K -m reflink=1 /data/vmfs.img
edit /etc/fstab
/data/vmfs.img /data/vmfs_pve xfs defaults,discard,noatime,nodiratime 0 2
mount -a
fstrim -v /data/vmfs_pve
Restore VM's and have fun
PassMark Disk: 15162 (BTRFS w/ XFS block file on top AKA, poor mans zvol)
Better! About 37% better than ZFS ZVOL. Or about 260% better then XFS on MD RAID6. NOT bad!
At this point I have settled on my poor mans zvol on top of BTRFS and am rather happy with the outcome. I have verified that fstrim/unmap/discard is working at all disk layers and I still have reflink support!