Proxmox btrfs support roadmap, as fallback for possible licensing issues ZFS on linux

EuroDomenii

Renowned Member
Sep 30, 2016
145
32
68
Slatina
www.domenii.eu
Why expressly linking btrfs to zfs, and not other filesystem? Let me explain myself.


PERFORMANCE VERSUS HA

Always, there’s a trade-off between performance and high availability.

With ceph distributed storage you have HA, but there’s performance penalty, induced by network latency.

With local lvm thin you have high performance, near bear metal ext4, without high availability ( lvm cluster works also with network shared storage or lvm mirrored).

With zfs, you can the best have both worlds:
  • High local performance ( with goodies like ARC Cache, ZiL)

  • High availability, using near continuous replication send/receive incremental of local zfs filesystem, to remote server
Proxmox team saw the importance and made the brillliant https://pve.proxmox.com/wiki/PVE-zsync

Zfs has a lot of features, but in this context, send/receive is where it shines.

Let me give you an example: https://ayufan.eu/projects/proxmox-ve-differential-backups/ is a smart patch for Promox, based on http://xdelta.org/

Unfortunately, doing a differential backup takes more time than the initial backup ! ( xdelta3 is slow scanning all the files) .

On the other hand, I have done a paranoia test, with a 100GB LXC ZFS container, having more than 700.000 files. Using send/receive pve zsync, doing an incremental backup of another 8000 files, takes max 5 sec! The magic is that ZFS compares incremental the last snapshots, not whole subvolume. See code relevant code from pve-zsync

if($source->{last_snap} && snapshot_exist($source , $dest, $param->{method})) {
push @$cmd, '-i', "$source->{all}\@$source->{last_snap}";
}
push @$cmd, '--', "$source->{all}\@$source->{new_snap}";


BTRFS

Btrfs is the only filesystem that has built in send/receive incremental replication, like ZFS.

It used to be unstable, but now it seems to be ok.

BTRFS it’s not the best candidate for virtualization

For example, it suffers when there are heavy write activities in the middle of an existing files, so probably it’s not the best candidate for virtualization (the virtual disks are updated in-place at each write). http://www.virtualtothecore.com/en/2016-btrfs-really-next-filesystem/

https://phocean.net/2016/03/20/a-journey-with-btrfs.html reports data corruption.

BTRFS Performance

Btrfs doesn’t shine for databases https://blog.pgaddict.com/posts/friends-dont-let-friends-use-btrfs-for-oltp

Performance is poor http://www.ilsistemista.net/index.p...ith-kvm-a-storage-performance-comparison.html . This test is not relevant, with old linux kernel.
Update : http://www.phoronix.com/scan.php?page=article&item=4fs-linux-4649&num=4

My latest sysbench and fio tests showed good results, using btrfs as directory storage in proxmox, comparing to zfs and ext4.

Docker bets on BTRFS
https://docs.docker.com/engine/userguide/storagedriver/btrfs-driver/

Btrfs has been long hailed as the future of Linux filesystems. With full support in the mainline Linux kernel, a stable on-disk-format, and active development with a focus on stability, this is now becoming more of a reality. As far as Docker on the Linux platform goes, many people see the btrfs storage driver as a potential long-term replacement for the devicemapper storage driver.

So if this ok for docker, I guess could be ok also for LXC on Proxmox.

Current state on Proxmox

Of course now you can use as directory on proxmox https://www.internalfx.com/how-to-use-a-btrfs-raid-on-your-proxmox-server/ , but you are missing all the proxmox integration benefits ( including spanshots).

ZFS on LINUX LICENSING OUTLINE

Short outline story on Wikipedia https://en.wikipedia.org/wiki/Commo...icense#CDDL.27d_ZFS_into_GPL.27d_Linux_kernel

-Ubuntu says “yes” https://insights.ubuntu.com/2016/02/18/zfs-licensing-and-linux/

- Software Freedom Conservancy says “no” https://sfconservancy.org/blog/2016/feb/25/zfs-and-linux/, even with possibility of seeking resolution from the Courts, giving the example of https://sfconservancy.org/copyleft-compliance/vmware-lawsuit-faq.html#why-lawsuit ( funny thing the first round is lost http://www.theregister.co.uk/2016/0...pl_breach_case_but_plaintiff_promises_appeal/ )

On the other hand, I guess they don’t really want to bring it in court https://sfconservancy.org/blog/2016/apr/11/fsf-zfs/ , it’s more like an ambitious license conflict : Richard M. Stallman says “ Oracle should lead the way, as the largest single copyright holder in the ZFS codebase, to either relicense ZFS, or, if they prefer, publish an updated CDDL that is compatible with the GPLv2 and GPLv3 (which would be another viable route to solve the same problem).”

What is more intereseting ( see https://en.wikipedia.org/wiki/OpenZFS#History ) that the initial port of ZFS on Linux was in 2008, the OpenZFS umbrella in 2013, but only in 2016, with release of Ubuntu 16.04 LTS includes CDDL-licensed ZFS on Linux, all the internet fuss started. Now, after one year, I am not aware of any public updates on this issue.

The above are internet links. On the other hand, my 2 cents is that Oracle really enjoys staying silent in this dispute: Allowing the open source community working, but backporting innovations and fixes form OpenZFS to his closed source ZFS distribution. On the other hand, refusing to relicense ZFS under GPL, keeps a confusion for ZFS on linux, boosting the marketing and sales for his enterprise ZFS based products.

CONCLUSIONS


This boring legal staff sounds like pure paranoia. ZFS is so good, that I take my chances and go for it. Fortunately, the Proxmox flexibility allows me to port the LXC containers to another storage, just in case.

On the other hand, what if ? The integration of a new storage backend is not trivial, so why not putting btrfs proxmox on roadmap, even with low priority ?

Finally, to quote FreeNAS : “Once you go ZFS, you will never want to go back.” :)
 
Last edited:
  • Like
Reactions: gkovacs and elmacus
I totally agree with you @EuroDomenii.

The really sad story about ZFS and BTRFS is that in both were developed and are still maintained by Oracle.

There is one big problem with ZFS though: direct I/O (O_DIRECT) which is not implemented and therefore all application desperately needing this will fail, e.g. Oracle Database on ZFS (which is also not supported, but hey, both are from the same vendor: Oracle).

ZFS and docker also work very well together (besides the O_DIRECT problem) and BTRFS is going to be good when they implement RAID5/6 that is known not to fail (which is still currently the case). ZFS on Linux is really a compromise because of the whole solaris portability layer, e.g. the performance on low-end devices is also much better with BTRFS than with Linux (e.g. on a RPi)
 
  • Like
Reactions: EuroDomenii
While I agree it is good to have at least one "backup solution", I'm against anything that stinks like Oracle. That company has very long tradition of burying good projects (i.e. mysql, openoffice, solaris, opensso, java, etc). It is a multi-billi corporation and if they come to conclusion stabbing the whole open-source community in the back could rise their own profit a bit, they'll do it. Would not be so easy with btrfs (due to gpl), but let's not forget they have the whole army of lawyers...
 
Excellent. Synology DSM 6.1 supports btrfs volumes and I want to use them to do hot backups using btrfs snapshots from btrfs volumes controlled by PVE5 to btrfs volumes on Synology NAS.

Why use Synology, when we have Proxmox VE storage replication framework ? At the moment supports only ZFS https://pve.proxmox.com/wiki/Storage_Replication#_supported_storage_types, but this is a general framework. Since BTRFS has the snapshotting / remote send-receive incremental capabilities of ZFS, I am sure that will be supported.

So, you won’t have just a backup, but a hot standby, combined with Proxmox High-Availability.
 
Why use Synology, when we have Proxmox VE storage replication framework ?

When Proxmox add support for BTRFS I will happily use it.

This is the plan I have in mind:
  • server SDD drives for postgres database running in a LXCLXD container under ProxMox control. The volume is external to the container (no a loop/qcow/etc volume) to improve performance and better recovery times. This is already implemented and we are happy.
  • server internal drives for VM + Storage. Already implemented, HA is missing because no btrfs support in proxmox.
  • internal backup 6 times at day to Synology NAS (implemented). Later it will change to hourly hot backups using BTRFS Snapshot.
  • External backup to the cloud by using crashplan. They will work on the data stored in the Synology NAS.
I could change some parts if best alternatives are available.

* edited: replaced LXD with LXC. I am using LXC as described in ProxMox manual.
 
Last edited:
  • server SDD drives for postgres database running in a LXD container under ProxMox control. The volume is external to the container (no a loop/qcow/etc volume) to improve performance and better recovery times. This is already implemented and we are happy.
Proxmox doesn't use LXD, it has his own implementation of LXC ( with pct command line).
IMHO, BTRFS subvolumes with default proxmox implementation mount points are the best option ( I don't see why other mount could be faster).
Beware to use a different btrfs filesystem for postgres, mounted with nodatacow option, for performance reasons. You miss checksumming, but you still have snapshots + send/receive remote.
 
  • Like
Reactions: Pablo Alcaraz
Hello @EuroDomenii,

Proxmox doesn't use LXD, it has his own implementation of LXC ( with pct command line).

I got confused with the names. I am using LXC as described in ProxMox manual. I corrected the original answer.

IMHO, BTRFS subvolumes with default proxmox implementation mount points are the best option ( I don't see why other mount could be faster).

BTRFS is mounted with default options in the server. I am using these options to mount it in the container running the database instance:
Mount Point (mp0) /mnt/sdd/postgresql, mp=/mnt/postgresql
this mount type does not create a raw file with a volume, but it maps a BTRFS subvolume directly inside the container.

/mnt/sdd has BTRFS and /mnt/sdd/postgresql has a subvolume. A script creates snapshots each 6 hours and put them in /mnt/sdd/snapshots/postgresql-yyymmdd-hhMMSS/ as readonly snapshots. They are transferred to a Synology unit and later deleted.


Beware to use a different btrfs filesystem for postgres, mounted with nodatacow option, for performance reasons. You miss checksumming, but you still have snapshots + send/receive remote.

I am using default mounting options (with cow activated). Since I am using snapshots on the postgres subvolume, there is not point to mount it as nodatacow. Besides, cow is one of the btrfs features I like. I know it is slower than using nodatacow and I am good with it.
 
  • Like
Reactions: EuroDomenii
Hello @EuroDomenii,
BTRFS is mounted with default options in the server. I am using these options to mount it in the container running the database instance:
Mount Point (mp0) /mnt/sdd/postgresql, mp=/mnt/postgresql
this mount type does not create a raw file with a volume, but it maps a BTRFS subvolume directly inside the container.
The btrfs prototype for LXC doesn't use raw files, see https://www.mail-archive.com/pve-devel@pve.proxmox.com/msg18747.html Use subvolumes with btrfs storage

Hello @EuroDomenii,
I am using default mounting options (with cow activated). Since I am using snapshots on the postgres subvolume, there is not point to mount it as nodatacow. Besides, cow is one of the btrfs features I like. I know it is slower than using nodatacow and I am good with it.

Nodatacow is a trade-off, between performance and checksumming features of Cow. Of course is your choice. Performance gain is usually < 5% unless the workload is random writes to large database files, where the difference can become very large.
MariaDB recommends "It's usually best to mount Btrfs with the nodatacow option, disabling copy-on-write, because COW causes fragmentation, dish thrashing, and CPU and RAM spikes when you have a lot of random writes."

With datacow, Cow is still used on metadata, so that mount behaves at least like and ext4/xfs, but with snapshots and send/receive remote incremental. This is a great flexibility of BTRFS, ZFS doesn't.

Regarding snapshots, you still have it, even is not obvious "There seems to be some misunderstanding around how nodatacow works. Nodatacow doesn't prohibit snapshot use. Snapshots are still allowed and, of course, will cause CoW to happen when a write occurs, but only on the first write. Subsequent writes will not CoW again. This does mean you don't get CRC protection for data, though. Since most databases do this internally, that is probably no great loss. Re: BTRFS for OLTP Databases http://www.spinics.net/lists/linux-btrfs/msg62715.html"
 
  • Like
Reactions: Pablo Alcaraz
With datacow, Cow is still used on metadata, so that mount behaves at least like and ext4/xfs, but with snapshots and send/receive remote incremental. This is a great flexibility of BTRFS, ZFS doesn't.


It is true, but at the same time, you can loose any data, if some errors are happening. Then after such event, you must find what is the last snapshot with correct data. And it will be very dificult to find the last usable data. Also it is very possible to discover that you have bad data too late, and in this case what you will do?
If in your case is not a big problem to loose your data, btrfs without cow it is ok.

In any other cases it is a bad decision.

In the end the question is what do you want to get, a secure data and a safe backup or only a activity to mark ... ok I have a (bad) data and the backup for this (also bad).
 
  • Like
Reactions: Pablo Alcaraz
FYI, RedHat seems to be abandoning btrfs, as it was removed from the latest RHEL 7.4.

This might be serious, because as we know, RedHat has quite strong voice in Linux. Remember, it came with systemd, and despite strong oppostion, it pressed it forward to became de-facto standard init system...

There is not real alternative to btrfs. ZFS in linux is poorly integrated, competes successfully against server services by eating all memory resources and managing it, compared with btrfs, is cumberstone. The worst part is that there are no hopes of better integration. And all the time is that licensing risk liability...

We are debian/ubuntu fans. Red Hat is too risky because they are main (unique?) contributors of Fedora IMHO. They are able to do things like that (it seems pretty much similar to what Microsoft did in the pass when they pushed for DAO and then they discarded DAO and replaced with RDO and they discarded RDO and you were paying all the bills of coding/bugs conversion). Since Ubuntu is Debian based, they cannot do the same because Debian community is big enough to balance their influence.

Anyway, at this point it is more a matter of opinion and long term strategy. I bet btrfs will win because I see the project more similar to linux kernel development (and community and licensing).
 
  • Like
Reactions: EuroDomenii
It is true, but at the same time, you can loose any data, if some errors are happening. Then after such event, you must find what is the last snapshot with correct data. And it will be very dificult to find the last usable data. Also it is very possible to discover that you have bad data too late, and in this case what you will do?
If in your case is not a big problem to loose your data, btrfs without cow it is ok.

Indeed. Because of this we are not using nodatacow. A postgresql subvolume on btrfs with cow and snapshots is what we adopted. For now.
 
It’s true that Cow filesystem are the best solution for data integrity. Btrfs with nodatacow is like journaling filesystem xfs, ext4 ( only for database, you could have a different mount for other files with cow), except you still have snapshots and send/receive remote incremental.

It is true, but at the same time, you can loose any data, if some
In the end the question is what do you want to get, a secure data and a safe backup or only a activity to mark ... ok I have a (bad) data and the backup for this (also bad).

This is like saying: "never use xfs, ext4."

Let’s take the Facebok use case

Btrfs at Facebook is used for OS updates and glusterfs storage https://www.linux.com/news/learn/in...ok-uses-linux-and-btrfs-interview-chris-mason

But for database they stick with XFS. See Chris Mason YouTube W3QRWUfBua8?t=2289
Mysql has all the other features that btrfs is provided…”

Confirmation here : This does mean you don't get CRC protection for data, though. Since most databases do this internally, that is probably no great loss. http://www.spinics.net/lists/linux-btrfs/msg62715.html
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!