Using deduplication for dropbox on multiple VMs

RudeRubbish

New Member
Dec 7, 2016
15
1
3
42
I am planning to use several virtual machines that will all use the same Dropbox account, and I want to use the deduplication abilities of ZFS to minimize wasted disk space. I also want the drobox data to be on a separate physical device from the OSs so that they do not take up read/write bandwidth.

So here is my plan:
Create a zpool for the express purpose of dropbox.
Add to that zpool a mirror vdev made of 2x2TB drives
Also add to the pool an m.2 drive for L2ARC
Make two 800GB zvols from the mirror vdev
Pass each of those zvols to the appropriate VM as block devices.
Set up Dropbox in each VM (located on a separate pool) to locate on the ZVOl that has been passed to that VM.

Does this plan make sense and will it work as I am describing?
 
Deduplication is tricky on zfs :). Zfs use block disk deduplication. In simple words, it will make a kind of database like info about deduplicated blocks. If this deduplication data can fit in RAM, it is ok. But if this data at some point will grow and will not fit entirely in RAM, then any I/O operation will be very bad.
Also you must take in account that deduplication is one way road. You can not disable, if I remember; )
Some recommendations when is ok to use deduplication (my own thoughts )
- use it only if your data are not changed very often
- you must have a big quantity of RAM
- you data size is small

In many cases some other options are better: zfs compression, file level deduplication.
 
  • Like
Reactions: RudeRubbish
Ok, since my Dropbox is over 500GB and it changes very frequently, that's probably not a good idea. File-level deduplication sounds great, but can it be done across zvols with different file system types (in my case, NTFS and Ext4)?
 
I guess it is not possible . But on the same zvol, ntfs can do file level deduplication. For linux exist many different solution .
 
@LnxBil : Because it is not recommended to put Dropbox on a network drive (although some people do, and instructions for it can be found).

@guletz: I was just doing some more reading about deduplication, and according my my calculations 2GB RAM should be enough to cover dedup on 800 GB data with 128k block size. That actually isn't so bad.
 
@LnxBil : Because it is not recommended to put Dropbox on a network drive (although some people do, and instructions for it can be found).

Yes, sure .. locking and such if you use Dropbox on every client, but what if you only use it once outside and share only the files directly?

I'm just curious: I didn't understand why someone would need a big Dropbox in the first place. What type of files or what type of service needs such an architecture? What do you want to achieve that cannot be achieved by using another technology for sharing files?

@guletz: I was just doing some more reading about deduplication, and according my my calculations 2GB RAM should be enough to cover dedup on 800 GB data with 128k block size. That actually isn't so bad.

If it is on a zvol, you will not achieve this, due to another filesystem in between - it only works if you use LXC and therefore a ZFS filesystem. In a zvol, you need another filesystem on top of the zvol and the files need to be aligned in a 128K block and need to be not fragmented that this works. In real world, this cannot be guaranteed and is very unlikely.
 
At the end you can find how to simulate if in your case your data can benefit from deduplication.

The essence is this:

Code:
zdb -S <poolname>

so in my example:

Code:
root@proxmox4 ~ > time zdb -S rpool
Simulated DDT histogram:

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    48.0M    335G    271G    271G    48.0M    335G    271G    271G
     2    12.7M   87.4G   73.2G   73.3G    27.6M    191G    159G    160G
     4    6.66M   33.3G   30.5G   30.5G    33.4M    165G    152G    152G
     8    2.40M   11.0G   10.3G   10.4G    23.8M    109G    103G    103G
    16    1.02M   4.21G   4.14G   4.16G    21.4M   88.2G   87.0G   87.2G
    32     158K    691M    667M    669M    6.43M   27.9G   27.0G   27.1G
    64    24.2K    110M    107M    107M    2.10M   9.57G   9.27G   9.33G
   128    1.32K   6.60M   6.11M   6.12M     209K   1.01G    951M    953M
   256      315   1.32M   1.22M   1.23M     110K    474M    438M    441M
   512      125    524K    496K    500K    82.8K    349M    328M    331M
    1K       13     80K     52K     52K    17.5K    109M   69.9M   69.9M
    2K       10     53K     33K     40K    25.6K    139M   86.3M    102M
    8K        1    128K      4K      4K    9.53K   1.19G   38.1M   38.1M
   16K      652   2.55M   2.55M   2.55M    14.5M   57.9G   57.9G   57.9G
 Total    71.0M    471G    390G    390G     178M    986G    868G    869G

dedup = 2.23, compress = 1.14, copies = 1.00, dedup * compress / copies = 2.53


real    14m9.567s
user    14m44.960s
sys     5m42.744s
 
  • Like
Reactions: RudeRubbish
Thanks!
I thought about having Dropbox on the host and then unison syncing the files with guests, but doesn't that just add one more layer of reading/writing data and create a third copy of the data?
I'm not specifically married to Dropbox, but I like having my files synced across multiple devices at home and work, bidirectionally and seamlessly, and Dropbox has done that for me without issue for several years.
Are you sure that ZVOLs can't take advantage of deduplication? According to this they can: https://pthree.org/2012/12/21/zfs-administration-part-xiv-zvols/
 
Are you sure that ZVOLs can't take advantage of deduplication?

Of course it can, but you'll have another filesystem in between and for best deduplication rates, you need to have the same blocksize on your zvol as on your guest filesystem. If you e.g. have 8K on your zvol and the default 4K on your ext4 inside your guest. Imagine you have two guest fs with the same 16k file, which consists of 4x 4K blocks A to D.

Code:
Guest 1    x  x  |  A  B  |  C  D  |  x  x
Guest 2    x  A  |  B  C  |  D  x  |  x  y
ZVOL         a   |    b   |    c   |   d

Due to the misalignment of one 4K block, you cannot deduplicate the data. Each stored block on Guest 1 and Guest 2 are different.

If you have a file based system, the files will be the same, zvol is however a block based system.

I hope it's clear now what I meant.
 
Ah ok, that is a good thing to be mindful of. Is there anything that would stop me from using the same block size on everything?
Oh, wait I see. Ext4 has a max of 4kb and Windows, 64kb. With 4kb blocks deduplication doesn't make much sense for me.
It looks to me like there is no way to have Windows and Linux and ZFS all using filesystems with a large block size (32-128kb).. Too bad :(
 
  • Like
Reactions: guletz
Hmmm.. or maybe maybe I could use a different linux filesystem (maybe even ZFS) that uses larger blocks. I really want to use large blocks for this (64kb would be ideal). And I thought ntfs was 4kb by default too.
 
Hmmm.. or maybe maybe I could use a different linux filesystem (maybe even ZFS) that uses larger blocks. I really want to use large blocks for this (64kb would be ideal). And I thought ntfs was 4kb by default too.

Hopefully, you will not run a database on those larger blocks, because you'll have a lot of write amplification - most databases have a blocksize of 8 KB. Changing one database block will result in reading a 64KB block, changing the 8 KB with the actual change and writing a 64 KB block back to disk.

And yes, aligning and optimising for best throughput is a virtualised environment is not an easy task. You need to consider a lot of parameters. It's even more complicated with deduplication (memory consumption increases while reducing the volblocksize) and recordsize due to its variable nature.
 
Nope, not a database, lots of large text, csv, and media files. Multiple similar versions of them too. Which is why I want to use deduplication and big blocks.
 
Have you got a test environment in which you can test your setup? I often spin up some machines just for such reasons and test everything out. You can start with a "smaller" directory of your big inbox to test the deduplicability (it that a word?) and extrapolate the results to your big box.
 
I will once I do my build. My plan is to do a minimal build, test various configurations over a period of weeks or months, adding components after I've established a baseline. I plan to try both Proxmox/KVM and FreeBSD/Bhyve. But the more I can figure out in advance about how to configure them, the better.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!