VZDump with flashcache on top of LVM

check-ict

Well-Known Member
Apr 19, 2011
102
18
58
Hello,

I use flashcache to speed up my storage. It works great, however I noticed that VZDump can't make a LVM snapshot.

This is my setup:
MDADM 4x 2TB eco disk in RAID5 = /dev/md0
MDADM 2x 160GB SSD in RAID1 = /dev/md1

Created a LVM VG "PVE" on /dev/md0
Created a LVM LV "DATA" = /dev/mapper/pve-data
Created a ext3 filesystem on /dev/mapper/pve-data
(till now it works if I use snapshot back-ups)

Now I setup flashcache
flashcache -p thru ssd /dev/md1 /dev/md0
Flashcache is loaded and a new disk name appears = /dev/mapper/ssd

If I mount /dev/mapper/ssd it won't let me create snapshots. If I create a snapshot on /dev/mapper/pve-data (even when SSD is mounted) it works.

Is there some way I can trigger vzdump to use /dev/mapper/pve-data (the underlying LVM volume of flashcache)?
 
Would you please share how you went about compiling flashcache for the Proxmox kernel?
I may have some SSDs and time to test this myself in the next few weeks.

This is just a guess but I think you need to do things more like this:
Before creating the flashcache device you should unmount /var/lib/vz (/dev/mapper/pve-data)
Next you need to edit lvm.conf and create a filter telling LVM to ignore the underlying device (/dev/md0)
Something like this should work:
Code:
filter = [ "r|/dev/md0|", "a/.*/" ]
Now create the flash cache device:
Code:
flashcache_create -p thru ssd /dev/md1 /dev/md0
run pvscan
Verify that the pve group is only showing up on the flash cache device and not the underlying device.
Now you can mount /var/lib/vz (/dev/mapper/pve-data)
I would also expect that the snapshots in vzdump will work ok now.
 
Last edited:
Hello e100

This is my history:
43 apt-get install dkms build-essential git
44 git clone git://github.com/facebook/flashcache.git
45 ls
46 cd flashcache/
47 ls
48 make
49 make install
50 modprobe flashcache

After this, you can create a flashcache device:
flashcache_create -p thru ssd /dev/md1 /dev/mapper/pve-data
ssd = just a name
/dev/md1 = SSD (mine in raid1 for security)
/dev/mapper/pve-data = the slow disk storage
-p thru = writethrough

After this you can mount /dev/mapper/ssd to /var/lib/vz

You can remove flashcache by doing:
umount /var/lib/vz
dmsetup remove ssd
mount /dev/mapper/pve-data /var/lib/vz
Now you have all data again but without flashcache.

Writeback is also possible, but there are a few things to know.
flashcache_create -p back ssd /dev/md1 /dev/mapper/pve-data
mount /dev/mapper/ssd /var/lib/vz
umount /var/lib/vz
dmsetup remove ssd (can take a while, all dirty data gets written to disk)
-flashcache_destroy /dev/md1 (if you want to remove writeback ssd)
-flashcache_load /dev/md1 (if you want to restore the config again)

Writeback has a problem when rebooting if MDADM/software raid is used. I gives a kernel panic when trying to reboot or shutdown. You can use a init script at shutdown to resolve this:

Code:
# Start or stop Flashcache

### BEGIN INIT INFO
# Provides:          flashcache
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Short-Description: Flashcache SSD caching
# Description:       Flashcache SSD caching
### END INIT INFO

PATH=/bin:/usr/bin:/sbin:/usr/sbin

flashcache_start() {
if df -h | grep /var/lib/vz > /dev/null
then
echo "Flashcache allready running"
else
flashcache_load /dev/md1
mount /dev/mapper/ssd /var/lib/vz
echo 1 > /proc/sys/dev/flashcache/md1+pve-data/fast_remove
echo "Flashcache started"
fi
}

flashcache_stop() {
if df -h | grep /var/lib/vz > /dev/null
then
umount /var/lib/vz
dmsetup remove ssd
echo "Flashcache stopped"
else
echo "Flashcache not running"
fi
}


case "$1" in
    start)
flashcache_start
    ;;

    stop)
flashcache_stop
    ;;

    restart)
        $0 stop
        $0 start
    ;;
esac

exit 0
 
Seems pretty simple.

Now that I have a better understanding of how this works I edited my post above adding the proper flashcache_create command to make snapshots possible.

For snapshots to work the whole physical volume must be on the flashcache.
Before flashcache pvscan should output something like this:
Code:
PV /dev/md0   VG pve             lvm2
After flashcache pvscan should output something like this:
Code:
PV /dev/mapper/ssd   VG pve             lvm2

That is why it is necessary to set the filter in lvm.conf so pvscan sees the volume at /dev/mapper/ssd and not /dev/md0
You would still mount /dev/mapper/pve-data to /var/lib/vz.
The difference is that /dev/mapper/pve-data is inside the pve physical volume that is on /dev/mapper/ssd instead of /dev/md0

If your system is like a typical Proxmox setup it will not be possible to do this since pve-root and pve-swap are also on the pve volume unless you can get the flashcache setup applied on bootup before any volumes are mounted.
 
Flashcache on bootup is too risky, so I will reinstall my server with 2 MDADM RAID sets, 1 for OS and other for data.

Thanks for the info, I will try it next week.
 
Without Flashcache bonnie++ gives me 400 IOps. With Flashcache, it gives between 5000 and 10000 IOps. Depends on the SSD speed and RAID or single SSD.

Sequential write speed is sometimes slower because of the RAID penalty on the SSD's. Sequential read is very fast, specially if it's cached on the SSD. Bonnie++ gives nice results.
 
Without Flashcache bonnie++ gives me 400 IOps. With Flashcache, it gives between 5000 and 10000 IOps. Depends on the SSD speed and RAID or single SSD.

Sequential write speed is sometimes slower because of the RAID penalty on the SSD's. Sequential read is very fast, specially if it's cached on the SSD. Bonnie++ gives nice results.
Hi,
that's sound good. I guess that facebook used this on many servers, so it should be production-safe, or not??

Udo
 
Hi e100,
do you have any performance-info between with and without flashcache (with HW-raid)?
Perhaps pveperf or more??

Udo

Not yet.
*If* a project I quoted gets approved I will have dozens of SSD disks and some areca 1882ix-24 controllers to play with for a couple of days.
I will try to get some benchmarks with and without flashcache on top of hardware raid.

What sort of benchmarks do you suggest?

Do you think DRBD would be happy using a flash cache device as the underlying device?
Maybe I can test that too if this project gets approved.
 
Hi,
that's sound good. I guess that facebook used this on many servers, so it should be production-safe, or not??

Udo

https://raw.github.com/facebook/flashcache/master/doc/flashcache-doc.txt
"It is important to note that in the first cut, cache writes are non-atomic, ie, the "Torn Page Problem" exists. In the event of a power failure or a failed write, part of the block could be written, resulting in a partial write. We have ideas on how to fix this and provide atomic cache writes (see the Futures section)."

Maybe you uncover some corner case in your usage that facebook does not have.

Looks like using DRBD8.4 would be preferred to get the best performance from flashcache and DRBD, see video dietmar found.

There seem to be a number of issues logged:
https://github.com/facebook/flashcache/issues
 
Flashcache with writethrough has passed all my tests (power failures, RAID degration/rebuild etc.). Writeback however is very risky, it corrupts data easily and has problems with rebooting on Debian 5/6 and Ubuntu 10.04 LTS (12.04 works fine out of the box).

So I use writethrough for production use now and still experimenting with writeback. With some tweaks writeback is very stable, however it's allways a higher risk offcourse.
 
Hello E100,

LVM is working with flashcache now. It can now create snapshots when the pvscan shows /dev/mapper/ssd.

However I found a problem...
How can I make sure my flashcache script shuts down after Proxmox forced all VM's to shutdown?

If I don't unmount Flashcache, it will give a kernel panic when trying to restart/shutdown. Only Ubuntu 12.04 is able to reboot/shutdown without any script.

My script:
Code:
#!/bin/sh

# Start or stop Flashcache

### BEGIN INIT INFO
# Provides:          flashcache
# Required-Start:
# Required-Stop:     $remote_fs $network pvedaemon
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Short-Description: Flashcache SSD caching
# Description:       Flashcache SSD caching
### END INIT INFO

PATH=/bin:/usr/bin:/sbin:/usr/sbin

flashcache_start() {
if df -h | grep /var/lib/vz > /dev/null
then
echo "Flashcache allready running"
else
flashcache_load /dev/md2
mount /dev/mapper/pve-data /var/lib/vz
mount /dev/mapper/pve-backup /mnt/backup
echo 1 > /proc/sys/dev/flashcache/md2+md1/fast_remove
echo "Flashcache started"
fi
}

flashcache_stop() {
if df -h | grep /var/lib/vz > /dev/null
then
umount /mnt/backup
umount /var/lib/vz
dmsetup remove ssd
echo "Flashcache stopped"
else
echo "Flashcache not running"
fi
}


case "$1" in
    start)
flashcache_start
    ;;

    stop)
flashcache_stop
    ;;

    restart)
        $0 stop
        $0 start
    ;;
esac

exit 0

I want to keep Flashcache as simple as possible, so I hope it can all work without too many modifications.
 
Last edited:
You need to adjust the order of when things start and stop by editing the data in the init script.
Likely adding things to these lines is what you need to do:
# Required-Start:
# Required-Stop: $remote_fs $network pvedaemon

Then use update-rc.d to remove the old data and run it again setting the defaults.

I noticed you edited your post, originally you mentioned some LVM problems.
I assume you fixed those, if not and for the benefit of other readers, you need to set a filter in lvm.conf telling lvm to not look at the underlying device.
That way lvm will only see the volume on the flashcache device and everything will then work well.
 
Hi e100,
do you have any performance-info between with and without flashcache (with HW-raid)?
Perhaps pveperf or more??

Udo

I did some benchmarks using a Crucial M4 256GB SSD and an Areca 1880 with 6 WD RE4 disks in RAID5.
flashcache in that configuration was slower than using the Areca array directly.

I also tested using the same areca array but with a ram disk for the flashcache device.
This was to test the overhead of flash cache since RAM is the fastest SSD that is possible.
The overhead of flashcache is significant, my writes were much slower with flashcache even with write around mode.

It is my opinion, based on a few hours of benchmarks, that flashcache is not worth the hassle if you already have a good hardware raid controller.

Flashcache seems more suited for mostly read heavy applications.

--The following is speculation since I lack the necessary hardware to test---

I estimate that from a performance standpoint this is likely to be true:
A PCIe SSD + mdadm + few mechanical disks + flashcache == few mechanical disks + good RAID controller with BBU

flashcache would have an advantage of more cache vs the raid card, likely a longer burst of peak random IOPS.
Raid card would have an advantage of availability/reliability/ease of use.
PCIe SSD are costly so I suspect that the cost would be roughly identical.
 
Something is not right with flashcache. According to this test done on both RAID10 and RAID0 arrays, bcache is vastly superior in all use cases.
There are many times when flashcache is actually slower than the un-cached RAID array:
http://www.accelcloud.com/2012/04/18/linux-flashcache-and-bcache-performance-testing/

As flashcache has probably matured since last April, someone should rerun these benchmarks, preferably on a Proxmox VE server, from under a VM.

Also it would be interesting to see if bcache can be loaded as a module into the Proxmox kernel.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!