LizardFS anyone?

casalicomputers

Renowned Member
Mar 14, 2015
89
3
73
Hello everybody,
has anyone already tried LizardFS (http://www.lizardfs.com) as an alternative distributed filesystem?
It looks promising and, at a first glance, less complex than ceph since seems that it could be employed on just two nodes.

I'm just wondering if it could be coupled with ZFS ..... :rolleyes:

Any feedback to share?
 
Really no one? :(
AFAIK, that is a fork of MooseFS...

Before I'd start playing with it, I would like to know if someone has ever used it with proxmox (or whatever) and how it performs.
 
Just looking into it myself now, so can't answer your question :)

Did you give it a try?
 
LizardFS gets positive reviews for performance, features, and stability, however I haven't been able to find much documentation on it yet. LizardFS was based on MooseFS when it went open source back in May 2008, I believe the documentation for MooseFS is applicable. I don't know the reason why LizardFS forked from MooseFS.
 
Last edited:
LizardFS gets positive reviews for performance, features, and stability, however I haven't been able to find much documentation on it yet. LizardFS was based on MooseFS when it went open source back in May 2008, I believe the documentation for MooseFS is applicable. I don't know the reason why LizardFS forked from MooseFS.

HA I think, with moose you don't get failover or backup on the primary metadataserver unless you buy a subscription, so if it goes down, you're stuffed.

I'm a bit concerned as to whether I can run chunk and metadata servers on the proxmox nodes themselves, mem/cpu requirements are probably too high.

User list seems very quiet to, only a few posts per month - not a good sign.
 
I finally got round to setting up a test bed.
3 Debian containers (1 per Proxmox Node)
- 256GB of storage, backed by ZFS RAID10
- 2 GB RAM
- 2 Cores
- 1 master, 2 shadow servers
- chunkserver
- webserver
- 1GB Ethernet

Ran the fuse client on the proxmoxserver (3 nodes)

Took me several hours to get setup, while I got used to the system components. Once you get the hang of it, makes sense. A lot simpler than ceph, little bit more complex than gluster. The docs have improved a *lot* since I last looked at them. It would go a lot easier next time.

Did some initial testing with a couple of VM's, performance was surprisingly good. Writes getting 110 MB/s on replica 2, 80 MB/s on replica 3. Raw reads were ridiculously high - 2800+ MB/s, IOPS were pretty good to.

Tools for checking the status are quite good - command line and a nice web page showing servers, disks and chunk status. Easy to restart a chunkserver and watch the healing process.

The master metaserver is a single point of failure and its a real pain to manually promote shadowservers. I setup up a custom HA solution as described here:

https://sourceforge.net/p/lizardfs/mailman/lizardfs-users/thread/118344664.xhctZtnkdd@debstor/#msg33189560

using ucarp and network up/down scripts.

It actually works pretty well, kill or restart the master server, writes pause for a few seconds and then it transparently fails over to a another node. Couple of times I managed to have two master servers active, but only one was being used.


The goals are amazing flexible and useful - can set separate replicable levels for individual files and directories, so VM's on the same cluster can have replica 2, 3, 4, ... etc or various EC encodings. And you can change them on the fly, *including* changing a std replica goal to a EC goal. The systems just starts copying, encoding and deleting chunks as needed to fit the new requirements. Its fascinating to watch. It auto balances disks as well. Expanding storage and replacing disks is a doodle, unlike gluster which can be quite painful.


Quorum is ... interesting, will keep writing so long as at least the master metaserver is up. According to the docs chunks are versioned so that it knows which ones are the latest when chunkservers come back up. I did notice that when I managed to create a block of "missing" chunks - i.e force writes to a one chunkserver only, then take it down, the system blocked further writes to those chunks until that chunkserver was back up, so I was unable to create a split-brain.

Mildly concerned re this review on MooseFS:

http://pl.atyp.us/hekafs.org/index.php/2012/11/trying-out-moosefs/

"The MooseFS numbers are higher than is actually possible on the single GigE that client had. These numbers are supposed to include fsync time, and the GlusterFS numbers reflect that, but the MooseFS numbers keep going as data continues to be buffered in memory. That’s neither correct nor sustainable. The way that MooseFS ignores fsync reminds me of another thing I noticed a while ago: it ignores O_SYNC too. I verified this by looking at the code and seeing where O_SYNC got stripped out, and now my tests show the same effect"

That was from 2012 though, hopefully lizardfs has improved on that code. Will ask on the userlist.

I'm certainly intrigued, quite promising I think. Next step for me will be running chunkservers on real hardware and using some VM's on it in anger.
 
  • Like
Reactions: vkhera
I've done a lot more testing since, it hasn't worked out so well. Everything is hunky dory until you actually introduce some issues and it turns out the metadata servers are really flaky. I can reliably corrupt every running VM by power-cycling the master metadata server.

Copy paste from my post to the lizard list:

**********************************************************************************
Have just finished up a week of stress testing lizardfs for HA purposes, with a variety of setups and it hasn't faired to well. In short, when under load I see massive data loss when hard resets are simulated. I detailed the results in the following, plus my HA scripts.

I'm not trying to knock lizardfs - I find it very powerful and useful and would welcome any suggests for improvements. I really want to make it work for us.



Hardware:

3 Compute nodes, each with
  • 3 TB WD Red * 4 in ZFS Raid 10
  • dedicated SSD Log device
  • 64 GB RAM
  • 1GB * 2 Bond (balance-rr) dedicated to lizardfs
  • 1GB Public IP



I've setup a floating ip for mfsmaster that is handled via keepalived with scripts to handle master promotion/demotion. keepalived juggles the ip well, passing it between nodes as needed, running the promote/demote script.



Chunkservers function well, taking them up/down, adding/removing disks on the fly is not an issue. Quite impressive and very nice.



OTOH, the metadataservers seem quite fragile. Several times I've observed chunks become missing when a master is downed and a shadow takes over, a high IO load seems to be the biggest indicator for this.

Also they regularly fail to start after a node reset, even after a "mfsmetarestore -a", I regualrly see this error:



mfsmaster -d
[ OK ] configuration file /etc/mfs/mfsmaster.cfg loaded
[ OK ] changed working directory to: /var/lib/mfs
[ OK ] lockfile /var/lib/mfs/.mfsmaster.lock created and locked
[ OK ] initialized sessions from file /var/lib/mfs/sessions.mfs
[ OK ] initialized exports from file /etc/mfs/mfsexports.cfg
[ OK ] initialized topology from file /etc/mfs/mfstopology.cfg
[WARN] goal configuration file /etc/mfs/mfsgoals.cfg not found - using default goals; if you don't want to define custom goals create an empty file /etc/mfs/mfsgoals.cfg to disable this warning
[ OK ] loaded charts data file from /var/lib/mfs/stats.mfs
[....] connecting to Master
[ OK ] master <-> metaloggers module: listen on *:9419
[ OK ] master <-> chunkservers module: listen on *:9420
[ OK ] master <-> tapeservers module: listen on (*:9424)
[ OK ] main master server module: listen on *:9421
[ OK ] open files limit: 10000
[ OK ] mfsmaster daemon initialized properly
mfsmaster[6453]: connected to Master
mfsmaster[6453]: metadata downloaded 364545B/0.003749s (97.238 MB/s)
mfsmaster[6453]: changelog.mfs.1 downloaded 1143154B/0.024354s (46.939 MB/s)
mfsmaster[6453]: changelog.mfs.2 downloaded 0B/0.000001s (0.000 MB/s)
mfsmaster[6453]: sessions downloaded 2762B/0.000365s (7.567 MB/s)
mfsmaster[6453]: opened metadata file /var/lib/mfs/metadata.mfs
mfsmaster[6453]: loading objects (files,directories,etc.) from the metadata file
mfsmaster[6453]: loading names from the metadata file
mfsmaster[6453]: loading deletion timestamps from the metadata file
mfsmaster[6453]: loading extra attributes (xattr) from the metadata file
mfsmaster[6453]: loading access control lists from the metadata file
mfsmaster[6453]: loading quota entries from the metadata file
mfsmaster[6453]: loading file locks from the metadata file
mfsmaster[6453]: loading chunks data from the metadata file
mfsmaster[6453]: checking filesystem consistency of the metadata file
mfsmaster[6453]: connecting files and chunks
mfsmaster[6453]: calculating checksum of the metadata
mfsmaster[6453]: metadata file /var/lib/mfs/metadata.mfs read (26 inodes including 13 directory inodes and 13 file inodes, 10548 chunks)
mfsmaster[6453]: running in shadow mode - applying changelogs from /var/lib/mfs
terminate called after throwing an instance of 'std::invalid_argument'
what(): stoull
Aborted



Oddly, the only thing that seems to stop it happening is restarting the *master* server it is connecting to.



Worst case is simulating a catastrophic power failures, which I did by invoking a hard reset on each server ("echo b > /proc/sysrq-trigger"). I did this with a lizardfs system fully healed with 1 master and two shadows. 3 VM's were running on each node - a light load for our system.



When it came back up one shadow master failed to start and 683 chunks were missing, a range from every VM that was running.

As it stands thats unusable for a production system, from our perspective anyway. We can't be manually hand holding services and restoring backups every time a node throws a wobbly and it happens, even in the best of data centers, let a alone a understaffed SMB like ourselves.

A far as I can tell the metadata servers, and maybe the chunkservers aren't properly flushing data to disk. Thats good for performance, but bad for data integrity.



my keepalived conf file:

global_defs {
notification_email {
admin@softlog.com.au
}
notification_email_from lb-alert@brian.softlog
smtp_server smtp.emailsrvr.com
smtp_connect_timeout 30
}

vrrp_instance VI_1 {
state MASTER
interface bond0
virtual_router_id 51
priority 60
nopreempt
smtp_alert
advert_int 1
virtual_ipaddress {
10.10.10.249/24
192.168.5.249/24
}
notify "/etc/mfs/keepalived_notify.sh"
}




the script file it calls

#!/bin/bash

TYPE=$1
NAME=$2
STATE=$3

logger -t lizardha -s "Notify args = $*"


function restart_master_server() {
logger -t lizardha -s "Stopping lizardfs-master service"
systemctl stop lizardfs-master.service
if [ -f /var/lib/mfs/metadata.mfs.lock ];
then
logger -t lizardha -s "Lock file found, assuming bad shutdown"

logger -t lizardha -s "killing all mfsmaster"
killall -9 mfsmaster

logger -t lizardha -s "Removing lock file"
rm /var/lib/mfs/metadata.mfs.lock

logger -t lizardha -s "Running mfsmetarestore -a"
/usr/sbin/mfsmetarestore -a
if [ $? -ne 0 ]; then
logger -t lizardha -s "mfsmetarestore operation FAILED, check logs.";
fi;
fi
logger -t lizardha -s "Starting lizardfs-master service"
systemctl start lizardfs-master.service
systemctl restart lizardfs-cgiserv.service
logger -t lizardha -s "done."
}


case $STATE in
"MASTER") logger -t lizardha -s "MASTER state"
ln -sf /etc/mfs/mfsmaster.master.cfg /etc/mfs/mfsmaster.cfg
restart_master_server
exit 0
;;
"BACKUP") logger -t lizardha -s "BACKUP state"
ln -sf /etc/mfs/mfsmaster.shadow.cfg /etc/mfs/mfsmaster.cfg
restart_master_server
exit 0
;;
"STOP") logger -t lizardha -s "STOP state"
# Do nothing for now
# ln -sf /etc/mfs/mfsmaster.shadow.cfg /etc/mfs/mfsmaster.cfg
# systemctl stop lizardfs-master.service
exit 0
;;
"FAULT") logger -t lizardha -s "FAULT state"
ln -sf /etc/mfs/mfsmaster.shadow.cfg /etc/mfs/mfsmaster.cfg
restart_master_server
exit 0
;;
*) logger -t lizardha -s "unknown state"
ln -sf /etc/mfs/mfsmaster.shadow.cfg /etc/mfs/mfsmaster.cfg
restart_master_server
exit 1
;;
esac
 
What I didn't mention on the list because it didn't seem politic :) that the same cluster was also running a gluster 3.8.7 volume, sharded, replica 3. It endured all the same stress tests and never missed a beat. Its hosted VM's all autostarted and it healed itself with in a few minutes after reboots.

Might be time to revisit sheepdog.
 
Hi
Can you link me your thread on lizard mailing list?

I'm still evaluating gluster and lizard and I've seen major drawbacks on both systems

Lizards lacks HA and master servers seems to be unreliable, gluster is at version 3 but has some data loss bug that are unacceptable for a software at version 3.
They should be unacceptable even for a beta (you can't expand a shared volume or you corrupt files)
 
Sourceforge unfortunately:

https://sourceforge.net/p/lizardfs/mailman/lizardfs-users/?viewmonth=201612

A lot of discussion is actually via their github issues page:

https://github.com/lizardfs/lizardfs/issues

They have since started their own forums as well:

https://lizardfs.org/forum

Pretty quiet though.

I did make considerable progress in getting a reliable HA solution together using keepalived scripts, it survived hard resets and rebooting primary servers quite well.

Sticking with gluster for now:
  • Glusters active/active meta server approach is *very* robust. Even with keepalived, shadow servers and meta loggers I'm reluctant to trust a single primary metaserver, failover still feels a bit unreliable.
  • Gluster has much better performance on replica 3 writes and IOPS

However my needs are somewhat specialised - SMB with just three mixed compute/storage servers which is a perfect fit for a gluster rep 3 volume. If I needed to run more nodes and dynamically change my storage then I would be seriously considering LizardFS/MooseFS as they are insanely flexible in that regard and I did find it very reliable when it came to adding disks and changing file goals.

Whereas gluster is very inflexible in node/brick geometries and quite frankly the thought of adding extra bricks to node is quite scary.

Before anyone mentions ceph - performance on small clusters is crap, PITA to admin and it feels liek there ara a lot of issues with osd and monitors dying under pressure in current releases. We don't have unlimted funds to throw at memory, cpu and high end IBM SSD's.


Apparently the LizardFS team is working on a direct qemu interface which should improve performance a fair bit - no more fuse. But again, very quiet on that front.
 
My biggest concern about LizardsFS/MooseFS is: as client is writing to a single chunkserver and then the chunkserver replicates to the following one, what would happens in case of "primary" chunk server failure between the client write and the first replication?

Client will receve an ACK from the chunk server (because the write was succesfull) but after that, the chunk server crashed and data is still not replicated. This means data loss and noone is notified (no I/O error on the client, because data was wrote and client disconnected)
 
I think the write is not confirmed until all chunks are written, not a 100% sure on that though.

Also with erasure coding the client writes to all chunkservers simultaneously, rather than the chained writes with std replica.
 
I don't like EC
Too complicated in case of disaster recovery. You have to recreate a file from shards and from Erasure codes
 
I think the write is not confirmed until all chunks are written, not a 100% sure on that though.

Also with erasure coding the client writes to all chunkservers simultaneously, rather than the chained writes with std replica.

and what would happen in case of catastrophic failure of master during a write ?

the shadow maybe outdated a little bit and this lead to dataloss.
 
the shadow maybe outdated a little bit and this lead to dataloss.


... but not so much data. If I understand correctly the documentation, it will be no data loss: when a client need to write something it will receive all the chunck servers and inodes where the data will be sent. So when client will receive this info, it is not a data loos if master will be down.
Only new write tasks will be delayed until a shadow will be promoted as a master.
Like I said is only a speculation what I say, but this week I will test this scenario.

For any others reasons I think that lizardfs is a interesting option. I have start to test lizatdfs, and at a first look is a very nice tool for me.
If are others guys who are interested in this subject, please let me know, so I can share my own experiences with this.
 
Sorry to resurrect such an old thread but I'm just now getting a handful of distributed filesystems setup to test and lizardfs is in the mix. The official 3.13 RC1 packages are a little broken so I spent some time yesterday merging fix PRs and building unofficial debian packages.

If anyone is interested in re-visiting lizardfs now that proper HA is available I've pushed the .debs to github under foundObjects/lizardfs.

Apparently, based on what I'm reading, their HA system works fairly well and filesystem performance is excellent. I'm just getting everything configured now, it'll be a while before I can make any comparisons vs gluster, ceph or moosefs.
 
Last edited: