LizardFS anyone?

Discussion in 'Proxmox VE: Installation and configuration' started by casalicomputers, Apr 27, 2015.

  1. casalicomputers

    Joined:
    Mar 14, 2015
    Messages:
    63
    Likes Received:
    0
    Hello everybody,
    has anyone already tried LizardFS (http://www.lizardfs.com) as an alternative distributed filesystem?
    It looks promising and, at a first glance, less complex than ceph since seems that it could be employed on just two nodes.

    I'm just wondering if it could be coupled with ZFS ..... :rolleyes:

    Any feedback to share?
     
  2. casalicomputers

    Joined:
    Mar 14, 2015
    Messages:
    63
    Likes Received:
    0
    Really no one? :(
    AFAIK, that is a fork of MooseFS...

    Before I'd start playing with it, I would like to know if someone has ever used it with proxmox (or whatever) and how it performs.
     
  3. blackpaw

    blackpaw Member

    Joined:
    Nov 1, 2013
    Messages:
    230
    Likes Received:
    2
    Just looking into it myself now, so can't answer your question :)

    Did you give it a try?
     
  4. Erk

    Erk Member

    Joined:
    Dec 11, 2009
    Messages:
    149
    Likes Received:
    3
    LizardFS gets positive reviews for performance, features, and stability, however I haven't been able to find much documentation on it yet. LizardFS was based on MooseFS when it went open source back in May 2008, I believe the documentation for MooseFS is applicable. I don't know the reason why LizardFS forked from MooseFS.
     
    #4 Erk, Dec 6, 2015
    Last edited: Dec 6, 2015
  5. blackpaw

    blackpaw Member

    Joined:
    Nov 1, 2013
    Messages:
    230
    Likes Received:
    2
    HA I think, with moose you don't get failover or backup on the primary metadataserver unless you buy a subscription, so if it goes down, you're stuffed.

    I'm a bit concerned as to whether I can run chunk and metadata servers on the proxmox nodes themselves, mem/cpu requirements are probably too high.

    User list seems very quiet to, only a few posts per month - not a good sign.
     
  6. morph027

    morph027 Active Member

    Joined:
    Mar 22, 2013
    Messages:
    424
    Likes Received:
    52
    Hm, interested too. Might setup a lab env and try to crash things ;)
     
  7. blackpaw

    blackpaw Member

    Joined:
    Nov 1, 2013
    Messages:
    230
    Likes Received:
    2
    I finally got round to setting up a test bed.
    3 Debian containers (1 per Proxmox Node)
    - 256GB of storage, backed by ZFS RAID10
    - 2 GB RAM
    - 2 Cores
    - 1 master, 2 shadow servers
    - chunkserver
    - webserver
    - 1GB Ethernet

    Ran the fuse client on the proxmoxserver (3 nodes)

    Took me several hours to get setup, while I got used to the system components. Once you get the hang of it, makes sense. A lot simpler than ceph, little bit more complex than gluster. The docs have improved a *lot* since I last looked at them. It would go a lot easier next time.

    Did some initial testing with a couple of VM's, performance was surprisingly good. Writes getting 110 MB/s on replica 2, 80 MB/s on replica 3. Raw reads were ridiculously high - 2800+ MB/s, IOPS were pretty good to.

    Tools for checking the status are quite good - command line and a nice web page showing servers, disks and chunk status. Easy to restart a chunkserver and watch the healing process.

    The master metaserver is a single point of failure and its a real pain to manually promote shadowservers. I setup up a custom HA solution as described here:

    https://sourceforge.net/p/lizardfs/mailman/lizardfs-users/thread/118344664.xhctZtnkdd@debstor/#msg33189560

    using ucarp and network up/down scripts.

    It actually works pretty well, kill or restart the master server, writes pause for a few seconds and then it transparently fails over to a another node. Couple of times I managed to have two master servers active, but only one was being used.


    The goals are amazing flexible and useful - can set separate replicable levels for individual files and directories, so VM's on the same cluster can have replica 2, 3, 4, ... etc or various EC encodings. And you can change them on the fly, *including* changing a std replica goal to a EC goal. The systems just starts copying, encoding and deleting chunks as needed to fit the new requirements. Its fascinating to watch. It auto balances disks as well. Expanding storage and replacing disks is a doodle, unlike gluster which can be quite painful.


    Quorum is ... interesting, will keep writing so long as at least the master metaserver is up. According to the docs chunks are versioned so that it knows which ones are the latest when chunkservers come back up. I did notice that when I managed to create a block of "missing" chunks - i.e force writes to a one chunkserver only, then take it down, the system blocked further writes to those chunks until that chunkserver was back up, so I was unable to create a split-brain.

    Mildly concerned re this review on MooseFS:

    http://pl.atyp.us/hekafs.org/index.php/2012/11/trying-out-moosefs/

    "The MooseFS numbers are higher than is actually possible on the single GigE that client had. These numbers are supposed to include fsync time, and the GlusterFS numbers reflect that, but the MooseFS numbers keep going as data continues to be buffered in memory. That’s neither correct nor sustainable. The way that MooseFS ignores fsync reminds me of another thing I noticed a while ago: it ignores O_SYNC too. I verified this by looking at the code and seeing where O_SYNC got stripped out, and now my tests show the same effect"

    That was from 2012 though, hopefully lizardfs has improved on that code. Will ask on the userlist.

    I'm certainly intrigued, quite promising I think. Next step for me will be running chunkservers on real hardware and using some VM's on it in anger.
     
    vkhera likes this.
  8. morph027

    morph027 Active Member

    Joined:
    Mar 22, 2013
    Messages:
    424
    Likes Received:
    52
    Thanks for the writeup. This makes me even more curious to set up a test env ;)
     
  9. blackpaw

    blackpaw Member

    Joined:
    Nov 1, 2013
    Messages:
    230
    Likes Received:
    2
    Cool, look fwd to seeing what you think of it.
     
  10. blackpaw

    blackpaw Member

    Joined:
    Nov 1, 2013
    Messages:
    230
    Likes Received:
    2
    I've done a lot more testing since, it hasn't worked out so well. Everything is hunky dory until you actually introduce some issues and it turns out the metadata servers are really flaky. I can reliably corrupt every running VM by power-cycling the master metadata server.

    Copy paste from my post to the lizard list:

    **********************************************************************************
    Have just finished up a week of stress testing lizardfs for HA purposes, with a variety of setups and it hasn't faired to well. In short, when under load I see massive data loss when hard resets are simulated. I detailed the results in the following, plus my HA scripts.

    I'm not trying to knock lizardfs - I find it very powerful and useful and would welcome any suggests for improvements. I really want to make it work for us.



    Hardware:

    3 Compute nodes, each with
    • 3 TB WD Red * 4 in ZFS Raid 10
    • dedicated SSD Log device
    • 64 GB RAM
    • 1GB * 2 Bond (balance-rr) dedicated to lizardfs
    • 1GB Public IP



    I've setup a floating ip for mfsmaster that is handled via keepalived with scripts to handle master promotion/demotion. keepalived juggles the ip well, passing it between nodes as needed, running the promote/demote script.



    Chunkservers function well, taking them up/down, adding/removing disks on the fly is not an issue. Quite impressive and very nice.



    OTOH, the metadataservers seem quite fragile. Several times I've observed chunks become missing when a master is downed and a shadow takes over, a high IO load seems to be the biggest indicator for this.

    Also they regularly fail to start after a node reset, even after a "mfsmetarestore -a", I regualrly see this error:



    mfsmaster -d
    [ OK ] configuration file /etc/mfs/mfsmaster.cfg loaded
    [ OK ] changed working directory to: /var/lib/mfs
    [ OK ] lockfile /var/lib/mfs/.mfsmaster.lock created and locked
    [ OK ] initialized sessions from file /var/lib/mfs/sessions.mfs
    [ OK ] initialized exports from file /etc/mfs/mfsexports.cfg
    [ OK ] initialized topology from file /etc/mfs/mfstopology.cfg
    [WARN] goal configuration file /etc/mfs/mfsgoals.cfg not found - using default goals; if you don't want to define custom goals create an empty file /etc/mfs/mfsgoals.cfg to disable this warning
    [ OK ] loaded charts data file from /var/lib/mfs/stats.mfs
    [....] connecting to Master
    [ OK ] master <-> metaloggers module: listen on *:9419
    [ OK ] master <-> chunkservers module: listen on *:9420
    [ OK ] master <-> tapeservers module: listen on (*:9424)
    [ OK ] main master server module: listen on *:9421
    [ OK ] open files limit: 10000
    [ OK ] mfsmaster daemon initialized properly
    mfsmaster[6453]: connected to Master
    mfsmaster[6453]: metadata downloaded 364545B/0.003749s (97.238 MB/s)
    mfsmaster[6453]: changelog.mfs.1 downloaded 1143154B/0.024354s (46.939 MB/s)
    mfsmaster[6453]: changelog.mfs.2 downloaded 0B/0.000001s (0.000 MB/s)
    mfsmaster[6453]: sessions downloaded 2762B/0.000365s (7.567 MB/s)
    mfsmaster[6453]: opened metadata file /var/lib/mfs/metadata.mfs
    mfsmaster[6453]: loading objects (files,directories,etc.) from the metadata file
    mfsmaster[6453]: loading names from the metadata file
    mfsmaster[6453]: loading deletion timestamps from the metadata file
    mfsmaster[6453]: loading extra attributes (xattr) from the metadata file
    mfsmaster[6453]: loading access control lists from the metadata file
    mfsmaster[6453]: loading quota entries from the metadata file
    mfsmaster[6453]: loading file locks from the metadata file
    mfsmaster[6453]: loading chunks data from the metadata file
    mfsmaster[6453]: checking filesystem consistency of the metadata file
    mfsmaster[6453]: connecting files and chunks
    mfsmaster[6453]: calculating checksum of the metadata
    mfsmaster[6453]: metadata file /var/lib/mfs/metadata.mfs read (26 inodes including 13 directory inodes and 13 file inodes, 10548 chunks)
    mfsmaster[6453]: running in shadow mode - applying changelogs from /var/lib/mfs
    terminate called after throwing an instance of 'std::invalid_argument'
    what(): stoull
    Aborted



    Oddly, the only thing that seems to stop it happening is restarting the *master* server it is connecting to.



    Worst case is simulating a catastrophic power failures, which I did by invoking a hard reset on each server ("echo b > /proc/sysrq-trigger"). I did this with a lizardfs system fully healed with 1 master and two shadows. 3 VM's were running on each node - a light load for our system.



    When it came back up one shadow master failed to start and 683 chunks were missing, a range from every VM that was running.

    As it stands thats unusable for a production system, from our perspective anyway. We can't be manually hand holding services and restoring backups every time a node throws a wobbly and it happens, even in the best of data centers, let a alone a understaffed SMB like ourselves.

    A far as I can tell the metadata servers, and maybe the chunkservers aren't properly flushing data to disk. Thats good for performance, but bad for data integrity.



    my keepalived conf file:

    global_defs {
    notification_email {
    admin@softlog.com.au
    }
    notification_email_from lb-alert@brian.softlog
    smtp_server smtp.emailsrvr.com
    smtp_connect_timeout 30
    }

    vrrp_instance VI_1 {
    state MASTER
    interface bond0
    virtual_router_id 51
    priority 60
    nopreempt
    smtp_alert
    advert_int 1
    virtual_ipaddress {
    10.10.10.249/24
    192.168.5.249/24
    }
    notify "/etc/mfs/keepalived_notify.sh"
    }




    the script file it calls

    #!/bin/bash

    TYPE=$1
    NAME=$2
    STATE=$3

    logger -t lizardha -s "Notify args = $*"


    function restart_master_server() {
    logger -t lizardha -s "Stopping lizardfs-master service"
    systemctl stop lizardfs-master.service
    if [ -f /var/lib/mfs/metadata.mfs.lock ];
    then
    logger -t lizardha -s "Lock file found, assuming bad shutdown"

    logger -t lizardha -s "killing all mfsmaster"
    killall -9 mfsmaster

    logger -t lizardha -s "Removing lock file"
    rm /var/lib/mfs/metadata.mfs.lock

    logger -t lizardha -s "Running mfsmetarestore -a"
    /usr/sbin/mfsmetarestore -a
    if [ $? -ne 0 ]; then
    logger -t lizardha -s "mfsmetarestore operation FAILED, check logs.";
    fi;
    fi
    logger -t lizardha -s "Starting lizardfs-master service"
    systemctl start lizardfs-master.service
    systemctl restart lizardfs-cgiserv.service
    logger -t lizardha -s "done."
    }


    case $STATE in
    "MASTER") logger -t lizardha -s "MASTER state"
    ln -sf /etc/mfs/mfsmaster.master.cfg /etc/mfs/mfsmaster.cfg
    restart_master_server
    exit 0
    ;;
    "BACKUP") logger -t lizardha -s "BACKUP state"
    ln -sf /etc/mfs/mfsmaster.shadow.cfg /etc/mfs/mfsmaster.cfg
    restart_master_server
    exit 0
    ;;
    "STOP") logger -t lizardha -s "STOP state"
    # Do nothing for now
    # ln -sf /etc/mfs/mfsmaster.shadow.cfg /etc/mfs/mfsmaster.cfg
    # systemctl stop lizardfs-master.service
    exit 0
    ;;
    "FAULT") logger -t lizardha -s "FAULT state"
    ln -sf /etc/mfs/mfsmaster.shadow.cfg /etc/mfs/mfsmaster.cfg
    restart_master_server
    exit 0
    ;;
    *) logger -t lizardha -s "unknown state"
    ln -sf /etc/mfs/mfsmaster.shadow.cfg /etc/mfs/mfsmaster.cfg
    restart_master_server
    exit 1
    ;;
    esac
     
  11. blackpaw

    blackpaw Member

    Joined:
    Nov 1, 2013
    Messages:
    230
    Likes Received:
    2
    What I didn't mention on the list because it didn't seem politic :) that the same cluster was also running a gluster 3.8.7 volume, sharded, replica 3. It endured all the same stress tests and never missed a beat. Its hosted VM's all autostarted and it healed itself with in a few minutes after reboots.

    Might be time to revisit sheepdog.
     
  12. Alessandro 123

    Joined:
    May 22, 2016
    Messages:
    594
    Likes Received:
    19
    Hi
    Can you link me your thread on lizard mailing list?

    I'm still evaluating gluster and lizard and I've seen major drawbacks on both systems

    Lizards lacks HA and master servers seems to be unreliable, gluster is at version 3 but has some data loss bug that are unacceptable for a software at version 3.
    They should be unacceptable even for a beta (you can't expand a shared volume or you corrupt files)
     
  13. blackpaw

    blackpaw Member

    Joined:
    Nov 1, 2013
    Messages:
    230
    Likes Received:
    2
    Sourceforge unfortunately:

    https://sourceforge.net/p/lizardfs/mailman/lizardfs-users/?viewmonth=201612

    A lot of discussion is actually via their github issues page:

    https://github.com/lizardfs/lizardfs/issues

    They have since started their own forums as well:

    https://lizardfs.org/forum

    Pretty quiet though.

    I did make considerable progress in getting a reliable HA solution together using keepalived scripts, it survived hard resets and rebooting primary servers quite well.

    Sticking with gluster for now:
    • Glusters active/active meta server approach is *very* robust. Even with keepalived, shadow servers and meta loggers I'm reluctant to trust a single primary metaserver, failover still feels a bit unreliable.
    • Gluster has much better performance on replica 3 writes and IOPS

    However my needs are somewhat specialised - SMB with just three mixed compute/storage servers which is a perfect fit for a gluster rep 3 volume. If I needed to run more nodes and dynamically change my storage then I would be seriously considering LizardFS/MooseFS as they are insanely flexible in that regard and I did find it very reliable when it came to adding disks and changing file goals.

    Whereas gluster is very inflexible in node/brick geometries and quite frankly the thought of adding extra bricks to node is quite scary.

    Before anyone mentions ceph - performance on small clusters is crap, PITA to admin and it feels liek there ara a lot of issues with osd and monitors dying under pressure in current releases. We don't have unlimted funds to throw at memory, cpu and high end IBM SSD's.


    Apparently the LizardFS team is working on a direct qemu interface which should improve performance a fair bit - no more fuse. But again, very quiet on that front.
     
  14. Alessandro 123

    Joined:
    May 22, 2016
    Messages:
    594
    Likes Received:
    19
    My biggest concern about LizardsFS/MooseFS is: as client is writing to a single chunkserver and then the chunkserver replicates to the following one, what would happens in case of "primary" chunk server failure between the client write and the first replication?

    Client will receve an ACK from the chunk server (because the write was succesfull) but after that, the chunk server crashed and data is still not replicated. This means data loss and noone is notified (no I/O error on the client, because data was wrote and client disconnected)
     
  15. blackpaw

    blackpaw Member

    Joined:
    Nov 1, 2013
    Messages:
    230
    Likes Received:
    2
    I think the write is not confirmed until all chunks are written, not a 100% sure on that though.

    Also with erasure coding the client writes to all chunkservers simultaneously, rather than the chained writes with std replica.
     
  16. Alessandro 123

    Joined:
    May 22, 2016
    Messages:
    594
    Likes Received:
    19
    I don't like EC
    Too complicated in case of disaster recovery. You have to recreate a file from shards and from Erasure codes
     
  17. Alessandro 123

    Joined:
    May 22, 2016
    Messages:
    594
    Likes Received:
    19
    and what would happen in case of catastrophic failure of master during a write ?

    the shadow maybe outdated a little bit and this lead to dataloss.
     
  18. guletz

    guletz Active Member

    Joined:
    Apr 19, 2017
    Messages:
    929
    Likes Received:
    124

    ... but not so much data. If I understand correctly the documentation, it will be no data loss: when a client need to write something it will receive all the chunck servers and inodes where the data will be sent. So when client will receive this info, it is not a data loos if master will be down.
    Only new write tasks will be delayed until a shadow will be promoted as a master.
    Like I said is only a speculation what I say, but this week I will test this scenario.

    For any others reasons I think that lizardfs is a interesting option. I have start to test lizatdfs, and at a first look is a very nice tool for me.
    If are others guys who are interested in this subject, please let me know, so I can share my own experiences with this.
     
  19. arglebargle

    arglebargle New Member

    Joined:
    Apr 19, 2019
    Messages:
    2
    Likes Received:
    0
    Sorry to resurrect such an old thread but I'm just now getting a handful of distributed filesystems setup to test and lizardfs is in the mix. The official 3.13 RC1 packages are a little broken so I spent some time yesterday merging fix PRs and building unofficial debian packages.

    If anyone is interested in re-visiting lizardfs now that proper HA is available I've pushed the .debs to github under foundObjects/lizardfs.

    Apparently, based on what I'm reading, their HA system works fairly well and filesystem performance is excellent. I'm just getting everything configured now, it'll be a while before I can make any comparisons vs gluster, ceph or moosefs.
     
    #19 arglebargle, Apr 19, 2019
    Last edited: Apr 19, 2019
  20. arnaudd

    arnaudd New Member

    Joined:
    Aug 4, 2017
    Messages:
    11
    Likes Received:
    0
    hello, will be interrested to test your debs and make some testing too
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice