I've done a lot more testing since, it hasn't worked out so well. Everything is hunky dory until you actually introduce some issues and it turns out the metadata servers are really flaky. I can reliably corrupt every running VM by power-cycling the master metadata server.
Copy paste from my post to the lizard list:
**********************************************************************************
Have just finished up a week of stress testing lizardfs for HA purposes, with a variety of setups and it hasn't faired to well. In short, when under load I see massive data loss when hard resets are simulated. I detailed the results in the following, plus my HA scripts.
I'm not trying to knock lizardfs - I find it very powerful and useful and would welcome any suggests for improvements. I really want to make it work for us.
Hardware:
3 Compute nodes, each with
- 3 TB WD Red * 4 in ZFS Raid 10
- dedicated SSD Log device
- 64 GB RAM
- 1GB * 2 Bond (balance-rr) dedicated to lizardfs
- 1GB Public IP
I've setup a floating ip for mfsmaster that is handled via keepalived with scripts to handle master promotion/demotion. keepalived juggles the ip well, passing it between nodes as needed, running the promote/demote script.
Chunkservers function well, taking them up/down, adding/removing disks on the fly is not an issue. Quite impressive and very nice.
OTOH, the metadataservers seem quite fragile. Several times I've observed chunks become missing when a master is downed and a shadow takes over, a high IO load seems to be the biggest indicator for this.
Also they regularly fail to start after a node reset, even after a "mfsmetarestore -a", I regualrly see this error:
mfsmaster -d
[ OK ] configuration file /etc/mfs/mfsmaster.cfg loaded
[ OK ] changed working directory to: /var/lib/mfs
[ OK ] lockfile /var/lib/mfs/.mfsmaster.lock created and locked
[ OK ] initialized sessions from file /var/lib/mfs/sessions.mfs
[ OK ] initialized exports from file /etc/mfs/mfsexports.cfg
[ OK ] initialized topology from file /etc/mfs/mfstopology.cfg
[WARN] goal configuration file /etc/mfs/mfsgoals.cfg not found - using default goals; if you don't want to define custom goals create an empty file /etc/mfs/mfsgoals.cfg to disable this warning
[ OK ] loaded charts data file from /var/lib/mfs/stats.mfs
[....] connecting to Master
[ OK ] master <-> metaloggers module: listen on *:9419
[ OK ] master <-> chunkservers module: listen on *:9420
[ OK ] master <-> tapeservers module: listen on (*:9424)
[ OK ] main master server module: listen on *:9421
[ OK ] open files limit: 10000
[ OK ] mfsmaster daemon initialized properly
mfsmaster[6453]: connected to Master
mfsmaster[6453]: metadata downloaded 364545B/0.003749s (97.238 MB/s)
mfsmaster[6453]: changelog.mfs.1 downloaded 1143154B/0.024354s (46.939 MB/s)
mfsmaster[6453]: changelog.mfs.2 downloaded 0B/0.000001s (0.000 MB/s)
mfsmaster[6453]: sessions downloaded 2762B/0.000365s (7.567 MB/s)
mfsmaster[6453]: opened metadata file /var/lib/mfs/metadata.mfs
mfsmaster[6453]: loading objects (files,directories,etc.) from the metadata file
mfsmaster[6453]: loading names from the metadata file
mfsmaster[6453]: loading deletion timestamps from the metadata file
mfsmaster[6453]: loading extra attributes (xattr) from the metadata file
mfsmaster[6453]: loading access control lists from the metadata file
mfsmaster[6453]: loading quota entries from the metadata file
mfsmaster[6453]: loading file locks from the metadata file
mfsmaster[6453]: loading chunks data from the metadata file
mfsmaster[6453]: checking filesystem consistency of the metadata file
mfsmaster[6453]: connecting files and chunks
mfsmaster[6453]: calculating checksum of the metadata
mfsmaster[6453]: metadata file /var/lib/mfs/metadata.mfs read (26 inodes including 13 directory inodes and 13 file inodes, 10548 chunks)
mfsmaster[6453]: running in shadow mode - applying changelogs from /var/lib/mfs
terminate called after throwing an instance of 'std::invalid_argument'
what(): stoull
Aborted
Oddly, the only thing that seems to stop it happening is restarting the *master* server it is connecting to.
Worst case is simulating a catastrophic power failures, which I did by invoking a hard reset on each server ("echo b > /proc/sysrq-trigger"). I did this with a lizardfs system fully healed with 1 master and two shadows. 3 VM's were running on each node - a light load for our system.
When it came back up one shadow master failed to start and 683 chunks were missing, a range from every VM that was running.
As it stands thats unusable for a production system, from our perspective anyway. We can't be manually hand holding services and restoring backups every time a node throws a wobbly and it happens, even in the best of data centers, let a alone a understaffed SMB like ourselves.
A far as I can tell the metadata servers, and maybe the chunkservers aren't properly flushing data to disk. Thats good for performance, but bad for data integrity.
my keepalived conf file:
global_defs {
notification_email {
admin@softlog.com.au
}
notification_email_from lb-alert@brian.softlog
smtp_server smtp.emailsrvr.com
smtp_connect_timeout 30
}
vrrp_instance VI_1 {
state MASTER
interface bond0
virtual_router_id 51
priority 60
nopreempt
smtp_alert
advert_int 1
virtual_ipaddress {
10.10.10.249/24
192.168.5.249/24
}
notify "/etc/mfs/keepalived_notify.sh"
}
the script file it calls
#!/bin/bash
TYPE=$1
NAME=$2
STATE=$3
logger -t lizardha -s "Notify args = $*"
function restart_master_server() {
logger -t lizardha -s "Stopping lizardfs-master service"
systemctl stop lizardfs-master.service
if [ -f /var/lib/mfs/metadata.mfs.lock ];
then
logger -t lizardha -s "Lock file found, assuming bad shutdown"
logger -t lizardha -s "killing all mfsmaster"
killall -9 mfsmaster
logger -t lizardha -s "Removing lock file"
rm /var/lib/mfs/metadata.mfs.lock
logger -t lizardha -s "Running mfsmetarestore -a"
/usr/sbin/mfsmetarestore -a
if [ $? -ne 0 ]; then
logger -t lizardha -s "mfsmetarestore operation FAILED, check logs.";
fi;
fi
logger -t lizardha -s "Starting lizardfs-master service"
systemctl start lizardfs-master.service
systemctl restart lizardfs-cgiserv.service
logger -t lizardha -s "done."
}
case $STATE in
"MASTER") logger -t lizardha -s "MASTER state"
ln -sf /etc/mfs/mfsmaster.master.cfg /etc/mfs/mfsmaster.cfg
restart_master_server
exit 0
;;
"BACKUP") logger -t lizardha -s "BACKUP state"
ln -sf /etc/mfs/mfsmaster.shadow.cfg /etc/mfs/mfsmaster.cfg
restart_master_server
exit 0
;;
"STOP") logger -t lizardha -s "STOP state"
# Do nothing for now
# ln -sf /etc/mfs/mfsmaster.shadow.cfg /etc/mfs/mfsmaster.cfg
# systemctl stop lizardfs-master.service
exit 0
;;
"FAULT") logger -t lizardha -s "FAULT state"
ln -sf /etc/mfs/mfsmaster.shadow.cfg /etc/mfs/mfsmaster.cfg
restart_master_server
exit 0
;;
*) logger -t lizardha -s "unknown state"
ln -sf /etc/mfs/mfsmaster.shadow.cfg /etc/mfs/mfsmaster.cfg
restart_master_server
exit 1
;;
esac