hello
We've setup a pair of proxmox-ve servers into a high availability system using drbd and heartbeat. This uses drbd in Primary/Secondary mode.
We are running 2 open-vz and 2 lvm containers. We still may have some changes to make, but have done a lot of fail over tests and things are working well. We can unplug the primary server , and within a few minutes the second server turns into primary and brings up our 4 vm's.
We use Supermicro servers with 3-ware raid-10 .
see this as I got a lot of info from: http://wiki.openvz.org/HA_cluster_with_DRBD_and_Heartbeat
/etc/drbd.conf
Install heartbeat:
/etc/heartbeat/ha.cf
/etc/ha.d/haresources
/etc/ha.d/resource.d/fbc6
/etc/ha.d/resource.d/qm-fbc
run this now and from cron hourly:
/fbc/bin/update-rc-fbc6
/etc changes
on both nodes:
prepare folders
copy etc and lib
on Primary do this:
authkeys /etc/ha.d/authkeys
Finally, you can now start heartbeat on both nodes:
/etc/init.d/heartbeat start
there are some issues with heartbeat filesystem and open-vz . If you have VMID.mount doing mounts inside containers, those actual mount may try to be unmounted when heartbeat stops. This happens with nfs and local mounts. check /var/log/daemon.log and look for something like this:
In our case /home is an nfs mount, so in haresources we do umount / mount to get around the problem.
If you've got questions ask. We've used Debian , drbd and heartbeat for a while. and I can answer questions about our setup.
see http://www.drbd.org/users-guide/ for great info on drbd .
We've setup a pair of proxmox-ve servers into a high availability system using drbd and heartbeat. This uses drbd in Primary/Secondary mode.
We are running 2 open-vz and 2 lvm containers. We still may have some changes to make, but have done a lot of fail over tests and things are working well. We can unplug the primary server , and within a few minutes the second server turns into primary and brings up our 4 vm's.
We use Supermicro servers with 3-ware raid-10 .
see this as I got a lot of info from: http://wiki.openvz.org/HA_cluster_with_DRBD_and_Heartbeat
/etc/drbd.conf
Code:
global { usage-count yes; }
common { syncer { rate 100M; } }
resource r2 {
protocol C;
handlers {
# "pri-on-incon-degr": This handler is called if the node is primary,
# degraded and if the local copy of the data is inconsistent.
pri-on-incon-degr "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f ";
outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
}
startup {
wfc-timeout 0;
degr-wfc-timeout 30;
}
disk {
on-io-error detach;
fencing resource-only;
}
net {
cram-hmac-alg sha1;
shared-secret "my-secret";
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
on proxmox1 {
device /dev/drbd2;
disk /dev/vg1/data;
address 10.0.7.19:7790;
meta-disk internal;
}
on proxmox2 {
device /dev/drbd2;
disk /dev/vg1/data;
address 10.0.7.16:7790;
meta-disk internal;
}
}
# init.d/drbd reload
Code:
aptitude install heartbeat
Code:
# /etc/ha.d/ha.cf
# /etc/init.d/heartbeat reload # for change to take
# 2010-03-08 added. see wiki. we had a bug and 80GB/sec of coredumbs filled the disk
coredumps false
use_logd on
baud 19200
# Heartbeat cluster members
node proxmox1
node proxmox2
# Heartbeat communication timing
keepalive 1
warntime 10
deadtime 30
initdead 60
# Heartbeat communication paths
udpport 694
ucast eth1 10.0.7.19
ucast eth1 10.0.7.16
ucast eth0 10.100.100.19
ucast eth0 10.100.100.16
# 2010-04-05 commented out as we do not have cable connected:
#serial /dev/ttyS0
# Don't fail back automatically
auto_failback off
# Monitoring of network connection to default gateway
ping 10.100.100.2
respawn hacluster /usr/lib64/heartbeat/ipfail
# /etc/init.d/heartbeat reload # for change to take
Code:
#!/bin/bash
# /etc/ha.d/ haresources
proxmox1 fbc6 \
drbddisk::r2 \
Filesystem::/dev/drbd2::/data::ext3 \
fbc6 \
10.100.100.6 \
apache2 \
pvedaemon \
vz \
qm-fbc \
cron \
MailTo::put-your-address-here
# /etc/init.d/heartbeat reload # for change to take
# notes:
# cron added 2010-04-03 . pve cron scripts can only be run on Primary.
Code:
#!/bin/bash
# /etc/ha.d/resource.d/fbc6
#
# I call this 2x from haresources as heartbeat and vz have
# some kind of confusion with mounts .. see wiki
/fbc/bin/update-rc-fbc6 # rmv init.d scripts controlled by heartbeat.
# testing
##echo "testing "| mail -s "$0 $HOSTNAME /etc/ha.d/resource.d/fbc6" rob
# I think the /home and /bkup umounts are caused by drbd and heartbeat not getting along with pve/vz
umount /home
mount /home
# if /bkup used in any vz's then un comment:
#umount /bkup
#mount /bkup
exit 0
Code:
#!/bin/bash
# /etc/ha.d/resource.d/qm-fbc
# stop KVM containers from here, else it takes 180 secs
# using 'qemu-server' in haresources
CMD="$1"
#qm $CMD 105
#qm $CMD 106
case "$CMD" in
start)
/etc/init.d/qemu-server start ;
;;
stop)
qm stop 105 ;
qm stop 106 ;
/etc/init.d/qemu-server stop ;
;;
esac
exit 0
/fbc/bin/update-rc-fbc6
Code:
#!/bin/bash
/fbc/bin/update-rc-fbc19 # you may want this in /usr/local/bin .
# *********************************
# this is used by
# /etc/ha.d/resource.d/fbc19
# later in a cronscript like
#
# this is needed as deb updates sometimes change init.d starts
#
# ********************************
if [ "${EUID}" -ne 0 ]; then
echo "$0: must be root."
exit 1
fi
update-rc.d -f pvedaemon remove
update-rc.d -f apache2 remove
on both nodes:
Code:
umount /var/lib/vz
mv /var/lib/vz /var/lib/vz.orig
vi /etc/fstab # comment out /var/lib/vz mount:
# this is on drbd mounted by heartbeat :
#/dev/pve/data /var/lib/vz.orig ext3 defaults 0 1
# only on Primary node:
mount /dev/pve/data /var/lib/vz.orig
Code:
# on both nodes:
####### /etc
mv /etc/vz /etc/vz.orig
ln -s /data/etc/vz /etc/vz
mv /etc/pve /etc/pve.orig
ln -s /data/etc/pve /etc/pve
mv /etc/qemu-server /etc/qemu-server.orig
ln -s /data/etc/qemu-server /etc/qemu-server
######## /var/lib
mv /var/lib/vzquota /var/lib/vzquota.orig
ln -s /data/var/lib/vzquota /var/lib/vzquota
mv /var/lib/vzctl /var/lib/vzctl.orig
ln -s /data/var/lib/vzctl /var/lib/vzctl
mv /var/lib/pve-manager /var/lib/pve-manager.orig
ln -s /data/var/lib/pve-manager /var/lib/pve-manager
on Primary do this:
Code:
mount /dev/drbd2 /data
mkdir -p /data/var/lib/vz
# /etc
rsync -a /etc/vz.orig/ /data/etc/vz/
rsync -a /etc/pve.orig/ /data/etc/pve/
raync -a /etc/qemu-server.orig/ /data/etc/qemu-server/
# /var/lib
rsync -a /var/lib/vz.orig/ /data/var/lib/vz/ # may take awhile
rsync -a /var/lib/vzquota.orig/ /data/var/lib/vzquota/
rsync -a /var/lib/vzctl.orig/ /data/var/lib/vzctl/
rsync -a /var/lib/pve-manager.orig/ /data/var/lib/vz/pve-manager/
umount /data
Code:
auth 1
1 sha1 PutYourSuperSecretKeyHere
/etc/init.d/heartbeat start
there are some issues with heartbeat filesystem and open-vz . If you have VMID.mount doing mounts inside containers, those actual mount may try to be unmounted when heartbeat stops. This happens with nfs and local mounts. check /var/log/daemon.log and look for something like this:
Code:
Apr 3 14:30:24 proxmox2 ResourceManager[14751]: [16812]: debug: Starting /etc/ha.d/resource.d/Filesystem /dev/drbd1 /var/lib/vz ext3 stop
Apr 3 14:30:24 proxmox2 Filesystem[16824]: [16854]: INFO: Running stop for /dev/drbd1 on /var/lib/vz
Apr 3 14:30:24 proxmox2 Filesystem[16824]: [16864]: INFO: Trying to unmount /var/lib/vz
Apr 3 14:30:24 proxmox2 Filesystem[16824]: [16866]: INFO: unmounted /var/lib/vz/root/102/home successfully
Apr 3 14:30:24 proxmox2 Filesystem[16824]: [16867]: INFO: Trying to unmount /var/lib/vz
Apr 3 14:30:24 proxmox2 Filesystem[16824]: [16869]: ERROR: Couldn't unmount /var/lib/vz/root/102/fbc; trying cleanup with SIGTERM
Apr 3 14:30:24 proxmox2 Filesystem[16824]: [16871]: INFO: No processes on /var/lib/vz/root/102/fbc were signalled
Apr 3 14:30:25 proxmox2 Filesystem[16824]: [16874]: ERROR: Couldn't unmount /var/lib/vz/root/102/fbc; trying cleanup with SIGTERM
Apr 3 14:30:25 proxmox2 Filesystem[16824]: [16876]: INFO: No processes on /var/lib/vz/root/102/fbc were signalled
Apr 3 14:30:26 proxmox2 Filesystem[16824]: [16879]: ERROR: Couldn't unmount /var/lib/vz/root/102/fbc; trying cleanup with SIGTERM
Apr 3 14:30:26 proxmox2 Filesystem[16824]: [16881]: INFO: No processes on /var/lib/vz/root/102/fbc were signalled
Apr 3 14:30:27 proxmox2 Filesystem[16824]: [16884]: ERROR: Couldn't unmount /var/lib/vz/root/102/fbc; trying cleanup with SIGKILL
Apr 3 14:30:27 proxmox2 Filesystem[16824]: [16886]: INFO: No processes on /var/lib/vz/root/102/fbc were signalled
Apr 3 14:30:28 proxmox2 Filesystem[16824]: [16889]: ERROR: Couldn't unmount /var/lib/vz/root/102/fbc; trying cleanup with SIGKILL
Apr 3 14:30:28 proxmox2 Filesystem[16824]: [16891]: INFO: No processes on /var/lib/vz/root/102/fbc were signalled
Apr 3 14:30:29 proxmox2 Filesystem[16824]: [16894]: ERROR: Couldn't unmount /var/lib/vz/root/102/fbc; trying cleanup with SIGKILL
Apr 3 14:30:29 proxmox2 Filesystem[16824]: [16896]: INFO: No processes on /var/lib/vz/root/102/fbc were signalled
Apr 3 14:30:30 proxmox2 ntpd[16694]: synchronized to 128.113.28.67, stratum 2
Apr 3 14:30:30 proxmox2 ntpd[16694]: kernel time sync status change 0001
Apr 3 14:30:30 proxmox2 Filesystem[16824]: [16898]: ERROR: Couldn't unmount /var/lib/vz/root/102/fbc, giving up!
Apr 3 14:30:30 proxmox2 Filesystem[16824]: [16899]: INFO: Trying to unmount /var/lib/vz
Apr 3 14:30:30 proxmox2 Filesystem[16824]: [16902]: INFO: unmounted /var/lib/vz/root/102/bkup successfully
If you've got questions ask. We've used Debian , drbd and heartbeat for a while. and I can answer questions about our setup.
see http://www.drbd.org/users-guide/ for great info on drbd .
Last edited: