Question about LVM and DRBD implementation

mario · May 19, 2012

Hi everybody,
I'm new in Proxmox, I would like to create one claster with two nodes and HA futers.
I have two HP DL140 servers with one array raid 1, I install Proxmox at both nodes. I read all wikis about DRBD, HA, fencing and Two-Node High Availability Cluster.
I also find how to http://www.nedproductions.biz/wiki/c...ntranet-part-3.
I don't understand I should prepare DRBD on raw partition and then create VG on drbd volume, or first crate lv volume and then create drbd device and after that create again VG on drbd device?
I will be greateful for any help.

e100 · May 19, 2012

I suggest using a raw disk or a partition for the DRBD storage.

I believe it is possible to put DRBD on top of LVM, then put LVM in DRBD.
But doing so is likely complicated (I assume requiring more filters in lvm.conf) and then you have two layers of LVM which will impact performance.

I have setup all of our DRBD nodes very similar to how the Proxmox wiki states including using two DRBD volumes to make split brain recovery simpler.
http://pve.proxmox.com/wiki/DRBD

There are bugs in the DRBD module of the Proxmox kernel, if you want to avoid those bugs please read this thread:
http://forum.proxmox.com/threads/9376-Bug-in-DRBD-causes-split-brain-already-patched-by-DRBD-devs

mario · May 19, 2012

e100 thanks very much for answer.
Now I know what to do. I reduce size pve VG after standard ProxMox instalation and prepare two additional partition then set two DRBD for each partition (one for first node and one for second node).
After that I set new VG on each DRBD device.
If I understud right after that I can add this VG in web GUI.
How much place I need to leave in VG when create win VM with raw disk on LV volume to have snapshot funcionality?

e100 · May 19, 2012

The default snapshot size is 1GB per vm disk.

The snapshot needs to be larger than the amount of data that might change in the VM dusing the backup.

You can set the snapshot size in /etc/vzdump.conf or in the GUI on the storage.

Once you have decided on a snapshot size you can the determine how much free space you need.

I use 20 to 30GB for snapshot size.
@30GB if my vm has 3 disks I need 90GB of free space.

udo · May 19, 2012

mario said:
e100 thanks very much for answer.
Now I know what to do. I reduce size pve VG after standard ProxMox instalation and prepare two additional partition then set two DRBD for each partition (one for first node and one for second node).
After that I set new VG on each DRBD device.
If I understud right after that I can add this VG in web GUI.
How much place I need to leave in VG when create win VM with raw disk on LV volume to have snapshot funcionality?

Hi Mario,
this depends how many writes are inside the vm during backup-time. Normaly is 4GB free space but if you want on the save side, you should perhaps leave 8GB free?

Udo

mario · May 21, 2012

So I start for setting cluster with two nodes.

<?xml version="1.0"?>
<cluster name="zeus-cluster" config_version="6">
<cman keyfile="/var/lib/pve-cluster/corosync.authkey" two_node="1" expected_votes="1">
</cman>
<clusternodes>
<clusternode name="zeus11" votes="1" nodeid="1"/>
<clusternode name="zeus12" votes="1" nodeid="2"/></clusternodes>
</cluster>

Temprary without fence.
I done everythig like in a wiki but now cluster not working, it's mean working when I kick cman on second node but working only few second.
In this 5 second I see at first node:

Node Sts Inc Joined Name
1 M 104 2012-05-21 01:51:51 zeus11
2 M 212 2012-05-21 03:11:06 zeus12.iwt.local

and at second node:

Node Sts Inc Joined Name
1 X 0 zeus11
2 M 216 2012-05-21 03:13:31 zeus12

After 5 seconds I have at first node:

Node Sts Inc Joined Name
1 M 104 2012-05-21 01:51:51 zeus11
2 X 220 Node2

and at second node:

cman_tool: Cannot open connection to cman, is it running ?

On first node in web gui I see only first node, on second node in web gui I see two nodes, but first has red mark.
Have you some idea what to do or what to check?

dietmar · May 21, 2012

mario said:
I done everythig like in a wiki but now cluster not working, it's mean working when I kick cman on second node but working only few second.

What do you mean by 'kick cman'?

mario · May 21, 2012

It's men I add second node to cluster and it's look like this:

root@zeus12:~# pvecm add 192.168.1.11
copy corosync auth key
stopping pve-cluster service
Stopping pve cluster filesystem: pve-cluster.
backup old database
Starting pve cluster filesystem : pve-cluster.
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... [ OK ]
Starting fenced... [ OK ]
Starting dlm_controld... [ OK ]
Unfencing self... [ OK ]
waiting for quorum...OK
generating node certificates
merge known_hosts file
restart services
Restarting PVE Daemon: pvedaemon.
Restarting web server: apache2 ... waiting .
successfully added node 'zeus12' to cluster.
root@zeus12:~#

and then it works 5-7sec and stop then start after:

root@zeus12:~# /etc/init.d/cman start
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... [ OK ]
Starting fenced... [ OK ]
Starting dlm_controld... [ OK ]
Unfencing self... [ OK ]
root@zeus12:~# pvecm n
Node Sts Inc Joined Name
1 X 0 zeus11
2 M 368 2012-05-21 09:22:47 zeus12
root@zeus12:~# pvecm status
Version: 6.2.0
Config Version: 10
Cluster Name: zeus-cluster
Cluster Id: 2996
Cluster Member: Yes
Cluster Generation: 372
Membership state: Cluster-Member
Nodes: 1
Expected votes: 1
Total votes: 1
Node votes: 1
Quorum: 1
Active subsystems: 5
Flags: 2node Error
Ports Bound: 0
Node name: zeus12
Node ID: 2
Multicast addresses: 239.192.11.191
Node addresses: 192.168.1.12

dietmar · May 21, 2012

Is there any hint in /var/log/syslog (cman related)?

mario · May 21, 2012

Uff... Yestearday I made some mistake in config cluster (I was restarting first node after add two_node="1" to cluster.conf) and I loose a lot of time.
Today I reinstall PVE on both node, and everything works fine.
I'm starting with PVE so have you some advise for work with two nodes cluster + DRBD + HA?

udo · May 21, 2012

mario said:
...
I'm starting with PVE so have you some advise for work with two nodes cluster + DRBD + HA?

Hi,
a third node!

joking aside, I assume you get a much more stable system with three nodes and the third node don't need an powerfull one - 64bit without hardware virtualization is enough.

Udo

mario · May 23, 2012

OK. cluster working, fensed working, but I have some strange with DRBD and speed off syncer.
There are 2 x DL140, DRBD have separate network per to per 1Gb/s. Disks are SATA in RAID 1.

version: 8.3.13 (api:88/proto:86-96)
GIT-hash: 83ca112086600faacab2f157bc5a9324f7bd7f77 build by root@nodeA, 2012-05-23 02:16:31
0: cs:SyncSource rorimary/Secondary ds:UpToDate/Inconsistent C r-----
ns:4346680 nr:0 dw:2953088 dr:1393820 al:722 bm:85 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:416631316
[>....................] sync'ed: 0.4% (406864/408224)M
finish: 83:17:28 speed: 1,380 (1,272) K/sec

There is my config for r0

resource r0 {
protocol C;
startup {
wfc-timeout 15; # non-zero wfc-timeout can be dangerous (http://forum.proxmox.com/threads/3465-Is-it-safe-to-use-wfc-timeout-in-DRBD-configuration)
degr-wfc-timeout 60;
become-primary-on both;
}
net {
cram-hmac-alg sha1;
shared-secret "password";
allow-two-primaries;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
on zeus11 {
device /dev/drbd0;
disk /dev/cciss/c0d0p3;
address 10.10.10.11:7788;
meta-disk internal;
}
on zeus12 {
device /dev/drbd0;
disk /dev/cciss/c0d0p3;
address 10.10.10.12:7788;
meta-disk internal;
}
}

Have you some advice what to check?

udo · May 23, 2012

mario said:
OK. cluster working, fensed working, but I have some strange with DRBD and speed off syncer.
There are 2 x DL140, DRBD have separate network per to per 1Gb/s. Disks are SATA in RAID 1.

There is my config for r0

Have you some advice what to check?

Hi Mario,
you should test the network-troughput with iperf first between the host.

How looks your syncer-section in global_common.conf? I have this values for 10GB:

Code:

        syncer {
                rate 150000;
                verify-alg sha1;
        }

If you use an crossover-cable between the hosts you can also disable the encryption... but your bad syncrates must have another source...

Udo

mario · May 23, 2012

Udo thanks for advice.
I do some tests, and reconfig my global, r0 and r1 resources configs, now it's beter but still slowly.
I don't know what speed I should expect from direct link 1Gb/s and RAID 1 on SATA with HP E200 controler in DL140?
Below my test for link, for sync and configs.

------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[ 4] local 10.10.10.11 port 5001 connected with 10.10.10.12 port 57081
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 1.10 GBytes 941 Mbits/sec

version: 8.3.13 (api:88/proto:86-96)
GIT-hash: 83ca112086600faacab2f157bc5a9324f7bd7f77 build by root@zeus11, 2012-05-23 02:16:31
0: cs:SyncSource rorimary/Secondary ds:UpToDate/Inconsistent C r-----
ns:1116504 nr:0 dw:0 dr:1124552 al:0 bm:66 lo:3 pe:79 ua:64 ap:0 ep:1 wo:b oos:418313876
[>....................] sync'ed: 0.3% (408508/409588)M
finish: 11:22:56 speed: 10,192 (10,740) K/sec
1: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:260083864

global {
usage-count yes;
# minor-count dialog-refresh disable-ip-verification
}
common {
protocol C;
handlers {
pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
# fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
# split-brain "/usr/lib/drbd/notify-split-brain.sh root";
# out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
# before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
# after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
}
startup {
wfc-timeout 15; # non-zero wfc-timeout can be dangerous (http://forum.proxmox.com/threads/3465-Is-it-safe-to-use-wfc-timeout-in-DRBD-configurati$
degr-wfc-timeout 60;
become-primary-on both;
# wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb
}
disk {
# on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes
# no-disk-drain no-md-flushes max-bio-bvecs
}
net {
allow-two-primaries;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
# sndbuf-size rcvbuf-size timeout connect-int ping-int ping-timeout max-buffers
# max-epoch-size ko-count allow-two-primaries cram-hmac-alg shared-secret
# after-sb-0pri after-sb-1pri after-sb-2pri data-integrity-alg no-tcp-cork
}
syncer {
rate 120M;
# rate after al-extents use-rle cpu-mask verify-alg csums-alg
}
}

resource r0 {
device /dev/drbd0;
disk /dev/cciss/c0d0p3;
meta-disk internal;
on zeus11 {
address 10.10.10.11:7788;
}
on zeus12 {
address 10.10.10.12:7788;
}
}

resource r1 {
device /dev/drbd1;
disk /dev/cciss/c0d0p4;
meta-disk internal;
on zeus11 {
address 10.10.10.11:7789;
}
on zeus12 {
address 10.10.10.12:7789;
}
}

Mybe I have something else wrong in my configuration?

macday · May 23, 2012

hi perhaps i can tell you what is wrong

my experience tells me never to use an hp-raid-controller. i had 2 hp servers and changed the internal "cciss" raid-controller into an adaptec with bbu and everything was blazing fast.

just my 2cents

mario · May 23, 2012

Macday, You've probably right, but now I haven't possibility change RAID controlers.
I read a little and trayed turn on cache for phisical drive in controler E200.
Now is much better. Transfer has rised from 10MB/s to 60MB/s - it's near the actually speed read/write for SATA drive in RAID 1.

GIT-hash: 83ca112086600faacab2f157bc5a9324f7bd7f77 build by root@zeus12, 2012-05-23 02:13:11
0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:419420308 dw:419420308 dr:0 al:0 bm:25596 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
1: cs:SyncSource rorimary/Secondary ds:UpToDate/Inconsistent C r-----
ns:262304 nr:0 dw:0 dr:270664 al:0 bm:15 lo:1 pe:81 ua:64 ap:0 ep:1 wo:b oos:259831832
[>....................] sync'ed: 0.1% (253740/253988)M
finish: 1:08:37 speed: 63,008 (63,008) K/s

I will be greatful for some additional advice about global_common.conf for DRBD (my is above) in cluster with 2 nodes.

mario · May 29, 2012

Thank's everybody for help

.
Now a will be test environment and we will see.

Search

Search

Question about LVM and DRBD implementation

mario

Active Member

e100

Renowned Member

mario

Active Member

e100

Renowned Member

udo

Distinguished Member

mario

Active Member

dietmar

Proxmox Staff Member

mario

Active Member

dietmar

Proxmox Staff Member

mario

Active Member

udo

Distinguished Member

mario

Active Member

udo

Distinguished Member

mario

Active Member

macday

Member

mario

Active Member

mario

Active Member