Current DRBD model

sel

New Member
Jul 17, 2008
15
0
1
I've been using a couple of pve clusters with drbd between two servers for about a year. My conclusion is that this ain't stable enough for production. From time to time something happens, and the mirror breaks. It might be a network problem or something else. I then have to manually resolve the split brain (normally by stopping all virtual servers on one side, bacup, and restore on the other side).

I would like to suggest that the team (might be for 2.1) look into adding a model where a single logical volume is mirrored. This also makes it possible to change which physical server has the mirror of a given virtual server. This should be something that's done through the web ui, and should not require using a cli.

A primary/secondary model would also be practical, even a primary/(secondary,secondary) setup would be possible. This would also add the possibility to live migrate to a server which from the beginning don't have a mirror of the given volume.

I realize that this requires a lot of work, but I think it might be worth it.
 
Hi,
i have also two cluster with primary-primary drbd. And also sometimes break the connection of an drbd-resource (mostly during pve-backup). But i don't think that's a network problem, because i allways use all drbd-resources twice (one for server a, one for server b) so don't have trouble to resolve an split-brain condition and if one drbd-ressource became split brain, the others work well!
I think also an active/passive-configuration are much more reliable but not very flexible - because you can't use them for online migration (except only one VM are on the storage). In my eyes it's only usable as "hot cold-spare" if one server die.

Udo
 
Hi,
i have also two cluster with primary-primary drbd. And also sometimes break the connection of an drbd-resource (mostly during pve-backup). But i don't think that's a network problem, because i allways use all drbd-resources twice (one for server a, one for server b) so don't have trouble to resolve an split-brain condition and if one drbd-ressource became split brain, the others work well!

I wish I've done the same, but for now I'm stuck with a common drbd volume for both servers

I think also an active/passive-configuration are much more reliable but not very flexible - because you can't use them for online migration (except only one VM are on the storage). In my eyes it's only usable as "hot cold-spare" if one server die.

Udo

Well, that's some of my point, by putting the drbd layer on a lvm volume rather than having the drbd device as a pv in a volume group you would have much more flexibility than with the current model. It would fit in much better with the new cluster model.
 
I have 12 Proxmox 1.9 servers setup as 6 clusters each using two DRBD volumes in production.
The oldest pair are about 2 years old.
I also have two more running 2.0, I think lvm handeling is better in 2.0 and we will see fewer split brain as a result.

A few split brains have happened.
Most happen when I am mucking with things, humans and their mistakes.

A few happened during vzdump sessions and Kernel panics(twice from bad ram).

It is so infrequent I have yet to remember the commands.
So I pull up the drbd manual and follow the manual split brain recovery section.
No worries, nothing to be scared of, just a normal thing when using drbd.

You do not need to be stuck with a single drbd volume.
It is risky to reconfigure production systems, maybe the risk is too much.
If you want to change I will post an untested use at your own risk procedure.
The ability to recover from split bran by running a few commands might be worth the effort.
 
...
You do not need to be stuck with a single drbd volume.
It is risky to reconfigure production systems, maybe the risk is too much.
If you want to change I will post an untested use at your own risk procedure.
The ability to recover from split bran by running a few commands might be worth the effort.
Hi e100,
i see this like you. Esp. drbd ontop of lvm sounds not very good - no easy extension of the lvm (lvm yes but not the drbd-ressource) and the lv must be active on both nodes (perhaps also not the best idea?).
But one question about the split-brain resolving - if a split brain condition happens, I was only able to resolve this with invalidate the resource on the server which don't use this vg. Due to the size of the drbd-devices this can take a long time (my biggest device is 4.5T - and mostly on this split brain occur).
Are you able to resolve the split brain without invalidate??
Then post the trick, please.

Udo
 
Hi e100,

Are you able to resolve the split brain without invalidate??
Then post the trick, please.

Udo

I invalidate the node that does not have VMs running for that DRBD volume.
It only invalidates the extents that are marked out of sync, not the entire volume.
That is explained here: http://www.drbd.org/users-guide-8.3/s-resolve-split-brain.html

I have my monitoring system(zabbix) configured to notify us the moment a split-brain occurs.

All of my DRBD volumes are about 1TB.
On my nodes using dual 1 Gig replication I keep sync speed set to about 30Mb/sec and a typical split brain is fixed in a few minutes.
On my nodes with Infiniband for replication I keep sync speed at 100MB/sec and usually see split-brain fixed in under a minute.
I keep this set low so I do not drastically impact write performance during sync.

4.5TB is a rather large DRBD volume.
If I needed that much space I might consider using two DRBD volumes and assign disks to the VM from both rather than having one giant DRBD volume.
The time needed to do a full resync would be the driving factor on making that decision.
I prefer to keep the full resync to a few hours or less while limited to a speed that does not drastically impact IO performance of the source node.

A few other tips:
1. Never, ever, put two DRBD volumes on the same physical mechanical disks, all you do is unnecessarily create random IO.
2. Use a Battery Backed Write Cache RAID Card OR put the DRBD metadata on a seperate disk preferably an SSD.
see:http://www.drbd.org/users-guide-8.3/ch-internals.html#s-metadata
3. Tuning is important: http://www.drbd.org/users-guide-8.3/p-performance.html
4. Tuning related to resync: http://www.drbd.org/users-guide-8.3/s-activity-log.html
 
I invalidate the node that does not have VMs running for that DRBD volume.
It only invalidates the extents that are marked out of sync, not the entire volume.
That is explained here: http://www.drbd.org/users-guide-8.3/s-resolve-split-brain.html
Thanks for that link - i will try this way if a split brain occour in the next time. Hope it's help!
I have my monitoring system(zabbix) configured to notify us the moment a split-brain occurs.
here do icinga this job
All of my DRBD volumes are about 1TB.
On my nodes using dual 1 Gig replication I keep sync speed set to about 30Mb/sec and a typical split brain is fixed in a few minutes.
On my nodes with Infiniband for replication I keep sync speed at 100MB/sec and usually see split-brain fixed in under a minute.
I keep this set low so I do not drastically impact write performance during sync.
I use dolphin-Nics (20GB) or 10GB-Nics for the connection. My syncer rate ist higher (250MB with dolphin-nics), perhaps i use less if the split-brain resolving works without resync the full content.
4.5TB is a rather large DRBD volume.
If I needed that much space I might consider using two DRBD volumes and assign disks to the VM from both rather than having one giant DRBD volume.
The time needed to do a full resync would be the driving factor on making that decision.
I prefer to keep the full resync to a few hours or less while limited to a speed that does not drastically impact IO performance of the source node.
Yes 4.5TB is much, but it's one big fileserver - i don't know if i win much, if i use two raidsets (in my case i have six disks as raid-10). Less disks gives less speed.
A few other tips:
1. Never, ever, put two DRBD volumes on the same physical mechanical disks, all you do is unnecessarily create random IO.
Why not? Where is the different between one full drbd-ressource with writing on two VM-disks which lay on the different side of the disk to two drbd-resources, where writing at the same time?
I use an raidset (with fast raidcontroller + BBU) with two raidvolume on - one for server a the other for server b (like a_sata_r0, b_sata_r1, a_sas_r2 ...). For the right write-order is the raid-controller responsible.
2. Use a Battery Backed Write Cache RAID Card OR put the DRBD metadata on a seperate disk preferably an SSD.
see:http://www.drbd.org/users-guide-8.3/ch-internals.html#s-metadata
I use Raid + BBU and first wan't to speedup the config with putting the metadata on an ssd, but read on the linbit-site that this bring no benefit (don't find the url yet).
Internal metadata is the right choice.
Right - tuning have I tried with this manual. This is also the reason why i don't use GB-Nics with DRBD.

Udo
 
I use dolphin-Nics (20GB) or 10GB-Nics for the connection. My syncer rate ist higher (250MB with dolphin-nics), perhaps i use less if the split-brain resolving works without resync the full content.
I choose Infiniband because I could get dual port 10G pci-e cards for about $30 on ebay and lots of cheap options for switches.
I hope to some day move from DRBD to CEPH or Sheepdog and a 10G network will be great for that too.


Yes 4.5TB is much, but it's one big fileserver - i don't know if i win much, if i use two raidsets (in my case i have six disks as raid-10). Less disks gives less speed.
Sounds like you have little choice with so few disks.

Why not? Where is the different between one full drbd-ressource with writing on two VM-disks which lay on the different side of the disk to two drbd-resources, where writing at the same time?
I use an raidset (with fast raidcontroller + BBU) with two raidvolume on - one for server a the other for server b (like a_sata_r0, b_sata_r1, a_sas_r2 ...). For the right write-order is the raid-controller responsible.
I too though that it would not matter so I decide to test this myself.
My servers all have 12 disks.
Using 12 disks in RAID 6 with two DRBD volumes was the fastest.
However, it was also the most inconsistent, one VM doing lots of random reading or writing would make all of them suffer.
Doing a resync made all of them suffer, not just the volume being synced.

I switched to two RAID5 of 6 disks each.
Max read and write are a little less.
But whatever is going on on DRBD0 does not impact DRBD1 and vice versa.
Overal if you consider the max throughput of DRBD0 and DRBD1 combined it is a little faster.
Resync of one does not bother the other.
Now I can have random IO intensive tasks on both servers at the same time without one slowing everything down.

Maybe a more accurate #1 tip is:
When possible avoid putting two DRBD volumes on the same physical mechanical disks, since doing so causes additional random IO.

In your case with fewer disks it might not be benficial.

I use Raid + BBU and first wan't to speedup the config with putting the metadata on an ssd, but read on the linbit-site that this bring no benefit (don't find the url yet).
Internal metadata is the right choice.
Yes with BBU Internal is correct choice.
External Metadata performs better when using mechanical disks without a BBU cache according to what I have read.
 
Thanks for that link - i will try this way if a split brain occour in the next time. Hope it's help!
"drbdadm -- --discard-my-data connect resource" work very well - this weekend it's happens again and to sync the split-brain of the 4.5T-volume takes less than one minute (12 h split-brain)!
Very good...

Udo
 
"drbdadm -- --discard-my-data connect resource" work very well - this weekend it's happens again and to sync the split-brain of the 4.5T-volume takes less than one minute (12 h split-brain)!
Very good...

Udo

...i'm running a similar configuration (2 nodes - 2 drbd devs) with
after-sb-0pri discard-node-[node-name];
in drbd.conf, using the 'unused' node in each resource. Isn't this the same as above, but "automatic"?
 
...i'm running a similar configuration (2 nodes - 2 drbd devs) with
after-sb-0pri discard-node-[node-name];
in drbd.conf, using the 'unused' node in each resource. Isn't this the same as above, but "automatic"?
Hi,
don't know, but can't run automaticly, because you need first to unuse the resource (vgchange -a n drbd-vg).

Udo
 
Hi,
don't know, but can't run automaticly, because you need first to unuse the resource (vgchange -a n drbd-vg).

Udo

Hi udo and e100 (The masters of this forum)

I see in this link that DRBD say that have a option of automatic recovery, I believe that the option "Graceful recovery from split brain if one host has had no intermediate changes" will be good:
http://www.drbd.org/users-guide/s-split-brain-notification-and-recovery.html

What do you think? ... I have not yet tested it.

Best regards
Cesar