Bug in DRBD causes split-brain, already patched by DRBD devs

e100

Renowned Member
Nov 6, 2010
1,268
46
88
Columbus, Ohio
ulbuilder.wordpress.com
I've seen this issue a few times on 1.X and for the first time on 2.0 today.

Usually happens during snapshot backups, I assume because of the increased IO more than anything.
The error that is a big clue happens on the node that is not doing the backup.

Code:
block drbd0: [B]magic?? on data m:[/B] 0x3eabc3a5 c: 512 l: 97
block drbd0: peer( Primary -> Unknown ) conn( Connected ->[B] ProtocolError [/B]) pdsk( UpToDate -> DUnknown )

This issue goes back some time, easy to find references to it on these forums.

I believe this is a bug in DRBD that was recently fixed:
http://git.drbd.org/gitweb.cgi?p=drbd-8.3.git;a=commit;h=95153072a19dfef10a2cde98c0719cf0f5d72a68


That commit mentions:
We assumed only bios with bi_idx == 0 would end up in drbd_make_request().

That is wrong.

At least device mapper, in __clone_and_map(), may submit clones only covering a partial bio, but sharing the original bvec, by adjusting bi_idx and relevant other bio members of the clone.
We used __bio_for_each_segment() in various places, even though that is documented as

* drivers should not use the __ version unless they _really_ want to
* run through the entire bio and not just pending pieces

Impact: we would send the full bio bvec, even for the clone with bi_idx > 0, which will cause data corruption on the peer (because we submit wrong data at the clone offset), and will cause a DRBD protocol error,


Would it be possible to get DRBD code updated in Proxmox so this bug can finally be put to rest?


 

Would it be possible to get DRBD code updated in Proxmox so this bug can finally be put to rest?

Please post a bug to bugzilla.openvz.org - the openvz includes the drbd module, so it is better to directly ask there.
 
Please post a bug to bugzilla.openvz.org - the openvz includes the drbd module, so it is better to directly ask there.

apologies if I'm not on the right thread, but I want to know what command to run to find out what version of "DRBD" is supported by the "kernel"?
 
apologies if I'm not on the right thread, but I want to know what command to run to find out what version of "DRBD" is supported by the "kernel"?
Hi,
modinfo is a good tool for that
Code:
modinfo drbd
filename:       /lib/modules/2.6.32-11-pve/kernel/drivers/block/drbd/drbd.ko
alias:          block-major-147-*
license:        GPL
[B]version:        8.3.10[/B]
description:    drbd - Distributed Replicated Block Device v8.3.10
author:         Philipp Reisner <phil@linbit.com>, Lars Ellenberg <lars@linbit.com>
srcversion:     A52DAA74FC64F74BC4127FD
depends:        
vermagic:       2.6.32-11-pve SMP mod_unload modversions 
parm:           minor_count:Maximum number of drbd devices (1-256) (uint)
parm:           disable_sendpage:bool
parm:           allow_oos:DONT USE! (bool)
parm:           cn_idx:uint
parm:           proc_details:int
parm:           enable_faults:int
parm:           fault_rate:int
parm:           fault_count:int
parm:           fault_devs:int
parm:           usermode_helper:string
Udo
 
e100
thanks for making the bug report... hopefully drbd gets updated in the kernel soon.

to prevent this issue I'll use suspend backups for kvm's on drbd . have you tried doing that to prevent split brain?
 
e100
thanks for making the bug report... hopefully drbd gets updated in the kernel soon.

to prevent this issue I'll use suspend backups for kvm's on drbd . have you tried doing that to prevent split brain?

I have not tried that but I suspect it would work.
It is really only an inconvenience to me since repairing the split brain is so simple, I like my snapshot backups and will stick to using them.

We could also download DRBD code then compile a newer userland and kernel module.
Only issue is every kernel update would require you to remember to re-compile the DRBD module.
 
We could also download DRBD code then compile a newer userland and kernel module.
Only issue is every kernel update would require you to remember to re-compile the DRBD module.

Is possible to ask the team of Proxmox that in their repositories have available the latest version of DRBD and the respective kernel?:D
 
Building the DRBD userland and module is not too difficult, I might give this a try Monday when we reinstall 2.0 onto two old 1.9 servers.

This is untested but should work:
Code:
mkdir drbd
cd drbd
apt-get install git-core git-buildpackage fakeroot debconf-utils docbook-xml docbook-xsl dpatch xsltproc autoconf flex pve-headers-2.6.32-11-pve module-assistant
git clone http://git.drbd.org/drbd-8.3.git
cd drbd-8.3
git checkout drbd-8.3.13rc1   
dpkg-buildpackage -rfakeroot -b -uc
cd ..
dpkg -i drbd8-module-source_8.3.13rc1-0_all.deb drbd8-utils_8.3.13rc1-0_amd64.deb
module-assistant auto-install drbd8
reboot


NOTE: at the time of this post drbd-8.3.13rc1 is the latest tag in 8.3 and is the lowest version that contains the split brain causing bug.

If you update the module using this method you need to re-run this command if you install a new kernel in the future:
Code:
module-assistant auto-install drbd8
 
Wow e100, is very interesting !!!
This man has plenty of experience

And let me make some inquiries:

1- But if i want drbd version 8.4.1, what would be the procedure?, Is it only change the command git?.
2- And if I later want to update drbd and the module, what would be the procedure?

Best regards
Cesar
 
1- But if i want drbd version 8.4.1, what would be the procedure?, Is it only change the command git?.
2- And if I later want to update drbd and the module, what would be the procedure?

Best regards
Cesar

8.4 should be like this:
Code:
mkdir drbd 
cd drbd apt-get install git-core git-buildpackage fakeroot debconf-utils docbook-xml docbook-xsl dpatch xsltproc autoconf flex pve-headers-2.6.32-11-pve module-assistant 
git clone http://git.drbd.org/drbd-8.4.git 
cd drbd-8.4 
git checkout drbd-8.4.1    
dpkg-buildpackage -rfakeroot -b -uc 
cd .. 
dpkg -i drbd8-module-source_8.4.1-0_all.deb drbd8-utils_8.4.1-0_amd64.deb 
module-assistant auto-install drbd8 
reboot


To update you would go into the drbd-8.4 directory, use git to pull latest changes "git pull" and then checkout the version you want and the rest is about the same.

8.3 gets bug fixes until December 2012
8.4 gets bug fixes and new features
Should not be an issue to upgrade to the latest 8.3 or 8.4 from the version that is currently in the Proxmox Kernel.

The bug that causes the split-brain is not fixed in 8.4.1.
It is fixed in the 8.4 code, here is the commit:
http://git.drbd.org/gitweb.cgi?p=drbd-8.4.git;a=commit;h=90e21c9b9bb6bdf3669502318afbdd79997b5c52

But that only exists in the main 8.4 tree so you would basically be running the latest development version if you wanted to be free of the bug on 8.4.
So until they make a new release in 8.4 it might be best to use 8.3.13rc1, or wait for the module to get updated in the Proxmox kernel.
 
Note there's a bug in the bugzilla for this problem at https://bugzilla.proxmox.com/show_bug.cgi?id=185

I have similar problems where the default DRBD 8.3.10 constantly loops a complete disc transfer over and over again, which was fixed in 8.3.11.

Niall

Hi ned14

Can you explain in what scenario is working "drbd"?
(for example with 2 partitions for 2 Proxmox nodes with several VMs in each one)

Best regards
Cesar
 
Can you explain in what scenario is working "drbd"?
(for example with 2 partitions for 2 Proxmox nodes with several VMs in each one)

It's in this scenario: http://www.nedproductions.biz/wiki/...uster-running-over-an-openvpn-intranet-part-3

In short, it's unidirectional protocol A over an OpenVPN created subnet straddling a NATed ADSL connection, so about as not what DRBD should be used in as you can get. There are two DRBD partitions, one from internet to local and the other from local to internet. They are both unidirectional as the 80ms ping time makes anything else impractical.

Syncing works mostly fine from internet node to local node once you've heavily bumped up timeouts. This problem with the dirty bitmap getting constantly reset every connection only happens when going from local to internet. There was a post on the DRBD forums about this, and it was confirmed a bug in 8.3.10.

Niall
 
Building the DRBD userland and module is not too difficult, I might give this a try Monday when we reinstall 2.0 onto two old 1.9 servers.

This is untested but should work:
Code:
mkdir drbd
cd drbd
apt-get install git-core git-buildpackage fakeroot debconf-utils docbook-xml docbook-xsl dpatch xsltproc autoconf flex pve-headers-2.6.32-11-pve module-assistant
git clone http://git.drbd.org/drbd-8.3.git
cd drbd-8.3
git checkout drbd-8.3.13rc1   
dpkg-buildpackage -rfakeroot -b -uc
cd ..
dpkg -i drbd8-module-source_8.3.13rc1-0_all.deb drbd8-utils_8.3.13rc1-0_amd64.deb
module-assistant auto-install drbd8
reboot


NOTE: at the time of this post drbd-8.3.13rc1 is the latest tag in 8.3 and is the lowest version that contains the split brain causing bug.

If you update the module using this method you need to re-run this command if you install a new kernel in the future:
Code:
module-assistant auto-install drbd8

I can confirm these instructions work perfectly with drbd 8.3.13 release.

Thanks very much for posting them. Saved me figuring it out for myself.

Niall
 
e100: I have a question about drbd8-utils .

do drbd8-utils contain a kernel module or just the management programs?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!