Bug in DRBD causes split-brain, already patched by DRBD devs

e100 · Apr 24, 2012

I've seen this issue a few times on 1.X and for the first time on 2.0 today.

Usually happens during snapshot backups, I assume because of the increased IO more than anything.
The error that is a big clue happens on the node that is not doing the backup.

Code:

block drbd0: [B]magic?? on data m:[/B] 0x3eabc3a5 c: 512 l: 97
block drbd0: peer( Primary -> Unknown ) conn( Connected ->[B] ProtocolError [/B]) pdsk( UpToDate -> DUnknown )

This issue goes back some time, easy to find references to it on these forums.

I believe this is a bug in DRBD that was recently fixed:
http://git.drbd.org/gitweb.cgi?p=drbd-8.3.git;a=commit;h=95153072a19dfef10a2cde98c0719cf0f5d72a68

That commit mentions:

We assumed only bios with bi_idx == 0 would end up in drbd_make_request().

That is wrong.

At least device mapper, in __clone_and_map(), may submit clones only covering a partial bio, but sharing the original bvec, by adjusting bi_idx and relevant other bio members of the clone.
We used __bio_for_each_segment() in various places, even though that is documented as

* drivers should not use the __ version unless they _really_ want to
* run through the entire bio and not just pending pieces

Impact: we would send the full bio bvec, even for the clone with bi_idx > 0, which will cause data corruption on the peer (because we submit wrong data at the clone offset), and will cause a DRBD protocol error,

Would it be possible to get DRBD code updated in Proxmox so this bug can finally be put to rest?

dietmar · Apr 25, 2012

e100 said:
Would it be possible to get DRBD code updated in Proxmox so this bug can finally be put to rest?

Please post a bug to bugzilla.openvz.org - the openvz includes the drbd module, so it is better to directly ask there.

e100 · Apr 25, 2012

dietmar said:
Please post a bug to bugzilla.openvz.org - the openvz includes the drbd module, so it is better to directly ask there.

Reported:
http://bugzilla.openvz.org/show_bug.cgi?id=2254

cesarpk · Apr 25, 2012

dietmar said:
Please post a bug to bugzilla.openvz.org - the openvz includes the drbd module, so it is better to directly ask there.

apologies if I'm not on the right thread, but I want to know what command to run to find out what version of "DRBD" is supported by the "kernel"?

udo · Apr 25, 2012

cesarpk said:
apologies if I'm not on the right thread, but I want to know what command to run to find out what version of "DRBD" is supported by the "kernel"?

Hi,
modinfo is a good tool for that

Code:

modinfo drbd
filename:       /lib/modules/2.6.32-11-pve/kernel/drivers/block/drbd/drbd.ko
alias:          block-major-147-*
license:        GPL
[B]version:        8.3.10[/B]
description:    drbd - Distributed Replicated Block Device v8.3.10
author:         Philipp Reisner <phil@linbit.com>, Lars Ellenberg <lars@linbit.com>
srcversion:     A52DAA74FC64F74BC4127FD
depends:        
vermagic:       2.6.32-11-pve SMP mod_unload modversions 
parm:           minor_count:Maximum number of drbd devices (1-256) (uint)
parm:           disable_sendpage:bool
parm:           allow_oos:DONT USE! (bool)
parm:           cn_idx:uint
parm:           proc_details:int
parm:           enable_faults:int
parm:           fault_rate:int
parm:           fault_count:int
parm:           fault_devs:int
parm:           usermode_helper:string

Udo

cesarpk · Apr 25, 2012

Thanks Udo, 'modinfo drbd' works great

Best Regads
Cesar

bread-baker · Apr 28, 2012

e100
thanks for making the bug report... hopefully drbd gets updated in the kernel soon.

to prevent this issue I'll use suspend backups for kvm's on drbd . have you tried doing that to prevent split brain?

e100 · Apr 28, 2012

bread-baker said:
e100
thanks for making the bug report... hopefully drbd gets updated in the kernel soon.

to prevent this issue I'll use suspend backups for kvm's on drbd . have you tried doing that to prevent split brain?

I have not tried that but I suspect it would work.
It is really only an inconvenience to me since repairing the split brain is so simple, I like my snapshot backups and will stick to using them.

We could also download DRBD code then compile a newer userland and kernel module.
Only issue is every kernel update would require you to remember to re-compile the DRBD module.

cesarpk · Apr 28, 2012

e100 said:
We could also download DRBD code then compile a newer userland and kernel module.
Only issue is every kernel update would require you to remember to re-compile the DRBD module.

Is possible to ask the team of Proxmox that in their repositories have available the latest version of DRBD and the respective kernel?

e100 · Apr 28, 2012

cesarpk said:
Is possible to ask the team of Proxmox that in their repositories have available the latest version of DRBD and the respective kernel?

It was the Proxmox devs who suggested this get reported upstream.

e100 · Apr 28, 2012

Building the DRBD userland and module is not too difficult, I might give this a try Monday when we reinstall 2.0 onto two old 1.9 servers.

This is untested but should work:

Code:

mkdir drbd
cd drbd
apt-get install git-core git-buildpackage fakeroot debconf-utils docbook-xml docbook-xsl dpatch xsltproc autoconf flex pve-headers-2.6.32-11-pve module-assistant
git clone http://git.drbd.org/drbd-8.3.git
cd drbd-8.3
git checkout drbd-8.3.13rc1   
dpkg-buildpackage -rfakeroot -b -uc
cd ..
dpkg -i drbd8-module-source_8.3.13rc1-0_all.deb drbd8-utils_8.3.13rc1-0_amd64.deb
module-assistant auto-install drbd8
reboot

NOTE: at the time of this post drbd-8.3.13rc1 is the latest tag in 8.3 and is the lowest version that contains the split brain causing bug.

If you update the module using this method you need to re-run this command if you install a new kernel in the future:

Code:

module-assistant auto-install drbd8

cesarpk · Apr 28, 2012

Wow e100, is very interesting !!!
This man has plenty of experience

And let me make some inquiries:

1- But if i want drbd version 8.4.1, what would be the procedure?, Is it only change the command git?.
2- And if I later want to update drbd and the module, what would be the procedure?

Best regards
Cesar

e100 · Apr 28, 2012

cesarpk said:
1- But if i want drbd version 8.4.1, what would be the procedure?, Is it only change the command git?.
2- And if I later want to update drbd and the module, what would be the procedure?

Best regards
Cesar

8.4 should be like this:

Code:

mkdir drbd 
cd drbd apt-get install git-core git-buildpackage fakeroot debconf-utils docbook-xml docbook-xsl dpatch xsltproc autoconf flex pve-headers-2.6.32-11-pve module-assistant 
git clone http://git.drbd.org/drbd-8.4.git 
cd drbd-8.4 
git checkout drbd-8.4.1    
dpkg-buildpackage -rfakeroot -b -uc 
cd .. 
dpkg -i drbd8-module-source_8.4.1-0_all.deb drbd8-utils_8.4.1-0_amd64.deb 
module-assistant auto-install drbd8 
reboot

To update you would go into the drbd-8.4 directory, use git to pull latest changes "git pull" and then checkout the version you want and the rest is about the same.

8.3 gets bug fixes until December 2012
8.4 gets bug fixes and new features
Should not be an issue to upgrade to the latest 8.3 or 8.4 from the version that is currently in the Proxmox Kernel.

The bug that causes the split-brain is not fixed in 8.4.1.
It is fixed in the 8.4 code, here is the commit:
http://git.drbd.org/gitweb.cgi?p=drbd-8.4.git;a=commit;h=90e21c9b9bb6bdf3669502318afbdd79997b5c52

But that only exists in the main 8.4 tree so you would basically be running the latest development version if you wanted to be free of the bug on 8.4.
So until they make a new release in 8.4 it might be best to use 8.3.13rc1, or wait for the module to get updated in the Proxmox kernel.

ned14 · May 9, 2012

Note there's a bug in the bugzilla for this problem at https://bugzilla.proxmox.com/show_bug.cgi?id=185

I have similar problems where the default DRBD 8.3.10 constantly loops a complete disc transfer over and over again, which was fixed in 8.3.11.

Niall

cesarpk · May 9, 2012

ned14 said:
Note there's a bug in the bugzilla for this problem at https://bugzilla.proxmox.com/show_bug.cgi?id=185

I have similar problems where the default DRBD 8.3.10 constantly loops a complete disc transfer over and over again, which was fixed in 8.3.11.

Niall

Hi ned14

Can you explain in what scenario is working "drbd"?
(for example with 2 partitions for 2 Proxmox nodes with several VMs in each one)

Best regards
Cesar

ned14 · May 9, 2012

cesarpk said:
Can you explain in what scenario is working "drbd"?
(for example with 2 partitions for 2 Proxmox nodes with several VMs in each one)

It's in this scenario: http://www.nedproductions.biz/wiki/...uster-running-over-an-openvpn-intranet-part-3

In short, it's unidirectional protocol A over an OpenVPN created subnet straddling a NATed ADSL connection, so about as not what DRBD should be used in as you can get. There are two DRBD partitions, one from internet to local and the other from local to internet. They are both unidirectional as the 80ms ping time makes anything else impractical.

Syncing works mostly fine from internet node to local node once you've heavily bumped up timeouts. This problem with the dirty bitmap getting constantly reset every connection only happens when going from local to internet. There was a post on the DRBD forums about this, and it was confirmed a bug in 8.3.10.

Niall

ned14 · May 9, 2012

e100 said:
Building the DRBD userland and module is not too difficult, I might give this a try Monday when we reinstall 2.0 onto two old 1.9 servers.

This is untested but should work:

Code:

mkdir drbd cd drbd apt-get install git-core git-buildpackage fakeroot debconf-utils docbook-xml docbook-xsl dpatch xsltproc autoconf flex pve-headers-2.6.32-11-pve module-assistant git clone http://git.drbd.org/drbd-8.3.git cd drbd-8.3 git checkout drbd-8.3.13rc1 dpkg-buildpackage -rfakeroot -b -uc cd .. dpkg -i drbd8-module-source_8.3.13rc1-0_all.deb drbd8-utils_8.3.13rc1-0_amd64.deb module-assistant auto-install drbd8 reboot

NOTE: at the time of this post drbd-8.3.13rc1 is the latest tag in 8.3 and is the lowest version that contains the split brain causing bug.

If you update the module using this method you need to re-run this command if you install a new kernel in the future:

Code:

module-assistant auto-install drbd8

I can confirm these instructions work perfectly with drbd 8.3.13 release.

Thanks very much for posting them. Saved me figuring it out for myself.

Niall

e100 · May 10, 2012

I have two servers running 8.3.13-rc1, two running 8.3.13-rc2 and two running 8.3.13
Should have ten nodes updated to 8.3.13 in the next week

So far things have been running great.

e100 · May 11, 2012

Well the two nodes that were still running the default 8.3.10 module had the split-brain causing issue happen this morning.

I feel this confirms the issue.
8 nodes and the only two that have issues are the 8.3.10 ones.

Now I have all 8 running 8.3.13

bread-baker · May 15, 2012

e100: I have a question about drbd8-utils .

do drbd8-utils contain a kernel module or just the management programs?

Bug in DRBD causes split-brain, already patched by DRBD devs

Renowned Member

Proxmox Staff Member

Renowned Member

Well-Known Member

Distinguished Member

Well-Known Member

Member

Renowned Member

Well-Known Member

Renowned Member

Renowned Member

Well-Known Member

Renowned Member

New Member

Well-Known Member

New Member

New Member

Renowned Member

Renowned Member

Member

We value your privacy