Bug in DRBD causes split-brain, already patched by DRBD devs

udo · May 15, 2012

bread-baker said:
e100: I have a question about drbd8-utils .

do drbd8-utils contain a kernel module or just the management programs?

Hi,
only userland:

Code:

# dpkg -c /root/drbd8-utils_8.3.10-0_amd64.deb
drwxr-xr-x root/root         0 2012-02-10 10:22 ./
drwxr-xr-x root/root         0 2012-02-10 10:22 ./usr/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./usr/lib/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./usr/lib/drbd/
-rwxr-xr-x root/root      3532 2012-02-10 10:22 ./usr/lib/drbd/notify.sh
-rwxr-xr-x root/root      3015 2012-02-10 10:22 ./usr/lib/drbd/outdate-peer.sh
-rwxr-xr-x root/root     19465 2012-02-10 10:22 ./usr/lib/drbd/crm-fence-peer.sh
-rwxr-xr-x root/root      2173 2012-02-10 10:22 ./usr/lib/drbd/snapshot-resync-target-lvm.sh
drwxr-xr-x root/root         0 2012-02-10 10:22 ./usr/lib/ocf/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./usr/lib/ocf/resource.d/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./usr/lib/ocf/resource.d/linbit/
-rwxr-xr-x root/root     27900 2012-02-10 10:22 ./usr/lib/ocf/resource.d/linbit/drbd
drwxr-xr-x root/root         0 2012-02-10 10:22 ./usr/share/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./usr/share/doc/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./usr/share/doc/drbd8-utils/
-rw-r--r-- root/root       218 2012-02-10 09:56 ./usr/share/doc/drbd8-utils/TODO.Debian
-rw-r--r-- root/root       711 2012-02-10 10:21 ./usr/share/doc/drbd8-utils/changelog.Debian.gz
-rw-r--r-- root/root       420 2012-02-10 09:56 ./usr/share/doc/drbd8-utils/README.Debian
-rw-r--r-- root/root     12442 2012-02-10 10:21 ./usr/share/doc/drbd8-utils/changelog.gz
-rw-r--r-- root/root       902 2012-02-10 09:56 ./usr/share/doc/drbd8-utils/copyright
drwxr-xr-x root/root         0 2012-02-10 10:22 ./usr/share/man/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./usr/share/man/man5/
-rw-r--r-- root/root     14648 2012-02-10 10:22 ./usr/share/man/man5/drbd.conf.5.gz
drwxr-xr-x root/root         0 2012-02-10 10:22 ./usr/share/man/man8/
-rw-r--r-- root/root      2300 2012-02-10 10:22 ./usr/share/man/man8/drbdmeta.8.gz
-rw-r--r-- root/root      1204 2012-02-10 10:22 ./usr/share/man/man8/drbddisk.8.gz
-rw-r--r-- root/root      1295 2012-02-10 10:22 ./usr/share/man/man8/drbd.8.gz
-rw-r--r-- root/root     13373 2012-02-10 10:22 ./usr/share/man/man8/drbdsetup.8.gz
-rw-r--r-- root/root      3962 2012-02-10 10:22 ./usr/share/man/man8/drbdadm.8.gz
drwxr-xr-x root/root         0 2012-02-10 10:22 ./usr/sbin/
-rwxr-xr-x root/root      6389 2012-02-10 10:22 ./usr/sbin/drbd-overview
drwxr-xr-x root/root         0 2012-02-10 10:22 ./var/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./var/lock/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./var/lib/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./var/lib/drbd/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./sbin/
-rwxr-xr-x root/root     77912 2012-02-10 10:22 ./sbin/drbdmeta
-rwxr-xr-x root/root    150128 2012-02-10 10:22 ./sbin/drbdadm
-rwxr-xr-x root/root     68432 2012-02-10 10:22 ./sbin/drbdsetup
drwxr-xr-x root/root         0 2012-02-10 10:22 ./etc/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./etc/bash_completion.d/
-rw-r--r-- root/root      4514 2012-02-10 10:22 ./etc/bash_completion.d/drbdadm
drwxr-xr-x root/root         0 2012-02-10 10:22 ./etc/drbd.d/
-rw-r--r-- root/root      1418 2012-02-10 10:22 ./etc/drbd.d/global_common.conf
-rw-r--r-- root/root       133 2012-02-10 10:22 ./etc/drbd.conf
drwxr-xr-x root/root         0 2012-02-10 10:22 ./etc/xen/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./etc/xen/scripts/
-rwxr-xr-x root/root      8047 2012-02-10 10:22 ./etc/xen/scripts/block-drbd
drwxr-xr-x root/root         0 2012-02-10 10:22 ./etc/init.d/
-rwxr-xr-x root/root      6459 2012-02-10 10:22 ./etc/init.d/drbd
drwxr-xr-x root/root         0 2012-02-10 10:22 ./etc/ha.d/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./etc/ha.d/resource.d/
-rwxr-xr-x root/root      1167 2012-02-10 10:22 ./etc/ha.d/resource.d/drbdupper
-rwxr-xr-x root/root      3162 2012-02-10 10:22 ./etc/ha.d/resource.d/drbddisk
drwxr-xr-x root/root         0 2012-02-10 10:22 ./etc/udev/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./etc/udev/rules.d/
-rw-r--r-- root/root       649 2012-02-10 10:22 ./etc/udev/rules.d/65-drbd.rules
lrwxrwxrwx root/root         0 2012-02-10 10:22 ./usr/lib/drbd/notify-emergency-shutdown.sh -> notify.sh
lrwxrwxrwx root/root         0 2012-02-10 10:22 ./usr/lib/drbd/notify-out-of-sync.sh -> notify.sh
lrwxrwxrwx root/root         0 2012-02-10 10:22 ./usr/lib/drbd/notify-split-brain.sh -> notify.sh
lrwxrwxrwx root/root         0 2012-02-10 10:22 ./usr/lib/drbd/unsnapshot-resync-target-lvm.sh -> snapshot-resync-target-lvm.sh
lrwxrwxrwx root/root         0 2012-02-10 10:22 ./usr/lib/drbd/crm-unfence-peer.sh -> crm-fence-peer.sh
lrwxrwxrwx root/root         0 2012-02-10 10:22 ./usr/lib/drbd/notify-pri-on-incon-degr.sh -> notify.sh
lrwxrwxrwx root/root         0 2012-02-10 10:22 ./usr/lib/drbd/notify-emergency-reboot.sh -> notify.sh
lrwxrwxrwx root/root         0 2012-02-10 10:22 ./usr/lib/drbd/notify-io-error.sh -> notify.sh
lrwxrwxrwx root/root         0 2012-02-10 10:22 ./usr/lib/drbd/notify-pri-lost.sh -> notify.sh
lrwxrwxrwx root/root         0 2012-02-10 10:22 ./usr/lib/drbd/notify-pri-lost-after-sb.sh -> notify.sh

Udo

bread-baker · May 15, 2012

I thought it was just user land.

then when a new pve kernel is installed, we do not need to build a module . correct?

[ AFAIR rebuilding a module was suggested in one of these drbd threads].

e100 · May 18, 2012

bread-baker said:
I thought it was just user land.

then when a new pve kernel is installed, we do not need to build a module . correct?

[ AFAIR rebuilding a module was suggested in one of these drbd threads].

In my instructions above I described how to create the debian packages for userland utils and the kernel module.
The kernel module package installs some source.
Then you compile the module using module assist: module-assistant auto-install drbd8
If you install a new kernel just re-run the module-assistant command then reboot.

The DRBD 8.3.10 module in the Proxmox kernel has multiple bugs.
I have not had a single issue since I started using 8.3.13-rc1 from source.
We now have ten nodes running 8.3.13 (release) with no problems.

I hate having to compile kernel modules for production systems.
But in this case the stability gained is worth the effort.
At least debian has module-assistant that makes this easy

bread-baker · May 19, 2012

module-assistantOK I have this part done:

Code:

mkdir drbd
cd drbd
apt-get install git-core git-buildpackage fakeroot debconf-utils docbook-xml docbook-xsl dpatch xsltproc autoconf flex pve-headers-2.6.32-11-pve module-assistant
git clone [URL]http://git.drbd.org/drbd-8.3.git[/URL]
cd drbd-8.3
git checkout drbd-8.3.13rc1
dpkg-buildpackage -rfakeroot -b -uc
mkdir drbd
cd drbd
apt-get install git-core git-buildpackage fakeroot debconf-utils docbook-xml docbook-xsl dpatch xsltproc autoconf flex   pve-headers-2.6.32-11-pve pve-headers-2.6.32-12-pve module-assistant
git clone [URL]http://git.drbd.org/drbd-8.3.git[/URL]
cd drbd-8.3
git checkout drbd-8.3.13rc1
dpkg-buildpackage -rfakeroot -b -uc
cd ..
dpkg -i drbd8-module-source_8.3.13rc1-0_all.deb drbd8-utils_8.3.13rc1-0_amd64.deb
module-assistant auto-install drbd8

That built a module for the running kernel .

I had already installed but not booted 2.6.32-12-pve . We are running 2.6.32-11.pve kernel.

Question : how can a drbd module build for 2.6.32-12-pve while using 2.6.32-11-pve ? I'd like to build and install the module for version 12 before rebooting.

I've been trying to use the -k option for module-assistant. But have had no luck yet.

e100 · May 19, 2012

After installing the 2.6.32-12-pve kernel, here is how you install the drbd module before reboot:

Code:

aptitude install pve-headers-2.6.32-12-pve
module-assistant -l 2.6.32-12-pve auto-install drbd8

e100 · May 19, 2012

Also you should not install 8.3.13-rc1, 8.3.13 release is out.

git checkout drbd-8.3.13

Also the created deb packages will be named differently.

macday · May 19, 2012

thank you very much e100...

...could you tell me if yout solution works also on pve 1.9 ?

I´m too afraid to update my 3tb-sas-drbd-cluster

cheers
macday

e100 · May 20, 2012

Not tired on 1.9 but it should work provided you change the kernel versions wherever they appear in the commands.

macday · May 21, 2012

...at least I´m thinking about upgrading my DRBD-PVE-Cluster.

Can I do it this way (?):

1. Move all VM´s to one node
2. Do a fresh install (2.1) on the second node
3. Creating a fresh DRBD on the second node (with local storage only)
4. vzdump one VM after the other from 1.9 to the new 2.1
5. Do a fresh install (2.1) on the first node
6. Synchronize DRBD-Volumes from the second node (while VM´s on the second node are online)
7. Joining first node to the Cluster
8. Be happy ?

thanks for your opinions

macday · May 21, 2012

macday said:
...at least I´m thinking about upgrading my DRBD-PVE-Cluster.

Can I do it this way (?):

1. Move all VM´s to one node
2. Do a fresh install (2.1) on the second node
3. Creating a fresh DRBD on the second node (with local storage only)
4. vzdump one VM after the other from 1.9 to the new 2.1
5. Do a fresh install (2.1) on the first node
6. Synchronize DRBD-Volumes from the second node (while VM´s on the second node are online)
7. Joining first node to the Cluster
8. Be happy ?

thanks for your opinions

ok, a simpler question

is it possible to create drbd with one host only (fill it up wih data) and add the second host later and sync the data in background ?

e100 · May 21, 2012

macday said:
ok, a simpler question

is it possible to create drbd with one host only (fill it up wih data) and add the second host later and sync the data in background ?

Yes, but be careful when pairing the other node, if you make a mistake you can easily obliterate your good data by invalidating the wrong side.

macday · May 21, 2012

ok I see now...that means issuing

drbdadm -- --overwrite-data-of-peer primary r0

only from the host with the data on it, right?

but ist the volumegroup online while the sync takes place? the wiki tells me to wait until the first sync is finished...

dear e100, would you upgrade the whole drbd-cluster inplace as an other option ?

e100 · May 21, 2012

macday said:
ok I see now...that means issuing only from the host with the data on it, right?

but ist the volumegroup online while the sync takes place? the wiki tells me to wait until the first sync is finished...

dear e100, would you upgrade the whole drbd-cluster inplace as an other option ?

You can not follow the wiki since you only have one node.
So you get it up on one node working in primary.
When you add the other node it should split brain
Then follow wiki for split brain recovery to invalidate the correct node.

I would use that method only if I had no other choice and I would be 100% sure I had good backups first too.
We upgraded one DRBD pair to 2.X, the copied VMs from 1.9 DRBD pair into the upgraded 2.x pair.
Repeat till all are updated (last pair gets upgraded to 2.1 today)
This only worked out because we happened to get a couple new servers right before 2.0 was released, without the extra servers that method would not have been possible.

We are getting off-topic, if you have more questions please start a new thread.

cesarpk · Sep 6, 2012

please anybody can help me?

1- First i execute in both nodes: "service drbd start"
and i see in the screen of one node:
Starting DRBD resources:[ d(r0) s(r0) n(r0) ].0: State change failed: (-10) State change was refused by peer node
Command '/sbin/drbdsetup 0 primary' terminated with exit code 11
.

2- After i only execute: "cat /proc/drbd"
and i see in the screen:
version: 8.3.10 (api:88/proto:86-96)
GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by phil@fat-tyre, 2011-01-28 12:17:35
0: cs:Connected ro

rimary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:0 dw:0 dr:1196 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

Note: my file r0.res has not been modified, and after making some changes on DRBD i get this error

What i need to know for repair this problem?

RobFantini · Sep 6, 2012

is this a new set up?

per this:

Code:

 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----

it looks like drbd is normal.

e100 · Sep 6, 2012

cesarpk said:
please anybody can help me?

1- First i execute in both nodes: "service drbd start"
and i see in the screen of one node:
Starting DRBD resources:[ d(r0) s(r0) n(r0) ].0: State change failed: (-10) State change was refused by peer node
Command '/sbin/drbdsetup 0 primary' terminated with exit code 11

RobFantini said:
it looks like drbd is normal.

I agree with RobFantini, this looks like DRBD is normal.
Not 100% sure, but it seems like the error mentioned might occur if you tried to start DRBD when DRBD had previously been started.

What exactly is your problem?

cesarpk · Sep 6, 2012

Hi e100 and RobFantini

Thank you very much for your answers, always very nice to receive feedback from those who know.

Always using Proxmox VE 2.x, at first I did a format to one HDD and then was integrated it with DRBD without change any configuration files. From that moment I started to see some strange behaviors in DRBD:

1- At boot of both Nodes with the service DRBD enabled:
Before: DRBD start immediately without any message error.
After: DRBD start with one error message - the order times for wfc-timeout and degr-wfc-timeout.

2- At service DRBD start manually without start the service DRBD on initial boot of both Nodes:
Before: DRBD start immediately without any message error.
After: Sometimes the error message code 11 that i mentioned above in this post.
Always wait 15 sec. and view on the screen the countdown of DRBD.

But after of some tests works fine both LVM partitions on top of DRBD, as restore, live migration, HA, etc.
I mention all this in case anyone may help you

Best regards
Cesar

RobFantini · Sep 6, 2012

Does the issue occur when both nodes are powered up or restarting at the same time?

PS:
Have you searched the drbd users mail list? http://www.gossamer-threads.com/lists/drbd/users/

And the users guide is here: http://www.drbd.org/users-guide-8.3/

cesarpk · Sep 6, 2012

RobFantini said:
Does the issue occur when both nodes are powered up or restarting at the same time?

PS:
Have you searched the drbd users mail list? http://www.gossamer-threads.com/lists/drbd/users/

And the users guide is here: http://www.drbd.org/users-guide-8.3/

Thanks RobFatini for your quick response and the links

And the answer is yes. Before: these messages never showed and no matter what time booting the second node. and if I start only one node, then "DRBD" displays "wfc-timeout" time in countdown - that is well.

Today smartd show error message and i will be change the hard disk (ahhhhh !!! more work to), after i will be testing and I will be commenting on the outcome.

Best Regards
Cesar

cesarpk · Sep 8, 2012

cesarpk said:
Thanks RobFatini for your quick response and the links

And the answer is yes. Before: these messages never showed and no matter what time booting the second node. and if I start only one node, then "DRBD" displays "wfc-timeout" time in countdown - that is well.

Today smartd show error message and i will be change the hard disk (ahhhhh !!! more work to), after i will be testing and I will be commenting on the outcome.

Best Regards
Cesar

Hello fellow

Now i installed the new HDD and DRBD work well but with the same non critical errors that i mentioned above in this post.

I like to know if I start both nodes with the service drbd enabled without initiate VMs, and after, I execute " ls /dev/drbd* ", should I see the groups LVM: drbdvg0 and drbdvg1?. Because i don't see these groups, and after if I execute " pvscan" these groups are show for this command. And if after execute " vgchange -ay " then the command " ls /dev/drbd* are show these lvm groups.

Best Regards
Cesar

Bug in DRBD causes split-brain, already patched by DRBD devs

Distinguished Member

Member

Famous Member

Member

Famous Member

Famous Member

Member

Famous Member

Member

Member

Famous Member

Member

Famous Member

Renowned Member

Famous Member

Famous Member

Renowned Member

Famous Member

Renowned Member

Renowned Member

We value your privacy