Bug in DRBD causes split-brain, already patched by DRBD devs

e100: I have a question about drbd8-utils .

do drbd8-utils contain a kernel module or just the management programs?
Hi,
only userland:
Code:
# dpkg -c /root/drbd8-utils_8.3.10-0_amd64.deb
drwxr-xr-x root/root         0 2012-02-10 10:22 ./
drwxr-xr-x root/root         0 2012-02-10 10:22 ./usr/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./usr/lib/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./usr/lib/drbd/
-rwxr-xr-x root/root      3532 2012-02-10 10:22 ./usr/lib/drbd/notify.sh
-rwxr-xr-x root/root      3015 2012-02-10 10:22 ./usr/lib/drbd/outdate-peer.sh
-rwxr-xr-x root/root     19465 2012-02-10 10:22 ./usr/lib/drbd/crm-fence-peer.sh
-rwxr-xr-x root/root      2173 2012-02-10 10:22 ./usr/lib/drbd/snapshot-resync-target-lvm.sh
drwxr-xr-x root/root         0 2012-02-10 10:22 ./usr/lib/ocf/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./usr/lib/ocf/resource.d/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./usr/lib/ocf/resource.d/linbit/
-rwxr-xr-x root/root     27900 2012-02-10 10:22 ./usr/lib/ocf/resource.d/linbit/drbd
drwxr-xr-x root/root         0 2012-02-10 10:22 ./usr/share/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./usr/share/doc/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./usr/share/doc/drbd8-utils/
-rw-r--r-- root/root       218 2012-02-10 09:56 ./usr/share/doc/drbd8-utils/TODO.Debian
-rw-r--r-- root/root       711 2012-02-10 10:21 ./usr/share/doc/drbd8-utils/changelog.Debian.gz
-rw-r--r-- root/root       420 2012-02-10 09:56 ./usr/share/doc/drbd8-utils/README.Debian
-rw-r--r-- root/root     12442 2012-02-10 10:21 ./usr/share/doc/drbd8-utils/changelog.gz
-rw-r--r-- root/root       902 2012-02-10 09:56 ./usr/share/doc/drbd8-utils/copyright
drwxr-xr-x root/root         0 2012-02-10 10:22 ./usr/share/man/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./usr/share/man/man5/
-rw-r--r-- root/root     14648 2012-02-10 10:22 ./usr/share/man/man5/drbd.conf.5.gz
drwxr-xr-x root/root         0 2012-02-10 10:22 ./usr/share/man/man8/
-rw-r--r-- root/root      2300 2012-02-10 10:22 ./usr/share/man/man8/drbdmeta.8.gz
-rw-r--r-- root/root      1204 2012-02-10 10:22 ./usr/share/man/man8/drbddisk.8.gz
-rw-r--r-- root/root      1295 2012-02-10 10:22 ./usr/share/man/man8/drbd.8.gz
-rw-r--r-- root/root     13373 2012-02-10 10:22 ./usr/share/man/man8/drbdsetup.8.gz
-rw-r--r-- root/root      3962 2012-02-10 10:22 ./usr/share/man/man8/drbdadm.8.gz
drwxr-xr-x root/root         0 2012-02-10 10:22 ./usr/sbin/
-rwxr-xr-x root/root      6389 2012-02-10 10:22 ./usr/sbin/drbd-overview
drwxr-xr-x root/root         0 2012-02-10 10:22 ./var/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./var/lock/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./var/lib/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./var/lib/drbd/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./sbin/
-rwxr-xr-x root/root     77912 2012-02-10 10:22 ./sbin/drbdmeta
-rwxr-xr-x root/root    150128 2012-02-10 10:22 ./sbin/drbdadm
-rwxr-xr-x root/root     68432 2012-02-10 10:22 ./sbin/drbdsetup
drwxr-xr-x root/root         0 2012-02-10 10:22 ./etc/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./etc/bash_completion.d/
-rw-r--r-- root/root      4514 2012-02-10 10:22 ./etc/bash_completion.d/drbdadm
drwxr-xr-x root/root         0 2012-02-10 10:22 ./etc/drbd.d/
-rw-r--r-- root/root      1418 2012-02-10 10:22 ./etc/drbd.d/global_common.conf
-rw-r--r-- root/root       133 2012-02-10 10:22 ./etc/drbd.conf
drwxr-xr-x root/root         0 2012-02-10 10:22 ./etc/xen/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./etc/xen/scripts/
-rwxr-xr-x root/root      8047 2012-02-10 10:22 ./etc/xen/scripts/block-drbd
drwxr-xr-x root/root         0 2012-02-10 10:22 ./etc/init.d/
-rwxr-xr-x root/root      6459 2012-02-10 10:22 ./etc/init.d/drbd
drwxr-xr-x root/root         0 2012-02-10 10:22 ./etc/ha.d/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./etc/ha.d/resource.d/
-rwxr-xr-x root/root      1167 2012-02-10 10:22 ./etc/ha.d/resource.d/drbdupper
-rwxr-xr-x root/root      3162 2012-02-10 10:22 ./etc/ha.d/resource.d/drbddisk
drwxr-xr-x root/root         0 2012-02-10 10:22 ./etc/udev/
drwxr-xr-x root/root         0 2012-02-10 10:22 ./etc/udev/rules.d/
-rw-r--r-- root/root       649 2012-02-10 10:22 ./etc/udev/rules.d/65-drbd.rules
lrwxrwxrwx root/root         0 2012-02-10 10:22 ./usr/lib/drbd/notify-emergency-shutdown.sh -> notify.sh
lrwxrwxrwx root/root         0 2012-02-10 10:22 ./usr/lib/drbd/notify-out-of-sync.sh -> notify.sh
lrwxrwxrwx root/root         0 2012-02-10 10:22 ./usr/lib/drbd/notify-split-brain.sh -> notify.sh
lrwxrwxrwx root/root         0 2012-02-10 10:22 ./usr/lib/drbd/unsnapshot-resync-target-lvm.sh -> snapshot-resync-target-lvm.sh
lrwxrwxrwx root/root         0 2012-02-10 10:22 ./usr/lib/drbd/crm-unfence-peer.sh -> crm-fence-peer.sh
lrwxrwxrwx root/root         0 2012-02-10 10:22 ./usr/lib/drbd/notify-pri-on-incon-degr.sh -> notify.sh
lrwxrwxrwx root/root         0 2012-02-10 10:22 ./usr/lib/drbd/notify-emergency-reboot.sh -> notify.sh
lrwxrwxrwx root/root         0 2012-02-10 10:22 ./usr/lib/drbd/notify-io-error.sh -> notify.sh
lrwxrwxrwx root/root         0 2012-02-10 10:22 ./usr/lib/drbd/notify-pri-lost.sh -> notify.sh
lrwxrwxrwx root/root         0 2012-02-10 10:22 ./usr/lib/drbd/notify-pri-lost-after-sb.sh -> notify.sh
Udo
 
I thought it was just user land.

then when a new pve kernel is installed, we do not need to build a module . correct?

[ AFAIR rebuilding a module was suggested in one of these drbd threads].
 
I thought it was just user land.

then when a new pve kernel is installed, we do not need to build a module . correct?

[ AFAIR rebuilding a module was suggested in one of these drbd threads].

In my instructions above I described how to create the debian packages for userland utils and the kernel module.
The kernel module package installs some source.
Then you compile the module using module assist: module-assistant auto-install drbd8
If you install a new kernel just re-run the module-assistant command then reboot.

The DRBD 8.3.10 module in the Proxmox kernel has multiple bugs.
I have not had a single issue since I started using 8.3.13-rc1 from source.
We now have ten nodes running 8.3.13 (release) with no problems.

I hate having to compile kernel modules for production systems.
But in this case the stability gained is worth the effort.
At least debian has module-assistant that makes this easy :cool:
 
module-assistantOK I have this part done:
Code:
mkdir drbd
cd drbd
apt-get install git-core git-buildpackage fakeroot debconf-utils docbook-xml docbook-xsl dpatch xsltproc autoconf flex pve-headers-2.6.32-11-pve module-assistant
git clone [URL]http://git.drbd.org/drbd-8.3.git[/URL]
cd drbd-8.3
git checkout drbd-8.3.13rc1
dpkg-buildpackage -rfakeroot -b -uc
mkdir drbd
cd drbd
apt-get install git-core git-buildpackage fakeroot debconf-utils docbook-xml docbook-xsl dpatch xsltproc autoconf flex   pve-headers-2.6.32-11-pve pve-headers-2.6.32-12-pve module-assistant
git clone [URL]http://git.drbd.org/drbd-8.3.git[/URL]
cd drbd-8.3
git checkout drbd-8.3.13rc1
dpkg-buildpackage -rfakeroot -b -uc
cd ..
dpkg -i drbd8-module-source_8.3.13rc1-0_all.deb drbd8-utils_8.3.13rc1-0_amd64.deb
module-assistant auto-install drbd8

That built a module for the running kernel .

I had already installed but not booted 2.6.32-12-pve . We are running 2.6.32-11.pve kernel.

Question : how can a drbd module build for 2.6.32-12-pve while using 2.6.32-11-pve ? I'd like to build and install the module for version 12 before rebooting.


I've been trying to use the -k option for module-assistant. But have had no luck yet.
 
...at least I´m thinking about upgrading my DRBD-PVE-Cluster.

Can I do it this way (?):

1. Move all VM´s to one node
2. Do a fresh install (2.1) on the second node
3. Creating a fresh DRBD on the second node (with local storage only)
4. vzdump one VM after the other from 1.9 to the new 2.1
5. Do a fresh install (2.1) on the first node
6. Synchronize DRBD-Volumes from the second node (while VM´s on the second node are online)
7. Joining first node to the Cluster
8. Be happy ?

thanks for your opinions
 
...at least I´m thinking about upgrading my DRBD-PVE-Cluster.

Can I do it this way (?):

1. Move all VM´s to one node
2. Do a fresh install (2.1) on the second node
3. Creating a fresh DRBD on the second node (with local storage only)
4. vzdump one VM after the other from 1.9 to the new 2.1
5. Do a fresh install (2.1) on the first node
6. Synchronize DRBD-Volumes from the second node (while VM´s on the second node are online)
7. Joining first node to the Cluster
8. Be happy ?

thanks for your opinions

ok, a simpler question ;)

is it possible to create drbd with one host only (fill it up wih data) and add the second host later and sync the data in background ?
 
ok I see now...that means issuing
drbdadm -- --overwrite-data-of-peer primary r0
only from the host with the data on it, right?

but ist the volumegroup online while the sync takes place? the wiki tells me to wait until the first sync is finished...

dear e100, would you upgrade the whole drbd-cluster inplace as an other option ?
 
ok I see now...that means issuing only from the host with the data on it, right?

but ist the volumegroup online while the sync takes place? the wiki tells me to wait until the first sync is finished...

dear e100, would you upgrade the whole drbd-cluster inplace as an other option ?

You can not follow the wiki since you only have one node.
So you get it up on one node working in primary.
When you add the other node it should split brain
Then follow wiki for split brain recovery to invalidate the correct node.

I would use that method only if I had no other choice and I would be 100% sure I had good backups first too.
We upgraded one DRBD pair to 2.X, the copied VMs from 1.9 DRBD pair into the upgraded 2.x pair.
Repeat till all are updated (last pair gets upgraded to 2.1 today)
This only worked out because we happened to get a couple new servers right before 2.0 was released, without the extra servers that method would not have been possible.

We are getting off-topic, if you have more questions please start a new thread.
 
please anybody can help me?

1- First i execute in both nodes: "service drbd start"
and i see in the screen of one node:
Starting DRBD resources:[ d(r0) s(r0) n(r0) ].0: State change failed: (-10) State change was refused by peer node
Command '/sbin/drbdsetup 0 primary' terminated with exit code 11
.

2- After i only execute: "cat /proc/drbd"
and i see in the screen:
version: 8.3.10 (api:88/proto:86-96)
GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by phil@fat-tyre, 2011-01-28 12:17:35
0: cs:Connected ro:primary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:0 dw:0 dr:1196 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

Note: my file r0.res has not been modified, and after making some changes on DRBD i get this error

What i need to know for repair this problem?
 
is this a new set up?

per this:
Code:
 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----

it looks like drbd is normal.
 
please anybody can help me?

1- First i execute in both nodes: "service drbd start"
and i see in the screen of one node:
Starting DRBD resources:[ d(r0) s(r0) n(r0) ].0: State change failed: (-10) State change was refused by peer node
Command '/sbin/drbdsetup 0 primary' terminated with exit code 11

it looks like drbd is normal.

I agree with RobFantini, this looks like DRBD is normal.
Not 100% sure, but it seems like the error mentioned might occur if you tried to start DRBD when DRBD had previously been started.

What exactly is your problem?
 
Hi e100 and RobFantini

Thank you very much for your answers, always very nice to receive feedback from those who know.

Always using Proxmox VE 2.x, at first I did a format to one HDD and then was integrated it with DRBD without change any configuration files. From that moment I started to see some strange behaviors in DRBD:

1- At boot of both Nodes with the service DRBD enabled:
Before: DRBD start immediately without any message error.
After: DRBD start with one error message - the order times for wfc-timeout and degr-wfc-timeout.

2- At service DRBD start manually without start the service DRBD on initial boot of both Nodes:
Before: DRBD start immediately without any message error.
After: Sometimes the error message code 11 that i mentioned above in this post.
Always wait 15 sec. and view on the screen the countdown of DRBD.

But after of some tests works fine both LVM partitions on top of DRBD, as restore, live migration, HA, etc.
I mention all this in case anyone may help you

Best regards
Cesar
 
Does the issue occur when both nodes are powered up or restarting at the same time?


PS:
Have you searched the drbd users mail list? http://www.gossamer-threads.com/lists/drbd/users/

And the users guide is here: http://www.drbd.org/users-guide-8.3/

Thanks RobFatini for your quick response and the links

And the answer is yes. Before: these messages never showed and no matter what time booting the second node. and if I start only one node, then "DRBD" displays "wfc-timeout" time in countdown - that is well.

Today smartd show error message and i will be change the hard disk (ahhhhh !!! more work to), after i will be testing and I will be commenting on the outcome.

Best Regards
Cesar
 
Last edited:
Thanks RobFatini for your quick response and the links

And the answer is yes. Before: these messages never showed and no matter what time booting the second node. and if I start only one node, then "DRBD" displays "wfc-timeout" time in countdown - that is well.

Today smartd show error message and i will be change the hard disk (ahhhhh !!! more work to), after i will be testing and I will be commenting on the outcome.

Best Regards
Cesar

Hello fellow

Now i installed the new HDD and DRBD work well but with the same non critical errors that i mentioned above in this post.

I like to know if I start both nodes with the service drbd enabled without initiate VMs, and after, I execute " ls /dev/drbd* ", should I see the groups LVM: drbdvg0 and drbdvg1?. Because i don't see these groups, and after if I execute " pvscan" these groups are show for this command. And if after execute " vgchange -ay " then the command " ls /dev/drbd* are show these lvm groups.

Best Regards
Cesar
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!