Is it safe to use wfc-timeout in DRBD configuration?

giner · Mar 4, 2010

Hi guys,

I read the Proxmox wiki page related to DRBD configuration and found next option there: "wfc-timeout 15". It seems very dangerous to use this option in production. I'll try to explain by example.
1. We have two running machines with DRBD (machine1 and machine2).
2. There are some reasons to stop them and we do it (machine1 down first and machine2 in several minutes after the first one).
3. Now machine2 have up-to-date DATA but machine1 don't have.
4. We start machine1. When boot proccess is complete it's trying to connect to machine2.
5. Machine1 can't find in 15 seconds and mount drbd device as is.
6. After several minutes we run machine2 and machine to will be synchronized with machine1.
7. At last we have two machines with an OLD DATA and we lost all up-to-date infrormation from machine2.

What do you think, guys?

Cheers,
Stas

fluke · Mar 4, 2010

In this situation a split-brain will occur and machines will not be synchronized automatically. You will have to choose manually from which you will synchronize.

giner · Mar 5, 2010

fluke said:
In this situation a split-brain will occur and machines will not be synchronized automatically. You will have to choose manually from which you will synchronize.

I had an experiment on that and it I've got that it works as I described above.

giner · Mar 5, 2010

I have investigated this with following configuration from the Wiki.

Code:

global { usage-count no; }
common { syncer { rate 30M; } }
resource r0 {
        protocol C;
        startup {
                wfc-timeout  15;
                degr-wfc-timeout 60;
                become-primary-on both;
        }
        net {
                cram-hmac-alg sha1;
                shared-secret "my-secret";
                allow-two-primaries;
                after-sb-0pri discard-zero-changes;
                after-sb-1pri discard-secondary;
                after-sb-2pri disconnect;
        }
        on proxmox-105 {
                device /dev/drbd0;
                disk /dev/sdb1;
                address 10.0.7.105:7788;
                meta-disk internal;
        }
        on proxmox-106 {
                device /dev/drbd0;
                disk /dev/sdb1;
                address 10.0.7.106:7788;
                meta-disk internal;
        }
}

dietmar · Mar 8, 2010

giner said:
I had an experiment on that and it I've got that it works as I described above.

Please can you ask such questions on the DRBD mailing list?

giner · Mar 8, 2010

dietmar said:
Please can you ask such questions on the DRBD mailing list?

Before it I'll go though it again and write here step by step what I do and what happen.

giner · Mar 9, 2010

Just see, guys.

1. Create partition (10GB) on drbd0
2. Stop DRBD on virt2
3. Remove old partition (10GB) and create new one (4GB) on drbd0 (virt1)
4. Stop DRBD on virt1
5. Start DRBD on virt2
6. Wait 15 seconds and start DRBD on virt1
7. Show partitions on drbd0: WE HAVE ONLY FIRST CREATED PARTITION AND LOST LAST ONE.

Code:

global { usage-count no; }
common { syncer { rate 30M; } }
resource r0 {
        protocol C;
        startup {
                wfc-timeout  15;
                degr-wfc-timeout 60;
                become-primary-on both;
        }
        net {
                cram-hmac-alg sha1;
                shared-secret "xxxxxxxxxxxxxxxxxx";
                allow-two-primaries;
                after-sb-0pri discard-zero-changes;
                after-sb-1pri discard-secondary;
                after-sb-2pri disconnect;
        }
        on virt1 {
                device /dev/drbd0;
                disk /dev/pve/drbd;
                address 172.27.1.1:7788;
                meta-disk internal;
        }
        on virt2 {
                device /dev/drbd0;
                disk /dev/pve/drbd;
                address 172.27.1.2:7788;
                meta-disk internal;
        }
}

Code:

virt1:~# cat /proc/drbd
version: 8.3.2 (api:88/proto:86-90)
GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by root@oahu, 2009-11-17 09:36:06
 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r----
    ns:104854368 nr:0 dw:4 dr:105179620 al:1 bm:6400 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

---

virt2:~# cat /proc/drbd 
version: 8.3.2 (api:88/proto:86-90)
GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by root@oahu, 2009-11-17 09:36:06
 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r----
    ns:0 nr:0 dw:0 dr:268 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

---

virt1:~# fdisk -l /dev/drbd0

Disk /dev/drbd0: 107.3 GB, 107370868736 bytes
255 heads, 63 sectors/track, 13053 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

      Device Boot      Start         End      Blocks   Id  System
/dev/drbd0p1               1        1245    10000431   83  Linux

-------------------------------------------------------------------------------------------

virt2:~# /etc/init.d/drbd stop
Stopping all DRBD resources:.

-------------------------------------------------------------------------------------------

virt1:~# cat /proc/drbd
version: 8.3.2 (api:88/proto:86-90)
GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by root@oahu, 2009-11-17 09:36:06
 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r----
    ns:104854368 nr:0 dw:8 dr:105179860 al:1 bm:6400 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:4

---

(change partition table by cfdisk and check)
virt1:~# fdisk -l /dev/drbd0

Disk /dev/drbd0: 107.3 GB, 107370868736 bytes
255 heads, 63 sectors/track, 13053 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

      Device Boot      Start         End      Blocks   Id  System
/dev/drbd0p1               1         498     4000153+  83  Linux

-------------------------------------------------------------------------------------------

virt1:~# /etc/init.d/drbd stop
Stopping all DRBD resources:.

-------------------------------------------------------------------------------------------

virt2:~# /etc/init.d/drbd start
Starting DRBD resources:DRBD module version: 8.3.2
[ d(r0) s(r0) n(r0) ]DRBD module version: 8.3.2
DRBD module version: 8.3.2
..........
***************************************************************
 DRBD's startup script waits for the peer node(s) to appear.
 - In case this node was already a degraded cluster before the
   reboot the timeout is 60 seconds. [degr-wfc-timeout]
 - If the peer was available before the reboot the timeout will
   expire after 15 seconds. [wfc-timeout]
   (These values are for resource 'r0'; 0 sec -> wait forever)
 To abort waiting enter 'yes' [  15]:
.

---

virt2:~# cat /proc/drbd 
version: 8.3.2 (api:88/proto:86-90)
GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by root@oahu, 2009-11-17 09:36:06
 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r----
    ns:0 nr:0 dw:0 dr:268 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

-------------------------------------------------------------------------------------------

virt1:~# /etc/init.d/drbd start
Starting DRBD resources:DRBD module version: 8.3.2
[ d(r0) s(r0) n(r0) ]DRBD module version: 8.3.2
DRBD module version: 8.3.2
.

---

virt1:~# cat /proc/drbd
version: 8.3.2 (api:88/proto:86-90)
GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by root@oahu, 2009-11-17 09:36:06
 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r----
    ns:0 nr:4 dw:4 dr:264 al:0 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

-------------------------------------------------------------------------------------------

virt1:~# fdisk -l /dev/drbd0

Disk /dev/drbd0: 107.3 GB, 107370868736 bytes
255 heads, 63 sectors/track, 13053 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

      Device Boot      Start         End      Blocks   Id  System
/dev/drbd0p1               1        1245    10000431   83  Linux

---

virt2:~# fdisk -l /dev/drbd0

Disk /dev/drbd0: 107.3 GB, 107370868736 bytes
255 heads, 63 sectors/track, 13053 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

      Device Boot      Start         End      Blocks   Id  System
/dev/drbd0p1               1        1245    10000431   83  Linux

fluke · Mar 9, 2010

Your test is not relevant to the way proxmox uses drbd. For example you make changes to disk layout while drbd is not running, this is something, which never happens in proxmox. I would suggest, that you configure drbd as in proxmox wiki and then make tests.

giner · Mar 10, 2010

fluke said:
Your test is not relevant to the way proxmox uses drbd. For example you make changes to disk layout while drbd is not running, this is something, which never happens in proxmox. I would suggest, that you configure drbd as in proxmox wiki and then make tests.

It can happen when I reboot my servers (for example if I have to shutdown both for two ours). In that case I must always keep in mind which one shutdown first and which one second.

fluke · Mar 10, 2010

giner said:
It can happen when I reboot my servers (for example if I have to shutdown both for two ours). In that case I must always keep in mind which one shutdown first and which one second.

As I said, if you setup drbd as written in proxmox wiki, the situation you described won't happen. When you shutdown your server your VM's are also shut down and no changes occur to the drbd device after drbd stops. I suggest that you do real tests - create two VM's - one on each node with disks, residing on the drbd device and then do your shutdowns and see what happens.

giner · Mar 10, 2010

fluke said:
As I said, if you setup drbd as written in proxmox wiki, the situation you described won't happen. When you shutdown your server your VM's are also shut down and no changes occur to the drbd device after drbd stops. I suggest that you do real tests - create two VM's - one on each node with disks, residing on the drbd device and then do your shutdowns and see what happens.

In the top post I wrote my DRBD configuration. It's TOTALLY identical to Wiki's one.

giner · Mar 10, 2010

giner said:
In the top post I wrote my DRBD configuration. It's TOTALLY identical to Wiki's one.

Just imagine.

1. We have two servers virt1 and virt2.
2. We have a two virtual servers virtserv1 and virtserv2 and each of them use drbd0.
3. We have a problem with a power and UPS send shutdown command to virt1.
4. virt2 still works and virtserv2 write something to drbd0.
5. Later another UPS send command to virt2 and it's go to down too.
6. Next steps a have described before.

fluke · Mar 10, 2010

giner said:
Just imagine.

1. We have two servers virt1 and virt2.
2. We have a two virtual servers virtserv1 and virtserv2 and each of them use drbd0.
3. We have a problem with a power and UPS send shutdown command to virt1.
4. virt2 still works and virtserv2 write something to drbd0.
5. Later another UPS send command to virt2 and it's go to down too.
6. Next steps a have described before.

Yes, test it this way and then post what happens, because in the test you performed you did modifications to the drbd device, while the server was running but drbd service was not running - this is not the same as above.

giner · Mar 10, 2010

Unfortunately I can't do it at the moment cause these are production servers. I'm thinking about using DRBD on them but now I'm a little bit worried in according to my investigations.

giner · Mar 10, 2010

fluke said:
Yes, test it this way and then post what happens, because in the test you performed you did modifications to the drbd device, while the server was running but drbd service was not running - this is not the same as above.

I did modifications to the drbd device on the server which have drbd running. Why do you think stoping drbd service and stoping server have a difference? Do you mean that the proxmox service do somthing to the drbd while it's in shutting down? I don't think so.

fluke · Mar 10, 2010

giner said:
I did modifications to the drbd device on the server which have drbd running. Why do you think stoping drbd service and stoping server have a difference? Do you mean that the proxmox service do somthing to the drbd while it's in shutting down? I don't think so.

No.
If you want to test drbd without shutting down the servers - create your vm's then disconnect the network, which drbd uses for synchronization and make modifications as you desire. Reconnect network and see what happens.

giner · Mar 10, 2010

fluke said:
No.
If you want to test drbd without shutting down the servers - create your vm's then disconnect the network, which drbd uses for synchronization and make modifications as you desire. Reconnect network and see what happens.

I think it's wrong way cause it's different of shutdown process.
I think the right way is (more similar to shutdown):
1. Stop drbd on virt1.
2. Disconnect network on virt1.
3. Do modifications on virt2.
4. Stop drbd on virt2.
5. Disconnect network on drbd2.
6. Start network on virt1.
7. Start drbd on virt1.
8. Start network on virt2.
9 Start drbd on virt2.

What do you think?

fluke · Mar 10, 2010

giner said:
I think it's wrong way cause it's different of shutdown process.
I think the right way is (more similar to shutdown):
1. Stop drbd on virt1.
2. Disconnect network on virt1.
3. Do modifications on virt2.
4. Stop drbd on virt2.
5. Disconnect network on drbd2.
6. Start network on virt1.
7. Start drbd on virt1.
8. Start network on virt2.
9 Start drbd on virt2.

What do you think?

Why do you insist on stopping drbd? If you disconnect network, drbd can't synchronize with the other node - it is the same as if it was shut down. If you have a VM running you can't stop drbd, because the drbd device is in use.

giner · Mar 10, 2010

fluke said:
Why do you insist on stopping drbd?

Because when I stop server it works in this way:
1. umount devices if mounted
2. stop drbd
3. stop network

fluke said:
... If you have a VM running you can't stop drbd, because the drbd device is in use.

That's not true. I've tried to do device in use and then I successfuly stopped drbd sevice on the other server.

giner · Mar 10, 2010

I'm not really sure bu I think it works in the folowing way at the moment:
1. Both server shutted down.
2. I start one of them. In accroding to "wfc-timeout 15;" and "become-primary-on both;" after 15 seconds this server become MASTER.
3. I start another one and in according to "after-sb-1pri discard-secondary;" all date sync to it from the first one. Further this server become MASTER too.

Is it safe to use wfc-timeout in DRBD configuration?

Renowned Member

fluke

Guest

Renowned Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

fluke

Guest

Renowned Member

fluke

Guest

Renowned Member

Renowned Member

fluke

Guest

Renowned Member

Renowned Member

fluke

Guest

Renowned Member

fluke

Guest

Renowned Member

Renowned Member

We value your privacy