KVM on top of DRBD and out of sync: long term investigation results

giner · Apr 15, 2014

e100 said:
Having swap inconsistent would be bad for live migrations.

It is possible to use different modes for different virtual drives so we can move swap to another drive.

Could you share your benchmark results?

e100 · Nov 22, 2014

directsync does NOT prevent this problem.

I've recently noticed a larger number of inconsistencies, on upgraded faster hardware, so I enabled data-integrity-alg on a couple of nodes.
Few hours later DRBD breaks and split-brains due to the error:
buffer modified by upper layers during write

This is the precise problem that giner described in the first post.
All of my VMs have directsync set for all of their VM disks, so it is obvious that directsync does NOT prevent the buffer from being modified while DRBD writes are in flight.

writethrough, in my experience, does not perform well so I am not sure what I will do to prevent this problem.

I wonder if DRBD 8.4 in the 3.10 kernel is less susceptible to this problem.

For reference the cause of this problem is explained by Lars, one of the main DRBD developers, here:
http://lists.linbit.com/pipermail/drbd-user/2014-February/020607.html

giner · Nov 22, 2014

This is very unusual. I've never experienced this problem since I switched to cache=writethrough.
Can you try to reproduce the issue with data-integrity-alg disabled? In theory data-integrity-alg can catch false-positive results.

Stanislav

macday · Nov 22, 2014

e100 said:
directsync does NOT prevent this problem.

I've recently noticed a larger number of inconsistencies, on upgraded faster hardware, so I enabled data-integrity-alg on a couple of nodes.
Few hours later DRBD breaks and split-brains due to the error:
buffer modified by upper layers during write

This is the precise problem that giner described in the first post.
All of my VMs have directsync set for all of their VM disks, so it is obvious that directsync does NOT prevent the buffer from being modified while DRBD writes are in flight.

writethrough, in my experience, does not perform well so I am not sure what I will do to prevent this problem.

I wonder if DRBD 8.4 in the 3.10 kernel is less susceptible to this problem.

For reference the cause of this problem is explained by Lars, one of the main DRBD developers, here:
http://lists.linbit.com/pipermail/drbd-user/2014-February/020607.html

I have also still the same problem. Mostly with slower Storage (near-line-sas). And I have the 3.10 kernel. In my case I´m trying now to switch to writethrough and lower the al-extents on the slower storage because after the split-brain thereis alot of resync to do. The strange thing is, on the same host the superfast SAS-Storage on the same Raid-Controller which is more frequented does not have any problems.

e100 · Nov 23, 2014

Yes, we are seeing lots of out of sync blocks when we run a verify.
I only enabled data-integrit-alg to confirm that it is being caused by the buffer being modified while DRBD writes are in flight.

Testing if writethrough prevents this problem is now on my todo list.

cesarpk · Nov 23, 2014

Hi to all

Long time ago i was talking with Lars, and he tell me that "data-integrit-alg" only should use for purposes of test and never in a production enviroment due that typically exist modifications of the network packets (upper layers in my case refers to the Hardware into the same PC due to that i use DRBD in mode NIC-to-NIC), and since that i erased such directive (data-integrit-alg), my problems are over.

Moreover, always is better this practices:
1) DRDB in connections NIC-to-NIC (i use much bonding "balance-rr" and "jumbo-frames" for duplicate the speed of connection)
2) Don't use a password for the replication (in a connection NIC-to-NIC nobody can see the transmition), and also you get more speed for the replication and less use of processor.
3) LVM on top of DRBD is the best way of get speed of access disk.
4) PVE host should have as I/O Scheduler "deadline", while a Linux guest should have as I/O Scheduler "noop" (no optimized)
5- Virtio-block as disk driver in the guest (maybe no all know).
6- In my particular case, i use DRBD 8.4.4 version since many time ago, and soon i will have 8.4.5 in other PVE servers, and i never had problems considering that verifications automatically of the DRBD storages are executed a time by week (i believe that the latest versions of DRBD are better - less bugs y better optimizations), and i always hear to Lars say it to many people this same.

Best regards
Cesar

macday · Nov 23, 2014

cesarpk said:
Hi to all

Long time ago i was talking with Lars, and he tell me that "data-integrit-alg" only should use for purposes of test and never in a production enviroment due that typically exist modifications of the network packets (upper layers in my case refers to the Hardware into the same PC due to that i use DRBD in mode NIC-to-NIC), and since that i erased such directive (data-integrit-alg), my problems are over.

Moreover, always is better this practices:
1) DRDB in connections NIC-to-NIC (i use much bonding "balance-rr" and "jumbo-frames" for duplicate the speed of connection)
2) Don't use a password for the replication (in a connection NIC-to-NIC nobody can see the transmition), and also you get more speed for the replication and less use of processor.
3) LVM on top of DRBD is the best way of get speed of access disk.
4) PVE host should have as I/O Scheduler "deadline", while a Linux guest should have as I/O Scheduler "noop" (no optimized)
5- Virtio-block as disk driver in the guest (maybe no all know).
6- In my particular case, i use DRBD 8.4.4 version since many time ago, and soon i will have 8.4.5 in other PVE servers, and i never had problems considering that verifications automatically of the DRBD storages are executed a time by week (i believe that the latest versions of DRBD are better - less bugs y better optimizations), and i always hear to Lars say it to many people this same.

Best regards
Cesar

Hi Cesar,

thanks for your post. Could you help me? Here is one of my ressource.conf:
drbdadm dump

# /etc/drbd.conf
common {
net {
protocol C;
sndbuf-size 512k;
max-buffers 128k;
max-epoch-size 8000;
}
disk {
disk-barrier no;
disk-flushes no;
md-flushes no;
resync-rate 500M;
al-extents 3389;
}
}

# resource r0 on fbo-vm-02: not ignored, not stacked
# defined at /etc/drbd.d/r0.res:1
resource r0 {
on fbo-vm-01 {
device /dev/drbd0 minor 0;
disk /dev/sdb1;
meta-disk internal;
address ipv4 10.0.100.1:7788;
}
on fbo-vm-02 {
device /dev/drbd0 minor 0;
disk /dev/sdb1;
meta-disk internal;
address ipv4 10.0.100.2:7788;
}
options {
on-no-data-accessible suspend-io;
}
net {
protocol C;
max-buffers 8000;
max-epoch-size 8000;
unplug-watermark 16;
sndbuf-size 0;
cram-hmac-alg sha1;
verify-alg md5;
shared-secret my-secret;
allow-two-primaries yes;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
disk {
disk-flushes no;
disk-barrier no;
}
startup {
wfc-timeout 15;
degr-wfc-timeout 60;
become-primary-on both;
}
}

# resource r1 on fbo-vm-02: not ignored, not stacked
# defined at /etc/drbd.d/r1.res:1
resource r1 {
on fbo-vm-01 {
device /dev/drbd1 minor 1;
disk /dev/sdc1;
meta-disk internal;
address ipv4 10.0.101.1:7788;
}
on fbo-vm-02 {
device /dev/drbd1 minor 1;
disk /dev/sdc1;
meta-disk internal;
address ipv4 10.0.101.2:7788;
}
options {
on-no-data-accessible suspend-io;
}
net {
protocol C;
max-buffers 8000;
max-epoch-size 8000;
unplug-watermark 16;
sndbuf-size 0;
cram-hmac-alg sha1;
verify-alg md5;
shared-secret my-secret;
allow-two-primaries yes;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
disk {
disk-flushes no;
disk-barrier no;
}
startup {
wfc-timeout 15;
degr-wfc-timeout 60;
become-primary-on both;
}
}

Now my question: Is verify-alg similar to data-integrit-alg ? How can I tune my config ?

Thanks in advance
Mac

cesarpk · Nov 23, 2014

Hi macday

This are my suggestions:

1) Read all in this link - DRBD tuning recomendations (including the next web pages of the same topic of tuning recomendations, these recommendations should be your bible):
http://www.drbd.org/users-guide/s-throughput-tuning.html

At first glance, I would correct this:
Notes:
a) I know it because i have read all about DRBD tuning recommendations)
b) I presume that you are using DRBD 8.4.x version
c) About of the syntax, i don't remember the exact syntax , so it do not will be added in my suggestions

2) Where it says: sndbuf-size 512k;
It should say: sndbuf-size 0k; (DRBD will calculate and will change it automatically as will be necessary)

3) Where it says: max-buffers 128k; max-epoch-size 8000;
It should say: max-buffers 8000; max-epoch-size 8000; (DRBD recommend that these values be equals)

4) Where it says: disk-barrier no; disk-flushes no; md-flushes no;
It should say: disk-flushes no; md-flushes no; (About of disk-barrier, will be better remove it, and about of "disk-flushes no;" "md-flushes no;", these options are good only when you have a Hardware RAID controller with BBU (backet battery unit), the target is don't lose information if the server suddenly turns off or the VM crashes.

5) Where it says: options { ...
It should say: cpu-mask 0; (for that DRBD use all cores and threads of your processors)

6) Where it says

n-no-data-accessible suspend-io;
It should say: on-no-data-accessible io-error; (that is the default), but if you add in "disk { on-io-error detach }", in case of failure of disk, DRRB will do the reads and writes of disks in the other node, so your VM will continue working without problems in the same node (maybe the speed access of disk will be less)

7) Where it says: "cram-hmac-alg" and "shared-secret"
It should say: don't use such directives, and remember that the DRBD connections must be: NIC-to-NIC without a switch in the middle (you will have more speed of DRBD link, moreover, if your NIC cards accept configurations of mtu (jumbo-frames), will be better change it to the max value that the NIC supports.

8) Where it says:allow-two-primaries yes;
It should say: i guess that only must say: allow-two-primaries;

9) Where it says:disk-flushes no; disk-barrier no;
It should say: i am not sure that is the correct syntax, please verify with the documentation of your version

Suggestions miscellaneous:
10) Use only NIC or NICs in dedicated mode for DRBD
11) Enable these options for obtain alarms immediate by email:
split-brain "/usr/lib/drbd/notify-split-brain.sh some@emailAddress.com";
out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh some@emailAddress.com";
(not necessarily must be the root user the receiver of these messages, but you will should check that works)

Always that you can do it, you use bonding in PVE with balance-rr for use exclusive of DRBD, and better with two NICs equal, then you will get these advantages (that can configured it by the PVE GUI, but the "mtu" only after that you verify that balance-rr works well, and only can configure it by CLI):
12) Duplicate the speed of your DRBD link
13) If only a NIC of your bonding balance-rr is disconnected, or only a NIC decomposes, DRBD will continue to working, but with less speed in the network communication (and obviously for the VM, with less speed access to disk).

Very luck with your DRBD configurations

Best regards
Cesar

e100 · Nov 24, 2014

cesarpk said:
2) Don't use a password for the replication (in a connection NIC-to-NIC nobody can see the transmition), and also you get more speed for the replication and less use of processor.

The authentication happens only once during the initial connection not on every single request, it does not impact performance.
Turning it off will not make DRBD faster but having it enabled makes your data a little safer.

I can see maybe not using it if you are NIC-to-NIC, but if you are using a switch then authentication should be enabled.

cesarpk · Nov 24, 2014

e100 said:
The authentication happens only once during the initial connection not on every single request, it does not impact performance.
Turning it off will not make DRBD faster but having it enabled makes your data a little safer.

I can see maybe not using it if you are NIC-to-NIC, but if you are using a switch then authentication should be enabled.

Hi e100, is a pleasure to greet you again (you are the great master of masters)

Many thanks for the clarification, and maybe i am wrong, but as i understand, if we have a shared key in DRBD, any transmission will be encrypted. If it no bothers, do you have a web link that speaks about of this topic.

About of DRBD and the authentication when is used a Switch in the middle, i am in accordance with you, and if is possible avoid use a switch will be better for no have other single point of failure.

I take leave wishing you enjoy more successes of than those already enjoys.

Best regards
Cesar

e100 · Nov 27, 2014

You need to specify the HMAC algorithm to enable peer authentication at all. You are strongly encouraged to use peer authentication.

Source: http://www.drbd.org/users-guide/re-drbdconf.html

I've been using DRBD since around 2005, I remember reading that the auth is performed only once in an example drbd.conf.
The only references I can find are old example drbd.conf files, search google for:
"Authentication is only done once" DRBD

Maybe thats not true anymore but it sure would be silly and inefficient to perform the auth on every single request.

giner · Nov 27, 2014

Hello,

macday said:
...
resync-rate 500M;
...

Just a remark: such a big resync-rate can cause I/O starvation and big issues afterwards.

Stanislav

macday · Nov 27, 2014

Thanks Stanislav...but what count should i use on 10GBE NIC to NIC Connection ?

giner · Nov 27, 2014

This is more about I/O subsystem then about network. I would choose not more than 1/3 or even 1/4 of maximum possible I/O throughput.

cesarpk · Nov 27, 2014

@macday:
The DRBD team recommends for use the synchronization speed, the third part of the disk speed or the NIC, whichever is slower of both, of this manner, your replication system will use all the rest possible of bandwidth and speed disk.
Please see this link:
http://www.drbd.org/users-guide/s-configure-sync-rate.html

cesarpk · Nov 27, 2014

@macday:
AH ..... i forgot tell you that with 8.4.x versions of DRBD, these systems have some options for that in dynamic form change the speed of replication and synchronization in the fly, the target is obtain more speed of replication when is necessary, and obtain more speed of synchronization when the speed of replication is not necessary. A requisites for get such target is configure the max and min value of the synchronization system.

macday · Nov 28, 2014

cesarpk said:
@macday:
AH ..... i forgot tell you that with 8.4.x versions of DRBD, these systems have some options for that in dynamic form change the speed of replication and synchronization in the fly, the target is obtain more speed of replication when is necessary, and obtain more speed of synchronization when the speed of replication is not necessary. A requisites for get such target is configure the max and min value of the synchronization system.

thanks to all...do you have an ideal config file structure and content (drbd common and ressources)?

cesarpk · Nov 29, 2014

macday said:
thanks to all...do you have an ideal config file structure and content (drbd common and ressources)?

It depends largely on the version you are using.

macday · Nov 29, 2014

I´m using Kernel 3.10.5-pve with the DRDB-Tools and Modules 8.4.3..thx

cesarpk · Nov 29, 2014

macday said:
I´m using Kernel 3.10.5-pve with the DRDB-Tools and Modules 8.4.3..thx

Also should be careful of avoid the loss of data, then I use this strategy for obtain better performance and avoid the loss data:

A) The PVE server with much RAM for assign to the VM, and as the OS of the VM can to do read cache, then there will be minus activity of disk and in general terms means that the VM will run more fast.

B) Into KVM, i use "direct sync" for the configuration of the virtual disk, that it does not disk cache, and as the VM if are doing, then we don't need two cache systems (wasting RAM).

C) Use "deadline" as the "I/O scheduler" for your PVE servers, and "noop" for your Linux guests (recomendation of IBM).

D) Today i start with this option for prove the results (don't do it if your system have little RAM):
- Configure the OS of PVE for that don't use swap disk: vm.swappiness=0

About of DRBD:
It depend if you have a RAID controller in writeback mode with BBU configured or not in your PVE server, for do the configurations of disk in DRBD, that technically speaking, it comes to enable or disable these options of DRBD:

disk { ...
disk-flushes; md-flushes; disk-barrier no; #for HDDs or Hardware RAID controllers without the writeback mode enabled in his logical disks.
disk-flushes no; md-flushes no; disk-barrier no; #only for Hardware RAID controllers with the writeback mode enabled in his logical disks.
... }

Almost never my configuration is equal of a server to other, due to that i modify the configuration of DRBD depending of the RAID controller configuration (if the server have it), type of array, if his cache are enabled or disabled for use it with his virtual disks, the disks speed , and the network speed.

Here i send you a example of DRBD configuration for a "SATA" or "Near line SAS" of 7.2k RPM (In my case connected to a Hardware RAID Controller in RAID-1, but without any cache enable for this virtual disk, that is almost the same that don't have a RAID controller speaking in terms of speed access)

The "global_common.conf" file:

Code:

global {
    usage-count no;
}
common {
    handlers {
         split-brain "/usr/lib/drbd/notify-split-brain.sh root";
         out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
}
startup {
        wfc-timeout 30; degr-wfc-timeout 20; outdated-wfc-timeout 15;
}
    options {
        cpu-mask 0;
    }
    disk {
    }
    net {
        sndbuf-size 0;  unplug-watermark 16; max-buffers 8000; max-epoch-size 8000; verify-alg sha1;
    }
}

The "r1.res" file:

Code:

 resource r1 {
    protocol C;
    startup {
        become-primary-on both;
    }
    disk {
        on-io-error detach; al-extents 1801; resync-rate 25M; c-plan-ahead 20; c-min-rate 25M; c-max-rate 60M; c-fill-target 128k;
        disk-flushes; md-flushes; disk-barrier no;
    }
    net {
        allow-two-primaries;
        after-sb-0pri discard-zero-changes;
        after-sb-1pri discard-secondary;
        after-sb-2pri disconnect;
    }
    volume 11 {
        device /dev/drbd11;
        disk /dev/sdb1;
        meta-disk internal;
    }
    on pve5 {
        address 10.1.1.50:7788;
    }
    on pve6 {
        address 10.1.1.51:7788;
    }
}

I hope this information helps you.

Best regards
Cesar

Re-edited: Maybe you want to see this web link:
http://blogs.linbit.com/p/469/843-random-writes-faster/

KVM on top of DRBD and out of sync: long term investigation results

Member

Renowned Member

Member

Member

Renowned Member

Renowned Member

Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Member

Member

Member

Renowned Member

Renowned Member

Member

Renowned Member

Member

Renowned Member

We value your privacy