DRBD disconnects at higher network loads !?

plewka

Member
Sep 28, 2009
49
1
8
I did some UDP-packet-loss testing using higher loads on all 4 links which connect our PVE cluster machines.

One of the links is exclusively used for DRBD usually.
One machine is a dual XEON the other a Single PhenomX6 using latest Intel Ethernet boards.
I increased buffering of kernel to get below 1% of lost UDP packets (using iperf with 4x500MBit).

As soon as I start the test I'm able to recognice the autobalancing of the stack. For a few seconds
the machine behaves bad (delays, cpu-load, lost packets) and then easyly handles the traffic for
>10 min.

We use HW-RAID1 together with DRBD in primary/secondary configuration (to prevent split brain).
We switch to primary/primary if nescessary. Single resource...
When I start the test everything continues to work ok (NFS, SMB etc. ) but DRBD fails.
It tries to sync all the time of transfer, but fails until it comes to normal
network load.

Any ideas how to fix? Increase snd-buffer?

Many thanks in advance!

===============


Mar 18 13:48:08 >server< kernel: block drbd0: conn( Unconnected -> WFConnection )
Mar 18 13:48:08 >server< kernel: block drbd0: Handshake successful: Agreed network protocol version 91
Mar 18 13:48:08 >server< kernel: block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC
Mar 18 13:48:08 >server< kernel: block drbd0: conn( WFConnection -> WFReportParams )
Mar 18 13:48:08 >server< kernel: block drbd0: Starting asender thread (from drbd0_receiver [10517])
Mar 18 13:48:08 >server< kernel: block drbd0: data-integrity-alg: <not-used>
Mar 18 13:48:08 >server< kernel: block drbd0: drbd_sync_handshake:
Mar 18 13:48:08 >server< kernel: block drbd0: self A6831F2AF3D1F7C7:C1E9A0F43ABC78DB:A34C21E79E669ACB:4F595A3CE5C86B6F bits:61 flags:0
Mar 18 13:48:08 >server< kernel: block drbd0: peer C1E9A0F43ABC78DA:0000000000000000:A34C21E79E669ACA:4F595A3CE5C86B6F bits:0 flags:0
Mar 18 13:48:08 >server< kernel: block drbd0: uuid_compare()=1 by rule 70
Mar 18 13:48:08 >server< kernel: block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
Mar 18 13:48:08 >server< kernel: block drbd0: pdsk( UpToDate -> Outdated )
Mar 18 13:48:19 >server< kernel: block drbd0: peer( Secondary -> Unknown ) conn( WFBitMapS -> NetworkFailure )
Mar 18 13:48:19 >server< kernel: block drbd0: asender terminated
Mar 18 13:48:19 >server< kernel: block drbd0: Terminating drbd0_asender
Mar 18 13:48:19 >server< kernel: block drbd0: Connection closed
Mar 18 13:48:19 >server< kernel: block drbd0: conn( NetworkFailure -> Unconnected )
Mar 18 13:48:19 >server< kernel: block drbd0: receiver terminated
Mar 18 13:48:19 >server< kernel: block drbd0: Restarting drbd0_receiver
Mar 18 13:48:19 >server< kernel: block drbd0: receiver (re)started
Mar 18 13:48:19 >server< kernel: block drbd0: conn( Unconnected -> WFConnection )
Mar 18 13:48:22 >server< kernel: block drbd0: Handshake successful: Agreed network protocol version 91
Mar 18 13:48:22 >server< kernel: block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC
Mar 18 13:48:22 >server< kernel: block drbd0: conn( WFConnection -> WFReportParams )
Mar 18 13:48:22 >server< kernel: block drbd0: Starting asender thread (from drbd0_receiver [10517])
Mar 18 13:48:22 >server< kernel: block drbd0: data-integrity-alg: <not-used>
Mar 18 13:48:22 >server< kernel: block drbd0: drbd_sync_handshake:
Mar 18 13:48:22 >server< kernel: block drbd0: self A6831F2AF3D1F7C7:C1E9A0F43ABC78DB:A34C21E79E669ACB:4F595A3CE5C86B6F bits:397 flags:0
Mar 18 13:48:22 >server< kernel: block drbd0: peer C1E9A0F43ABC78DA:0000000000000000:A34C21E79E669ACA:4F595A3CE5C86B6F bits:49 flags:0
Mar 18 13:48:22 >server< kernel: block drbd0: uuid_compare()=1 by rule 70
Mar 18 13:48:22 >server< kernel: block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS )
Mar 18 13:48:23 >server< kernel: block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Outdated -> Inconsistent )
Mar 18 13:48:23 >server< kernel: block drbd0: Began resync as SyncSource (will sync 1588 KB [397 bits set]).
Mar 18 13:48:24 >server< kernel: block drbd0: Resync done (total 1 sec; paused 0 sec; 1588 K/sec)
Mar 18 13:48:24 >server< kernel: block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )

global { usage-count no; }
common { syncer { rate 50M; } }
resource r0 {
protocol C;
handlers {
split-brain "/usr/lib/drbd/notify-split-brain.sh root";
}
startup {
wfc-timeout 15; # wfc-timeout can be dangerous (http://forum.proxmox.com/th$
degr-wfc-timeout 300;
become-primary-on both;
}
net {
cram-hmac-alg sha1;
shared-secret "my-secret";
max-buffers 8000;
max-epoch-size 8000;
sndbuf-size 512k;
allow-two-primaries;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
on rzsv1360 {
device /dev/drbd0;
disk /dev/sdb3;
address 192.168.42.1:7788;
meta-disk internal;
}
on rzsv0690 {
device /dev/drbd0;
disk /dev/sdb3;
address 192.168.42.2:7788;
meta-disk internal;
}
}