DRBD disconnects at higher network loads !?

plewka

Member
Sep 28, 2009
49
1
8
I did some UDP-packet-loss testing using higher loads on all 4 links which connect our PVE cluster machines.

One of the links is exclusively used for DRBD usually.
One machine is a dual XEON the other a Single PhenomX6 using latest Intel Ethernet boards.
I increased buffering of kernel to get below 1% of lost UDP packets (using iperf with 4x500MBit).

As soon as I start the test I'm able to recognice the autobalancing of the stack. For a few seconds
the machine behaves bad (delays, cpu-load, lost packets) and then easyly handles the traffic for
>10 min.

We use HW-RAID1 together with DRBD in primary/secondary configuration (to prevent split brain).
We switch to primary/primary if nescessary. Single resource...
When I start the test everything continues to work ok (NFS, SMB etc. ) but DRBD fails.
It tries to sync all the time of transfer, but fails until it comes to normal
network load.

Any ideas how to fix? Increase snd-buffer?

Many thanks in advance!

===============


Mar 18 13:48:08 >server< kernel: block drbd0: conn( Unconnected -> WFConnection )
Mar 18 13:48:08 >server< kernel: block drbd0: Handshake successful: Agreed network protocol version 91
Mar 18 13:48:08 >server< kernel: block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC
Mar 18 13:48:08 >server< kernel: block drbd0: conn( WFConnection -> WFReportParams )
Mar 18 13:48:08 >server< kernel: block drbd0: Starting asender thread (from drbd0_receiver [10517])
Mar 18 13:48:08 >server< kernel: block drbd0: data-integrity-alg: <not-used>
Mar 18 13:48:08 >server< kernel: block drbd0: drbd_sync_handshake:
Mar 18 13:48:08 >server< kernel: block drbd0: self A6831F2AF3D1F7C7:C1E9A0F43ABC78DB:A34C21E79E669ACB:4F595A3CE5C86B6F bits:61 flags:0
Mar 18 13:48:08 >server< kernel: block drbd0: peer C1E9A0F43ABC78DA:0000000000000000:A34C21E79E669ACA:4F595A3CE5C86B6F bits:0 flags:0
Mar 18 13:48:08 >server< kernel: block drbd0: uuid_compare()=1 by rule 70
Mar 18 13:48:08 >server< kernel: block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
Mar 18 13:48:08 >server< kernel: block drbd0: pdsk( UpToDate -> Outdated )
Mar 18 13:48:19 >server< kernel: block drbd0: peer( Secondary -> Unknown ) conn( WFBitMapS -> NetworkFailure )
Mar 18 13:48:19 >server< kernel: block drbd0: asender terminated
Mar 18 13:48:19 >server< kernel: block drbd0: Terminating drbd0_asender
Mar 18 13:48:19 >server< kernel: block drbd0: Connection closed
Mar 18 13:48:19 >server< kernel: block drbd0: conn( NetworkFailure -> Unconnected )
Mar 18 13:48:19 >server< kernel: block drbd0: receiver terminated
Mar 18 13:48:19 >server< kernel: block drbd0: Restarting drbd0_receiver
Mar 18 13:48:19 >server< kernel: block drbd0: receiver (re)started
Mar 18 13:48:19 >server< kernel: block drbd0: conn( Unconnected -> WFConnection )
Mar 18 13:48:22 >server< kernel: block drbd0: Handshake successful: Agreed network protocol version 91
Mar 18 13:48:22 >server< kernel: block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC
Mar 18 13:48:22 >server< kernel: block drbd0: conn( WFConnection -> WFReportParams )
Mar 18 13:48:22 >server< kernel: block drbd0: Starting asender thread (from drbd0_receiver [10517])
Mar 18 13:48:22 >server< kernel: block drbd0: data-integrity-alg: <not-used>
Mar 18 13:48:22 >server< kernel: block drbd0: drbd_sync_handshake:
Mar 18 13:48:22 >server< kernel: block drbd0: self A6831F2AF3D1F7C7:C1E9A0F43ABC78DB:A34C21E79E669ACB:4F595A3CE5C86B6F bits:397 flags:0
Mar 18 13:48:22 >server< kernel: block drbd0: peer C1E9A0F43ABC78DA:0000000000000000:A34C21E79E669ACA:4F595A3CE5C86B6F bits:49 flags:0
Mar 18 13:48:22 >server< kernel: block drbd0: uuid_compare()=1 by rule 70
Mar 18 13:48:22 >server< kernel: block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS )
Mar 18 13:48:23 >server< kernel: block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Outdated -> Inconsistent )
Mar 18 13:48:23 >server< kernel: block drbd0: Began resync as SyncSource (will sync 1588 KB [397 bits set]).
Mar 18 13:48:24 >server< kernel: block drbd0: Resync done (total 1 sec; paused 0 sec; 1588 K/sec)
Mar 18 13:48:24 >server< kernel: block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )

global { usage-count no; }
common { syncer { rate 50M; } }
resource r0 {
protocol C;
handlers {
split-brain "/usr/lib/drbd/notify-split-brain.sh root";
}
startup {
wfc-timeout 15; # wfc-timeout can be dangerous (http://forum.proxmox.com/th$
degr-wfc-timeout 300;
become-primary-on both;
}
net {
cram-hmac-alg sha1;
shared-secret "my-secret";
max-buffers 8000;
max-epoch-size 8000;
sndbuf-size 512k;
allow-two-primaries;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
on rzsv1360 {
device /dev/drbd0;
disk /dev/sdb3;
address 192.168.42.1:7788;
meta-disk internal;
}
on rzsv0690 {
device /dev/drbd0;
disk /dev/sdb3;
address 192.168.42.2:7788;
meta-disk internal;
}
}
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!