DRBD flaky on 3.10 kernel

e100

Renowned Member
Nov 6, 2010
1,268
46
88
Columbus, Ohio
ulbuilder.wordpress.com
At nights when vzdump runs DRBD is receiving a timeout error and disconnects.

The timeout always occurs on the node running vzdump.
It seems to happen when vzdump is reading data quickly because it compresses well (300-500MB/sec)
I run with two DRBD volumes, the one that times out is often the one we are NOT backing up from.
Sometimes both DRBD volumes will timeout at nearly the same time.
Kernel messages:
Code:
Dec 18 02:51:48 vm5 kernel: [665981.852082] drbd drbd1: peer( Primary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )
Dec 18 02:51:48 vm5 kernel: [665981.852097] drbd drbd1: asender terminated
Dec 18 02:51:48 vm5 kernel: [665981.852098] drbd drbd1: Terminating drbd_a_drbd1
Dec 18 02:51:48 vm5 kernel: [665981.852153] block drbd1: new current UUID 5BA77824A3677D65:FD6644C8DEF156FB:3A44E5292E0230D3:3A43E5292E0230D3
Dec 18 02:51:48 vm5 kernel: [665981.884100] drbd drbd1: Connection closed
Dec 18 02:51:48 vm5 kernel: [665981.884159] drbd drbd1: conn( Timeout -> Unconnected )
Dec 18 02:51:48 vm5 kernel: [665981.884161] drbd drbd1: receiver terminated

Backup log:
Code:
221: Dec 18 02:50:36 INFO: status: 75% (193546878976/257698037760), sparse 33% (86667313152), duration 1388, 373/0 MB/s
221: Dec 18 02:50:43 INFO: status: 76% (196155670528/257698037760), sparse 34% (89276104704), duration 1395, 372/0 MB/s
221: Dec 18 02:50:50 INFO: status: 77% (198786023424/257698037760), sparse 35% (91906457600), duration 1402, 375/0 MB/s
221: Dec 18 02:50:56 INFO: status: 78% (201095577600/257698037760), sparse 36% (94216011776), duration 1408, 384/0 MB/s
221: Dec 18 02:51:02 INFO: status: 79% (203586666496/257698037760), sparse 37% (96707100672), duration 1414, 415/0 MB/s
221: Dec 18 02:51:09 INFO: status: 80% (206316371968/257698037760), sparse 38% (99436806144), duration 1421, 389/0 MB/s
221: Dec 18 02:51:31 INFO: status: 81% (208765190144/257698037760), sparse 39% (101885624320), duration 1443, 111/0 MB/s
221: Dec 18 02:51:40 INFO: status: 82% (211773030400/257698037760), sparse 40% (104893464576), duration 1452, 334/0 MB/s
221: Dec 18 02:51:45 INFO: status: 83% (214275719168/257698037760), sparse 41% (107396153344), duration 1457, 500/0 MB/s
221: Dec 18 02:51:52 INFO: status: 84% (216481005568/257698037760), sparse 42% (109601439744), duration 1464, 315/0 MB/s
221: Dec 18 02:52:10 INFO: status: 85% (219055259648/257698037760), sparse 43% (112175693824), duration 1482, 143/0 MB/s
221: Dec 18 02:52:20 INFO: status: 86% (221872914432/257698037760), sparse 44% (114993348608), duration 1492, 281/0 MB/s
221: Dec 18 02:52:27 INFO: status: 87% (224285949952/257698037760), sparse 45% (117406384128), duration 1499, 344/0 MB/s
221: Dec 18 02:52:35 INFO: status: 88% (226912370688/257698037760), sparse 46% (120032804864), duration 1507, 328/0 MB/s
221: Dec 18 02:52:42 INFO: status: 89% (229424365568/257698037760), sparse 47% (122544799744), duration 1514, 358/0 MB/s
221: Dec 18 02:52:49 INFO: status: 90% (232107409408/257698037760), sparse 48% (125227843584), duration 1521, 383/0 MB/s
221: Dec 18 02:52:57 INFO: status: 91% (234941841408/257698037760), sparse 49% (128062275584), duration 1529, 354/0 MB/s
221: Dec 18 02:53:04 INFO: status: 92% (237454360576/257698037760), sparse 50% (130574794752), duration 1536, 358/0 MB/s
221: Dec 18 02:53:10 INFO: status: 93% (239703687168/257698037760), sparse 51% (132824121344), duration 1542, 374/0 MB/s
221: Dec 18 02:53:17 INFO: status: 94% (242433261568/257698037760), sparse 52% (135553695744), duration 1549, 389/0 MB/s
221: Dec 18 02:53:35 INFO: status: 95% (244858355712/257698037760), sparse 53% (137978789888), duration 1567, 134/0 MB/s
221: Dec 18 02:53:51 INFO: status: 96% (247513612288/257698037760), sparse 54% (140634046464), duration 1583, 165/0 MB/s
221: Dec 18 02:54:05 INFO: status: 97% (250172407808/257698037760), sparse 55% (143292841984), duration 1597, 189/0 MB/s
221: Dec 18 02:54:13 INFO: status: 98% (252913123328/257698037760), sparse 56% (146033557504), duration 1605, 342/0 MB/s
221: Dec 18 02:54:23 INFO: status: 99% (255221301248/257698037760), sparse 57% (148341735424), duration 1615, 230/0 MB/s
221: Dec 18 02:54:30 INFO: status: 100% (257698037760/257698037760), sparse 58% (150818467840), duration 1622, 353/0 MB/s
221: Dec 18 02:54:30 INFO: transferred 257698 MB in 1622 seconds (158 MB/s)
221: Dec 18 02:54:30 INFO: archive file size: 55.79GB
221: Dec 18 02:54:31 INFO: Finished Backup of VM 221 (00:27:04)

Both volumes are on the same Areca 1880 card but use different disks.

Exact same hardware running the 2.6.32 kernel worked fine.
We have updated ten nodes to 3.10, six have had this issue.
Servers with 24GB RAM have had the issue most often.
64GB RAM less frequently
128GB RAM, never

I did install the latest drbd utilities and also installed the 8.4.5 kernel module.
I've tried adjusting timeouts in DRBD.
Neither have helped.

One thing I did notice is that servers that had this occur the most often were ones that had ionice=8 set in /etc/vzdump.conf.
Changing them to ionice=7 seems to have reduced the occurrences of these timeouts but not eliminated them.
Backup disk is using CFQ, drbd volumes are set to deadline.

Anyone have suggestions?
 
At nights when vzdump runs DRBD is receiving a timeout error and disconnects.

The timeout always occurs on the node running vzdump.
It seems to happen when vzdump is reading data quickly because it compresses well (300-500MB/sec)
I run with two DRBD volumes, the one that times out is often the one we are NOT backing up from.
Sometimes both DRBD volumes will timeout at nearly the same time.

...
Servers with 24GB RAM have had the issue most often.
64GB RAM less frequently
128GB RAM, never
Hi e100,
the issue is related to memory (buffer?)?? Strange!

Have you tried to use an higher read ahead buffer for the drbd-volumes?
Code:
cat 4096 > /sys/block/drbd0/queue/read_ahead_kb

If you reduce the vzdump-bandwith with "bwlimit", does that change anything?

Udo
 
Hi Udo,

Thanks for the suggestions, this is becoming quite frustrating.

I think buffer/cache has something to do with it. My suspicion is something changed in the IO scheduler to cause this, any idea what might have changed?
I've managed to improve the situation by mounting my backup disk with barrier=1 and data=ordered combined with ionice=7 in vzdump.conf.

Just made some changes to some DRBD timeouts this morning.
If that does not seem to help over the next couple of days I'll try your suggestions.
 
Hi e100

Do you have a error message for your NIC of DRBD that say "link down" or "down link" (i don't remember the exact text) in your kern.log or syslog file?

I have the kernel pve-kernel-3.10.0-5-pve with drbd 8.4.5 since a month ago, in four servers, a with a script that run verification of DRBD resources a once a week (verify-alg sha1) and never i had problems with the replication, nor with vzdump, DRBD is solid as a rock with C protocol. And always i use DRBD with 2 NICs bonding in round-robin conected NIC-to-NIC, group of NICs are of 1 Gb/s for 2 servers, and 10 Gb/s for the other two servers, always with 9000 mtu.

Moreover, my servers of backups are PC workstations with PVE and NFS installed, deadline is the I/O scheduler. In Hardware for the PCs of backups, for each PVE node that i have, in the PC of backups, i have a RAID-1 with HDD SATA 7.2k rpm, and the NIC used for all the backups of all my PVE Nodes is from 10 Gb/s.

I hope that be of help.

Best regards
Cesar
 
Last edited: