At nights when vzdump runs DRBD is receiving a timeout error and disconnects.
The timeout always occurs on the node running vzdump.
It seems to happen when vzdump is reading data quickly because it compresses well (300-500MB/sec)
I run with two DRBD volumes, the one that times out is often the one we are NOT backing up from.
Sometimes both DRBD volumes will timeout at nearly the same time.
Kernel messages:
Backup log:
Both volumes are on the same Areca 1880 card but use different disks.
Exact same hardware running the 2.6.32 kernel worked fine.
We have updated ten nodes to 3.10, six have had this issue.
Servers with 24GB RAM have had the issue most often.
64GB RAM less frequently
128GB RAM, never
I did install the latest drbd utilities and also installed the 8.4.5 kernel module.
I've tried adjusting timeouts in DRBD.
Neither have helped.
One thing I did notice is that servers that had this occur the most often were ones that had ionice=8 set in /etc/vzdump.conf.
Changing them to ionice=7 seems to have reduced the occurrences of these timeouts but not eliminated them.
Backup disk is using CFQ, drbd volumes are set to deadline.
Anyone have suggestions?
The timeout always occurs on the node running vzdump.
It seems to happen when vzdump is reading data quickly because it compresses well (300-500MB/sec)
I run with two DRBD volumes, the one that times out is often the one we are NOT backing up from.
Sometimes both DRBD volumes will timeout at nearly the same time.
Kernel messages:
Code:
Dec 18 02:51:48 vm5 kernel: [665981.852082] drbd drbd1: peer( Primary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )
Dec 18 02:51:48 vm5 kernel: [665981.852097] drbd drbd1: asender terminated
Dec 18 02:51:48 vm5 kernel: [665981.852098] drbd drbd1: Terminating drbd_a_drbd1
Dec 18 02:51:48 vm5 kernel: [665981.852153] block drbd1: new current UUID 5BA77824A3677D65:FD6644C8DEF156FB:3A44E5292E0230D3:3A43E5292E0230D3
Dec 18 02:51:48 vm5 kernel: [665981.884100] drbd drbd1: Connection closed
Dec 18 02:51:48 vm5 kernel: [665981.884159] drbd drbd1: conn( Timeout -> Unconnected )
Dec 18 02:51:48 vm5 kernel: [665981.884161] drbd drbd1: receiver terminated
Backup log:
Code:
221: Dec 18 02:50:36 INFO: status: 75% (193546878976/257698037760), sparse 33% (86667313152), duration 1388, 373/0 MB/s
221: Dec 18 02:50:43 INFO: status: 76% (196155670528/257698037760), sparse 34% (89276104704), duration 1395, 372/0 MB/s
221: Dec 18 02:50:50 INFO: status: 77% (198786023424/257698037760), sparse 35% (91906457600), duration 1402, 375/0 MB/s
221: Dec 18 02:50:56 INFO: status: 78% (201095577600/257698037760), sparse 36% (94216011776), duration 1408, 384/0 MB/s
221: Dec 18 02:51:02 INFO: status: 79% (203586666496/257698037760), sparse 37% (96707100672), duration 1414, 415/0 MB/s
221: Dec 18 02:51:09 INFO: status: 80% (206316371968/257698037760), sparse 38% (99436806144), duration 1421, 389/0 MB/s
221: Dec 18 02:51:31 INFO: status: 81% (208765190144/257698037760), sparse 39% (101885624320), duration 1443, 111/0 MB/s
221: Dec 18 02:51:40 INFO: status: 82% (211773030400/257698037760), sparse 40% (104893464576), duration 1452, 334/0 MB/s
221: Dec 18 02:51:45 INFO: status: 83% (214275719168/257698037760), sparse 41% (107396153344), duration 1457, 500/0 MB/s
221: Dec 18 02:51:52 INFO: status: 84% (216481005568/257698037760), sparse 42% (109601439744), duration 1464, 315/0 MB/s
221: Dec 18 02:52:10 INFO: status: 85% (219055259648/257698037760), sparse 43% (112175693824), duration 1482, 143/0 MB/s
221: Dec 18 02:52:20 INFO: status: 86% (221872914432/257698037760), sparse 44% (114993348608), duration 1492, 281/0 MB/s
221: Dec 18 02:52:27 INFO: status: 87% (224285949952/257698037760), sparse 45% (117406384128), duration 1499, 344/0 MB/s
221: Dec 18 02:52:35 INFO: status: 88% (226912370688/257698037760), sparse 46% (120032804864), duration 1507, 328/0 MB/s
221: Dec 18 02:52:42 INFO: status: 89% (229424365568/257698037760), sparse 47% (122544799744), duration 1514, 358/0 MB/s
221: Dec 18 02:52:49 INFO: status: 90% (232107409408/257698037760), sparse 48% (125227843584), duration 1521, 383/0 MB/s
221: Dec 18 02:52:57 INFO: status: 91% (234941841408/257698037760), sparse 49% (128062275584), duration 1529, 354/0 MB/s
221: Dec 18 02:53:04 INFO: status: 92% (237454360576/257698037760), sparse 50% (130574794752), duration 1536, 358/0 MB/s
221: Dec 18 02:53:10 INFO: status: 93% (239703687168/257698037760), sparse 51% (132824121344), duration 1542, 374/0 MB/s
221: Dec 18 02:53:17 INFO: status: 94% (242433261568/257698037760), sparse 52% (135553695744), duration 1549, 389/0 MB/s
221: Dec 18 02:53:35 INFO: status: 95% (244858355712/257698037760), sparse 53% (137978789888), duration 1567, 134/0 MB/s
221: Dec 18 02:53:51 INFO: status: 96% (247513612288/257698037760), sparse 54% (140634046464), duration 1583, 165/0 MB/s
221: Dec 18 02:54:05 INFO: status: 97% (250172407808/257698037760), sparse 55% (143292841984), duration 1597, 189/0 MB/s
221: Dec 18 02:54:13 INFO: status: 98% (252913123328/257698037760), sparse 56% (146033557504), duration 1605, 342/0 MB/s
221: Dec 18 02:54:23 INFO: status: 99% (255221301248/257698037760), sparse 57% (148341735424), duration 1615, 230/0 MB/s
221: Dec 18 02:54:30 INFO: status: 100% (257698037760/257698037760), sparse 58% (150818467840), duration 1622, 353/0 MB/s
221: Dec 18 02:54:30 INFO: transferred 257698 MB in 1622 seconds (158 MB/s)
221: Dec 18 02:54:30 INFO: archive file size: 55.79GB
221: Dec 18 02:54:31 INFO: Finished Backup of VM 221 (00:27:04)
Both volumes are on the same Areca 1880 card but use different disks.
Exact same hardware running the 2.6.32 kernel worked fine.
We have updated ten nodes to 3.10, six have had this issue.
Servers with 24GB RAM have had the issue most often.
64GB RAM less frequently
128GB RAM, never
I did install the latest drbd utilities and also installed the 8.4.5 kernel module.
I've tried adjusting timeouts in DRBD.
Neither have helped.
One thing I did notice is that servers that had this occur the most often were ones that had ionice=8 set in /etc/vzdump.conf.
Changing them to ionice=7 seems to have reduced the occurrences of these timeouts but not eliminated them.
Backup disk is using CFQ, drbd volumes are set to deadline.
Anyone have suggestions?