Proxmox 4.2 DRBD: Node does not reconnect after reboot/connection loss

Jospeh Huber

Renowned Member
Apr 18, 2016
99
7
73
46
Hello all,

I am new to DRBD but not new to ProxMox ;-)
We have a 3 Node DRBD9 Cluster setup with Proxmox 4.2 like it is described in the Wiki article here https://pve.proxmox.com/wiki/DRBD9.
MyVersions: proxmox-ve: 4.2-64 (running kernel: 4.4.16-1-pve),drbdmanage: 0.97-1

The drbd9 storage is available and I have two LXC containers with HA in it. HA-Migration and Fail-Over works as expected.
But if one node gets restarted or a connection loss happens it never connects to DRDB again.
I have also setup the "post-up drbdadm adjust all" in /etc/network/interfaces.
In the Wiki is described that a "drbdadm adjust all or drbdadm adjust-with-progress all" should do the job... but not for me. It does nothing even if it is manualy invoked
Also, I did not find anything here https://www.drbd.org/en/doc/users-guide-90/s-node-failure.
If I recreate the VMs out of a backup everything is fine again, but I think this is not the way to solve the problem ;-)

Any ideas?

P.S. My plan is, when problem is solved, to operate some smaller Testsystems and if it works i like to use it in production.

Here some data:

Code:
root@vmhost2:~# drbd-overview
  0:.drbdctrl/0      Connected(3*)                       Secondary(3*)                             UpTo(vmhost2)/UpTo(vmhost5,vmhost1)
  1:.drbdctrl/1      Connected(3*)                       Secondary(3*)                             UpTo(vmhost2)/UpTo(vmhost5,vmhost1)
100:vm-108-disk-1/0  Conn(vmhost5,vmhost2)/C'ng(vmhost1) Prim(vmhost2)/Unkn(vmhost1)/Seco(vmhost5) UpTo(vmhost2)/Inco(vmhost1)/UpTo(vmhost5)
101:vm-132-disk-1/0  Conn(vmhost2,vmhost5)/C'ng(vmhost1) Seco(vmhost2)/Unkn(vmhost1)/Prim(vmhost5) UpTo(vmhost2)/Inco(vmhost1)/UpTo(vmhost5)

root@vmhost1:~#  drbdmanage list-nodes
+---------------------------------------------------------------------------------------------------------+
| Name    | Pool Size | Pool Free |                                                               | State |
|---------------------------------------------------------------------------------------------------------|
| vmhost1 |    510976 |    500756 |                                                               |    ok |
| vmhost2 |    510976 |    506734 |                                                               |    ok |
| vmhost5 |    510976 |    500756 |                                                               |    ok |
+---------------------------------------------------------------------------------------------------------+

A) The disconnected node:
drbdsetup status
.drbdctrl role:Secondary
  volume:0 disk:UpToDate
  volume:1 disk:UpToDate
  vmhost2 role:Secondary
    volume:0 peer-disk:UpToDate
    volume:1 peer-disk:UpToDate
  vmhost5 role:Secondary
    volume:0 peer-disk:UpToDate
    volume:1 peer-disk:UpToDate

vm-108-disk-1 role:Secondary
  disk:Inconsistent
  vmhost2 connection:StandAlone
  vmhost5 connection:StandAlone

vm-132-disk-1 role:Secondary
  disk:Outdated
  vmhost2 connection:StandAlone
  vmhost5 connection:StandAlone



B) The connected Node
root@vmhost2:~# drbdsetup status
.drbdctrl role:Secondary
  volume:0 disk:UpToDate
  volume:1 disk:UpToDate
  vmhost1 role:Secondary
    volume:0 peer-disk:UpToDate
    volume:1 peer-disk:UpToDate
  vmhost5 role:Secondary
    volume:0 peer-disk:UpToDate
    volume:1 peer-disk:UpToDate

vm-108-disk-1 role:Primary
  disk:UpToDate
  vmhost1 connection:Connecting
  vmhost5 role:Secondary
    peer-disk:UpToDate

vm-132-disk-1 role:Secondary
  disk:UpToDate
  vmhost1 connection:Connecting
  vmhost5 role:Primary
    peer-disk:UpToDate
 
if you update to the current version, this should no longer happen
 
Problem occured again after a reboot. The system with drbd was up for 65 days.

proxmox-ve: 4.3-66 (running kernel: 4.4.19-1-pve)
...
drbdmanage: 0.97.3-1


I have to execute on all Nodes:
drbdmanage export-res "*";drbdadm adjust all

Then it is reconnecting again.

=> Fortunately not reproducible after several reboots...
 
Last edited:
After several reboots and upgrades i cannot get some disks synced of my 3 Nodes Cluster connecting again.

I have tried several different approaches but nothing helps:
Code:
drbdmanage list-nodes
+--------------------------------------------------------------------------------------------------+
| Name    | Pool Size | Pool Free |                                                        | State |
|--------------------------------------------------------------------------------------------------|
| vmhost1 |    510976 |    366727 |                                                        |    ok |
| vmhost2 |    510976 |    365858 |                                                        |    ok |
| vmhost5 |    510976 |    370917 |                                                        |    ok |
+--------------------------------------------------------------------------------------------------+

Node 1
drbdsetup status vm-103-disk-1
vm-103-disk-1 role:Secondary
  disk:Inconsistent
  vmhost2 connection:Connecting
  vmhost5 connection:Connecting

Node 2
vm-103-disk-1 role:Secondary
  disk:UpToDate
  vmhost1 connection:StandAlone
  vmhost5 role:Primary
    peer-disk:UpToDate

Node 3
vm-103-disk-1 role:Primary
  disk:UpToDate
  vmhost1 connection:StandAlone
  vmhost2 role:Secondary
    peer-disk:UpToDate

I tried a manual split-brain recovery... but nothing helps (drbdmanage export-res "*";drbdadm adjust all)
Any ideas?
stalenode: drbdadm disconnect vm-103-disk-1
stalenode: drbdadm connect --discard-my-data vm-103-disk-1
goodnode:
drbdadm connect vm-103-disk-1`

It seems that I have also some stale data in my configuration... can't fix this!
/var/lib/drbd.d/drbdmanage_vm-107-disk-1.res:2: in resource vm-107-disk-1:
# executed on all three nodes ...
drbdmanage remove-resource vm-107-disk-1 --force
drbdmanage export-res "*";drbdadm adjust-with-progress all
WARNING:root:Could not read configuration file '/etc/drbdmanaged.cfg'
Operation completed successfully
/var/lib/drbd.d/drbdmanage_vm-107-disk-1.res:2: in resource vm-107-disk-1:
There is no 'on' section for hostname 'vmhost1' named in the connection-mesh
 
Last edited:
I have same problem, same version 4.3, connection lost and status connecting... outdated.. never reconnects..
I will try to upgrade to version 4.4 and see what happens...
 
Sad to say still problems exist my resource is in StandAlone even after upgrading my nodes to Promox 4.4-12. DRBDmanage license is back to GPL status so please Proxmox help...
One thing I can see is that drbdmanage was not update to latest version..
 
I specially registered to suggest you guys a solution.

Make ssh root key connection between nodes.
Then hooks for network (ansible jinja configs)
Iface shutdown
Code:
# nano ifdown_drbd.j2
#!/bin/bash

if [ "${IFACE}" == "{{ nic }}" ]; then
    killall -r '/opt/drbd' &> /dev/null
fi
exit 0

iface upcoming
Code:
# nano ifup_drbd.j2
#!/bin/bash

if [ -x /sbin/drbdadm ] && [ "${IFACE}" == "{{ nic }}" ]; then
    timeout 600 /opt/drbd &> /dev/null &
fi
exit 0

script itself
Code:
# nano opt_drbd.j2
#!/bin/bash

### after network restart drbd cluster nodes become standalone without reconnecting attempts
# time for cluster to degrade
sleep 5
# ssh check 2nd node
while ! ssh -q -o "BatchMode=yes" -o StrictHostKeyChecking=accept-new {{ remote_ip }} exit; do
    sleep 10
done
# start repair until success
while ! drbdadm status | grep -q "peer-disk:UpToDate"; do
    drbdadm adjust all
    timeout 5 ssh {{ remote_ip }} drbdadm adjust all
    sleep 10
done
#  remove network error
sleep 10
crm resource cleanup

It affects both cases with degraded and not degraded cluster after network restart.
You are welcome!