Proxmox 4.2 DRBD: Node does not reconnect after reboot/connection loss

Jospeh Huber · Sep 30, 2016

Hello all,

I am new to DRBD but not new to ProxMox ;-)
We have a 3 Node DRBD9 Cluster setup with Proxmox 4.2 like it is described in the Wiki article here https://pve.proxmox.com/wiki/DRBD9.
MyVersions: proxmox-ve: 4.2-64 (running kernel: 4.4.16-1-pve),drbdmanage: 0.97-1

The drbd9 storage is available and I have two LXC containers with HA in it. HA-Migration and Fail-Over works as expected.
But if one node gets restarted or a connection loss happens it never connects to DRDB again.
I have also setup the "post-up drbdadm adjust all" in /etc/network/interfaces.
In the Wiki is described that a "drbdadm adjust all or drbdadm adjust-with-progress all" should do the job... but not for me. It does nothing even if it is manualy invoked
Also, I did not find anything here https://www.drbd.org/en/doc/users-guide-90/s-node-failure.
If I recreate the VMs out of a backup everything is fine again, but I think this is not the way to solve the problem ;-)

Any ideas?

P.S. My plan is, when problem is solved, to operate some smaller Testsystems and if it works i like to use it in production.

Here some data:

Code:

root@vmhost2:~# drbd-overview
  0:.drbdctrl/0      Connected(3*)                       Secondary(3*)                             UpTo(vmhost2)/UpTo(vmhost5,vmhost1)
  1:.drbdctrl/1      Connected(3*)                       Secondary(3*)                             UpTo(vmhost2)/UpTo(vmhost5,vmhost1)
100:vm-108-disk-1/0  Conn(vmhost5,vmhost2)/C'ng(vmhost1) Prim(vmhost2)/Unkn(vmhost1)/Seco(vmhost5) UpTo(vmhost2)/Inco(vmhost1)/UpTo(vmhost5)
101:vm-132-disk-1/0  Conn(vmhost2,vmhost5)/C'ng(vmhost1) Seco(vmhost2)/Unkn(vmhost1)/Prim(vmhost5) UpTo(vmhost2)/Inco(vmhost1)/UpTo(vmhost5)

root@vmhost1:~#  drbdmanage list-nodes
+---------------------------------------------------------------------------------------------------------+
| Name    | Pool Size | Pool Free |                                                               | State |
|---------------------------------------------------------------------------------------------------------|
| vmhost1 |    510976 |    500756 |                                                               |    ok |
| vmhost2 |    510976 |    506734 |                                                               |    ok |
| vmhost5 |    510976 |    500756 |                                                               |    ok |
+---------------------------------------------------------------------------------------------------------+

A) The disconnected node:
drbdsetup status
.drbdctrl role:Secondary
  volume:0 disk:UpToDate
  volume:1 disk:UpToDate
  vmhost2 role:Secondary
    volume:0 peer-disk:UpToDate
    volume:1 peer-disk:UpToDate
  vmhost5 role:Secondary
    volume:0 peer-disk:UpToDate
    volume:1 peer-disk:UpToDate

vm-108-disk-1 role:Secondary
  disk:Inconsistent
  vmhost2 connection:StandAlone
  vmhost5 connection:StandAlone

vm-132-disk-1 role:Secondary
  disk:Outdated
  vmhost2 connection:StandAlone
  vmhost5 connection:StandAlone



B) The connected Node
root@vmhost2:~# drbdsetup status
.drbdctrl role:Secondary
  volume:0 disk:UpToDate
  volume:1 disk:UpToDate
  vmhost1 role:Secondary
    volume:0 peer-disk:UpToDate
    volume:1 peer-disk:UpToDate
  vmhost5 role:Secondary
    volume:0 peer-disk:UpToDate
    volume:1 peer-disk:UpToDate

vm-108-disk-1 role:Primary
  disk:UpToDate
  vmhost1 connection:Connecting
  vmhost5 role:Secondary
    peer-disk:UpToDate

vm-132-disk-1 role:Secondary
  disk:UpToDate
  vmhost1 connection:Connecting
  vmhost5 role:Primary
    peer-disk:UpToDate

Jospeh Huber · Sep 30, 2016

Uhhh this is my problem:
https://bugzilla.proxmox.com/show_bug.cgi?id=1110

If I do the steps described here, it is connecting again and syncing.

Production use canceled ;-)

fabian · Sep 30, 2016

if you update to the current version, this should no longer happen

Jospeh Huber · Sep 30, 2016

OK I will try.

But in BugZilla it is not marked as resolved ...

Jospeh Huber · Oct 4, 2016

Confirmed: After Upgrading to 4.3 the reconnect works after reboot.
Solved

!

Jospeh Huber · Dec 5, 2016

Problem occured again after a reboot. The system with drbd was up for 65 days.

proxmox-ve: 4.3-66 (running kernel: 4.4.19-1-pve)
...
drbdmanage: 0.97.3-1

I have to execute on all Nodes:
drbdmanage export-res "*";drbdadm adjust all

Then it is reconnecting again.

=> Fortunately not reproducible after several reboots...

Jospeh Huber · Dec 13, 2016

After several reboots and upgrades i cannot get some disks synced of my 3 Nodes Cluster connecting again.

I have tried several different approaches but nothing helps:

Code:

drbdmanage list-nodes
+--------------------------------------------------------------------------------------------------+
| Name    | Pool Size | Pool Free |                                                        | State |
|--------------------------------------------------------------------------------------------------|
| vmhost1 |    510976 |    366727 |                                                        |    ok |
| vmhost2 |    510976 |    365858 |                                                        |    ok |
| vmhost5 |    510976 |    370917 |                                                        |    ok |
+--------------------------------------------------------------------------------------------------+

Node 1
drbdsetup status vm-103-disk-1
vm-103-disk-1 role:Secondary
  disk:Inconsistent
  vmhost2 connection:Connecting
  vmhost5 connection:Connecting

Node 2
vm-103-disk-1 role:Secondary
  disk:UpToDate
  vmhost1 connection:StandAlone
  vmhost5 role:Primary
    peer-disk:UpToDate

Node 3
vm-103-disk-1 role:Primary
  disk:UpToDate
  vmhost1 connection:StandAlone
  vmhost2 role:Secondary
    peer-disk:UpToDate

I tried a manual split-brain recovery... but nothing helps (drbdmanage export-res "*";drbdadm adjust all)
Any ideas?
stalenode: drbdadm disconnect vm-103-disk-1
stalenode: drbdadm connect --discard-my-data vm-103-disk-1
goodnode:
drbdadm connect vm-103-disk-1`

It seems that I have also some stale data in my configuration... can't fix this!
/var/lib/drbd.d/drbdmanage_vm-107-disk-1.res:2: in resource vm-107-disk-1:
# executed on all three nodes ...
drbdmanage remove-resource vm-107-disk-1 --force
drbdmanage export-res "*";drbdadm adjust-with-progress all
WARNING:root:Could not read configuration file '/etc/drbdmanaged.cfg'
Operation completed successfully
/var/lib/drbd.d/drbdmanage_vm-107-disk-1.res:2: in resource vm-107-disk-1:
There is no 'on' section for hostname 'vmhost1' named in the connection-mesh

titux · Feb 3, 2017

I have same problem, same version 4.3, connection lost and status connecting... outdated.. never reconnects..
I will try to upgrade to version 4.4 and see what happens...

titux · Feb 5, 2017

Sad to say still problems exist my resource is in StandAlone even after upgrading my nodes to Promox 4.4-12. DRBDmanage license is back to GPL status so please Proxmox help...
One thing I can see is that drbdmanage was not update to latest version..

e100 · Feb 5, 2017

Linbit now provides DRBD repo for Proxmox, maybe switching to that will resolve your problem.

https://www.drbd.org/en/doc/users-guide-90/s-proxmox-install

petr108m · Sep 26, 2024

I specially registered to suggest you guys a solution.

Make ssh root key connection between nodes.
Then hooks for network (ansible jinja configs)
Iface shutdown

Code:

# nano ifdown_drbd.j2
#!/bin/bash

if [ "${IFACE}" == "{{ nic }}" ]; then
    killall -r '/opt/drbd' &> /dev/null
fi
exit 0

iface upcoming

Code:

# nano ifup_drbd.j2
#!/bin/bash

if [ -x /sbin/drbdadm ] && [ "${IFACE}" == "{{ nic }}" ]; then
    timeout 600 /opt/drbd &> /dev/null &
fi
exit 0

script itself

Code:

# nano opt_drbd.j2
#!/bin/bash

### after network restart drbd cluster nodes become standalone without reconnecting attempts
# time for cluster to degrade
sleep 5
# ssh check 2nd node
while ! ssh -q -o "BatchMode=yes" -o StrictHostKeyChecking=accept-new {{ remote_ip }} exit; do
    sleep 10
done
# start repair until success
while ! drbdadm status | grep -q "peer-disk:UpToDate"; do
    drbdadm adjust all
    timeout 5 ssh {{ remote_ip }} drbdadm adjust all
    sleep 10
done
#  remove network error
sleep 10
crm resource cleanup

It affects both cases with degraded and not degraded cluster after network restart.
You are welcome!

Search

Search

Proxmox 4.2 DRBD: Node does not reconnect after reboot/connection loss

Jospeh Huber

Renowned Member

Jospeh Huber

Renowned Member

fabian

Proxmox Staff Member

Jospeh Huber

Renowned Member

Jospeh Huber

Renowned Member

Jospeh Huber

Renowned Member

Jospeh Huber

Renowned Member

titux

Renowned Member

titux

Renowned Member

e100

Renowned Member

petr108m

New Member

We value your privacy