Proxmox 4.2 DRBD: Node does not reconnect after reboot/connection loss

Jospeh Huber · Sep 30, 2016

Hello all,

I am new to DRBD but not new to ProxMox ;-)
We have a 3 Node DRBD9 Cluster setup with Proxmox 4.2 like it is described in the Wiki article here https://pve.proxmox.com/wiki/DRBD9.
MyVersions: proxmox-ve: 4.2-64 (running kernel: 4.4.16-1-pve),drbdmanage: 0.97-1

The drbd9 storage is available and I have two LXC containers with HA in it. HA-Migration and Fail-Over works as expected.
But if one node gets restarted or a connection loss happens it never connects to DRDB again.
I have also setup the "post-up drbdadm adjust all" in /etc/network/interfaces.
In the Wiki is described that a "drbdadm adjust all or drbdadm adjust-with-progress all" should do the job... but not for me. It does nothing even if it is manualy invoked
Also, I did not find anything here https://www.drbd.org/en/doc/users-guide-90/s-node-failure.
If I recreate the VMs out of a backup everything is fine again, but I think this is not the way to solve the problem ;-)

Any ideas?

P.S. My plan is, when problem is solved, to operate some smaller Testsystems and if it works i like to use it in production.

Here some data:

Code:

root@vmhost2:~# drbd-overview
  0:.drbdctrl/0      Connected(3*)                       Secondary(3*)                             UpTo(vmhost2)/UpTo(vmhost5,vmhost1)
  1:.drbdctrl/1      Connected(3*)                       Secondary(3*)                             UpTo(vmhost2)/UpTo(vmhost5,vmhost1)
100:vm-108-disk-1/0  Conn(vmhost5,vmhost2)/C'ng(vmhost1) Prim(vmhost2)/Unkn(vmhost1)/Seco(vmhost5) UpTo(vmhost2)/Inco(vmhost1)/UpTo(vmhost5)
101:vm-132-disk-1/0  Conn(vmhost2,vmhost5)/C'ng(vmhost1) Seco(vmhost2)/Unkn(vmhost1)/Prim(vmhost5) UpTo(vmhost2)/Inco(vmhost1)/UpTo(vmhost5)

root@vmhost1:~#  drbdmanage list-nodes
+---------------------------------------------------------------------------------------------------------+
| Name    | Pool Size | Pool Free |                                                               | State |
|---------------------------------------------------------------------------------------------------------|
| vmhost1 |    510976 |    500756 |                                                               |    ok |
| vmhost2 |    510976 |    506734 |                                                               |    ok |
| vmhost5 |    510976 |    500756 |                                                               |    ok |
+---------------------------------------------------------------------------------------------------------+

A) The disconnected node:
drbdsetup status
.drbdctrl role:Secondary
  volume:0 disk:UpToDate
  volume:1 disk:UpToDate
  vmhost2 role:Secondary
    volume:0 peer-disk:UpToDate
    volume:1 peer-disk:UpToDate
  vmhost5 role:Secondary
    volume:0 peer-disk:UpToDate
    volume:1 peer-disk:UpToDate

vm-108-disk-1 role:Secondary
  disk:Inconsistent
  vmhost2 connection:StandAlone
  vmhost5 connection:StandAlone

vm-132-disk-1 role:Secondary
  disk:Outdated
  vmhost2 connection:StandAlone
  vmhost5 connection:StandAlone



B) The connected Node
root@vmhost2:~# drbdsetup status
.drbdctrl role:Secondary
  volume:0 disk:UpToDate
  volume:1 disk:UpToDate
  vmhost1 role:Secondary
    volume:0 peer-disk:UpToDate
    volume:1 peer-disk:UpToDate
  vmhost5 role:Secondary
    volume:0 peer-disk:UpToDate
    volume:1 peer-disk:UpToDate

vm-108-disk-1 role:Primary
  disk:UpToDate
  vmhost1 connection:Connecting
  vmhost5 role:Secondary
    peer-disk:UpToDate

vm-132-disk-1 role:Secondary
  disk:UpToDate
  vmhost1 connection:Connecting
  vmhost5 role:Primary
    peer-disk:UpToDate

Jospeh Huber · Sep 30, 2016

Uhhh this is my problem:
https://bugzilla.proxmox.com/show_bug.cgi?id=1110

If I do the steps described here, it is connecting again and syncing.

Production use canceled ;-)

fabian · Sep 30, 2016

if you update to the current version, this should no longer happen

Jospeh Huber · Sep 30, 2016

OK I will try.

But in BugZilla it is not marked as resolved ...

Jospeh Huber · Oct 4, 2016

Confirmed: After Upgrading to 4.3 the reconnect works after reboot.
Solved

!

Jospeh Huber · Dec 5, 2016

Problem occured again after a reboot. The system with drbd was up for 65 days.

proxmox-ve: 4.3-66 (running kernel: 4.4.19-1-pve)
...
drbdmanage: 0.97.3-1

I have to execute on all Nodes:
drbdmanage export-res "*";drbdadm adjust all

Then it is reconnecting again.

=> Fortunately not reproducible after several reboots...

Jospeh Huber · Dec 13, 2016

After several reboots and upgrades i cannot get some disks synced of my 3 Nodes Cluster connecting again.

I have tried several different approaches but nothing helps:

Code:

drbdmanage list-nodes
+--------------------------------------------------------------------------------------------------+
| Name    | Pool Size | Pool Free |                                                        | State |
|--------------------------------------------------------------------------------------------------|
| vmhost1 |    510976 |    366727 |                                                        |    ok |
| vmhost2 |    510976 |    365858 |                                                        |    ok |
| vmhost5 |    510976 |    370917 |                                                        |    ok |
+--------------------------------------------------------------------------------------------------+

Node 1
drbdsetup status vm-103-disk-1
vm-103-disk-1 role:Secondary
  disk:Inconsistent
  vmhost2 connection:Connecting
  vmhost5 connection:Connecting

Node 2
vm-103-disk-1 role:Secondary
  disk:UpToDate
  vmhost1 connection:StandAlone
  vmhost5 role:Primary
    peer-disk:UpToDate

Node 3
vm-103-disk-1 role:Primary
  disk:UpToDate
  vmhost1 connection:StandAlone
  vmhost2 role:Secondary
    peer-disk:UpToDate

I tried a manual split-brain recovery... but nothing helps (drbdmanage export-res "*";drbdadm adjust all)
Any ideas?
stalenode: drbdadm disconnect vm-103-disk-1
stalenode: drbdadm connect --discard-my-data vm-103-disk-1
goodnode:
drbdadm connect vm-103-disk-1`

It seems that I have also some stale data in my configuration... can't fix this!
/var/lib/drbd.d/drbdmanage_vm-107-disk-1.res:2: in resource vm-107-disk-1:
# executed on all three nodes ...
drbdmanage remove-resource vm-107-disk-1 --force
drbdmanage export-res "*";drbdadm adjust-with-progress all
WARNING:root:Could not read configuration file '/etc/drbdmanaged.cfg'
Operation completed successfully
/var/lib/drbd.d/drbdmanage_vm-107-disk-1.res:2: in resource vm-107-disk-1:
There is no 'on' section for hostname 'vmhost1' named in the connection-mesh

titux · Feb 3, 2017

I have same problem, same version 4.3, connection lost and status connecting... outdated.. never reconnects..
I will try to upgrade to version 4.4 and see what happens...

titux · Feb 5, 2017

Sad to say still problems exist my resource is in StandAlone even after upgrading my nodes to Promox 4.4-12. DRBDmanage license is back to GPL status so please Proxmox help...
One thing I can see is that drbdmanage was not update to latest version..

e100 · Feb 5, 2017

Linbit now provides DRBD repo for Proxmox, maybe switching to that will resolve your problem.

https://www.drbd.org/en/doc/users-guide-90/s-proxmox-install

Search

Search

Proxmox 4.2 DRBD: Node does not reconnect after reboot/connection loss

Jospeh Huber

Well-Known Member

Jospeh Huber

Well-Known Member

fabian

Proxmox Staff Member

Jospeh Huber

Well-Known Member

Jospeh Huber

Well-Known Member

Jospeh Huber

Well-Known Member

Jospeh Huber

Well-Known Member

titux

Renowned Member

titux

Renowned Member

e100

Renowned Member