Proxmox 4.2 DRBD: Node does not reconnect after reboot/connection loss

Jospeh Huber

Well-Known Member
Apr 18, 2016
98
6
48
45
Hello all,

I am new to DRBD but not new to ProxMox ;-)
We have a 3 Node DRBD9 Cluster setup with Proxmox 4.2 like it is described in the Wiki article here https://pve.proxmox.com/wiki/DRBD9.
MyVersions: proxmox-ve: 4.2-64 (running kernel: 4.4.16-1-pve),drbdmanage: 0.97-1

The drbd9 storage is available and I have two LXC containers with HA in it. HA-Migration and Fail-Over works as expected.
But if one node gets restarted or a connection loss happens it never connects to DRDB again.
I have also setup the "post-up drbdadm adjust all" in /etc/network/interfaces.
In the Wiki is described that a "drbdadm adjust all or drbdadm adjust-with-progress all" should do the job... but not for me. It does nothing even if it is manualy invoked
Also, I did not find anything here https://www.drbd.org/en/doc/users-guide-90/s-node-failure.
If I recreate the VMs out of a backup everything is fine again, but I think this is not the way to solve the problem ;-)

Any ideas?

P.S. My plan is, when problem is solved, to operate some smaller Testsystems and if it works i like to use it in production.

Here some data:

Code:
root@vmhost2:~# drbd-overview
  0:.drbdctrl/0      Connected(3*)                       Secondary(3*)                             UpTo(vmhost2)/UpTo(vmhost5,vmhost1)
  1:.drbdctrl/1      Connected(3*)                       Secondary(3*)                             UpTo(vmhost2)/UpTo(vmhost5,vmhost1)
100:vm-108-disk-1/0  Conn(vmhost5,vmhost2)/C'ng(vmhost1) Prim(vmhost2)/Unkn(vmhost1)/Seco(vmhost5) UpTo(vmhost2)/Inco(vmhost1)/UpTo(vmhost5)
101:vm-132-disk-1/0  Conn(vmhost2,vmhost5)/C'ng(vmhost1) Seco(vmhost2)/Unkn(vmhost1)/Prim(vmhost5) UpTo(vmhost2)/Inco(vmhost1)/UpTo(vmhost5)

root@vmhost1:~#  drbdmanage list-nodes
+---------------------------------------------------------------------------------------------------------+
| Name    | Pool Size | Pool Free |                                                               | State |
|---------------------------------------------------------------------------------------------------------|
| vmhost1 |    510976 |    500756 |                                                               |    ok |
| vmhost2 |    510976 |    506734 |                                                               |    ok |
| vmhost5 |    510976 |    500756 |                                                               |    ok |
+---------------------------------------------------------------------------------------------------------+

A) The disconnected node:
drbdsetup status
.drbdctrl role:Secondary
  volume:0 disk:UpToDate
  volume:1 disk:UpToDate
  vmhost2 role:Secondary
    volume:0 peer-disk:UpToDate
    volume:1 peer-disk:UpToDate
  vmhost5 role:Secondary
    volume:0 peer-disk:UpToDate
    volume:1 peer-disk:UpToDate

vm-108-disk-1 role:Secondary
  disk:Inconsistent
  vmhost2 connection:StandAlone
  vmhost5 connection:StandAlone

vm-132-disk-1 role:Secondary
  disk:Outdated
  vmhost2 connection:StandAlone
  vmhost5 connection:StandAlone



B) The connected Node
root@vmhost2:~# drbdsetup status
.drbdctrl role:Secondary
  volume:0 disk:UpToDate
  volume:1 disk:UpToDate
  vmhost1 role:Secondary
    volume:0 peer-disk:UpToDate
    volume:1 peer-disk:UpToDate
  vmhost5 role:Secondary
    volume:0 peer-disk:UpToDate
    volume:1 peer-disk:UpToDate

vm-108-disk-1 role:Primary
  disk:UpToDate
  vmhost1 connection:Connecting
  vmhost5 role:Secondary
    peer-disk:UpToDate

vm-132-disk-1 role:Secondary
  disk:UpToDate
  vmhost1 connection:Connecting
  vmhost5 role:Primary
    peer-disk:UpToDate
 
if you update to the current version, this should no longer happen
 
Problem occured again after a reboot. The system with drbd was up for 65 days.

proxmox-ve: 4.3-66 (running kernel: 4.4.19-1-pve)
...
drbdmanage: 0.97.3-1


I have to execute on all Nodes:
drbdmanage export-res "*";drbdadm adjust all

Then it is reconnecting again.

=> Fortunately not reproducible after several reboots...
 
Last edited:
After several reboots and upgrades i cannot get some disks synced of my 3 Nodes Cluster connecting again.

I have tried several different approaches but nothing helps:
Code:
drbdmanage list-nodes
+--------------------------------------------------------------------------------------------------+
| Name    | Pool Size | Pool Free |                                                        | State |
|--------------------------------------------------------------------------------------------------|
| vmhost1 |    510976 |    366727 |                                                        |    ok |
| vmhost2 |    510976 |    365858 |                                                        |    ok |
| vmhost5 |    510976 |    370917 |                                                        |    ok |
+--------------------------------------------------------------------------------------------------+

Node 1
drbdsetup status vm-103-disk-1
vm-103-disk-1 role:Secondary
  disk:Inconsistent
  vmhost2 connection:Connecting
  vmhost5 connection:Connecting

Node 2
vm-103-disk-1 role:Secondary
  disk:UpToDate
  vmhost1 connection:StandAlone
  vmhost5 role:Primary
    peer-disk:UpToDate

Node 3
vm-103-disk-1 role:Primary
  disk:UpToDate
  vmhost1 connection:StandAlone
  vmhost2 role:Secondary
    peer-disk:UpToDate

I tried a manual split-brain recovery... but nothing helps (drbdmanage export-res "*";drbdadm adjust all)
Any ideas?
stalenode: drbdadm disconnect vm-103-disk-1
stalenode: drbdadm connect --discard-my-data vm-103-disk-1
goodnode:
drbdadm connect vm-103-disk-1`

It seems that I have also some stale data in my configuration... can't fix this!
/var/lib/drbd.d/drbdmanage_vm-107-disk-1.res:2: in resource vm-107-disk-1:
# executed on all three nodes ...
drbdmanage remove-resource vm-107-disk-1 --force
drbdmanage export-res "*";drbdadm adjust-with-progress all
WARNING:root:Could not read configuration file '/etc/drbdmanaged.cfg'
Operation completed successfully
/var/lib/drbd.d/drbdmanage_vm-107-disk-1.res:2: in resource vm-107-disk-1:
There is no 'on' section for hostname 'vmhost1' named in the connection-mesh
 
Last edited:
I have same problem, same version 4.3, connection lost and status connecting... outdated.. never reconnects..
I will try to upgrade to version 4.4 and see what happens...
 
Sad to say still problems exist my resource is in StandAlone even after upgrading my nodes to Promox 4.4-12. DRBDmanage license is back to GPL status so please Proxmox help...
One thing I can see is that drbdmanage was not update to latest version..
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!