DRBD Diskless after 48 hours

Kamyk

New Member
Oct 16, 2013
18
0
1
Hi all,

I have some problem with my Proxmox cluster and DRBD between it. I set all and it's fine. Cluster working perfect. Migration, backups all VM. Storage i have on LVM on DRBD. Every vgdrbd has 1TB. But after some time i have problem with drbd:

Code:
0:r0  Connected Primary/Primary UpToDate/Diskless C r----- lvm-pv: drbdvg0 931.29g 861.00g 
1:r1  Connected Primary/Primary UpToDate/UpToDate C r----- lvm-pv: drbdvg1 931.29g 0g

In dmesg i see:

Code:
block drbd0: Starting worker thread (from cqueue [2626])
block drbd0: open("/dev/sdb1") failed with -16
block drbd0: drbd_bm_resize called with capacity == 0
block drbd0: worker terminated
block drbd0: Terminating worker thread
block drbd1: Starting worker thread (from cqueue [2626])
block drbd1: disk( Diskless -> Attaching ) 
block drbd1: Found 4 transactions (70 active extents) in activity log.
block drbd1: Method to ensure write ordering: barrier
block drbd1: max BIO size = 131072
block drbd1: drbd_bm_resize called with capacity == 1953064672
block drbd1: resync bitmap: bits=244133084 words=3814580 pages=7451
block drbd1: size = 931 GB (976532336 KB)
block drbd1: bitmap READ of 7451 pages took 37 jiffies
block drbd1: recounting of set bits took additional 36 jiffies
block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
block drbd1: disk( Attaching -> UpToDate ) 
block drbd1: attached to UUIDs 70A8363B4F73C19E:0000000000000000:43AC9F762F8AF4F7:43AB9F762F8AF4F7
block drbd0: Starting worker thread (from cqueue [2626])
block drbd0: conn( StandAlone -> Unconnected ) 
block drbd0: Starting receiver thread (from drbd0_worker [2661])
block drbd0: receiver (re)started
block drbd0: conn( Unconnected -> WFConnection ) 
block drbd1: conn( StandAlone -> Unconnected ) 
block drbd1: Starting receiver thread (from drbd1_worker [2649])
block drbd1: receiver (re)started
block drbd1: conn( Unconnected -> WFConnection ) 
block drbd0: Handshake successful: Agreed network protocol version 96
block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC
block drbd0: conn( WFConnection -> WFReportParams ) 
block drbd0: Starting asender thread (from drbd0_receiver [2670])
block drbd0: data-integrity-alg: <not-used>
block drbd0: max BIO size = 4096
block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate ) 
block drbd1: Handshake successful: Agreed network protocol version 96
block drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC
block drbd1: conn( WFConnection -> WFReportParams ) 
block drbd1: Starting asender thread (from drbd1_receiver [2674])
block drbd1: data-integrity-alg: <not-used>
block drbd1: drbd_sync_handshake:
block drbd1: self 70A8363B4F73C19E:0000000000000000:43AC9F762F8AF4F7:43AB9F762F8AF4F7 bits:0 flags:0
block drbd1: peer 7D727C5A8840067D:70A8363B4F73C19F:43AC9F762F8AF4F7:43AB9F762F8AF4F7 bits:0 flags:0
block drbd1: uuid_compare()=-1 by rule 50
block drbd1: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk( DUnknown -> UpToDate ) 
block drbd0: role( Secondary -> Primary ) 
block drbd1: role( Secondary -> Primary ) 
DLM (built Oct 14 2013 08:10:28) installed
block drbd1: conn( WFBitMapT -> WFSyncUUID ) 
block drbd1: updated sync uuid 70A9363B4F73C19F:0000000000000000:43AC9F762F8AF4F7:43AB9F762F8AF4F7
block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1
block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1 exit code 0 (0x0)
block drbd1: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent ) 
block drbd1: Began resync as SyncTarget (will sync 0 KB [0 bits set]).
block drbd1: Resync done (total 1 sec; paused 0 sec; 0 K/sec)
block drbd1: updated UUIDs 7D727C5A8840067D:0000000000000000:70A9363B4F73C19F:70A8363B4F73C19F
block drbd1: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) 
block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1
block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1 exit code 0 (0x0)
block drbd1: bitmap WRITE of 7451 pages took 20 jiffies
block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
ip_tables: (C) 2000-2006 Netfilter Core Team

My drbd configuration looks like this:

- global_common.conf
Code:
global {
  usage-count yes;
  # minor-count dialog-refresh disable-ip-verification
}

common {
  protocol C;

  handlers {
    # The following 3 handlers were disabled due to #576511.
    # Please check the DRBD manual and enable them, if they make sense in your setup.
    # pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
    # pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
    # local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";

    # fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
    # split-brain "/usr/lib/drbd/notify-split-brain.sh root";
    # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
    # before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
    # after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
  }

  startup {
    # wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb
    wfc-timeout 15;
          degr-wfc-timeout 15;
          become-primary-on both;
  }

  disk {
    # on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes
    # no-disk-drain no-md-flushes max-bio-bvecs
  }

  net {
    # sndbuf-size rcvbuf-size timeout connect-int ping-int ping-timeout max-buffers
    # max-epoch-size ko-count allow-two-primaries cram-hmac-alg shared-secret
    # after-sb-0pri after-sb-1pri after-sb-2pri data-integrity-alg no-tcp-cork
          cram-hmac-alg sha1;
          shared-secret "my-secret";
          allow-two-primaries;
          after-sb-0pri discard-zero-changes;
          after-sb-1pri discard-secondary;
          after-sb-2pri disconnect;
  }

  syncer {
    # rate after al-extents use-rle cpu-mask verify-alg csums-alg
    rate 1000M;
  }
}

And resource like this: r0.res
Code:
# This is the resource used for the shared GFS2 partition.
resource r0 {
  # This is the block device path.
  device    /dev/drbd0;

  # We'll use the normal internal metadisk (takes about 32MB/TB)
  meta-disk internal;

  # This is the `uname -n` of the first node
  on node1 {
    # The 'address' has to be the IP, not a hostname. This is the
    # node's SN (bond1) IP. The port number must be unique amoung
    # resources.
    address   10.0.0.12:7788;

    # This is the block device backing this resource on this node.
    disk    /dev/sdb1;
  }
  # Now the same information again for the second node.
  on node2 {
    address   10.0.0.13:7788;
    disk    /dev/sdb1;
  }
}
I have tried a lot of stuff. But i don't have any idea now. What happened and why? Do you have some idea? Maybe disk is broken on servers?

I will be very grateful for help and answer.

Best,
Rafal
 
Last edited:
Hi all,

I have some problem with my Proxmox cluster and DRBD between it. I set all and it's fine. Cluster working perfect. Migration, backups all VM. Storage i have on LVM on DRBD. Every vgdrbd has 1TB. But after some time i have problem with drbd:

Code:
...
  syncer {
    # rate after al-extents use-rle cpu-mask verify-alg csums-alg
    rate 1000M;
  }
}
Hi Rafal,
I have disconnected DRBD-Volumes only with buggy network-connections (driver/switches).
Do you have an NIC for DRBD only, or shared with other services?
If DRBD-only, have you tested an crossover-cable between the nodes?

What kind of network do you have? Syncer of 1000M is very very high.
I have following syncer-entry for 10GB-Ethernet:
Code:
        syncer {
                rate 150000;
                verify-alg sha1;
        }
Perhaps you should also try the verify-alg...

Udo
 
I have experienced problems with drbd when using bonding over 3 links in round-robin. 2 ports belong from the same quad network cards
and 1 from another one ( I wanted for this to work if one of the cards fails ). It seems when using high sync-rate it happens that packets
from time to time don't arrive in the same order as they were sent so there is an option in the kernel to protect you from this:
/proc/sys/net/ipv4/tcp_reordering : I set this to maximum (127), although I am not sure exactly what this value means,
it seems the bigger the more it can compensate such problems. So, I am not sure if this applies to your setup, is just
something that happened in mine. For example whenever I ran check on drbd the system eventually crashed completely on drbd
because of this reordering issue.
 
I'm replying to this old thread as I ran into same issue when setting up 10 partitions for drbd . after many hours the solution here was to put each partition to /etc/lvm/lvm.conf like this:

/etc/lvm/lvm.conf

filter = [ "r|/dev/sda1|","r|/dev/sda2|","r|/dev/sda3|","r|/dev/sda4|","r|/dev/sda5|","r|/dev/sda6|","r|/dev/sda7|", "r|/dev/sda8|","r|/dev/sda9|","r|/dev/sda10|", "r|/dev/sda11|","r|/dev/sda12|","r|/dev/disk/|", "r|/dev/block/|", "a/.*/" ]



prior to that i had a wildcard set for /dev/sda and that did not work the way I had it set up.

the wildcard issue was intermittent . on reboot random resources would show diskless at /proc/drbd .
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!