Hi,
I have tested 2.6.18 and DRBD for a few weeks. It has been stable and live migration works like a charm.
However today I checked drbd status by coincidence and one of my DRBD volumes apparently had a splitbrain a few days ago. I have followed the guide on pve.proxmox.com, but I see I definitely need to set up notification when this can occur out of nothing. I cannot find anything suspicious leading up to this event. The error log is shown below for server #1:
server #2:
I'll post the DRBD stuff to the DRBD mailing list also. As far as I understand the "Digest integrity"-error might occur in rare circumstances or when hardware is defect. Normally it should recover itself, but I guess the split brain situation occurs when running in Primary/Primary and having VMs running on both servers. As I have been testing I'm not 100% sure if I only had VMs running on 1 server on the same DRBD device.
The thing that worries me is that I am still allowed to do live migration even though DRBD is running in Primary/Unknown on both servers. I guess the only thing PVE cares about is the "Shared"-tick when adding the volume.
Will DRBD status be integrated for HA?
I guess it should be possible to make a fix in /usr/sbin/qmigrate to take it into account?
Best regards,
Bo
I have tested 2.6.18 and DRBD for a few weeks. It has been stable and live migration works like a charm.
However today I checked drbd status by coincidence and one of my DRBD volumes apparently had a splitbrain a few days ago. I have followed the guide on pve.proxmox.com, but I see I definitely need to set up notification when this can occur out of nothing. I cannot find anything suspicious leading up to this event. The error log is shown below for server #1:
Code:
Sep 1 21:02:38 p1 kernel: block drbd2: Digest integrity check FAILED.
Sep 1 21:02:38 p1 kernel: block drbd2: error receiving Data, l: 4140!
Sep 1 21:02:38 p1 kernel: block drbd2: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown )
Sep 1 21:02:38 p1 kernel: block drbd2: asender terminated
Sep 1 21:02:38 p1 kernel: block drbd2: Terminating asender thread
Sep 1 21:02:38 p1 kernel: block drbd2: Creating new current UUID
Sep 1 21:02:38 p1 kernel: block drbd2: Connection closed
Sep 1 21:02:38 p1 kernel: block drbd2: conn( ProtocolError -> Unconnected )
Sep 1 21:02:38 p1 kernel: block drbd2: receiver terminated
Sep 1 21:02:38 p1 kernel: block drbd2: Restarting receiver thread
Sep 1 21:02:38 p1 kernel: block drbd2: receiver (re)started
Sep 1 21:02:38 p1 kernel: block drbd2: conn( Unconnected -> WFConnection )
Sep 1 21:02:38 p1 kernel: block drbd2: Handshake successful: Agreed network protocol version 91
Sep 1 21:02:38 p1 kernel: block drbd2: Peer authenticated using 20 bytes of 'sha1' HMAC
Sep 1 21:02:38 p1 kernel: block drbd2: conn( WFConnection -> WFReportParams )
Sep 1 21:02:38 p1 kernel: block drbd2: Starting asender thread (from drbd2_receiver [8949])
Sep 1 21:02:38 p1 kernel: block drbd2: data-integrity-alg: sha1
Sep 1 21:02:38 p1 kernel: block drbd2: drbd_sync_handshake:
Sep 1 21:02:38 p1 kernel: block drbd2: self FAD15BFCF355A9C5:1D714E0E2AF45CA3:443B58EFC77E89EF:81102D203587BE84 bits:0 flags:0
Sep 1 21:02:38 p1 kernel: block drbd2: peer 74C295FE5A299DD5:1D714E0E2AF45CA3:443B58EFC77E89EF:81102D203587BE84 bits:7 flags:0
Sep 1 21:02:38 p1 kernel: block drbd2: uuid_compare()=100 by rule 90
Sep 1 21:02:38 p1 kernel: block drbd2: Split-Brain detected, dropping connection!
Sep 1 21:02:38 p1 kernel: block drbd2: helper command: /sbin/drbdadm split-brain minor-2
Sep 1 21:02:39 p1 kernel: block drbd2: helper command: /sbin/drbdadm split-brain minor-2 exit code 0 (0x0)
Sep 1 21:02:39 p1 kernel: block drbd2: conn( WFReportParams -> Disconnecting )
Sep 1 21:02:39 p1 kernel: block drbd2: error receiving ReportState, l: 4!
Sep 1 21:02:39 p1 kernel: block drbd2: asender terminated
Sep 1 21:02:39 p1 kernel: block drbd2: Terminating asender thread
Sep 1 21:02:39 p1 kernel: block drbd2: Connection closed
Sep 1 21:02:39 p1 kernel: block drbd2: conn( Disconnecting -> StandAlone )
Sep 1 21:02:39 p1 kernel: block drbd2: receiver terminated
Sep 1 21:02:39 p1 kernel: block drbd2: Terminating receiver thread
Code:
Sep 1 21:02:38 p2 kernel: block drbd2: sock was shut down by peer
Sep 1 21:02:38 p2 kernel: block drbd2: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
Sep 1 21:02:38 p2 kernel: block drbd2: short read expecting header on sock: r=0
Sep 1 21:02:38 p2 kernel: block drbd2: meta connection shut down by peer.
Sep 1 21:02:38 p2 kernel: block drbd2: asender terminated
Sep 1 21:02:38 p2 kernel: block drbd2: Terminating asender thread
Sep 1 21:02:38 p2 kernel: block drbd2: Creating new current UUID
Sep 1 21:02:38 p2 kernel: block drbd2: Connection closed
Sep 1 21:02:38 p2 kernel: block drbd2: conn( BrokenPipe -> Unconnected )
Sep 1 21:02:38 p2 kernel: block drbd2: receiver terminated
Sep 1 21:02:38 p2 kernel: block drbd2: Restarting receiver thread
Sep 1 21:02:38 p2 kernel: block drbd2: receiver (re)started
Sep 1 21:02:38 p2 kernel: block drbd2: conn( Unconnected -> WFConnection )
Sep 1 21:02:38 p2 kernel: block drbd2: Handshake successful: Agreed network protocol version 91
Sep 1 21:02:38 p2 kernel: block drbd2: Peer authenticated using 20 bytes of 'sha1' HMAC
Sep 1 21:02:38 p2 kernel: block drbd2: conn( WFConnection -> WFReportParams )
Sep 1 21:02:38 p2 kernel: block drbd2: Starting asender thread (from drbd2_receiver [8906])
Sep 1 21:02:38 p2 kernel: block drbd2: data-integrity-alg: sha1
Sep 1 21:02:38 p2 kernel: block drbd2: drbd_sync_handshake:
Sep 1 21:02:38 p2 kernel: block drbd2: self 74C295FE5A299DD5:1D714E0E2AF45CA3:443B58EFC77E89EF:81102D203587BE84 bits:7 flags:0
Sep 1 21:02:38 p2 kernel: block drbd2: peer FAD15BFCF355A9C5:1D714E0E2AF45CA3:443B58EFC77E89EF:81102D203587BE84 bits:0 flags:0
Sep 1 21:02:38 p2 kernel: block drbd2: uuid_compare()=100 by rule 90
Sep 1 21:02:38 p2 kernel: block drbd2: Split-Brain detected, dropping connection!
Sep 1 21:02:38 p2 kernel: block drbd2: helper command: /sbin/drbdadm split-brain minor-2
Sep 1 21:02:39 p2 kernel: block drbd2: meta connection shut down by peer.
The thing that worries me is that I am still allowed to do live migration even though DRBD is running in Primary/Unknown on both servers. I guess the only thing PVE cares about is the "Shared"-tick when adding the volume.
Will DRBD status be integrated for HA?
I guess it should be possible to make a fix in /usr/sbin/qmigrate to take it into account?
Best regards,
Bo