I've seen this issue a few times on 1.X and for the first time on 2.0 today.
Usually happens during snapshot backups, I assume because of the increased IO more than anything.
The error that is a big clue happens on the node that is not doing the backup.
This issue goes back some time, easy to find references to it on these forums.
I believe this is a bug in DRBD that was recently fixed:
http://git.drbd.org/gitweb.cgi?p=drbd-8.3.git;a=commit;h=95153072a19dfef10a2cde98c0719cf0f5d72a68
That commit mentions:
Would it be possible to get DRBD code updated in Proxmox so this bug can finally be put to rest?
Usually happens during snapshot backups, I assume because of the increased IO more than anything.
The error that is a big clue happens on the node that is not doing the backup.
Code:
block drbd0: [B]magic?? on data m:[/B] 0x3eabc3a5 c: 512 l: 97
block drbd0: peer( Primary -> Unknown ) conn( Connected ->[B] ProtocolError [/B]) pdsk( UpToDate -> DUnknown )
This issue goes back some time, easy to find references to it on these forums.
I believe this is a bug in DRBD that was recently fixed:
http://git.drbd.org/gitweb.cgi?p=drbd-8.3.git;a=commit;h=95153072a19dfef10a2cde98c0719cf0f5d72a68
That commit mentions:
We assumed only bios with bi_idx == 0 would end up in drbd_make_request().
That is wrong.
At least device mapper, in __clone_and_map(), may submit clones only covering a partial bio, but sharing the original bvec, by adjusting bi_idx and relevant other bio members of the clone.
We used __bio_for_each_segment() in various places, even though that is documented as
* drivers should not use the __ version unless they _really_ want to
* run through the entire bio and not just pending pieces
Impact: we would send the full bio bvec, even for the clone with bi_idx > 0, which will cause data corruption on the peer (because we submit wrong data at the clone offset), and will cause a DRBD protocol error,
Would it be possible to get DRBD code updated in Proxmox so this bug can finally be put to rest?