"Move Disk" data corruption on 4.3

BloodyIron · Nov 15, 2016

I'm not exactly fully up to speed on ceph at this time, but I wanted to just share some thoughts I had.

Firstly, I've used "move disk" (including delete original) a LOT. This is on v2, v3 and v4 clusters. I've found it very reliable, so if something as you said here is causing corruption, then that's not indicative of what I would say is typical behavior. A lot of the disks I've moved have ran smaller databases, including mysql and mail databases. Not exactly massive examples though.

Secondly, My examples are with storage backed by ZFS served through NFS. Not horribly relevant here, but I wanted to be clear about this.

Thirdly, a thought comes to mind here. What I've seen a few times in this thread is that the data that's showing corruption is old data, not recently written data. The thought is that perhaps this data was corrupt before the move, but the move simply triggered mysql and the mail databases to actually check that data, and then they see it's corrupt. Silent corruption/bit rot can happen if your storage isn't setup to counter that (why I use ZFS). Now, I haven't read all the details in this thread, so you may have covered this already. If not, perhaps consider this?

Fourthly, perhaps some more info about the storage is in order? Also, how big are these VMs that are failing to move disk successfully?

jdw · Nov 16, 2016

We, too, have used it hundreds of times and have had only a handful of problems.

Ceph is designed for large-scale storage, so it does replicate and regularly scrub data looking for bitrot.

"Move disk" isn't a disruptive operation, so it doesn't cause the guest to check/reread anything.

Although the data is not recently written, the corruption definitely occurs at the time of the move. In all cases we were able to go to the previous day's backup (which in one case was an hour earlier) and recover uncorrupted versions of the data.

In addition to the two possibilities I mentioned earlier, I suppose the third possibility is that there is an external factor (i.e. high CPU load, high network activity, sudden RAM pressure, cosmic rays) we have not detected that somehow interferes. There are various other reports of that general nature, e.g. VMs spontaneously dying if a server gets too severely overloaded. We don't have any evidence of such problems, but there could always be something we missed.

Most of our disk images are in the 32-128GiB range.

BloodyIron · Nov 16, 2016

The disk images, are they RAW, QCOW2, VMDK or what?

jdw · Nov 16, 2016

Something interesting that we discovered late yesterday...

After we finished the ceph upgrades and shuffled everything back into its proper place using Move Disk without the delete option, we gave it 24-48 hours to make sure all was working and then went through to purge all the unused disk images.

During that process, we got several timeouts on the "Remove" operation. In general, if you wait long enough, the operation will complete behind the scenes and the image will disappear off of the VM config. In some cases, it didn't and we had to repeat the delete.

Once we had completed that process, deleting probably 60 or so images, we found that about five images were still present on the rbd list but no longer listed in the matching VM's config file.

It is a little odd that the removes take so long, as deleting one of the extraneous images with "rbd rm" takes only a few seconds.

It's impossible to say for certain, but I *think* these images correspond to cases where there were timeouts and we moved on to remove another image without waiting the few minutes for the previous operation to fully complete. (Note that there is no progress indicator of any kind after a timeout.)

So perhaps, somehow, when we were moving images with "delete" enabled, a problem occurred when we moved one image during the "hidden background delete" of another image.

This would explain why we haven't seen the problem since. As we no longer "delete source" and we now limit moves to one per proxmox cluster at a time, there is no potential for overlap.

jdw · Nov 16, 2016

BloodyIron said:
The disk images, are they RAW, QCOW2, VMDK or what?

Ceph images are always raw.

spirit · Nov 17, 2016

jdw said:
Something interesting that we discovered late yesterday...

After we finished the ceph upgrades and shuffled everything back into its proper place using Move Disk without the delete option, we gave it 24-48 hours to make sure all was working and then went through to purge all the unused disk images.

During that process, we got several timeouts on the "Remove" operation. In general, if you wait long enough, the operation will complete behind the scenes and the image will disappear off of the VM config. In some cases, it didn't and we had to repeat the delete.

Once we had completed that process, deleting probably 60 or so images, we found that about five images were still present on the rbd list but no longer listed in the matching VM's config file.

It is a little odd that the removes take so long, as deleting one of the extraneous images with "rbd rm" takes only a few seconds.

It's impossible to say for certain, but I *think* these images correspond to cases where there were timeouts and we moved on to remove another image without waiting the few minutes for the previous operation to fully complete. (Note that there is no progress indicator of any kind after a timeout.)

So perhaps, somehow, when we were moving images with "delete" enabled, a problem occurred when we moved one image during the "hidden background delete" of another image.

This would explain why we haven't seen the problem since. As we no longer "delete source" and we now limit moves to one per proxmox cluster at a time, there is no potential for overlap.

do you have task log when the delete has failed ?
I'm aware about lirbd bug with messages like "rbd: error: image still has watchers", on old librbd versions, where closing the image (before delete), not free connections correctly to ceph, and then delete was failing.

jdw · Nov 17, 2016

The delete operation (on it's own, separate from a move) doesn't appear to leave anything in the task log.

Black Knight MHT · Nov 25, 2016

Hi.
It's strange, but after upgrading nodes to pve-manager/4.3-10/7230e60f (running kernel: 4.4.19-1-pve), there is no problems with movement.

Search

Search

"Move Disk" data corruption on 4.3

BloodyIron

Renowned Member

jdw

Renowned Member

BloodyIron

Renowned Member

jdw

Renowned Member

jdw

Renowned Member

spirit

Distinguished Member

jdw

Renowned Member

Black Knight MHT

Member

We value your privacy