"Move Disk" data corruption on 4.3

jdw · Oct 25, 2016

Data corruption appears to have occurred while using "Move Disk" on a KVM VM running Ubuntu Xenial that is a database server. The MySQL server crashed during the migration and refused to start, citing InnoDB checksum errors in several tables, many of which had not been written to in months.

The move was from an old Ceph Firefly cluster to a new Ceph Jewel cluster (both on separate/dedicated hardware).

The data was rolled back.

But we have a lot of data to move, so this is a serious concern. "Move Disk" doesn't have the best track record in this area. Is it perhaps acting up again?

# pveversion -v
proxmox-ve: 4.3-66 (running kernel: 4.4.19-1-pve)
pve-manager: 4.3-1 (running version: 4.3-1/e7cdc165)
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.4.19-1-pve: 4.4.19-66
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-46
qemu-server: 4.0-88
pve-firmware: 1.1-9
libpve-common-perl: 4.0-73
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-61
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-qemu-kvm: 2.6.1-6
pve-container: 1.0-75
pve-firewall: 2.0-29
pve-ha-manager: 1.0-35
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.4-1
lxcfs: 2.0.3-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
zfsutils: 0.6.5.7-pve10~bpo80
openvswitch-switch: 2.5.0-1

spirit · Oct 25, 2016

Do you have the move disk task log ? I have done a lot of move disk to ceph jewel recently (around 300-400), without any problem. But it was with nfs as source.

jdw · Oct 25, 2016

Yes, it's large, so I put it here: http://pastebin.com/bn4VnTf8

Moving disk images is very common for us as well; we have done a lot of RBD-to-RBD moves as well since the last instance of this issue a couple years ago, which is why an issue like this sucks the oxygen right out of the room.

spirit · Oct 25, 2016

this is strange,I don't see any error in the task log. (if they are read/write block error on source or destination, the job fail automaticly).

Does your mysql crash occur during the migration, or at the end ? (maybe storage overload ?)

fabian · Oct 25, 2016

jdw said:
Data corruption appears to have occurred while using "Move Disk" on a KVM VM running Ubuntu Xenial that is a database server. The MySQL server crashed during the migration and refused to start, citing InnoDB checksum errors in several tables, many of which had not been written to in months.

The move was from an old Ceph Firefly cluster to a new Ceph Jewel cluster (both on separate/dedicated hardware).

The data was rolled back.

But we have a lot of data to move, so this is a serious concern. "Move Disk" doesn't have the best track record in this area. Is it perhaps acting up again?

isn't the simplest explanation that the corruption was caused by the crash of the mysql server? disk migration can cause a lot of I/O contention, which can lead to system instability..

mir · Oct 25, 2016

jdw said:
Yes, it's large, so I put it here: http://pastebin.com/bn4VnTf8

Moving disk images is very common for us as well; we have done a lot of RBD-to-RBD moves as well since the last instance of this issue a couple years ago, which is why an issue like this sucks the oxygen right out of the room.

Is qemu-guest-agent active on the VMs?
If so, then the fsfreeze command from move disk should effectively have stopped all writes to disk at the end of the move.

spirit · Oct 25, 2016

mir said:
Is qemu-guest-agent active on the VMs?
If so, then the fsfreeze command from move disk should effectively have stopped all writes to disk at the end of the move.

fsfreeze is not used in drive mirror. (only for snapshot currently) But a sync is done internally by qemu when disk are swapped. So it shouldn't be a problem even without guest-agent.
I have done it on a lot of database (mysql,postgresql,sqlserver) without any corruption

spirit · Oct 25, 2016

About the task log:

Code:

[LIST=1]
[*]
Removing image: 1% complete...
[*]Removing image: 2% complete.
[/LIST]

It was the old disk ? (do you had checked "delete source") ?

jdw · Oct 25, 2016

Yes, the old disk was removed due to "delete source."

No, the corruption was not caused by MySQL crashing as the corrupt tables have not been written in many weeks.

We have now found a second case on another MySQL VM where the process didn't crash until a half an hour after the migration (presumably when it tried to access the corrupted table).

I'm not sure what "storage overload" refers to, but both Ceph clusters are lightly-loaded and made up entirely of Intel DC S37XX SSD's so they definitely weren't overloaded by a simple disk move.

And if disk migration can cause system instability, then the feature should be fixed or removed.

mir · Oct 25, 2016

Could be cache setting on Ceph pool or VM filesystem?

jdw · Oct 25, 2016

"cache=writeback" is the supported/recommended option for VM's using Ceph pools and sets the RBD cache behavior. ( https://forum.proxmox.com/threads/virtio-disk-caching-or-not.20945/#post-106899 )

jdw · Oct 25, 2016

Also worth nothing: we have two Proxmox cluster, one running 3.4 and one running 4.3. The 3.4 cluster did a lot more migrations yesterday, including doing up to 5 at once, and thusfar they are all fine. The 4.3 cluster was doing them later, one at a time, and we have found multiple problems with corrupted data. It really seems like an issue with 4.3.

mir · Oct 25, 2016

What are cache setting for the Ceph pool itself?
What filesystem is used inside the VM and what mount options is used?

Another thing: Do you use asynchronous writes for the database? If so updates could be lost.

jdw · Oct 25, 2016

As referenced in the thread linked above, rbd_cache is set based on the Qemu cache setting. If you are referring to some other cache setting, please be more specific as this is the only relevant setting I am aware of.

The filesystem is ext4 (rw,relatime,data=ordered).

And to reiterate, the database did not write to the suddenly-corrupted tables for weeks prior to the corruption. The MySQL process crashed because of the corruption, the corruption did not occur because the MySQL process crashed.

As there were no writes, pending or otherwise, this issue is unlikely to be cache-related.

mir · Oct 25, 2016

jdw said:
As there were no writes, pending or otherwise, this issue is unlikely to be cache-related.

Even if you do not write to the database writes do happen as part of general database maintenance controlled by the DBMS.

spirit · Oct 25, 2016

What is your version of librbd on proxmox 4.3?

jdw · Oct 25, 2016

Unless you are claiming MySQL backdates file timestamps, the corrupted files were not written to for weeks before they became corrupted. The modification times and sizes were identical to the backups; only the contents differed. Please stop trying to blame this on MySQL.

Spirit:

# dpkg -l | egrep -i '(ceph|rbd|rados)'
ii ceph-common 0.80.7-2+deb8u1 amd64 common utilities to mount and interact with a ceph storage cluster
ii libcephfs1 0.80.7-2+deb8u1 amd64 Ceph distributed file system client library
ii librados2 0.80.7-2+deb8u1 amd64 RADOS distributed object store client library
ii librados2-perl 1.0-3 amd64 Perl bindings for librados
ii librbd1 0.80.7-2+deb8u1 amd64 RADOS block device client library
ii python-ceph 0.80.7-2+deb8u1 amd64 Python libraries for the Ceph distributed filesystem

spirit · Oct 25, 2016

Not sure it's related, but you could try to edit
/usr/share/perl5/PVE/Storage/RBDPlugin.pm

and remove

Code:

 sparseinit => { base => 1, current => 1},

from
sub volume_has_feature

and restart pvedaemon service.
This is a new feature in proxmox 4, to try to have sparse drive on target storage after move disk

spirit · Oct 25, 2016

[QUOTEj
# dpkg -l | egrep -i '(ceph|rbd|rados)'
ii ceph-common 0.80.7-2+deb8u1 amd64 common utilities to mount and interact with a ceph storage cluster
ii libcephfs1 0.80.7-2+deb8u1 amd64 Ceph distributed file system client library
ii librados2 0.80.7-2+deb8u1 amd64 RADOS distributed object store client library
ii librados2-perl 1.0-3 amd64 Perl bindings for librados
ii librbd1 0.80.7-2+deb8u1 amd64 RADOS block device client library
ii python-ceph 0.80.7-2+deb8u1 amd64 Python libraries for the Ceph distributed filesystem[/QUOTE]

You don't have updated librbd to jewel ??????? (10.2.0) (I mean on your proxmox 4.3 node)

mir · Oct 25, 2016

jdw said:
Unless you are claiming MySQL backdates file timestamps, the corrupted files were not written to for weeks before they became corrupted. The modification times and sizes were identical to the backups; only the contents differed. Please stop trying to blame this on MySQL.

Count to 10 and then try reading wants more what a wrote.

"Move Disk" data corruption on 4.3

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

Proxmox Staff Member

Famous Member

Distinguished Member

Distinguished Member

Renowned Member

Famous Member

Renowned Member

Renowned Member

Famous Member

Renowned Member

Famous Member

Distinguished Member

Renowned Member

Distinguished Member

Distinguished Member

Famous Member

We value your privacy