"Move Disk" data corruption on 4.3

jdw · Oct 26, 2016

spirit said:
Do you have configured tunables on your jewel cluster to be compatible with firefly librbd ?

Correct.

(Just to be sure, you don't use krbd right ?)

Correct.

again, qemu drive-mirror abort if any io error (read or write) occur during the migration (on source or destination)

Correct.

jdw · Oct 26, 2016

spirit said:
This has been fixed in librbd 10.2.3
http://docs.ceph.com/docs/master/_downloads/v10.2.3.txt
"jewel: qa/workunits/rbd: respect RBD_CREATE_ARGS environment variable"
http://tracker.ceph.com/issues/16289

It may not be that fixed then, as our test was conducted with librbd 10.2.3-1~bpo80+1.

But as of right now I really don't see any reason to lay this issue at the door of ceph at all.

jdw · Oct 26, 2016

Black Knight MHT said:
May I join you conversation.....

We have PVE cluster with 3 nodes (All of them 4.3-3), and use NFS Share based on FreeNAS.
Not so far we opened to ourselves "move disks".......It was very cool....and gives us opportunity migrate VM from one storage to another without "backup/restore".
For a test we moved about 10 VM's.....some of them were Win, some Linux based......but!...after 10-12 hours....2 of them crushed.....they were: our wiki....some pages were lost.....and freeradius-dhcp.......
On both VM's were troubles with mysql.
On both cache - Write through
Type of disk - virtio.

Is this reproducible?

Did you use the "Delete Source" option?

How many VM's did you move at a time?

Does your PVE cluster have ECC?

Do both your source and target fileservers have ECC?

Are there any network errors on the interface of any involved server or switch port?

Were you able to get "before" and "after" views of a corrupted file to see exactly what the corruption looks like? (I.e. is it all zeroes, random noise, data from some other file or image, etc.) This is one of the big things I wish I had -- besides the ability to reproduce the problem on non-production data -- that I don't yet.

Black Knight MHT · Oct 26, 2016

jdw said:
Is this reproducible?

In progress.....on "hot" cluster don't want make tests, so installing another cluster for tests.

jdw said:
Did you use the "Delete Source" option?

Yes

jdw said:
How many VM's did you move at a time?

one by one......non-multiple movement

jdw said:
Does your PVE cluster have ECC?
Do both your source and target fileservers have ECC?

Yes to all.....also there are hardware controllers with BBU and cache

jdw said:
Are there any network errors on the interface of any involved server or switch port?

No errors on network.....

It's strange that all other VM's without MySQL moved fine.....even Win7 with D-View Cam.....and on-line streaming

jdw · Oct 26, 2016

It is definitely strange that this (so far) affects only servers running MySQL.

But in our case, in addition to files that happen not to have been written, even some .frm files got corrupted and those are never written unless the database schema is manually changed, which definitely wasn't the case for us and probably wasn't the case for you.

None of my tests with moving busy MySQL servers have resulted in corruption. However, the MySQL servers in our case that got corruption were relatively idle, and the files that got corrupted (both .frm and .ibd) weren't being used at the time, so maybe heavy workload is not the right direction.

Could somebody maybe explain (or point to a reference for) how "move disk" actually works?

mir · Oct 26, 2016

If your sync_binlog has the default value this could be the cause for data loss: http://dev.mysql.com/doc/refman/5.6/en/replication-options-binary-log.html#sysvar_sync_binlog
An explanation: http://www.tocker.ca/2013/05/06/when-does-mysql-perform-io.html

Especially important if you use ext4 since by default async writes are bundled.

jdw · Oct 26, 2016

Please let go of the idea that MySQL is somehow writing to files that aren't being written to, and that this is somehow causing corruption. Your understanding of what, when and how MySQL writes is not correct.

The binlog is a special-purpose feature used for replication and has nothing to do with this.

mir · Oct 26, 2016

Since you obvious dont want any help and consider yourself more clever than anyone else in the universe I will stop providing you any further help in the future.

jdw · Oct 26, 2016

I'm sorry you feel that way.

The frm files that are being corrupted are structural files that have a special purpose and are not written during ordinary operation. They do not contain table data. They do not contain indexes. They are not part of the binlog. They are not part of the innodb transaction log. They are very small files that contain only static information about the table definitions.

None of that is opinion or cleverness, it is just fact. And if the facts do not match your theory, it is the theory that has to be revised, not the facts.

spirit · Oct 26, 2016

on my side, my migration from nfs to ceph, for vms with mysql, this was with xfs as filesystem, cache=none, and I don't use remove disk option.
I don't have had any error (without around100 vms with mysql).
This is with proxmox 4.3, and last jewel librbd, ceph jewel cluster with last tunables.

jdw · Oct 26, 2016

Right now, I am a little suspicious of the "delete source" option, though I do not have a strong basis for that. It's just mainly what we immediately stopped doing it as a safety precaution and suddenly we can no longer reproduce the issue. Also Black Knight MHT used it and had the problem, and you did not use it and did not have the problem. But that's correlation, not causation.

The other theory I'm working on is that it may not be the disk or the write cache getting corrupted, it may be the VM's buffer cache. There is no evidence for or against that either, but it would explain how read-only files got "corrupted." Also it would be easy to test if we could just reproduce the stupid problem. :-( Hopefully we will be able to try migrating some read-intensive workloads today and see what happens.

Black Knight MHT · Oct 26, 2016

Ok, people. Let's live in peace. There is nobody "clever than anyone else in the universe"....we simply have some troubles with function of moving disk.
And I try to understand is it safety use it.
Anybody knows how it work with cache (raid, disk, host-system, VM)? And why it is only with mysql? I checked my cameras....disk with 140Gb encrypted video.....i moved it 2 times......everything alright.... and VM's with mysql crushed after 10-12 hours of working....

jdw · Oct 26, 2016

BHM, what error message did MySQL give you when it crashed, and which specific files were corrupted in your case?

Black Knight MHT · Oct 26, 2016

Sorry....but I had no time to check and dig deeper the problem with mysql....and restore from backup ((((((((
I'll try to make more tests tomorrow on the test Cluster.

jdw · Oct 26, 2016

Yes, my situation was much the same.

Being focused on recovery, I did not gather nearly the amount of information in retrospect it would be good to have now.

spirit · Oct 26, 2016

jdw said:
Right now, I am a little suspicious of the "delete source" option, though I do not have a strong basis for that. It's just mainly what we immediately stopped doing it as a safety precaution and suddenly we can no longer reproduce the issue. Also Black Knight MHT used it and had the problem, and you did not use it and did not have the problem. But that's correlation, not causation.

I can't tell if it could be related, but I never use it by safety.
The original disk is deleted when qemu drive-mirror block job is finish, so pending writes are normally flushed.
(But maybe the zero init could be related, i'm not sure here).
I'll try to build a firewall cluster for testing and reproduce your problem.

(BTW, if you can test with comment the zero init the rbdplugin, it could help to debug if you can reproduce easily the problem)

spirit · Oct 27, 2016

I have finish to build my jewel and firefly cluster,I'm begin to debug

one on the default 0.80.7. We promptly got a "function not implemented" error trying to create an image on the Firefly cluster,or on trying to migrate an image from a Jewel cluster to a Firefly cluster

Ok, I'm able to reproduce this.

rbd command from ceph-jewel try to create new volume with all new features.
you can add in /etc/ceph/ceph.conf in your proxmox host
"rbd default features = 1"

You should have same behaviour than rbd command from firefly.

And live migrating any VMs from the Proxmox-jewel node to the Proxmox-firefly node failes:

Can't reproduce, works fine for me.
I can migrate a vm from source host with librbd jewel to destination host with librbd firefly
I can migrate a vm from source host with librbd firefly to destination host with librbd jewel

both with the ceph storage firefly or jewel with firefly tunables

I have done a lot of move disk this night, firefly to jewel, jewel to firefly, with 2 librbd jewel && firefly. running a mysql benchmark during the migration, and I can't reproduce the error.

librbd firefly was 0.80.11 from ceph.com firefly repository.

Not sure it's related, but an openstack user have reported today fs error with librbd 0.80.7
http://www.spinics.net/lists/ceph-users/msg32017.html

jdw · Oct 28, 2016

So far I also have not been able to reproduce the problem, although I haven't had as much time for testing as I would want. Still haven't conducted the read-workload tests I hope to try. As many of the tests involve reinstalling Proxmox over remote IPMI at a glacial pace, it's a very slow process. :-(

jdw · Oct 29, 2016

Another server that had its disks moved around the same time popped up with serious filesystem corruption today. This was an email server, not MySQL, and again the files that got corrupted were largely static system configuration files that had not been updated in months -- years in a couple of cases -- rather than the (busy) mail spool. The problem was thus not noticed until we attempted to apply a security update to the base system and got a system crash from the corruption it ran into.

It is reasonably likely that this happened when the disks were moved, but that's not a certainty. It is, to my knowledge, the only time in recent history when any disk write activity even peripherally related to those config files occurred.

Still no luck doing it on purpose. :-(

jdw · Nov 15, 2016

Just to follow up on this, after updating our Ceph clusters and Proxmox nodes to Jewel 10.2.3, we have moved over 100 disk images without incident or (detected) corruption, including dozens of MySQL servers.

The only other change we have made is that we also changed policy to forbid use of "Delete Source" when moving disks; all moves must now wait 24 hours before deleting disks.

It's hard not to conclude that the problem is related to either of two things.

- A bug in librbd on Ceph 0.80.7 that gets installed on Proxmox 4.1, which must be fixed in the 0.80.8 that comes with Proxmox 3.x.

- A bug in "Delete Source."

Black Knight MHT reported a similar issue using NFS, and stated that they also used Delete Source when it happened. If the causes are the same (not proven), then that would tend to implicate Delete Source rather than Ceph.

"Move Disk" data corruption on 4.3

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Famous Member

Renowned Member

Famous Member

Renowned Member

Distinguished Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Distinguished Member

Distinguished Member

Renowned Member

Renowned Member

Renowned Member

We value your privacy