"Move Disk" data corruption on 4.3

jdw · Oct 25, 2016

spirit said:
You don't have updated librbd to jewel ??????? (10.2.0) (I mean on your proxmox 4.3 node)

No; there are no pending upgrades for the ceph stuff on the proxmox nodes.

jdw · Oct 25, 2016

There is definitely some versioning weirdness.

On the 3.4 servers where this problem did not occur, they have ceph 0.80.8 from the enterprise.proxmox.com repository.

On the 4.3 servers where this problem did occur, they have ceph 0.80.7 from the debian repository. They don't show anything from the enterprise.proxmox.com repository.

Ceph 0.80 is firefly, which is really super old and what we're trying to get rid of, so that's fundamentally scary, but why does 4.x have an older version than 3.x? Are the packages missing from the PVE enterprise repository? (For some reason I thought proxmox was based on at least ceph hammer client libraries.)

Is it even remotely safe to change the ceph client libraries out from under Proxmox? apt-cache shows a lot of version-specific dependencies.

mir · Oct 25, 2016

From my latest 4.3 using enterprise repo:

Code:

dpkg -l | egrep -i '(ceph|rbd|rados)'
ii  ceph-common                      0.80.8-1~bpo70+1               amd64        common utilities to mount and interact with a ceph storage cluster
ii  libcephfs1                       0.80.7-2+deb8u1                amd64        Ceph distributed file system client library
ii  librados2                        0.80.8-1~bpo70+1               amd64        RADOS distributed object store client library
ii  librados2-perl                   1.0-3                          amd64        Perl bindings for librados
ii  librbd1                          0.80.8-1~bpo70+1               amd64        RADOS block device client library
ii  python-ceph                      0.80.8-1~bpo70+1               amd64        Python libraries for the Ceph distributed filesystem

Remember 3.4 is based on wheezy which is now under LTS support (not under Debian maintenance). Maybe this repo have been updated with newer backported versions?

mir · Oct 25, 2016

Try compare output from your 3.4 and 4.3:
apt-cache policy librbd1

jdw · Oct 25, 2016

apt-cache policy shows the PVE enterprise repo is simply not offering any ceph packages on 4.3.

For 3.4:

# apt-cache policy librbd1
librbd1:
Installed: 0.80.8-1~bpo70+1
Candidate: 0.80.8-1~bpo70+1
Version table:
*** 0.80.8-1~bpo70+1 0
500 https://enterprise.proxmox.com/debian/ wheezy/pve-enterprise amd64 Packages
100 /var/lib/dpkg/status
0.80.6-1~bpo70+1 0
500 https://enterprise.proxmox.com/debian/ wheezy/pve-enterprise amd64 Packages
0.80.5-1~bpo70+1 0
500 https://enterprise.proxmox.com/debian/ wheezy/pve-enterprise amd64 Packages
0.67.9-1~bpo70+1 0
500 https://enterprise.proxmox.com/debian/ wheezy/pve-enterprise amd64 Packages
0.67.7-1~bpo70+1 0
500 https://enterprise.proxmox.com/debian/ wheezy/pve-enterprise amd64 Packages
0.67.5-1~bpo70+1 0
500 https://enterprise.proxmox.com/debian/ wheezy/pve-enterprise amd64 Packages
0.67.4-1~bpo70+1 0
500 https://enterprise.proxmox.com/debian/ wheezy/pve-enterprise amd64 Packages
0.67.3-1~bpo70+1 0
500 https://enterprise.proxmox.com/debian/ wheezy/pve-enterprise amd64 Packages
0.67.2-1~bpo70+1 0
500 https://enterprise.proxmox.com/debian/ wheezy/pve-enterprise amd64 Packages
0.67.1-1~bpo70+1 0
500 https://enterprise.proxmox.com/debian/ wheezy/pve-enterprise amd64 Packages
0.67-1~bpo70+1 0
500 https://enterprise.proxmox.com/debian/ wheezy/pve-enterprise amd64 Packages

For 4.3:

# apt-cache policy librbd1
librbd1:
Installed: 0.80.7-2+deb8u1
Candidate: 0.80.7-2+deb8u1
Version table:
*** 0.80.7-2+deb8u1 0
500 http://ftp.us.debian.org/debian/ jessie/main amd64 Packages
100 /var/lib/dpkg/status

spirit, can you run this on yours to see if perhaps you are picking up something from the nosub repository, or at least see where exactly it is coming from?

mir · Oct 25, 2016

From the changelog:
ceph (0.80.8-1~bpo70+1) wheezy; urgency=low

*

-- Jenkins Build <gary.lowell@inktank.com> Wed, 14 Jan 2015 10:19:38 -0800

ceph (0.80.8-1) stable; urgency=low

* New upstream release

librbd1 from 3.4 is package by proxmox while librbd1 from 4.3 comes directly from debian repository so I think the different version numbers is of no concern.

mir · Oct 25, 2016

On the other hand debian backports has 0.80.10-2~bpo8+1. Could this be the package Spirit has?
https://packages.debian.org/jessie-backports/librbd1

spirit · Oct 25, 2016

if you use ceph jewel, you need to update yourself the librbd on your host node !

add the debian jewel ceph repo

/etc/apt/sources.list.d/ceph.list
deb http://download.ceph.com/debian-jewel/ jessie main

I'm surprised that you are able to connect to your jewel cluster.
Using old librbd on new ceph cluster version is not supported and tested by ceph team.

Use sint new librbd with old ceph cluster is ok and tested.

jdw · Oct 25, 2016

Whoops, I thought spirit posted the 0.80.8 versions.

Mir, where did *you* get 0.80.8 on 4.3?

Also, I read the release notes for 0.80.8 vs. 0.80.7 and indeed none of the client-side fixes (and there weren't many) sound relevant so I agree 0.80.7 vs. 0.80.8 is not likely to be a big deal.

jdw · Oct 25, 2016

spirit said:
I'm surprised that you are able to connect to your jewel cluster.
Using old librbd on new ceph cluster version is not supported and tested by ceph team.

Use sint new librbd with old ceph cluster is ok and tested.

What about using new librbd with proxmox in violation of version-specific dependencies? Is that tested?

That is not an area where I want to blaze experimental new trails.

mir · Oct 25, 2016

jdw said:
Mir, where did *you* get 0.80.8 on 4.3?

0.80.8 is the one hosted in debian jessie main repository.

mir · Oct 25, 2016

jdw said:
What about using new librbd with proxmox in violation of version-specific dependencies? Is that tested?

The packages from inktank is build on and for debian jessie the same way packages are made for debian jessie-backports so this is a "supported" install.

mir · Oct 25, 2016

spirit said:
I'm surprised that you are able to connect to your jewel cluster.
Using old librbd on new ceph cluster version is not supported and tested by ceph team.

I think we have found the cause to your migration problems

jdw · Oct 25, 2016

Whether or not the ceph versioning issue is a problem, it is a separate problem, unlikely to be the cause of the issue. This is not a new situation. These two SSD-based ceph clusters are our smallest ones. Our largest 55TB storage cluster is also running Jewel and has been under heavy load for some time (including mass migrations to and from during the upgrade, albeit based on 3.4) without incident. Prior to that, the jewel clusters were hammer for a long time.

And, to reiterate, 3.4 can run 5 migrations at once with no issue. 4.3 corrupts with one. All on the same clusters.

All of the available evidence points at Proxmox 4.3 being the only factor present in all of the corruption cases.

The new-for-4.x sparse thing seems worth testing though thusfar I have not found a safe way to test it.

jdw · Oct 25, 2016

Per the Ceph developers, mismatched client/server versions should not cause any issues without supplemental stupidity (such as using features or tunables not supported by the older client), and that they work very hard to preserve compatibility.

And even in the case of supplemental stupidity, errors would be the expected result, not silent data corruption.

(They did add the caveat that Firefly is so old that they no longer test it.)

By contrast, we did set up a test with two Proxmox 4.3 nodes, one running Jewel and one on the default 0.80.7. We promptly got a "function not implemented" error trying to create an image on the Firefly cluster, or on trying to migrate an image from a Jewel cluster to a Firefly cluster. And live migrating any VMs from the Proxmox-jewel node to the Proxmox-firefly node failes: "Error: start failed."

So the conservative approach appears to be the right one in situations with multiple ceph clusters running different versions, as is inherently the case in a mid-upgrade situation like ours.

The 0.80.8 thing is still bothering me as well, as the debian jessie repositories are all showing me 0.80.7, including on a freshly-installed debian jessie VM I just spun up right now. Nothing I can find indicates that 0.80.8 can be obtained from the debian jessie main repository.

So, right now, there are two main possible causes of the corruption:

1) A problem in Ceph 0.80.7 that does not exist in 0.80.8.
2) The sparse thing suggested by spirit.

Currently my efforts are focused on finding a repro that does not involve corrupting live production data. So far, no luck.

udo · Oct 26, 2016

jdw said:
Currently my efforts are focused on finding a repro that does not involve corrupting live production data. So far, no luck.

Hi,
what's about

Code:

deb http://download.ceph.com/debian-firefly/ jessie main

Udo

jdw · Oct 26, 2016

A "repro" is a way to reliably reproduce the problem so it can be further examined and resolved. (And if the repro is any good, other people will be able to follow it and reproduce the problem as well.)

As it stands, nothing I do outside of production has been able to reproduce the problem, up to and including sparing out the machine it actually happened on and setting up a VM that just saturates its I/O doing InnoDB inserts while the disk migrates back and forth between the same two ceph clusters.

Whatever the problem is, it must depend on an external factor not yet identified.

spirit · Oct 26, 2016

jdw said:
Per the Ceph developers, mismatched client/server versions should not cause any issues without supplemental stupidity (such as using features or tunables not supported by the older client), and that they work very hard to preserve compatibility.

Do you have configured tunables on your jewel cluster to be compatible with firefly librbd ?

http://docs.ceph.com/docs/jewel/rados/operations/crush-map/

because hammer have CRUSH_V4 tunable and jewel CRUSH_V5, and it's not comptable with librbd of firefly

(Just to be sure, you don't use krbd right ?)

And even in the case of supplemental stupidity, errors would be the expected result, not silent data corruption.

again, qemu drive-mirror abort if any io error (read or write) occur during the migration (on source or destination)

By contrast, we did set up a test with two Proxmox 4.3 nodes, one running Jewel and one on the default 0.80.7. We promptly got a "function not implemented" error trying to create an image on the Firefly cluster, or on trying to migrate an image from a Jewel cluster to a Firefly cluster. And live migrating any VMs from the Proxmox-jewel node to the Proxmox-firefly node failes: "Error: start failed."

This is strange. for image create, the command execute by proxmox is the same.
For live migration, seem strange too. As the "start failed", as it's only a new qemu process is started with is own librbd version.
(but of course, you can't migrate from a new qemu version to an old qemu version)

spirit · Oct 26, 2016

spirit said:
one on the default 0.80.7. We promptly got a "function not implemented" error trying to create an image on the Firefly cluster,

This has been fixed in librbd 10.2.3
http://docs.ceph.com/docs/master/_downloads/v10.2.3.txt
"jewel: qa/workunits/rbd: respect RBD_CREATE_ARGS environment variable"
http://tracker.ceph.com/issues/16289

Black Knight MHT · Oct 26, 2016

May I join you conversation.....

We have PVE cluster with 3 nodes (All of them 4.3-3), and use NFS Share based on FreeNAS.
Not so far we opened to ourselves "move disks".......It was very cool....and gives us opportunity migrate VM from one storage to another without "backup/restore".
For a test we moved about 10 VM's.....some of them were Win, some Linux based......but!...after 10-12 hours....2 of them crushed.....they were: our wiki....some pages were lost.....and freeradius-dhcp.......
On both VM's were troubles with mysql.
On both cache - Write through
Type of disk - virtio.

"Move Disk" data corruption on 4.3

Renowned Member

Renowned Member

Famous Member

Famous Member

Renowned Member

Famous Member

Famous Member

Distinguished Member

Renowned Member

Renowned Member

Famous Member

Famous Member

Famous Member

Renowned Member

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

Distinguished Member

Member

We value your privacy