can storage migration result in bad data?

felipe

Well-Known Member
Oct 28, 2013
222
6
58
i did a storage migration from local lvm to ceph. all went fine. but after the next start of windows i had a lot of corrupt files when rebooting the windows server (windows 2008)
 
i did a storage migration from local lvm to ceph. all went fine. but after the next start of windows i had a lot of corrupt files when rebooting the windows server (windows 2008)

which version of proxmox ? (#pve-version -v ?)

which version of ceph ?

Can you post the log of the migration task ?
 
proxmox-ve-2.6.32: 3.3-147 (running kernel: 3.10.0-7-pve)
pve-manager: 3.4-1 (running version: 3.4-1/3f2d890e)
pve-kernel-3.10.0-7-pve: 3.10.0-27
pve-kernel-2.6.32-32-pve: 2.6.32-136
pve-kernel-2.6.32-30-pve: 2.6.32-130
pve-kernel-2.6.32-37-pve: 2.6.32-147
pve-kernel-2.6.32-29-pve: 2.6.32-126
pve-kernel-3.10.0-3-pve: 3.10.0-11
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-16
qemu-server: 3.3-20
pve-firmware: 1.1-3
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-31
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-12
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

ceph version (firelfy) : 0.80.8-1

LOG:

create full clone of drive ide0 (guest1:vm-174-disk-1)
2015-05-05 01:35:00.152663 7f1e43dda760 -1 did not load config file, using default settings.
transferred: 0 bytes remaining: 128849018880 bytes total: 128849018880 bytes progression: 0.00 % busy: true
transferred: 20971520 bytes remaining: 128828047360 bytes total: 128849018880 bytes progression: 0.02 % busy: true
transferred: 73400320 bytes remaining: 128775618560 bytes total: 128849018880 bytes progression: 0.06 % busy: true
transferred: 115343360 bytes remaining: 128733675520 bytes total: 128849018880 bytes progression: 0.09 % busy: true
transferred: 136314880 bytes remaining: 128712704000 bytes total: 128849018880 bytes progression: 0.11 % busy: true
transferred: 167772160 bytes remaining: 128681246720 bytes total: 128849018880 bytes progression: 0.13 % busy: true
transferred: 199229440 bytes remaining: 128649789440 bytes total: 128849018880 bytes progression: 0.15 % busy: true
transferred: 230686720 bytes remaining: 128618332160 bytes total: 128849018880 bytes progression: 0.18 % busy: true
transferred: 272629760 bytes remaining: 128576389120 bytes total: 128849018880 bytes progression: 0.21 % busy: true
......... (goes on the same way until) .......
transferred: 128604766208 bytes remaining: 244252672 bytes total: 128849018880 bytes progression: 99.81 % busy: true
transferred: 128642449408 bytes remaining: 206569472 bytes total: 128849018880 bytes progression: 99.84 % busy: true
transferred: 128730857472 bytes remaining: 118161408 bytes total: 128849018880 bytes progression: 99.91 % busy: true
transferred: 128811401216 bytes remaining: 37617664 bytes total: 128849018880 bytes progression: 99.97 % busy: true
transferred: 128838533120 bytes remaining: 10485760 bytes total: 128849018880 bytes progression: 99.99 % busy: true
transferred: 128849018880 bytes remaining: 0 bytes total: 128849018880 bytes progression: 100.00 % busy: false
Logical volume "vm-174-disk-1" successfully removed
TASK OK

both vms where i did the disk move have big inconsistencies and needed a filedisk check...
 
but this means that with any other the writetrough a storage migration could result in data loss / corruption?
if i have any kind of caching unexpected shutdown can logically lead to data loss.. but also storage migration?
 
I never has problem with storage migration to nfs target,

but users report corruption with ceph/rbd target.

I wonder if it's not a bug in qemu rbd driver ....

I'll migrate soon a lot of vm to ceph (hammer), I'll try to make a report on the forum.
 
I wonder if this qemu (2.3) commit could fix it
http://git.qemu.org/?p=qemu.git;a=commit;h=b21c76529d55bf7bb02ac736b312f5f8bf033ea2

"block/mirror: Improve progress report

Instead of taking the total length of the block device as the block
job's length, use the number of dirty sectors. The progress is now the
number of sectors mirrored to the target block device. Note that this
may result in the job's length increasing during operation, which is
however in fact desirable.
"


Code:
[COLOR=#008800][FONT=monospace]+        /* s->common.offset contains the number of bytes already processed so[/FONT][/COLOR]
[COLOR=#008800][FONT=monospace]+         * far, cnt is the number of dirty sectors remaining and[/FONT][/COLOR]
[COLOR=#008800][FONT=monospace]+         * s->sectors_in_flight is the number of sectors currently being[/FONT][/COLOR]
[COLOR=#008800][FONT=monospace]+         * processed; together those are the current total operation length */[/FONT][/COLOR]
[COLOR=#008800][FONT=monospace]+        s->common.len = s->common.offset +[/FONT][/COLOR]
[COLOR=#008800][FONT=monospace]+                        (cnt + s->sectors_in_flight) * BDRV_SECTOR_SIZE;[/FONT][/COLOR]

as currently we finished the mirror when

$stat->{len} == $stat->{offset} && $busy eq 'false'
 
Hi,

I have made a patch to finish the migration, when flag "ready" = true.
This is a new flag since qemu 2.1, and I think it should be better than stat->{len} == $stat->{offset} && $busy eq 'false'.


Can you try to install this file
http://odisoweb1.odiso.net/qemu-server_3.4-4_amd64.deb

and restart
/etc/init.d/pvedaemon.


Then try to do a drive migration again.
 
thank you.
i will install some machines on my test environment and try to install the new qemu verion. still i also first have to reproduce the bug in the storage migration form the production cluster in the test environment. i am not 100% sure if every time i transfered a machine to the cluster it happened. at least before migrating the 2 windows machines i migrated 2 linux machines and they seem to work normally and did not want to make some filecheck on reboot etc...
i will start to test ....
 
thank you.
i will install some machines on my test environment and try to install the new qemu verion. still i also first have to reproduce the bug in the storage migration form the production cluster in the test environment. i am not 100% sure if every time i transfered a machine to the cluster it happened. at least before migrating the 2 windows machines i migrated 2 linux machines and they seem to work normally and did not want to make some filecheck on reboot etc...
i will start to test ....

I forgot to say that it's need qemu 2.2, not 2.1.
So you need to update proxmox to last version.
 
my tests in the test worked well. still i would like to know if this bug is fixed now. in the repository there are even newer packages of qemu-server.
i am still afraid to do migrations because the last where a disaster.
 
I think it should be ok now.

The only thing that could break the migration, is using the discard/trimming feature of qemu. (if your storage support them).
This one will be fixed in qemu 2.4 (proxmox 4)

qemu 2.4 have a lot of improvements in drive-mirror, so if you want to be really safe, you can just wait a little more for proxmox 4.0.
 
hmmm ok. ceph has the discard/trimming feature of qemu?
i think with nas storage over nfs it shouldnt be a problem. but i have to move around 10 vms to ceph. and until proxmox4 is stable some time will go by....
could this also happen with restoring backups? (data corruption with ceph rbd?) this would also be a way to tranfer the machines... (but not live)
 
hmmm ok. ceph has the discard/trimming feature of qemu?

if you enable it in vm disk config, yes. (+virtio-scsi)

[/QUOTE]i think with nas storage over nfs it shouldnt be a problem. but i have to move around 10 vms to ceph. and until proxmox4 is stable some time will go by....
could this also happen with restoring backups? (data corruption with ceph rbd?) this would also be a way to tranfer the machines... (but not live)
[/QUOTE]
backup/restore should work fine, or move disk with vm shutdown.


also currently drive-mirroring allocate all blocks (including zeros block),so no sparse on target. this will be fixed in qemu 2.4.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!