storage migration virtio failed

liska_

Member
Nov 19, 2013
115
3
18
Hi,
I am now in process of migrating some machines from proxmox 3.2 to proxmox 3.3. With one machine I have a following error after reaching 100%:
qmp command 'block-job-complete' failed - The active block job for device 'drive-virtio0' cannot be completed
I have successfully done live migration to second server but I am not able to finish moving image from nfs storage using virtio to another nfs storage.
With another vm I solved it by usin
g command qm rescan --vmid 108 but it was a little different case.
Do you have any hint, how to solve this?
Thanks a lot in advance
 
"move disk" does not always works live. power off your VM and then you should be able to move the disks.
 
Hi,

RBD --> RBD


I get around a 60% success rate.

They usually end with "TASK ERROR: storage migration failed: mirroring error: VM 101 qmp command 'block-job-complete' failed - The active block job for device 'drive-scsi1' cannot be completed" or similar for virtio devices.
 
Last edited:
Thanks,

It is Ceph Firefly 0.80.5

And yes, this is with cache=writeback
 
Last edited:
Hi spirit,

It may be relevant, I have tested this on around 20 VM's on our test cluster, and could not repeat the problem. But... VM's in our test cluster are much less busy than VM's on our production cluster.

The guests that seem to fail migration are all high IO machines, e.g. File Servers, Exchange Servers, SQL servers. They are running Windows Server 2008 with either virtio-blk or virtio-scsi disks.
 
Hi,
I am now in process of migrating some machines from proxmox 3.2 to proxmox 3.3. With one machine I have a following error after reaching 100%:
qmp command 'block-job-complete' failed - The active block job for device 'drive-virtio0' cannot be completed


This happens to us at least 50% of the time as well. It seems to have 2-3 different failure modes that vary in the details, but in all cases it is the same in general: it goes through the entire process and then fails at the very end. Here is one example:


transferred: 8589934592 bytes remaining: 0 bytes total: 8589934592 bytes progression: 100.00 %
2014-10-01 11:20:57.270423 7f89cc03a760 -1 did not load config file, using default settings.
Removing all snapshots: 100% complete...done.
2014-10-01 11:20:57.311938 7fa55d8be760 -1 did not load config file, using default settings.
image has watchers - not removing
Removing image: 0% complete...failed.
rbd: error: image still has watchers
TASK ERROR: storage migration failed: mirroring error: VM 101 qmp command 'block-job-complete' failed - The active block job for device 'drive-virtio0' cannot be completed

Sometimes it also fails but with much less detail, only:


TASK ERROR: storage migration failed: mirroring error: VM 140 qmp command 'block-job-complete' failed - The active block job for device 'drive-virtio0' cannot be completed

(Exactly as reported by the original poster.)


This is also with ceph firefly. It is particularly frustrating for large images, as they take a very long time to move and only fail at the very end. This is definitely something that should be addressed, as it is a terrible misuse of time and network bandwidth.

That message about "did not load config file, using default settings" often appears at the beginning of the storage migration as well, but that does not appear to be a predictor of success/failure.


It does seem like the busier a virtual disk is, the more likely the problem is to occur, but since we value our data we do not use writeback caching on any volume that cannot be trivially rebuilt in the event of a crash, so it definitely also occurs with cache set to "none." (Also there are some pretty dire warnings floating around out there to the effect of "live migration + writeback cache = data corruption.")
 
Speaking only for myself, we do not.


My understanding of the problem, is that the command "block-job-complete" hang, or don't return ACK.
But have correctly switched the disks.

so, proxmox see that as an error, and try to delete the new disk.

But as the new disk is open in qemu, ceph throw the error "image has watchers - not removing"/


I'll try to reproduce, maybe it's a timeout problem with block-job-complete.
 
I'll try to reproduce, maybe it's a timeout problem with block-job-complete.

That could be. Timeouts and UI errors do often pop up when deleting large, unused ceph volumes via the UI. (But they do eventually delete behind the scenes.)

However, we have tried separating the migration and deleting the old volume, and while the success rate seems slightly higher, it definitely does not prevent the migration-only step from failing.

Is there a way to do storage migration from the command line?
 
>>Is there a way to do storage migration from the command line?
yes (#qm move) , but it's use the same apis. It's not a gui timeout problem.


if you can reproduce the problem,

could you try to edit
/usr/share/perl5/PVE/QemuServer.pm

search:

vm_mon_cmd($vmid, "block-job-complete", device => "drive-$drive");

and replace it by

vm_mon_cmd($vmid, "block-job-complete", timeout => 10, device => "drive-$drive");


then restart

/etc/init.d/pvedaemon restart
/etc/init.d/pveproxy restart


and try the storage migration again.
 
could you try to edit
/usr/share/perl5/PVE/QemuServer.pm

On which Proxmox server shall I perform these steps? The one connected to for UI, the one where the VM is located, or all Proxmox servers in the cluster?
 
On which Proxmox server shall I perform these steps? The one connected to for UI, the one where the VM is located, or all Proxmox servers in the cluster?

the node where the vm with rbd is running.

Note that I'm note sure I'll help.

I have check in qemu code

I's hanging here:
Code:
static void mirror_complete(BlockJob *job, Error **errp)
{
    MirrorBlockJob *s = container_of(job, MirrorBlockJob, common);
    Error *local_err = NULL;
    int ret;


    ret = bdrv_open_backing_file(s->target, NULL, &local_err);
    if (ret < 0) {
        error_propagate(errp, local_err);
        return;
    }


    if (!s->synced) {

       ------>>>>>>>>>>HERE

        error_set(errp, QERR_BLOCK_JOB_NOT_READY, job->bs->device_name);
        return;
    }


    /* check the target bs is not blocked and block all operations on it */
    if (s->replaces) {
        s->to_replace = check_to_replace_node(s->replaces, &local_err);
        if (!s->to_replace) {
            error_propagate(errp, local_err);
            return;
        }


        error_setg(&s->replace_blocker,
                   "block device is in use by block-job-complete");
        bdrv_op_block_all(s->to_replace, s->replace_blocker);
        bdrv_ref(s->to_replace);
    }


    s->should_complete = true;
    block_job_resume(job);
}

According to qemu doc

http://wiki.qemu.org/Features/BlockJob

We can only call block-job-complete, when we receive event "MIRROR_STATE_CHANGE : with state (synced: true/false).

This is to be sure that disk are synced before doing the switch.

Currently in proxmox, we don't check events, so after drive-mirror is completed, we assume that we can finish the job.

Do you use writeback with your rbd volume?

I'll to see if we can improve that
 
I experienced this when moving disk from nfs to other nfs storage. Turning off the machine, move disk and turning on again solve the problem. Luckily this one was not important vm.
 
Another thing to try:

edit

/usr/share/perl5/PVE/QemuServer.pm

search

Code:
sub qemu_drive_mirror {
    my ($vmid, $drive, $dst_volid, $vmiddst, $maxwait) = @_;


    my $count = 1;
    my $old_len = 0;
    my $frozen = undef;

and add

Code:
sub qemu_drive_mirror {
    my ($vmid, $drive, $dst_volid, $vmiddst, $maxwait) = @_;


    my $count = 1;
    my $old_len = 0;
    my $frozen = undef;
    $maxwait = 10;   ####ADD THIS



This should force vm freeze/resume if too much write occur at the end the migration.

(restart pvedaemon and pveproxy after)
 
Added both of those changed. It actually made things worse. The transfer bailed at 11.37% done.

Code:
transferred: 4884463616 bytes remaining: 38065209344 bytes total: 42949672960 bytes progression: 11.37 %
trying to aquire lock...Removing all snapshots: 100% complete...done.
image has watchers - not removing
Removing image: 0% complete...failed.
rbd: error: image still has watchers
rbd rm 'vm-1343-disk-2' error: rbd: error: image still has watchers
storage migration failed: mirroring error: can't lock file '/var/lock/qemu-server/lock-1343.conf' - got timeout
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!