storage migration virtio failed

spirit · Oct 1, 2014

Ok,
I need to dig a little more to implemented something clean.

Do you have a lot of write in your vm during the storage migration ?

jdw · Oct 1, 2014

The chance of migration failure definitely seems proportional to the I/O activity of the image being migrated.

jdw · Oct 1, 2014

Is it possibly related that storage migration performance is pretty awful? It seems to peak at 20MB/sec and often stalls entirely for several seconds at a time. This is on a very lightly utilized Proxmox node with bonded gigabit links to the storage clusters, and the two Ceph clusters are 10Gbit. Seems like pegging at least the gigabit LAN interface should be pretty easy, but it's not even close. And the complete stalls are really surprising.

Possibly if the migrations finished more quickly, there would not be timeout issues?

spirit · Oct 1, 2014

jdw said:
Possibly if the migrations finished more quickly, there would not be timeout issues?

Yes sure.
The point is to do the block-job-complete when disk are sync.
Normaly it's fine, but with a lot of write, it's possible that between the end of mirror and block-job-complete, new writes are coming.

What is your ceph storage config ? (do you use ssd for journals ?)

jdw · Oct 1, 2014

spirit said:
Yes sure.
What is your ceph storage config ? (do you use ssd for journals ?)

One of the clusters uses SSD's for journals. The other is 100% SSD. Write operations from inside VM's on the same proxmox node that gets 20MB/sec during migration can easily exceed 60MB/sec. Which is still lower than I'd like, but it'd be very nice to get at least that much from storage migration.

spirit · Oct 1, 2014

jdw said:
One of the clusters uses SSD's for journals. The other is 100% SSD. Write operations from inside VM's on the same proxmox node that gets 20MB/sec during migration can easily exceed 60MB/sec. Which is still lower than I'd like, but it'd be very nice to get at least that much from storage migration.

Ok, last idea to try:

add a sleep before the block-job-complete

Code:

            if ($vmiddst == $vmid) {
                sleep 10;
                # switch the disk if source and destination are on the same guest
                vm_mon_cmd($vmid, "block-job-complete", device => "drive-$drive");
            }

jdw · Oct 2, 2014

spirit said:
Ok, last idea to try:

add a sleep before the block-job-complete

This appears to be helping. So far, I have done a handful of migrations that failed, then apply this change, and then no more migrations on that server fail. The failure rate isn't high enough to rule out coincidence, but it's leaning toward unlikely.

Note also that this morning a got a failure (on an unpatched Proxmox node) while moving a 6GB image from local SSD to the SSD-journalled ceph cluster. That just shouldn't ever happen. (It also took almost 3 minutes to move 6GB.)

All is not well in storage migration land, and although I'm very grateful for this patch (Thank you!), it does seem like a workaround that's really only hiding some deep, ugly underlying problem.

There are not many more images to move in our current batch, but if I encounter any non-expected results with those that remain, I will report back. We do also have plenty of non-critical VM's we can do test migrations with if needed.

spirit · Oct 2, 2014

Hi,

I review the qemu code, and proxmox code, and we are doing the things right.

But maybe the block-job-complete is lauchend too fast. (maybe 1s too fast, race condition).

If it's work fine with the sleep 10, maybe could you try with 5,2 or 1s to compare. (But I think that 1s should be enough)

nz_monkey · Oct 3, 2014

Hi spirit,

Initial testing looks good. I am just about to test it on our production cluster.

It would be really helpful if the output was more informative.

e.g. on the "sleep 10;" it would output something like "Pausing for sync"

and when it migrates the active disk it outputs something like "Migrating active disk Source --> Destination"

and before deleting the Source disk it outputs "Removing source disk"

spirit · Oct 3, 2014

note that it's just a workaround, so I need to implemented something cleaner, to not wait at all.

Also, we have notice that some bugs in block-job related to aio has been fixed in qemu 2.1.1.

Can you do, in monitor (in gui, tab monitor):

# info version

on the working or failing vms.

nz_monkey · Oct 5, 2014

Hi spirit,

Our prod cluster which is PVE3.0 returns 1.4.2

mir · Oct 6, 2014

nz_monkey said:
Hi spirit,

Our prod cluster which is PVE3.0 returns 1.4.2

You are running an ancient version! You really should upgrade ASAP. If you upgrade to latest version you will get qemu 2.1.1 which has the required fix.

nz_monkey · Oct 6, 2014

SLA's mean upgrades that require outages require a lot of "change-control" and other such enterprise BS

Unfortunately it is not possible to live-migrate between PVE 3.0 and PVE3.3 or we would have already migrated.

We have it scheduled for 2 weeks to do an in-place upgrade from PVE3.0 to PVE3.3, hopefully after this the problem will be solved.

mir · Oct 6, 2014

nz_monkey said:
Unfortunately it is not possible to live-migrate between PVE 3.0 and PVE3.3 or we would have already migrated.

Live migration is possible from one point release to the next point release but since you have kept your 3.0 for too long you do not have this option.

nz_monkey · Oct 6, 2014

mir said:
Live migration is possible from one point release to the next point release but since you have kept your 3.0 for too long you do not have this option.

Thanks mir,

We have extensively tested doing an in-place upgrade from 3.0 --> 3.3 with all VM's running, migrating them to another 3.3 node and then rebooting the original one. We upgraded our test cluster with about 60-70 windows VM's this way and it went well, we are just lucky our PVE3.0 cluster has always been running a 3.10 kernel and that we use KMS for all windows licencing.

pvedaemon and pve-proxy need to be restarted a few times as they lock up, but once all nodes are on PVE3.3 it hums along really well.

nz_monkey · Oct 16, 2014

OK. I just upgraded our production cluster to PVE3.3 ans still get the same failures.

TASK ERROR: storage migration failed: mirroring error: VM 188 qmp command 'block-job-complete' failed - The active block job for device 'drive-virtio0' cannot be completed

I will put the wait in and see what happens.

mir · Oct 16, 2014

In my experience there are four major parts which influence whether live storage migration fails or not:
1) Network performance and especially whether network sees congestion while storage is migrated
2) Storage performance on both source and destination
3) I/O inside the guest while storage migration is ongoing
4) Size of the disk which is migrated

To sum it up:
- Bad network performance or congestion will likely cause failure
- Slow storage on both destination or source will likely cause failure
- High I/O inside the guest will likely cause failure
- Big disks will likely cause failure

nz_monkey · Oct 16, 2014

Hi mir,

1. We have 2x 10Gbit to each host, performance is good.
2. We have ceph cluster with 84 disks and 14x SSD journals. Performance is good
3. This is reasonably high
4. It is more likely to fail on larger disks.

If we implement the "sleep 10;" fix above, it does on the surface seem to fix the issue. Disks that have previously failed seem to complete.

We are quite concerned as migrations that have failed in the past have caused guest file system corruption.

spirit · Oct 16, 2014

nz_monkey said:
Hi mir,

1. We have 2x 10Gbit to each host, performance is good.
2. We have ceph cluster with 84 disks and 14x SSD journals. Performance is good
3. This is reasonably high
4. It is more likely to fail on larger disks.

If we implement the "sleep 10;" fix above, it does on the surface seem to fix the issue. Disks that have previously failed seem to complete.

We are quite concerned as migrations that have failed in the past have caused guest file system corruption.

Hi, can you do in the vm monitor on the vm where the drive mirror is failing:
"info version"

I would like to known which qemu version is running, because some fixes has been done in last qemu 2.1.1

jdw · Oct 16, 2014

mir said:
In my experience there are four major parts which influence whether live storage migration fails or not:
1) Network performance and especially whether network sees congestion while storage is migrated
2) Storage performance on both source and destination
3) I/O inside the guest while storage migration is ongoing
4) Size of the disk which is migrated

To sum it up:
- Bad network performance or congestion will likely cause failure
- Slow storage on both destination or source will likely cause failure
- High I/O inside the guest will likely cause failure
- Big disks will likely cause failure

Points 1 & 2 are reasonable where they apply. Points 3 & 4 are, IMO, not. Likewise, risk of storage corruption is also not reasonable. Those are all things that need to be addressed on the software side rather than blaming the user for using too much space or I/O. The average size of disk images is growing pretty fast; we've created our first 1TB+ images recently. That's only going to get more common, and likewise disk I/O's are going up not down.

storage migration virtio failed

Distinguished Member

Renowned Member

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Famous Member

Renowned Member

Famous Member

Renowned Member

Renowned Member

Famous Member

Renowned Member

Distinguished Member

Renowned Member