Windows VM Migration Issue

Greatsamps · Jan 17, 2021

Hi Gang,

So this weekend's problem.

We did some maintenance on one of our nodes this weekend. We have a cluster with Ceph running. It appeared that everything went ok, but we have had reports today from cusomers that their applications crashed around the time of the migration.

One VM that we have looked into was as follows:

14:23:37 <-- migration started
14:23:49 <-- migration completed
14:24:44 <-- application crashed

There could be a bit of a difference with clock times, however.

In the Windows event log there was an entry as follows:

Log Name: Application
Source: Application Error
Date: 1/16/2021 2:24:22 PM
Event ID: 1005
Task Category: (100)
Level: Error
Keywords: Classic
User: N/A
Computer: xxxx
Description:
Windows cannot access the file for one of the following reasons: there is a problem with the network connection, the disk that the file is stored on, or the storage drivers installed on this computer; or the disk is missing. Windows closed the program MetaTrader because of this error.

Program: xxxx
File:

The error value is listed in the Additional Data section.
User Action
1. Open the file again. This situation might be a temporary problem that corrects itself when the program runs again.
2. If the file still cannot be accessed and
- It is on the network, your network administrator should verify that there is not a problem with the network and that the server can be contacted.
- It is on a removable disk, for example, a floppy disk or CD-ROM, verify that the disk is fully inserted into the computer.
3. Check and repair the file system by running CHKDSK. To run CHKDSK, click Start, click Run, type CMD, and then click OK. At the command prompt, type CHKDSK /F, and then press ENTER.
4. If the problem persists, restore the file from a backup copy.
5. Determine whether other files on the same disk can be opened. If not, the disk might be damaged. If it is a hard disk, contact your administrator or computer hardware vendor for further assistance.

Additional Data
Error value: 00000000
Disk type: 0

As far as we can tell, this has affected every single VM we moved, but some crashed at different times.

For example, I have another one that crashed 30 seconds or so after it was moved back onto its original host.

If we can't live-migrate VM's without this particular application crashing (something that worked perfectly well under Hyper-V), then it does entirely defeat the objective of having the cluster.

My guess here is that there was some sort high latency on the storage, enough to make the disk appear to be offline? Bit weird though considering that it was the C Drive that nothing else on the OS complained.

Any thoughts?

Greatsamps · Jan 18, 2021

Just to add to this.

So we have a dedicated 1gb/s network for corosync. We then have 2 LCAP bonded 10gb/s network cards in each host which have been split into VLAN's. We have a couple for ceph and 1 for public, which at the time in question would have had zero traffic on it. We have defined the storage vlan to be used for migrations.

My guess here is that the migration traffic was what upset things as was on same nic(s) as the ceph public network?

The way the VM's were migrated was slightly different to how we have done before as well.

We have an HA group with a number of servers in with different priorities. We simply reduced the priority of the one we wanted to take out of service below the others. Perhaps too many were migrating at once and flooded the storage network?

Search

Search

Windows VM Migration Issue

Greatsamps

Active Member

Greatsamps

Active Member

We value your privacy