Hi Gang,
So this weekend's problem.
We did some maintenance on one of our nodes this weekend. We have a cluster with Ceph running. It appeared that everything went ok, but we have had reports today from cusomers that their applications crashed around the time of the migration.
One VM that we have looked into was as follows:
14:23:37 <-- migration started
14:23:49 <-- migration completed
14:24:44 <-- application crashed
There could be a bit of a difference with clock times, however.
In the Windows event log there was an entry as follows:
Log Name: Application
Source: Application Error
Date: 1/16/2021 2:24:22 PM
Event ID: 1005
Task Category: (100)
Level: Error
Keywords: Classic
User: N/A
Computer: xxxx
Description:
Windows cannot access the file for one of the following reasons: there is a problem with the network connection, the disk that the file is stored on, or the storage drivers installed on this computer; or the disk is missing. Windows closed the program MetaTrader because of this error.
Program: xxxx
File:
The error value is listed in the Additional Data section.
User Action
1. Open the file again. This situation might be a temporary problem that corrects itself when the program runs again.
2. If the file still cannot be accessed and
- It is on the network, your network administrator should verify that there is not a problem with the network and that the server can be contacted.
- It is on a removable disk, for example, a floppy disk or CD-ROM, verify that the disk is fully inserted into the computer.
3. Check and repair the file system by running CHKDSK. To run CHKDSK, click Start, click Run, type CMD, and then click OK. At the command prompt, type CHKDSK /F, and then press ENTER.
4. If the problem persists, restore the file from a backup copy.
5. Determine whether other files on the same disk can be opened. If not, the disk might be damaged. If it is a hard disk, contact your administrator or computer hardware vendor for further assistance.
Additional Data
Error value: 00000000
Disk type: 0
As far as we can tell, this has affected every single VM we moved, but some crashed at different times.
For example, I have another one that crashed 30 seconds or so after it was moved back onto its original host.
If we can't live-migrate VM's without this particular application crashing (something that worked perfectly well under Hyper-V), then it does entirely defeat the objective of having the cluster.
My guess here is that there was some sort high latency on the storage, enough to make the disk appear to be offline? Bit weird though considering that it was the C Drive that nothing else on the OS complained.
Any thoughts?
So this weekend's problem.
We did some maintenance on one of our nodes this weekend. We have a cluster with Ceph running. It appeared that everything went ok, but we have had reports today from cusomers that their applications crashed around the time of the migration.
One VM that we have looked into was as follows:
14:23:37 <-- migration started
14:23:49 <-- migration completed
14:24:44 <-- application crashed
There could be a bit of a difference with clock times, however.
In the Windows event log there was an entry as follows:
Log Name: Application
Source: Application Error
Date: 1/16/2021 2:24:22 PM
Event ID: 1005
Task Category: (100)
Level: Error
Keywords: Classic
User: N/A
Computer: xxxx
Description:
Windows cannot access the file for one of the following reasons: there is a problem with the network connection, the disk that the file is stored on, or the storage drivers installed on this computer; or the disk is missing. Windows closed the program MetaTrader because of this error.
Program: xxxx
File:
The error value is listed in the Additional Data section.
User Action
1. Open the file again. This situation might be a temporary problem that corrects itself when the program runs again.
2. If the file still cannot be accessed and
- It is on the network, your network administrator should verify that there is not a problem with the network and that the server can be contacted.
- It is on a removable disk, for example, a floppy disk or CD-ROM, verify that the disk is fully inserted into the computer.
3. Check and repair the file system by running CHKDSK. To run CHKDSK, click Start, click Run, type CMD, and then click OK. At the command prompt, type CHKDSK /F, and then press ENTER.
4. If the problem persists, restore the file from a backup copy.
5. Determine whether other files on the same disk can be opened. If not, the disk might be damaged. If it is a hard disk, contact your administrator or computer hardware vendor for further assistance.
Additional Data
Error value: 00000000
Disk type: 0
As far as we can tell, this has affected every single VM we moved, but some crashed at different times.
For example, I have another one that crashed 30 seconds or so after it was moved back onto its original host.
If we can't live-migrate VM's without this particular application crashing (something that worked perfectly well under Hyper-V), then it does entirely defeat the objective of having the cluster.
My guess here is that there was some sort high latency on the storage, enough to make the disk appear to be offline? Bit weird though considering that it was the C Drive that nothing else on the OS complained.
Any thoughts?