Virtual Machine for W2K3 stops responding

nccng2001

New Member
Nov 19, 2008
7
0
1
I faced the lockup/hanging problem on a full VM running on Plesk for Windows within 2 days. It happens 3 times within 2 days.

We have installed the latest Proxmox VE 1.0 with the latest patch: 2.6.24-1-pve. The physical machine is running on a 32Gb RAM and 2 X Quad Core Processors.

The configuration for the VM is as follows:

a) OS: Windows Server 2003 Enterprise running on a 64bit
b) CPU allocation: 4
c) RAM allocation: 8GB
d) Two HDD runniing on IDE
- HDD IDE 0:0 (200Gb)
- HDD IDE 0:1 (500Gb) where Plesk, MailEnable software and data are allocated
e) Ethernet Devices
- Model: e1000 Bridge: vmbr0
- Model: e1000 Bridge: vmbr1

The Proxmox VE is running on a cluster mode and there is only one VM running on this physical machine.

Before the operational deployment, we run the w2K3 VM for one week and it runs perfectly. Hence after that, we decided to make it operational.

When we are migrating the data just before we decided to make it live on Monday, 1st December at 9am. The first hang occurs at about 5am before it kicks off at 9am when everyone starts working. The whole physical server cannot be accessed and needless to say, the VM isn't accessible. From SSH, we can only access up to the password prompt but it goes nothing further we entered the password. After the reboot on the physical machine, it works fine both on the VM and the physical machine.

Because it is running solely for Plesk for Windows with MailEnable Enterprise version on a full load with over hundred over domains in it, the second hang was 2nd Dec 0432hrs and the third lockup is at 2nd Dec 16:20hrs.

All the signs are like this: first you cannot ping the IP address of the VM. Secondly, we cannot even access through KVM console. But Proxmox VE still shows that the VM is still running with 6.63GB of RAM and no processing is going on on the CPU.

Looking at the Event logs on the W2K3 shows nothing much.

Is there any way, we can trace any form of logs, whether if it is Proxmox VE logs or Windows 2K3 logs, that can help in this problem solving?

Thanks in advance
 
You should post this on the KVM mailling list. just one question: Why don´t you use the virtio network drivers?

I personally never tested win2003 64-bit with such a high load. I use openVZ for high load applications.
 
Are you sure that OpenVZ can run W2K3 64 bit?

Sorry, OpenVZ can only run Linux. I meant I prefer running such application on Linux.
 
Just for some information, in additional to the problems described, there is also another two problems encountered in the midst of instabilities.

The first most obvious one is the time being out of synchronized. For every minutes, there is a drift of about 15 to 30 seconds. As a result, the time drifts quite a bit after 15 minutes.

I tried to play around with the real clock and the CPU unit within the VM settings but to no avail.

I realize the time problem only on 1st December and correct it using a script that runs two commands: NET STOP W32TIME and NET START W32TIME. It runs every 10 minutes for 24 hours daily.

Is there any solution that can be resolved in the KVM for that?

The second problem encountered is that there was another instability encountered which is not in the problem. The MTA for MailEnable suddenly stops working at around 1259hrs on 1st December and a restart on all the Plesk services helps to correct all the problem.

These are two additional observations.

Any comments on that?

Thanks
 
In regards to the clock problem I would always suggest using w2k3 server's built in time synchronization abilities in conjunction with ntp as in this example from microsoft

http://support.microsoft.com/kb/816042

That should remove just about any clock issue you will ever have.
 
Thanks for the advise and assistance.

Tried out with the clock as suggested from the microsoft website. The option I tried was configuring the Windows Time Service to use internal clock. In fact, the registry is configured to use internal clock. However, if I don't run my schedule task with the script as mentioned earlier, the clock time still drifts every 10 to 30 seconds per minute.

This is observation.
 
My expierences with 2003 64bit on Proxmox

Food for thought, I am running several 64 bit SMP Windows 2003 guests in Proxmox and it is quite stable. It is a mix of SQL servers and Exchange, so it is quite a heavy load as well. No network issues or clock sync issues. I am using the e1000 emulated network cards. It was necessary to download the latest e1000 64 bit Windows drivers from Intel. Since that update, I have had no issues with the network.

I know that doesn't specifically help, but it may be useful to know that it is possible.

No special configuration was done to the Guest, or to the Proxmox host. Other than both Proxmox and 2003 Server Standard are updated to latest versions, and the updated network drivers from Intel.
 
Verdict

We have made many investigation on the problems by using different modern hardware from Supermicro motherboard.

The clock of the VM did not synchronize no matter how and what steps have been provided as mentioned earlier. This could be due to the fact that the motherboard BIOS has been updated to the latest and KVM could not sync it.

However, with those that has no BIOS update, it seems to work very well.

Next, we have this setup. 2 Windows servers (Both on W2K3) and 1 Linux server are running on KVM in the same physical machine.

When we try to copy or migrate one of the Windows KVM to another physical machine, the entire source physical machine became stalled and hang the entire Proxmox server. Needless to say, the entire VM on the rest of the Proxmox server are also halted and cannot be reached via ping, SSH, Web-based nor RDP.

Also, the entire physical proxmox server cannot be reached either via SSH or Web based.

The test has been made on different physical machines (Supermicro modern high-end hardware) as well and the result shows the same outcome.

Our investigation shows if there is a high I/O load transferring data between machines, it will stalled the physical machine, resulting the rest of the VM not running. It may be copying any high capacity files from proxmox machine to another.

We have deduced that if proxmox server can claim that it can perform live migration between machines, proxmox server may only tested on certain hardware. It will be useful if they can let us know what kind, brand, model and configuration have been used for their test.

Thanks

Ng Cher Choon
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!