Some Windows guests hanging every night

Pegasus

Active Member
Aug 29, 2013
60
1
26
California, USA
Hey everyone.

Within the past week, I'm having problems with many (but not all) of my Windows guests freezing/hanging almost every night. (In one case it happened at 8AM.) They're mostly Windows 2008 R2 x64 but one is Windows 2003 R2 x64. What's really weird is that the 2008 ones still respond to a CTRL-ALT-DEL at their consoles, but hang completely before displaying the logon screen. Once in this state, they don't respond to a Shutdown command and I have to use Stop.

Does anyone have any idea why this might be happening? The System event logs show nothing related (other than "the system shutdown at xxx was unexpected" after booting back up.) I installed the latest VirtIO drivers for the SCSI controller, NIC, and Baloon driver but that hasn't helped either.

The only coincidence is that I've started to roll out IPv6 on the LAN, having set up a firewall to do Router Advertisements for the network prefix and have been adding machines' IPv6 addresses to the internal DNS. I see that ProxMox doesn't have provisions for IPv6 addresses in the GUI, but the host NIC has successfully obtained one and can ping6 other LAN hosts.

Thanks for any information anyone can provide!
 

fireon

Famous Member
Oct 25, 2010
3,866
315
103
40
Austria/Graz
iteas.at
By this errors it is very hard say what is going on. Have you the possibility do move the VMs to another host, that you can exclude errors of the local PVE Host? We have IPV6 too, a long time. It should not be an problem.
 

Pegasus

Active Member
Aug 29, 2013
60
1
26
California, USA
Some additional information: The problems started shortly after installing the following updates:
Start-Date: 2014-10-05 01:30:05
Commandline: apt-get dist-upgrade
Install: pve-kernel-2.6.32-33-pve:amd64 (2.6.32-138, automatic)
Upgrade: pve-qemu-kvm:amd64 (2.1-8, 2.1-9), librbd1:amd64 (0.80.5-1~bpo70+1, 0.80.6-1~bpo70+1), ceph-common:amd64 (0.80.5-1~bpo70+1, 0.80.6-1~bpo70+1), proxmox-ve-2.6.32:amd64 (3.2-136, 3.3-138), qemu-server:amd64 (3.1-34, 3.1-35), librados2:amd64 (0.80.5-1~bpo70+1, 0.80.6-1~bpo70+1), pve-manager:amd64 (3.3-1, 3.3-2), python-ceph:amd64 (0.80.5-1~bpo70+1, 0.80.6-1~bpo70+1), rsyslog:amd64 (5.8.11-3, 5.8.11-3+deb7u1)
End-Date: 2014-10-05 01:30:31

What changed in qemu?
 

fireon

Famous Member
Oct 25, 2010
3,866
315
103
40
Austria/Graz
iteas.at
The differences on the versions what i see is only the kernel. Have you tested it with the 2.6.32? Qemu is the same. I have virtio 0.1-81 installed.
Can you post this output please?

pveversion -v


proxmox-ve-2.6.32: 3.2-136 (running kernel: 2.6.32-32-pve)
pve-manager: 3.3-1 (running version: 3.3-1/a06c9f73)
pve-kernel-2.6.32-32-pve: 2.6.32-136
pve-kernel-2.6.32-29-pve: 2.6.32-126
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-1
pve-cluster: 3.0-15
qemu-server: 3.1-34
pve-firmware: 1.1-3
libpve-common-perl: 3.0-19
libpve-access-control: 3.0-15
libpve-storage-perl: 3.0-23
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-9
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

What Storage do you use? What container (qcow2, raw...), what driver (virtio) ?
 
Last edited:

Pegasus

Active Member
Aug 29, 2013
60
1
26
California, USA
The differences on the versions what i see is only the kernel. Have you tested it with the 2.6.32?

That's the only version I have. (I'm not clear on how to get the 3.2 one.)

proxmox-ve-2.6.32: 3.3-138 (running kernel: 2.6.32-33-pve)
pve-manager: 3.3-2 (running version: 3.3-2/995e687e)
pve-kernel-2.6.32-32-pve: 2.6.32-136
pve-kernel-2.6.32-33-pve: 2.6.32-138
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-1
pve-cluster: 3.0-15
qemu-server: 3.1-35
pve-firmware: 1.1-3
libpve-common-perl: 3.0-19
libpve-access-control: 3.0-15
libpve-storage-perl: 3.0-23
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-9
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

What Storage do you use? What container (qcow2, raw...), what driver (virtio) ?

DRBD-backed local storage, qcow2 disks, virtio 0.81 drivers for NIC, SCSI and Baloon.
 
Last edited:

fireon

Famous Member
Oct 25, 2010
3,866
315
103
40
Austria/Graz
iteas.at
That's the only version I have. (I'm not clear on how to get the 3.2 one.
Ok, you use Package from TestingRepo. My Packages only from Enterprise Repo. Maybe there is e little bug in some one. On us testserver we also use the TestingRepo. I update to your Versions and let run some WindowsVM.

Regards
 

Pegasus

Active Member
Aug 29, 2013
60
1
26
California, USA
The really strange part is that none of my other machines in the cluster have the problem and they're all on the same version (and hardware.) However this host is the only one with four Windows VMs running, incase that matters. They use 7 VCPU cores in total though and the physical machine has 8, so there shouldn't be any contention there.
 

fireon

Famous Member
Oct 25, 2010
3,866
315
103
40
Austria/Graz
iteas.at
Hang one from that vm's also when you mv it to another machine... you say they are alle the same. So when the WindowsVM on an other host from your cluster works, than it is a problem with your local hardware... firmeware....
 

giner

Member
Oct 14, 2009
239
0
16
38
Tokyo
DRBD-backed local storage, qcow2 disks, virtio 0.81 drivers for NIC, SCSI and Baloon.

If you use DRBD you must use either write-through or direct-sync mode for a KVM drive.
Be careful with Baloon. When you set for example 1024/4096 this doesn't mean that the VM can take 2048 if needed, this means that other machines can take all but 1024 from this machine and this machine will have to kill processes or use swap or whatever to survive.
 

Pegasus

Active Member
Aug 29, 2013
60
1
26
California, USA
If you use DRBD you must use either write-through or direct-sync mode for a KVM drive.

I've just been using Default (no cache) on all my KVM VMs on all hosts. Is that not a good idea?

Be careful with Baloon. When you set for example 1024/4096 this doesn't mean that the VM can take 2048 if needed, this means that other machines can take all but 1024 from this machine and this machine will have to kill processes or use swap or whatever to survive.

Oh wow, that's not intuitive at all! (I need to read the docs again I guess.) Still, one of the affected VMs is set to a fixed 4GB, and the host has enough RAM for all of them at their max set points anyway, so that's not what's causing this problem.

I will move one of the VMs to another host as Fireon suggested and see if the problem reoccurs, either on the moved VM or on the ones left behind.
 

giner

Member
Oct 14, 2009
239
0
16
38
Tokyo
I've just been using Default (no cache) on all my KVM VMs on all hosts. Is that not a good idea?

Yes, it is not a good idea. Read more in the Proxmox wiki.

Oh wow, that's not intuitive at all!
Right, I have done some tests before moving with ballooning to production and found that bad idea. No ballooning anymore ever :)

(I need to read the docs again I guess.) Still, one of the affected VMs is set to a fixed 4GB, and the host has enough RAM for all of them at their max set points anyway, so that's not what's causing this problem.
I will move one of the VMs to another host as Fireon suggested and see if the problem reoccurs, either on the moved VM or on the ones left behind.

I suggest not migrating between DRBD hosts before changing cache mode and checking DRBD consistency, otherwise you can get VM images corrupted. If you migrate to another storage (not another DRBD node) - no problem then.
 

giner

Member
Oct 14, 2009
239
0
16
38
Tokyo
Pegasus, you say DRBD and at the same time qcow2. I probably misunderstand your configuration. I was talking about LVM on top of DRBD but it seems your configuration is different. Provide more information.
 

Pegasus

Active Member
Aug 29, 2013
60
1
26
California, USA
Okay here's the whole storage hierarchy:
- Physical disks
- Hardware RAID (with BBU cache)
- Partitions (LVM) - system, DRBD-backing1 (this host's VMs), DRBD-backing2 (other host's VMs)
- DRBD devices
- XFS File systems on DRBD devices
- qcow2 disk image files, set to "Default (no cache)"

Right now, I'm operating DRBD in Pri/Sec mode until I get fencing figured out, so migrating a VM still involves manual backup & restore.
 

giner

Member
Oct 14, 2009
239
0
16
38
Tokyo
Okay here's the whole storage hierarchy:
- Physical disks
- Hardware RAID (with BBU cache)
- Partitions (LVM) - system, DRBD-backing1 (this host's VMs), DRBD-backing2 (other host's VMs)
- DRBD devices
- XFS File systems on DRBD devices
- qcow2 disk image files, set to "Default (no cache)"

Right now, I'm operating DRBD in Pri/Sec mode until I get fencing figured out, so migrating a VM still involves manual backup & restore.

I recommend trying drbdadm verify before doing anything else.
 

Pegasus

Active Member
Aug 29, 2013
60
1
26
California, USA
I recommend trying drbdadm verify before doing anything else.

Done. Since the other host was secondary anyway, it wasn't a big deal, but there were indeed 340K worth of out-of-sync data that is now corrected.

(BTW, do I need to set up automatic DRBD resource verification on all nodes or just one of each pair?)
 

Pegasus

Active Member
Aug 29, 2013
60
1
26
California, USA
So far the moved VM is still operating correctly while one still on the problem host just experienced the problem again.

I was asking for guidance in ##windows and some people mentioned that storage performance problems can cause weird issues like this in guests. So I checked my RAID controller and found that apparently it no longer thinks it has a battery attached so has reverted to read-only caching. This could obviously cause a performance problem so might be the root cause.

All of my Windows guests are currently set to "Default (no cache)" which I thought was the safest option. Is there any mode I can set them to to allow them to work properly (if slowly) while the RAID controller can't do write caching?
 

giner

Member
Oct 14, 2009
239
0
16
38
Tokyo
> Okay, but https://pve.proxmox.com/wiki/Performance_Tweaks says to "Avoid to use cache=directsync and writethrough with qcow2 files." but it doesn't say why.

Because this disables write cache completely. However hardware RAID with BBU usually ignores such requests and report data being committed as soon as it reaches RAID cache.

> Also, since I'm using XFS to store the qcow2 files, it has write barrier support enabled by default, FWIW.

Unfortunately I found barrier not always helping. I know at least one case when it doesn't help - with graylog2. I don't know why but I had out-of-sync blocks related to data partition with ext4 when I tried graylog2 with "cache=none". My assumption is that graylog2 (java application) uses caching somehow and doesn't care about syncing when needed. But this is just a guest because I have no idea about internals of this part.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!