Some Windows guests hanging every night

Pegasus · Oct 23, 2014

Pegasus said:
So far the moved VM is still operating correctly

Nope, it too hung again. But it happened when the system was rebuilding a failed disk, so I/O delay may be the cause here too.

I've set all of the Windows VMs to directsync caching and restarted them, verifying in Device manager that it's indeed off. Let's see if that helps. (And all this time I thought "No cache" meant exactly that - no caching is done anywhere. Why the hell is it not called "Write cache" in Proxmox?!)

giner · Oct 23, 2014

Pegasus said:
And all this time I thought "No cache" meant exactly that - no caching is done anywhere. Why the hell is it not called "Write cache" in Proxmox?!

This is weird, I know. And this came from kvm/qemu.

Pegasus · Oct 24, 2014

giner said:
This is weird, I know. And this came from kvm/qemu.

For anyone else reading this, I was pointed to this article by someone in the IRC channel which explains the cache modes clearly: http://webapp5.rrz.uni-hamburg.de/SuSe-Dokumentation/manual/sles-manuals_en/cha.qemu.cachemodes.html

By those descriptions (and the qemu-kvm man page,) it sounds like cache=none really is what I want (assuming Windows 2008 R2 is intelligent about when to flush the cache.) But for the purposes of troubleshooting the issue of this thread, I will leave my Windows VMs using directsync for the next couple days.

Pegasus · Oct 24, 2014

AAAHHH HELP!!!

Even with cache=directsync, two of the three affected VMs hung again last night! This is maddening and I need it to stop!!! What else can I do?!

And they are on two different hosts!

Pegasus · Oct 24, 2014

fireon said:
Hang one from that vm's also when you mv it to another machine... you say they are alle the same. So when the WindowsVM on an other host from your cluster works, than it is a problem with your local hardware... firmeware....

Well, it happened twice now on the VM I moved to different hardware (whose other Windows VMs are working fine.)

So it's _something_ with the guest but how the heck do I find out what?!

Next I will try booting one of the hosts on the previous kernel and see what happens.

giner · Oct 24, 2014

Please provide more information here:
1. What physical servers are used and it's configuration?
2. RAID hardware and configuration?
3. May be network configuration if it is somehow special (not IP addresses but architecture)
4. What about backups? What kind of storage is used and when scheduled?
5. What about logs inside VM? Something unusual?
6. VM configuration (content of NNN.conf).

Also:
1. Try to disable ballooning
2. Switch from virtio to ide and e1000
3. Try to monitor free ram of the Proxmox host

Pegasus · Oct 24, 2014

1. What physical servers are used and it's configuration?

HP ProLiant DL365 each with 6x 300GB SATA disks in RAID 1+0 (striping across 3 mirrored pairs) on P400i RAID controllers with 512MB BBWC (one has a disconnected battery so is using the cache for read-only. The other has a problem with one of the physical drive slots and keeps seeing the drive disappear and reappear, so it rebuilds that mirror often.)

Network: 4 Gb interfaces, two bonded with 802.3ad layer 3+4 for LAN (connected to HP E3800 switch stack), two bonded with balance-rr and directly connected to the other host for DRBD
I just checked on the switch and it looks like it doesn't have LACP groups for these two servers, strangely. I will investigate and correct this now.

Backups: I don't do any directly in ProxMox. I have Bacula agents in the guests that communicate with my central (physical) Bacula server directly.

Logs in the VM: I never see anything unusual other than the recovery message after I force-stop and reboot them.

VM config of one of the affected VMs, looks very similar to one that isn't affected, other than the RAM (the non-affected one has 8192MB)

Code:

balloon: 2048
bootdisk: virtio0
cores: 2
ide2: none,media=cdrom
memory: 6144
name: problemVM2
net0: virtio=<MAC address>,bridge=vmbr0
onboot: 1
ostype: win7
scsihw: virtio-scsi-pci
sockets: 1
virtio0: clientvms2-drbd:11/vm-11-disk-1.qcow2,format=qcow2,cache=directsync,size=30G
virtio1: clientvms2-drbd:11/vm-11-disk-2.qcow2,format=qcow2,cache=directsync,size=10G

Balooning: one of the affected VMs already didn't have it enabled, fixed at 4096MB.
Host free RAM: The RRD graph shows that neither host uses more than about half of the available RAM. (I planned it that way on purpose for failover.)
I haven't tried switching the virtual I/O hardware yet, but will after I fix the LACP issue. (But keep in mind that other Windows VMs on the same hosts are fine with VirtIO everything.)

Pegasus · Oct 24, 2014

Pegasus said:
I just checked on the switch and it looks like it doesn't have LACP groups for these two servers, strangely. I will investigate and correct this now.

It was just a port naming issue. The port trunks are correct, so that's not it.

giner · Oct 24, 2014

Considering provided information I would recommend:
1) disabling balooning
2) switching from virtio (as I suggested before)

More questions:
1) amount of RAM on the host
2) number of VMs on the host

However the most probable cause I can think about so far is write i/o performance on the host. RAID without write caching and RAID which always rebuilding can produce long I/O delays for VMs when I/O is increasing on the host (for example during backup time).

acidrop · Oct 24, 2014

Pegasus said:
Okay here's the whole storage hierarchy:
- Partitions (LVM) - system, DRBD-backing1 (this host's VMs), DRBD-backing2 (other host's VMs)
- DRBD devices
- XFS File systems on DRBD devices
- qcow2 disk image files, set to "Default (no cache)"

Right now, I'm operating DRBD in Pri/Sec mode until I get fencing figured out, so migrating a VM still involves manual backup & restore.

hello

I am also using DRBD but in dual primary mode.
Your setup looks a bit strange.Seems that you want to have the benefits of DRBD (real time replication) but using a non clustered filesysten like XFS on top of it. Of course you will answer, that's why I use it in primary/secondary mode, to avoid corruption. Anyways I would not recommend that setup. Keep it simple, use DRBD in dual-primary mode (you can live migrate, fencing is important only if you enable HA) and put clvm on top of it (like wiki suggests).You will loose qcow2 live snapshots ability but you will have a more stable system to play with.
If you really need qcow2 then use glusterfs or ceph or SAN with NFS server.

If you want to do a test try this if possible:
1. create a drbd resource
2. pvcreate,vgcreate on it (follow wiki)
3. add the created vg on datastore -> storage
4. create a new vm on this storage (lv)
5. use move disk option from currently working vm (qcow2, which has the issues) to the newly created vm (lvm).
6. test the new windows vm now residing on lvm storage

Does it hang again? Also you can try by using ide disk instead of virtio just to find the issue.

regards

Pegasus · Oct 24, 2014

giner said:
Considering provided information I would recommend:
1) disabling balooning
2) switching from virtio (as I suggested before)

Okay, I'll try that tonight, but it won't be easy because Windows has a fit when you change the boot device. (And will make me reconfigure the NIC.)

More questions:
1) amount of RAM on the host
2) number of VMs on the host

Host RAM: 32GB
# VMs: 3 on each right now, one also has a VZ container. Only three of the 6 VMs are exhibiting the problem and it doesn't matter which host they're on.

However the most probable cause I can think about so far is write i/o performance on the host. RAID without write caching and RAID which always rebuilding can produce long I/O delays for VMs when I/O is increasing on the host (for example during backup time).

Still, this type of failure shouldn't be happening. How do people run VMs on lesser hardware? Isn't it possible to do on a software RAID-1 with just two regular HDs?

giner · Oct 24, 2014

> Still, this type of failure shouldn't be happening. How do people run VMs on lesser hardware? Isn't it possible to do on a software RAID-1 with just two regular HDs?
Software RAID is not recommended and I think those who use it configure their VMs for writeback. Otherwise performance would not acceptable.

kotakomputer · Oct 25, 2014

Pegasus said:
# VMs: 3 on each right now, one also has a VZ container. Only three of the 6 VMs are exhibiting the problem and it doesn't matter which host they're on.

In my experience, Windows Server R2 (see the additional R2) is not easy to move or restore from different node: some times blue screen or restart continuously.
The best way is try to install Win Server R2 from ISO ... then report the result here again.
NB: try without virtio and no-cache first.

Still, this type of failure shouldn't be happening. How do people run VMs on lesser hardware? Isn't it possible to do on a software RAID-1 with just two regular HDs?

Yes very possible. We have many Proxmox Servers using s/w RAID-1.

Pegasus · Oct 28, 2014

fireon said:
The differences on the versions what i see is only the kernel. Have you tested it with the 2.6.32?

And this was the problem: I booted one of my hosts on the 'Linux ClientVMs2 2.6.32-32-pve #1 SMP Thu Aug 21 08:50:19 CEST 2014 x86_64 GNU/Linux' kernel and the problem goes away, regardless of VM hardware settings. The VM I moved to the other host that is running the latest pve-kernel-2.6.32-33-pve still shows the problem, confirming the kernel is the source of the issue.

So what broke and how can it be fixed? (How can I trust that a future kernel update won't also be broken?)

giner · Oct 28, 2014

Pegasus said:
And this was the problem: I booted one of my hosts on the 'Linux ClientVMs2 2.6.32-32-pve #1 SMP Thu Aug 21 08:50:19 CEST 2014 x86_64 GNU/Linux' kernel and the problem goes away, regardless of VM hardware settings. The VM I moved to the other host that is running the latest pve-kernel-2.6.32-33-pve still shows the problem, confirming the kernel is the problem.

So what broke and how can it be fixed? (How can I trust that a future kernel update won't also be broken?)

We use the same kernel and no issues with Windows machines. Have you tried disabling virtio?

Pegasus · Oct 30, 2014

So even on the host running the slightly older kernel, I just had two of the same VMs block. (One let me try to log in but got stuck at "Waiting for User Profile Service" remotely and "Welcome" on the console. Still had to Stop the VM.)

I just switched the disks to IDE and the storage controller to the default LSI one. (Shockingly, it booted up with no issue!) The NIC is still VirtIO but I'll let this run for a few days and see what happens.

I'm still totally baffled as to how other VMs running the SAME Windows OS on the SAME HOST have no issues!! They even have a higher user load so more I/O!

giner · Oct 30, 2014

Pegasus said:
I'm still totally baffled as to how other VMs running the SAME Windows OS on the SAME HOST have no issues!! They even have a higher user load so more I/O!

Check exact virtio driver version for virtio block device. Sometimes Windows can install wrong one automatically.

Also I have never changed scsi-controller to virtio as I'm not sure what the advantage is in compare to configuring virtio for particular drive directly.

christophe · Oct 30, 2014

Hi,

We had a similar problem but it is really difficult to establish a precise bug report!

Yes, problems started after an upgrade, from PVE 3.1 to latest at this time : 2014-09-29.

Before that, i upgraded a first proxmox cluster as soon as v3.3 was out. No problem. Mainly Linux machines.

A week later, a second proxmox cluster, five nodes, was upgraded. All at once, via a cssh :
apt-get update ; apt-get dist-upgrade
No problem. Mainly Linux VMs. "Mainly" because Windows machines are really low loaded.

A week later again, a 3 nodes cluster was updated. One node at a time, by moving VMs, updating, restarting node, moving VMs back and so on.
Mainly BIG Windows machines, 2012R2, RDS farms (with hundreds of user sessions...) , brokers, remote app, appv and web gateways. 20GB of RAM on each RDS server. All those Windows VM use virtio. No ballooning service enabled (RDS doesn't like!).

Bad luck : randomly unresponsives, sometimes 2 - 3 times a day, sometimes not at all for 3 days.

Nothing special on those machines. Quasi identical ones, namely DFS servers, never had a crash / became unresponsive.

A special note : ALL those unresponsive machines log events ID 129, viostor reset. Not on their disk, which became unreachable. On a kibana / logstash / elasticsearch log centralizer on the network.

We decided to move back from virtio to IDE for disk and e1000 for NIC. Not any more problem then! Not one. It is not the best solution from a performance point of view, but it seems, after 2 weeks without any crash, to be a stable one.

A have dpkg.log if needed (monday).
Virtio version doesn't seem to play a role : from the last one to two before triggered the problem.

What changed in qemu?

Christophe.

mir · Oct 31, 2014

I wonder whether Windows is affected by this commit: http://git.qemu-project.org/?p=qemu.git;a=commit;h=4d43d3f3c8147ade184df9a1e9e82826edd39e19
As can be seen later this patch is reverted: http://git.qemu-project.org/?p=qemu.git;a=commit;h=45363e46aeebfc99753389649eac7c7fc22bfe52

Pegasus · Oct 31, 2014

christophe said:
A special note : ALL those unresponsive machines log events ID 129, viostor reset.

Thank you very much for this additional information, Christophe!

We decided to move back from virtio to IDE for disk and e1000 for NIC.

If the problem is with the storage driver, I would expect we can still use VirtIO for the NIC. I wonder if using a SCSI vdisk is any better performance than the IDE one?

mir said:
I wonder whether Windows is affected by this commit

Good find, Mir. Is there a way to disable PCI bus-mastering in ProxMox to test?

Some Windows guests hanging every night

Renowned Member

Member

Renowned Member

Renowned Member

Renowned Member

Member

Renowned Member

Renowned Member

Member

Renowned Member

Renowned Member

Member

Renowned Member

Renowned Member

Member

Renowned Member

Member

Renowned Member

Famous Member

Renowned Member

We value your privacy