Windows crashes - share your experience

jcheger

Renowned Member
Dec 20, 2011
13
0
66
I've tried to migrate several Windows production servers in KVM. I've got 2 of them so unstable that I had to put them back on hardware machines.

Windows has already a long list of trouble on any virtualization system (even Hyper-V), this is not new. I'm not expecting KVM of Proxmox to resolve Windows issues. But we're missing some feedback on the working solutions, or workarounds.

Here is a starting point of my experiences.

Windows Server Std 2003 SP2 32-bit
  • Virtio system disk
    • no cache / write back: will trigger a BSOD on high load / may create fatal corruptions to the system
    • crash test (1): FAIL
  • IDE system disk
    • no cache: damn slow
    • writeback: acceptable performances, working quite well
  • Network
    • (virtio nic) IP stack dies on heavy load
    • (rtl8139) loose some services on heavy load (no errors logged, but fileserver and printserver are not responding any more - the server can still connects to internet). The symtpom is a very heavy CPU, network and I/O load on the guest for a while, and then almost nothing on network. Surely related to SAP or the printing service running on the machine, but does not appear on physical one.
  • Memory balloon
    • Triggers a BSOD on heavy load

Windows Server Std 2008 R2 64-bit

  • Virtio system disk
    • no cache / write back: will trigger a BSOD on high load (a bit better than 2003)
  • IDE system disk
    • no cache: damn slow (8 hours for installation)
    • writeback: acceptable performances (~8x faster), working quite well
  • Network
    • (virtio nic) BSOD on heavy load (file transfer)
    • (rtl8139) quite slow for file transfer

Windows 7 Pro 32-bit

  • (virtio system disk, network and balloon) did often trigger BSOD while saving a 20MB LibreOffice file on the fileserver, but not always. This does not happen when saving locally. I've tried again 10 times while writing this message, but every test was successful.
  • (virtio system disk, network and balloon) crash test (1): OK
(1) For crash tests, I've used http://www.jam-software.com/heavyload/ (limited to 10 minutes).

Conclusions
  • Those tests are related to my use. I could not test all scenarios.
  • Experiences were made with many versions of KVM (from ~2 years ago until now). Last versions of KVM brought by Proxmox seem a bit more stable than previous ones.
  • Virtio drivers seem related to many trouble (ie BSOD)
  • I'm also working on low load guests will all virtio drivers, without any problem for more than on year of use.
  • Isolating errors when there is no log is very hard. We could still do many tests. Any suggestion for a reliable crash test is welcome.
For now, I'm not too confident to migrate heavy load Windows machines to KVM (no problem for Linux ones >= kernel 2.6.32, older ones may also have problems). If you did have some good experience on Windows migration (fileserver, printserver, heavy applications), I'm interested what settings you did use.

Jean-Christophe Heger
 
I do not have any crashes on my windows boxes but if you instruct me how to crash it I will fire up some test VM in order to validate your results.

So please give all details about your test setup and will see if I can validate it here in our test lab.
 
I am running 17 servers spread over 2 hosts in full production use with a mix of 64bit 2008 R2 (which loooves KVM) and mostly 2003 32bit. These are fairly heavy use serving 100 client devices, printers, thin clients. We use 1TB iSCSI backed storage off of a Nexenta Community Edition headend and regularly see 60MB reads across multiple simultaneous servers. Backups of the database of 5GB run in about 2-8 minutes and do not interrupt users. Also, as of RC1 we do not have the migrate-and-crash problem any more and migrations are quick at about 1-2 minutes for an online migrate.

We do use Virtio network and storage. So far we are good. We use "no cache" on all LVM storage and do write cacheing on the SAN. Our hardware is one Dell 1950 with 22GB and one Dell R410 with Xeon 5620's and 32GB ram. The R410 is amazingly better than the older 1950 and can run all of our VM's and barely get over a 2 load and with KSM use 20GB of ram.

The only major issues I have run into involve difficulties managing multipath and hot add/remove of devices and ISO's.
 
I've tried to migrate several Windows production servers in KVM. I've got 2 of them so unstable that I had to put them back on hardware machines.

Windows has already a long list of trouble on any virtualization system (even Hyper-V), this is not new. I'm not expecting KVM of Proxmox to resolve Windows issues. But we're missing some feedback on the working solutions, or workarounds.

Here is a starting point of my experiences.

Windows Server Std 2003 SP2 32-bit
  • Virtio system disk
    • no cache / write back: will trigger a BSOD on high load / may create fatal corruptions to the system
    • crash test (1): FAIL
  • IDE system disk
    • no cache: damn slow
    • writeback: acceptable performances, working quite well
Hi,
looks like that your underlaying system has problems with IO? How looks pveperf?
I use virtio or ide for win-guest and the speed is good.
I tried sometimes the virtio-net driver but have all times trouble with this. Use e1000 with the original intel-driver and all work ok and very stable.

On pve2 i have only one win-system (64bit 2008) which run quite well. On pve1.9 i have some win-system with partly high load. Normaly no problem.

Udo
 
I've been using virtio for two years now with many Win 2008 R2 servers and they have worked flawless the whole time.

The few times I have had any issues the problem was related to hardware.
Bad RAM and buggy firmware in RAID cards and Disks were the only things causing crashes for me in the past.

I always use cache=none on my LVM disks, writeback does fail under windows under heavy load but cache=none is fine.

You can see some of my benchmarks on 2008 R2 here: http://forum.proxmox.com/threads/8486-Windows-disk-IO-Performance

8 hour install using IDE?
Maybe your Disk storage system is just too slow.

What is the output of pveperf?
 
Thanks to all of you sharing your experience. I must specify I've started using Proxmox at the beginning of the 2.x version, last year. Previously, I was using libvirt and virtual-manager, with current kvm packages backported. By the way, thanks to Proxmox guys for your excellent job.

Windows 2008 installation in 8 hours (IDE) happened on june 2011. I've done it twice, and spent some time figuring out what was happening. I've found many people complaining in forums about this issue, and the best workaround I could find was to set the cache to writeback (local image or iSCSI, doesn't matter). I'm installing a win2008 right now, while writing this message, and it's already over (amazingly fast on an old machine). So, never mind, this issue belongs to the past.

Anyway, my bad experiences are very recent. One is with Win 2008 (running on Libvirt / KVM 0.14), and the other one a migration of Win 2003 with SAP on a brand new Proxmox v2 server. On both a get a BSOD almost once per day if I use virtio drivers for the system disk (no cache). Using IDE is better, but stll unstable (writeback is still enabled, but I will disable it, regarding your remark). No visible errors, only the blue screen and a log telling the server did restart abnormally. Using IDE, the Win 2003 + SAP server overloads twice a day and loose all network services, but the server still answer on a ping or can browse to internet. We've migrated it back to a physical machine, and this doesn't happen any more.

After Tom's answer, I've tried to find a process to make a Win 2003 test guest to crash. It happened 10 times this morning (BSOD), with a fatal corruption at the end, but no way to make it crash anymore. This is quite embarrassing and frustrating.

Here is the pveperf for the server hosting Win 2003 + SAP, with 2 VM running (max speed on disks is ~120 MB/sec)

CPU BOGOMIPS: 24797.14
REGEX/SECOND: 1357749
HD SIZE: 1815.45 GB (/dev/md1)
BUFFERED READS: 103.08 MB/sec
AVERAGE SEEK TIME: 19.07 ms
FSYNCS/SECOND: 32.90
DNS EXT: 36.54 ms
DNS INT: 4.14 ms (...)

Thanks for your help, Jean-Christophe
 
Here is the pveperf for the server hosting Win 2003 + SAP, with 2 VM running (max speed on disks is ~120 MB/sec)

CPU BOGOMIPS: 24797.14
REGEX/SECOND: 1357749
HD SIZE: 1815.45 GB (/dev/md1)
BUFFERED READS: 103.08 MB/sec
AVERAGE SEEK TIME: 19.07 ms
FSYNCS/SECOND: 32.90
DNS EXT: 36.54 ms
DNS INT: 4.14 ms (...)

Thanks for your help, Jean-Christophe

The fsyncs/sec is very low and likely a large part of your problem.
 
Hi, We are using a iscsi direct lun storage with virtio disks, and have add a lot of bluescreen with cache=none. (win2003/2008R2)

(just launching crystalmark always crash the vm).

So since we have rollback to cache=writethrough, we didn't have bsod anymore.

UPDATE :
I have made tests with last virtio-win
http://alt.fedoraproject.org/pub/alt/virtio-win/latest/images/bin/virtio-win-0.1-22.iso

And I didn't have BSOD with cache=none.
 
Last edited:
Hello Dietmar,

I'm a bit surprised by what I've found on your wiki. You're saying you've got trouble with software raid. Soft raid is a bit more complex to configure than hardware, and I do understand your position about giving some support for it. But I don't agree with the fact that soft raid is harder to recover than hardware one. In more than 10 years, I've always could recover data from soft raid, but got several huge and fatal crashes on hardware ones. Anyway, I have a 3ware spare card and will follow your suggestion.

By the way, the first machine (Win2008 fileserver) was on a CCISS card with a perf of 2400 fsyncs/s.

May I ask you the reliability of fsyncs/s test ? It seems that the test is always very low on soft raid. But I've also tried on iSCSI, and if I format the test partition in ext3, it's twice faster than ext4. Does this mean with should use ext3, and not ext4 ?

If I understand correctly, a low fsyncs/s will trigger very high loads because of wait states. We have many linux guests working on KVM, but we rarely observe a load above 1 / cpu, and the performance is excellent even on soft raid. Better fsyncs/s means better perfs, that's ok. But could it be related to crashes on the guest ?
 
Yes, I found it yesterday and did upgrade my test machines. It's seems to be much more stable. I'm still testing some heavy loads, but I didn't have any BSOD with this version yet.

I'm working on more tests with crystalmark to see if I can reproduce your symptom.
 
Last edited:
Spirit,

I've spent the last hour to try to trigger a BSOD with crystalmark 3.0.1, but no way. It irritating not being able to reproduce what happened many times yesterday morning. Everything is working as if I did never have any problem, whatever the drivers or settings I'm using.

Here are test tests I've made:

- Windows 2003 - IDE - no-cache: OK
- Windows 2003 - IDE - writeback: OK
- Windows 2003 - Virtio 0.1-15 - no-cache: OK
- Windows 2003 - Virtio 0.1-22 - no-cache: OK
- Windows 2008 - Virtio 0.1-22 - no-cache: OK

Both are fresh installs (less than 2 weeks, updated). Proxmox v2.0-30.
 
Hi, We are using a iscsi direct lun storage with virtio disks, and have add a lot of bluescreen with cache=none. (win2003/2008R2)

(just launching crystalmark always crash the vm).

So since we have rollback to cache=writethrough, we didn't have bsod anymore.

UPDATE :
I have made tests with last virtio-win
http://alt.fedoraproject.org/pub/alt/virtio-win/latest/images/bin/virtio-win-0.1-22.iso

And I didn't have BSOD with cache=none.

I find it interesting that you had BSOD with cache=none, for me cache=none was worked flawless and cache=writeback resulted in BSOD
The latest virtio drivers are a big improvement in performance, I hope they do not introduce any stability issues.
 
May I ask you the reliability of fsyncs/s test ? It seems that the test is always very low on soft raid. But I've also tried on iSCSI, and if I format the test partition in ext3, it's twice faster than ext4. Does this mean with should use ext3, and not ext4 ?

At least we use ext3 by default (and many other distributions).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!