Windows 2008 SP1 guest crashes -> corosync/TOTEM failures in hosts syslog

r4pt0x · Nov 19, 2012

One of our Windows 2008 SP1 (64bit) servers is crashing occasionally under heavy Load. It just happened again while compressing an old database-backup (~5GB) on the guest.

The (useless...) Windows-"Log" records 2 failures AFTER the Crash-Time logged by EventLog ("system has been rebooted at <date> <time>"), written after the reboot.

This entry is logged ~5 and ~15sec after the recorded crash time:

viostor / Reset on device "\Device\RaidPort1" (don't know the exact english equivalent, the entry in german is: "Ein Zuruecksetzen auf Geraet "\Device\RaidPort1" wurde ausgegeben")

~1-2mins after these events, i have the following log entries on the host system in /var/log/syslog:

Code:

Nov 19 14:45:36 proxmox corosync[1652]:   [TOTEM ] A processor failed, forming new configuration.
Nov 19 14:45:36 proxmox corosync[1652]:   [CLM   ] CLM CONFIGURATION CHANGE
Nov 19 14:45:36 proxmox corosync[1652]:   [CLM   ] New Configuration:
Nov 19 14:45:36 proxmox corosync[1652]:   [CLM   ] #011r(0) ip(10.18.89.100) 
Nov 19 14:45:36 proxmox corosync[1652]:   [CLM   ] Members Left:
Nov 19 14:45:36 proxmox corosync[1652]:   [CLM   ] Members Joined:
Nov 19 14:45:36 proxmox corosync[1652]:   [CLM   ] CLM CONFIGURATION CHANGE
Nov 19 14:45:36 proxmox corosync[1652]:   [CLM   ] New Configuration:
Nov 19 14:45:36 proxmox corosync[1652]:   [CLM   ] #011r(0) ip(10.18.89.100) 
Nov 19 14:45:36 proxmox corosync[1652]:   [CLM   ] Members Left:
Nov 19 14:45:36 proxmox corosync[1652]:   [CLM   ] Members Joined:
Nov 19 14:45:36 proxmox corosync[1652]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov 19 14:45:36 proxmox corosync[1652]:   [CPG   ] chosen downlist: sender r(0) ip(10.18.89.100) ; members(old:1 left:0)
Nov 19 14:45:36 proxmox corosync[1652]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 19 14:45:40 proxmox kernel: usb 3-2: reset low speed USB device number 3 using uhci_hcd

(The usb device is connected to the crashed VM)

The guest uses virtio drivers for HDD and NIC. Although within the hardware options for the hard drives ("RedHat VirtIO SCSI Disk Device"), Windows claims it uses some MS-Driver from 2006. Is this a normal behavior for the RedHat VirtIO driver?

The other Guests (1x W2K8 SP1 32bit, 3x debian-container, 1x debian VM, 1x Suse VM) on this host are running fine - no errors in their logs or unusual behaviour when the W2k8 64 Windows crashes. So i don't think it is host-related but a problem by the W2k8-64bit guest.

I'm not actively using clustering with this host - it was planned and preconfigured, but due to the low bandwith connection of the second node, it was never added to the configuration. So the only node for the cluster is the 10.18.89.100 machine itself.
Is it save to just stop the clustering-service to ensure it isn't responsible for these odd crashes?

tom · Nov 19, 2012

post the output of 'pveversion -v' and your VMID.conf file of the windows guest.

which virtio drivers do you use? (provide version number)

r4pt0x · Nov 20, 2012

Sorry, totally forgot that.

Code:

# pveversion -v
pve-manager: 2.2-26 (pve-manager/2.2/c1614c8c)
running kernel: 2.6.32-16-pve
proxmox-ve-2.6.32: 2.2-80
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-16-pve: 2.6.32-80
pve-kernel-2.6.32-14-pve: 2.6.32-74
pve-kernel-2.6.32-6-pve: 2.6.32-55
pve-kernel-2.6.32-7-pve: 2.6.32-60
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-1
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-28
qemu-server: 2.0-64
pve-firmware: 1.0-21
libpve-common-perl: 1.0-37
libpve-access-control: 1.0-25
libpve-storage-perl: 2.0-34
vncterm: 1.0-3
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.2-7
ksm-control-daemon: 1.1-1

Code:

boot: cdn
bootdisk: virtio0
cores: 4
cpu: kvm64
cpuunits: 2500
ide2: NAS1:iso/mmc-a.iso,media=cdrom
memory: 16380
name: w2k8-64
net0: virtio=76:70:2D:D4:C8:E1,bridge=vmbr0
onboot: 1
ostype: w2k8
sockets: 1
startup: up=30
usb0: host=0529:0001
virtio0: local:105/vm-105-disk-1.qcow2
virtio1: local:105/vm-105-disk-2.qcow2

VirtIO Driver for the block devices is shown as 6.0.6002.18005 / Date 21.06.2006. Driver vendor is Microsoft. The drives are shown as "Red Hat VirtIO SCSI Disk Device".
The Win2k8 R2 machine on another node shows Version 6.1.7600.16385 but with same date and vendor - so i assume this behaviour is normal? (This R2 machine runs perfectly fine BTW...)

I'm currently planning to convert the disk images to .raw and connecting them to the VM using IDE. Also the NIC will be changed to the e1000 device to eliminate the VirtIO drivers as a variable.
The host system uses an LSI MegaRaid 8708 raid-controller. We have had troubles with a LSI-driver bug once, causing the controller to reset on heavy I/O, but this affected the whole system and generated lots of errors at the hostsystem, so i don't think the problem is related to the controller, which worked fine for the last 10 months since the patched packages are in use. (The system was completely rebuilt with a clean new host installation and guest-backups)

Edit:
This is the error we were facing plus the solution:
http://www.anchor.com.au/blog/2012/...rnel-megaraid_sas-driver-from-crash-to-patch/

r4pt0x · Nov 30, 2012

No suggestions? yesterday the Windows 2008 SP1 machine rebooted again.

/var/log/syslog:

Code:

Nov 29 09:02:37 proxmox corosync[3097]:   [TOTEM ] A processor failed, forming new configuration.
Nov 29 09:02:37 proxmox corosync[3097]:   [CLM   ] CLM CONFIGURATION CHANGE
Nov 29 09:02:37 proxmox corosync[3097]:   [CLM   ] New Configuration:
Nov 29 09:02:37 proxmox corosync[3097]:   [CLM   ] #011r(0) ip(10.18.89.100) 
Nov 29 09:02:37 proxmox corosync[3097]:   [CLM   ] Members Left:
Nov 29 09:02:37 proxmox corosync[3097]:   [CLM   ] Members Joined:
Nov 29 09:02:37 proxmox corosync[3097]:   [CLM   ] CLM CONFIGURATION CHANGE
Nov 29 09:02:37 proxmox corosync[3097]:   [CLM   ] New Configuration:
Nov 29 09:02:37 proxmox corosync[3097]:   [CLM   ] #011r(0) ip(10.18.89.100) 
Nov 29 09:02:37 proxmox corosync[3097]:   [CLM   ] Members Left:
Nov 29 09:02:37 proxmox corosync[3097]:   [CLM   ] Members Joined:
Nov 29 09:02:37 proxmox corosync[3097]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov 29 09:02:37 proxmox corosync[3097]:   [CPG   ] chosen downlist: sender r(0) ip(10.18.89.100) ; members(old:1 left:0)
Nov 29 09:02:37 proxmox corosync[3097]:   [MAIN  ] Completed service synchronization, ready to provide service.

The Guest again shows the "device reset on \Device\RaidPort1" error

The issue isn't reproducable except the fact these crashes appear MOSTLY when increasing the load of the MSSQL server Running on this guest, but sometimes even on normal loads just out of nowhere...

The drive images will be converted to .raw this weekend, but i don't think this will solve the problems. As said: on another host the Windows 2008 R2 machine is running on qcow2 images using virtio drivers without any problems even when torturing the machine with I/O tests...

r4pt0x · Jan 14, 2013

I changed the HDD-Images to raw, which brought a small performance increase, but still random crashes on higher loads. After connecting them via IDE to the VM not a single crash since then. So virtio seems to be the problem...

Changing the virtio network-device back to e1000 decreased the connection speed, but i got rid of lost network connections at 1 out of 3 reboots - and again, the virtio driver was to blame...

It seems the virtio drivers are far from stable for Windows 2008 ?

mmenaz · Jan 14, 2013

For the benefit of the whole community, what version of virtio were you usign? I think they are in active development and bug fix, and often incompatible with some kvm versions. You seem not to have the most up to date version of Proxmox and kvm, have you tried with latest virtio?:
http://alt.fedoraproject.org/pub/alt/virtio-win/latest/images/bin/virtio-win-0.1-49.iso
Have you tried to upgrade Proxmox? (don't know if you will find regressions though, is just a question not a suggestion)

r4pt0x · Jan 14, 2013

The proxmox system along with all guests and drivers is updated every monday (the only day I can shutdown the system in the evening if necessary). virtio drivers were taken from the fedora page (linked from linux-kvm.org). I have found 3 versions of the .iso in the images directory, so at least those 3 (including the latest 0.1-49) were tested on the windows guest.

todays package versions on the host:

Code:

# pveversion -v
pve-manager: 2.2-32 (pve-manager/2.2/3089a616)
running kernel: 2.6.32-17-pve
proxmox-ve-2.6.32: 2.2-83
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-16-pve: 2.6.32-82
pve-kernel-2.6.32-14-pve: 2.6.32-74
pve-kernel-2.6.32-6-pve: 2.6.32-55
pve-kernel-2.6.32-17-pve: 2.6.32-83
pve-kernel-2.6.32-7-pve: 2.6.32-60
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-2
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-34
qemu-server: 2.0-72
pve-firmware: 1.0-21
libpve-common-perl: 1.0-41
libpve-access-control: 1.0-25
libpve-storage-perl: 2.0-36
vncterm: 1.0-3
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.3-10
ksm-control-daemon: 1.1-1

We are also still running a 32bit version of windows 2008 (licence valid for 64 and 32). as soon as the last software depending on the 32bit OS gets updated to run on 64bit (should have happened august 2012...) I can use this licence for a second 64bit system for testing purposes, then i can investigate this problem. For now i'm glad the production system is running rockstable as it should, so i won't touch the virtio-drivers again on this system until i can evaluate them on a testing server...

Though, the 32bit 2008 server is also using the virtio drivers, but it did and does not show these crashes. Also back when it was hosting a MSSQL 2005 server (The software using the MSSQL server was updated to 64bit/MSSQL 2008 R2, so it was newly installed on the 64bit machine).
Maybe something changed at the virtio drivers - I will apply some random disk I/O on the 32bit system with the newest virtio drivers and will see what happens.
If its fine, it seems to affect only 64bit and/or the drivers don't like the kind of load MSSQL 2008 R2 generates?

r4pt0x · Jan 25, 2013

Small Update:

Since last Weeks updates I'm having trouble with high CPU-Load (Interrupts) on the Guest, Causing 50-70% Load and making the MSSQL Database utterly slow.
I traced it down to the storage drivers - so i will try change to latest virtio-SCSI Drivers today.

Regarding the virtio blockdevice drivers I meanwhile tested with another 2008 non-R2 guest on the second proxmox system on which the R2 machine runs. Seems like 2008 "R1" doesn't load the correct drivers. I always get the error "newest driver already installed" and it stays with the decades old crap from Microsoft. When replacing the driver files manually the system fails to boot. Tried several driver versions - always the same problem, so it must be some f***ed up driver management from Windows 2008 preventing it from accepting newer and correct drivers for virtio devices. The R2 Machine accepts the new drivers with no problems.

Any idea how to sneak the drivers into 2008 "R1"?

udo · Jan 26, 2013

r4pt0x said:
...
Seems like 2008 "R1" doesn't load the correct drivers. I always get the error "newest driver already installed"
...
Any idea how to sneak the drivers into 2008 "R1"?

+1

If anybody has an hint, I'm happy too.

Udo

snowman66 · Jan 28, 2013

udo said:
+1

If anybody has an hint, I'm happy too.

Udo

VirtIO Controller driver can be updated. I'm not sure about disk drive driver - i guess is the same for all 2008 versions?

edit:install iso that i use (trial version _x86): http://www.microsoft.com/en-us/download/details.aspx?id=8371

r4pt0x · Jan 28, 2013

snowman66 said:
VirtIO Controller driver can be updated. I'm not sure about disk drive driver - i guess is the same for all 2008 versions?

Thanks for this post - Although Windows again did refuse to update the driver, you led me to the right direction:

With one disk attached via VirtIO Device remove the VirtIO SCSI CONTROLLER instead of the disk drive(s) from hardware maganer.
Remove old drivers manually
Rescan hardware (or reboot) - when new hardware gets recognized abort the automated driver installation (ballon message), choose driver for the controller manually and it will finally accept the driver!

Update:

Unfortunately it didn't solve the high CPU load used for Interrupts - But finally I was able to trace it down to USBPORT.SYS

Of course Windows can't find any newer driver than the Microsoft driver from 2006 and trying to install the latest ICH9 USB driver from intel (2010) - as expected - fails: "newest driver already installed". I pray for the day this crappy DOS-GUI called "Operating System" finally vanishes at least from servers...

As there are some issues with USB and high load recorded for every Windows Version since XP (!!) even at support.microsoft.com but none of the suggested fixes was either relevant to this setup, were completely useless ("disable onboard USB and buy a USB controller card"), or did not work, I also wrote to MS Support. The reply came quicker than it took me to write the very detailed description of the error and system configuration - of course blaming the linux virtualization for the problem and not even trying to give any helpful support... Thanks Microsoft, you reminded me why i'm not using any of your prducts at home since 9 years... Oh, BTW: The problem occured after the Windows Server installed Updates, not the proxmox system...

BOT:
I'm trying to reproduce the problem on the 2008 testing VM, by currently installing updates step by step - I bet they messed up something at one of their updates ~2 weeks ago...
Next question would be how to take back the specific update from the production server...

spirit · Jan 30, 2013

r4pt0x said:
Thanks for this post - Although Windows again did refuse to update the driver, you led me to the right direction:

With one disk attached via VirtIO Device remove the VirtIO SCSI CONTROLLER instead of the disk drive(s) from hardware maganer.
Remove old drivers manually
Rescan hardware (or reboot) - when new hardware gets recognized abort the automated driver installation (ballon message), choose driver for the controller manually and it will finally accept the driver!

Update:

Unfortunately it didn't solve the high CPU load used for Interrupts - But finally I was able to trace it down to USBPORT.SYS

Of course Windows can't find any newer driver than the Microsoft driver from 2006 and trying to install the latest ICH9 USB driver from intel (2010) - as expected - fails: "newest driver already installed". I pray for the day this crappy DOS-GUI called "Operating System" finally vanishes at least from servers...

As there are some issues with USB and high load recorded for every Windows Version since XP (!!) even at support.microsoft.com but none of the suggested fixes was either relevant to this setup, were completely useless ("disable onboard USB and buy a USB controller card"), or did not work, I also wrote to MS Support. The reply came quicker than it took me to write the very detailed description of the error and system configuration - of course blaming the linux virtualization for the problem and not even trying to give any helpful support... Thanks Microsoft, you reminded me why i'm not using any of your prducts at home since 9 years... Oh, BTW: The problem occured after the Windows Server installed Updates, not the proxmox system...

BOT:
I'm trying to reproduce the problem on the 2008 testing VM, by currently installing updates step by step - I bet they messed up something at one of their updates ~2 weeks ago...
Next question would be how to take back the specific update from the production server...

can you try to add

tablet: 0

in your vm config file and restart the vm ?

It'll disable the usb tablet device (used to have good mouse pointer position), but it can send a lot of interrupt

spirit · Jan 30, 2013

Hi,
Can you try to add

tablet: 0

in your vm config file ?

I'll disable the usb tablet mouse pointer (used to have good mouse position in console). It can send a lot of interrupts.

r4pt0x · Feb 4, 2013

I've already tried tablet=0. The problem seems to be the USB HASP-Key we need at this VM.
When unplugging it, CPU-Load drops immediately <10%, plugging it back in and after ~30mins the interrupts are skyrocketing and the system becomes unresponsive again.

After being sick the last week, I'll try to replace the USBPORT.SYS with one from an Win7 Client. I'll report back tomorrow after the Machine has restarted (scheduled for tonight).

Update:
The problem is definitely the HASP-Key - even with replaced USBPORT.SYS. Tablet is not connected (have to change that back, because the mouse is completely useless via VNC) and still interrupts are using up to 50% CPU...
As it worked fine for over 18 months and is still working fine on the 2008 R2 machine at our second branch, which is also running on proxmox, Win 2008 is to blame...

As there are Linux drivers available for the SafeNet/Aladdin HASP HL dongles, I'll try to get the license manager access the dongle remotely on the proxmox host.

Search

Search

Windows 2008 SP1 guest crashes -> corosync/TOTEM failures in hosts syslog

r4pt0x

Member

tom

Proxmox Staff Member

r4pt0x

Member

r4pt0x

Member

r4pt0x

Member

mmenaz

Renowned Member

r4pt0x

Member

r4pt0x

Member

udo

Distinguished Member

snowman66

Active Member

r4pt0x

Member

spirit

Distinguished Member

spirit

Distinguished Member

r4pt0x

Member