Separate names with a comma.
Discussion in 'Proxmox VE: Installation and configuration' started by cybermcm, Oct 24, 2017.
Oh that was fast.
Thank you very much!
pve-kernel-4.10.17-5 works well.
FWIW, the problem was MOST pronounced on ZFS drives with the 4.13 kernel. As in, Windows was seldom able to boot without BSODs.
The error occurred on my XFS drive as well, but only a handful of times, and performance seemed to be the same. On ZFS drives with the new kernel, there was a severe degradation in VM I/O performance.
Unfortunately I am a Linux-Noob. As you can see here
"GRUB_CMDLINE_LINUX_DEFAULT="quiet" put scsi_mod.use_blk_mq=n
/usr/sbin/grub-mkconfig: 9: /etc/default/grub: put: not found"
What am I missing?
it should be:
but it didn't work, at least not for me, bug still exists with 4.13 kernel
and don't forget to initiate
Has someone a good testcase to reproduce the windows Bluescreen?
Because on my machine it can take up to 24 hours that Windows crash.
Debugging is hard with this conditions.
I am also experiencing the CPU flag related issue, and not the VirtIO related one with kernel 4.13.
Two VMs, both 2012R2 fully updated as of 11/9/17, no VirtIO drivers installed in either, they are more or less identical as they are AD DCs. I have ensured that VM configurations are identical, but that doesn't matter anyway because it is always the one on the Xeons that crashes. I've tried host, kvm64, and qemu64 CPU types, no difference between them related to crashes.
I have two hosts, one with 2x Opteron 6220, the other with 2x Xeon L5420. On the Xeon system, either VM will BSOD with Critical_Structure_Corruption after a few minutes up to a few hours. On the Opteron system, both VMs are stable. Both Proxmox systems were fully updated on 10/24 and are running:
proxmox-ve: 5.1-25 (running kernel: 4.13.4-1-pve)
pve-manager: 5.1-35 (running version: 5.1-35/722cc488)
I have just updated the Xeon system to the pve-kernel-4.10.17-5-pve_4.10.17-25_amd64.deb package (and everything zfs related to 0.7.3) and am about to reboot it to see if that resolves the issue. I have not tried updating the microcode, and would like more details about that before I try it.
@wolfgang: I still have the issue with the 4.13 kernel and normally Windows crashes within one hour. How can I assist you to track this down?
Thanks. For apples-to-apples comparison, I have done the following items:
on the Xeon system, installed intel-microcode, and all updates except the kernel
on the Opteron system, installed amd64-microcode, and all updates including the kernel
Xeon L5420 microcode was 0xa0b, and the Opteron 6220 microcode was 0x600063d, neither changed after the install and reboot. However per Blue screen with 5.1 I don't expect this to make a significant difference even if there was an update. Also, the BIOS is the latest for each board, so maybe that's why the microcode was already up to date. At least now I can offer a direct comparison between 188.8.131.52 and 184.108.40.206. If I don't post again, you can assume that the VM hasn't crashed with the Critical_Structure_Corruption BSOD, otherwise if it does I'll report it. I'll be keeping an eye on this thread either way.
Edit: I confirmed with dmesg that the microcode update driver did indeed run during boot on both systems.
I'll try to find a system, if we have enough spare parts around. I'll check that out on monday and give you feedback.
@wolfgang is there some command that inventories the CPU features that we could run that would help determine what the least common denominator for this issue is?
Have you tried this?
I’m aware of that, but I don’t know if that would specifically help @wolfgang see the least common denominator for CPU features.
Based on certain things I've done on my W10 VM and one of the proxmox hosts, here some short reporting from my site.
upgrade to virtio-win-0.1.141 --> blue screen appears again
upgrade the intel microcode to 0x20 --> blue screen appears again
download/install and running the 4.10.17-5-pve kernel --> blue screen does not appearing again...cross the fingers.
Windows upgrade from 1703 to 1709, performed on step 1 and 2 was not able because blue screens repeatedly.
Only after done the step 3, I'm was able to upgrade the window from version 1703 to 1709. And it is still stable.
@wolfgang: regards to the provided special kernel, what do you think, would it be possible to expected an solution (maybe in the near future)? So we would be able to use again the standard apt-get upgrade process with all the standard components from the pve-no-subscription repository.
many thanks for your effort.
I've migrated a physical windows to vm last week and tracked my bsod down to the qxl driver.
Interesting. I've upgraded two hosts so far with no issues but I don't use the qxl driver in Windows at all.
For me, running that 4.10 kernel is the only solution with proven stability. Achieved a week uptime now, instead of 2 or 3 blue screens every day.
@morph027: I'm not using the qxl driver either and still have bluescreens...
@wolfgang: any news, can we assist to help you track this down?
@cybermcm It looks like a problem in the mmu of kvm.
I encountered this on a fresh 5.1 install on a Windows Server 2012 R2 VM, and a Windows Server 2012R2 VM on a upgraded system. I am currently downgrading them both to kernel 4.10
Dual Xeon E5-2620V4s
RAID backed storage on an Adaptec 8805 HBA
SuperMicro X10-DRW-i mainboard
Old system, upgraded:
Dell Poweredge R520
RAID backed storage on a PERC H710 HBA
Dual Xeon E5-2430 V0s
Fortunately the SMC system isn't in production yet. It definitely BSOD'd at least once under heavy I/O load. If there's any more information I can provide please let me know
Edit: I noticed looking through the dmesg output a ton of messages regarding linux_edac scrolled by that don’t on 4.10. A bunch of PCI IDs then it complained about not being able to find a Broadcom device. I don’t have a full output unfortunately