pve-kernel-5.0.21-4-pve cause Debian guests to reboot loop on older intel CPUs

Here's the tests I was planning to do unless they are no longer necessary:

  • Debian 10 guests
  • Grub argument list you provided
  • Setting processor to particular values (I had already tried kvm64 and host IIRC)

I have reduced the regression to three commits introduced by this kernel update.
I'm currently checking which one of those is the problematic one and how to solve them best (just revert or followup fix), that may depend on discussion with upstream - if it's not yet fixed there too, but AFAICT, 4.19 stable releases are affected too.
So, no currently we're all set with testing. But once we have a proposed fix we would naturally appreciate anybody testing that. :)

I'll post in this thread once we do (not sure if I can get along to upload something still today).
 
Perfect. Thanks for looking into the issue even if it only affects older hardware. I should be okay with testing kernels too if that's ever needed.

Have a good week,
 
I have reduced the regression to three commits introduced by this kernel update.
I'm currently checking which one of those is the problematic one and how to solve them best (just revert or followup fix), that may depend on discussion with upstream - if it's not yet fixed there too, but AFAICT, 4.19 stable releases are affected too.
So, no currently we're all set with testing. But once we have a proposed fix we would naturally appreciate anybody testing that. :)

I'll post in this thread once we do (not sure if I can get along to upload something still today).
I'll test the fix.
 
I'll got a bad commit, commented on the kernel.org bug report [0], will start a discussion with Ubuntu Kernel Devs (as they have this issue highly probably too) and then go to sleep :) Tomorrow we'lL hopefully have a clear mind and a correct solution.

[0]: https://bugzilla.kernel.org/show_bug.cgi?id=205441#c1

IMO, just reverting that commit may not be the best way as with newer kernel (5.3.7) this issue is not there, and they have that commit also.. So there may be just another fix missing..
 
A CPU from then having 90 W TDP is replacable with one with ~ 6 W TDP nowadays, just saying..

Yeah, it does not add up so easily. I also switched from s new very-low-voltage 10W atom to a 10-year old dual XEON 60 W(L-series) just for more IO options, much more RAM (576G) and yes, it is at least 12x more power hungry. The low-power-box was already maxed out and cost more than the used dual-XEON, mobo, cooling and more RAM. According to tests, the Xeon is "only" 2.5x faster in single thread performance, but of course outperforms in multi-thread applications.
 
Yeah, it does not add up so easily. I also switched from s new very-low-voltage 10W atom to a 10-year old dual XEON 60 W(L-series) just for more IO options, much more RAM (576G) and yes, it is at least 12x more power hungry. The low-power-box was already maxed out and cost more than the used dual-XEON, mobo, cooling and more RAM. According to tests, the Xeon is "only" 2.5x faster in single thread performance, but of course outperforms in multi-thread applications.

I was probably simplifying to much, and naturally one needs to watch out for production (resource) cost for new HW, if one want's to keep it environment-friendly too, not only watch out for power and thus cost saving..

XEON vs. Atom is naturally also completely a differnt thing regarding IO interfaces available, sure. IMO, the more recent politics of intel to divide every single feature in an extra CPU model does not helps. But there are also modern alternatives using less power for the same (albeit likely not <10W) but still having much IO options, e.g., some AMD ones :)

But anyway, that's quite a bit off-topic and the thread is already huge as-is, so maybe let's keep pro-contra old stuff out of here for now on :)
 
  • Like
Reactions: commander-in-chief
There's no way that Intel has some microcode updates for your specific model in their updates since >5 years, IMO.

But, what you could try temporarily are two things:
  1. Install the 5.3 based kernel we're currently evaluating for the next Proxmox VE release, see here (that kernel is now also available on pve-no-subscription)
  2. add the following to your boot kernel commandline noibrs noibpb nopti nospectre_v2 nospectre_v1 l1tf=off nospec_store_bypass_disable no_stf_barrier mds=off mitigations=off, e.g., in the /etc/default/grub file in the GRUB_CMDLINE_LINUX="<flags here>" variable, and run update-grub+reboot

Hi!

Another with the same problem, the -4 kernel do the same bug. Old hardware with bootloop in Win10 and Ubuntu VM's. The OpenVZ works fine. I've test the newest kernel and it's works like a charm.

The specs are IBM SystemX X3650 - Xeon E5420 24GB FBRAM
 
another one with an ancient CPU ;-) new 5.3.7 kernel works, thanks for the hint. Current 5.0.21-4 caused bluescreens with Windows Server 2019
root@host04:~# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 36 bits physical, 48 bits virtual
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 23
Model name: Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz
Stepping: 7
CPU MHz: 2019.532
CPU max MHz: 2499.0000
CPU min MHz: 2003.0000
BogoMIPS: 4999.46
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 3072K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm pti tpr_shadow vnmi flexpriority dtherm
 
We have just uploaded a new kernel package to pvetest which should resolve those issues, at least I could not reproduce it with that any more. The package is pve-kernel-5.0.21-4-pve in version 5.0.21-9.

Feedback would be appreciated.
 
Hello. I would like to add, I also continue to use an old cpu in my home environment. It still does what I need to do. In my case, I have incorporated a nvme pci expansion card-storage which has really given life to this old system and kept it out of the trash heaps.

Anyway, all that to say, I appreciate all that you guys are doing and the continued consideration/support of this older hardware.
# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 36 bits physical, 48 bits virtual
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 30
Model name: Intel(R) Core(TM) i5 CPU 750 @ 2.67GHz


Best.
 
Problem solved!

Test kernel removed and CMD line in grub, also removed. Working fine!

root@Proxmox-x3650:~# uname -a
Linux Proxmox-x3650 5.0.21-4-pve #1 SMP PVE 5.0.21-9 (Mon, 11 Nov 2019 14:12:37 +0100) x86_64 GNU/Linux
root@Proxmox-x3650:~# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 38 bits physical, 48 bits virtual
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 23
Model name: Intel(R) Xeon(R) CPU E5420 @ 2.50GHz
Stepping: 10
CPU MHz: 2493.920
CPU max MHz: 2490.0000
CPU min MHz: 1992.0000
BogoMIPS: 4987.84
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 6144K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 xsave lahf_lm tpr_shadow vnmi flexpriority dtherm
 
We have just uploaded a new kernel package to pvetest which should resolve those issues, at least I could not reproduce it with that any more. The package is pve-kernel-5.0.21-4-pve in version 5.0.21-9.

Feedback would be appreciated.

Great work Thomas for bisecting the commits!
Next time we meet, I hope we have time and I can buy you a beer.
 
Let me share part of my current pain in the ass with one of our Dedicated servers at Hetzner (PX92).

Scenario: I am trying to run stable Debian 10 OS under Proxmox 6. What differs from other involved in this thread is that my CPU is quite new (Intel® Xeon® W-2145 Octa-Core Skylake W).

I am installing fresh Debian 10 from netinst as I cannot boot Proxmox ISO directly at Hetzner. Fresh Debian 10 with kernel 4.19.0-4 is the only kernel so far which survives host reboot.

In first iteration I have followed manual install guide to deploy buggy kernel 5.0.21-4-pve. With this one, host doesn't survive the reboot and ends with blinking cursor. Then I went through this thread and in second iteration I have installed manually 5.3.7-1-pve, removed 5.0.21-4, grub-update and it doesn't survive the reboot, but behavior is different - no cursor at all. :)

Hetzner Support only help regarding this issue is the recommendation to kernel 4.19.x branch, which in my view means downgrade from Proxmox 6 to Proxmox 5 which is not a solution for us ATM.

Many thanks for any fruitful hints!
 
I'd argue that your situation is a bit different, people in this thread could always reboot the host, but some guest VMs did not work, so maybe it's worth opening another thread.

Can you try to first remove any "quiet" from /etc/default/grub (the GRUB_CMDLINE_LINUX variable) and run update-grub, this should give you hopefully a bit more than a blinking cursor.

Then, there's a newer 5.3.10 based kernel available, also worth a try.
 
I'd argue that your situation is a bit different, people in this thread could always reboot the host, but some guest VMs did not work, so maybe it's worth opening another thread.

Can you try to first remove any "quiet" from /etc/default/grub (the GRUB_CMDLINE_LINUX variable) and run update-grub, this should give you hopefully a bit more than a blinking cursor.

Then, there's a newer 5.3.10 based kernel available, also worth a try.
Sorry, my bad - You are right that my host behavior is different, so I have opened a new thread here.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!