[SOLVED] Random Crashes/Reboots mit Proxmox VE 6.1 auf EX62-NVME (Hetzner)

chotaire · Feb 10, 2020

@Licht Ich musste dreimal ersetzen lassen, erst seit dem hab ich keine Crashes mehr (bisher). Die zieren sich da auch nicht, anscheinend gibt es genug Leute mit Problemen, dass sie ohne grossartige Hinterfragung den Server auch gerne 2x innerhalb von 2 Tagen komplett replacen. Jemand anderes in diesem Thread musste auch drei oder viermal replacen bis endlich Ruhe war. Also klare Empfehlung: Austauschen, solange bis es laeuft. Oder Hardware Model komplett wechseln.

KingArthur98 · Feb 11, 2020

Hello,

Im reporting back. Sadly my system did not hold. After a week uptime it kernel panick-ed and after that it was only stable for 2-3 hours at the time.
They replaced my server again (4th time). And that only lasted for 5-10 minutes until the kernel hung.

I managed to dig deeper into the logs, and everything was pointing to some Intel idle state hang. I searched this on the internet and came acros an active topic on Intel Bay Trail cpus that have the exact same problem and crash messages.

This is weird because the baytrail architecture is completely different than the 9900k coffee lake. Because I was so desperate and angry I tried their fix. https://bugzilla.kernel.org/show_bug.cgi?id=109051

"cstates: intel_idle.max_cstate=1 required to prevent crashes - Baytrail "
I don't know if it's a coincidence, but my server has been up for almost 4 days now without a single issue. It could be coincidence so I will keep reporting back.

Greetings,
Merlin

chotaire · Feb 12, 2020

Hi Merlin,

I've read something similar, people were recommending to disable cstates in the BIOS. I did check into this, but the bios firmware used on this machine did not give me any option to disable/configure any cstates. I've debugged the machine (which is still stable after the 3rd time it got replaced) and realized it is extensively using C3 and higher cstates so just to be on the safe side I applied the kernel parameter as I don't think there's any sideeffect other than higher power consumption and possibly higher idle CPU temperature. For people wondering how to actually do this change on Proxmox:

Edit the file /etc/default/grub and modify GRUB_CMDLINE_LINUX_DEFAULT, I also use consoleblank=0 so I can ask Hetzner to connect a KVM in case of a crash and still be able to see the console should the system be unresponsive.

Code:

GRUB_CMDLINE_LINUX_DEFAULT="consoleblank=0 intel_idle.max_cstate=1"

Next apply the grub configuration:

Code:

# update-grub

Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.3.18-1-pve
Found initrd image: /boot/initrd.img-5.3.18-1-pve
Found linux image: /boot/vmlinuz-5.3.13-3-pve
Found initrd image: /boot/initrd.img-5.3.13-3-pve
done

Reboot the machine when done. To double-check that no other cstates than C0 and C1 are used after the reboot, try the following command (part of the linux-cpupower package):

Code:

# turbostat -S --debug sleep 10
<...>
usec    Time_Of_Day_Seconds     APIC    X2APIC  Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     POLL    C1      POLL%   C1%     CPU%c1  CPU%c3  CPU%c6  CPU%c7  CoreTmp PkgTmp  GFX%rc6 Totl%C0 Any%C0  GFX%C0  CPUGFX%      Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 Pkg%pc8 Pkg%pc9 Pk%pc10 PkgWatt CorWatt GFXWatt RAMWatt PKG_%   RAM_%
  535   1581501460.693717       -       -       76      1.63    4663    3600    48083   0       1037    77067   0.00    98.34   98.37   0.00    0.00    0.00    46      47      99.59   25.44   18.26   0.00    0.000.00     0.00    0.00    0.00    0.00    0.00    0.00    27.37   26.19   0.00    0.00    0.00    0.00

As can be seen all cstate time higher than C0 (Busy%) is listed in C1%.

Cheers
Marc

KingArthur98 said:
Hello,

Im reporting back. Sadly my system did not hold. After a week uptime it kernel panick-ed and after that it was only stable for 2-3 hours at the time.
They replaced my server again (4th time). And that only lasted for 5-10 minutes until the kernel hung.

View attachment 14904

I managed to dig deeper into the logs, and everything was pointing to some Intel idle state hang. I searched this on the internet and came acros an active topic on Intel Bay Trail cpus that have the exact same problem and crash messages.

This is weird because the baytrail architecture is completely different than the 9900k coffee lake. Because I was so desperate and angry I tried their fix. https://bugzilla.kernel.org/show_bug.cgi?id=109051

"cstates: intel_idle.max_cstate=1 required to prevent crashes - Baytrail "
I don't know if it's a coincidence, but my server has been up for almost 4 days now without a single issue. It could be coincidence so I will keep reporting back.

Greetings,
Merlin

chotaire · Feb 12, 2020

@Licht Kannst du testen, ob diese Kernel Config Änderung bei Dir zum Erfolg führt? Bitte Rückmeldung

Licht · Feb 12, 2020


# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 58
model name      : Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
stepping        : 9
microcode       : 0x21
cpu MHz         : 2998.120
cache size      : 8192 KB

# cat /sys/module/intel_idle/parameters/max_cstate
9

# grep intel_idle /var/log/kern.log.1 /var/log/kern.log
/var/log/kern.log.1:Feb  4 14:50:46 avmgmt-ve kernel: [    0.815523] intel_idle: MWAIT substates: 0x1120
/var/log/kern.log.1:Feb  4 14:50:46 avmgmt-ve kernel: [    0.815524] intel_idle: v0.4.1 model 0x3A
/var/log/kern.log.1:Feb  4 14:50:46 avmgmt-ve kernel: [    0.815733] intel_idle: lapic_timer_reliable_states 0xffffffff
/var/log/kern.log.1:Feb  4 15:33:30 avmgmt-ve kernel: [    0.799643] intel_idle: MWAIT substates: 0x1120
/var/log/kern.log.1:Feb  4 15:33:30 avmgmt-ve kernel: [    0.799643] intel_idle: v0.4.1 model 0x3A
/var/log/kern.log.1:Feb  4 15:33:30 avmgmt-ve kernel: [    0.799847] intel_idle: lapic_timer_reliable_states 0xffffffff
# #  Am 11.2. crashte eine VM; dist-upgrade, reboot:
/var/log/kern.log:Feb 11 14:38:51 avmgmt-ve kernel: [    0.823167] intel_idle: MWAIT substates: 0x1120
/var/log/kern.log:Feb 11 14:38:51 avmgmt-ve kernel: [    0.823168] intel_idle: v0.4.1 model 0x3A
/var/log/kern.log:Feb 11 14:38:51 avmgmt-ve kernel: [    0.823371] intel_idle: lapic_timer_reliable_states 0xffffffff

Habe wie oben vorgeschlagen
GRUB_CMDLINE_LINUX_DEFAULT="nomodeset"
durch
GRUB_CMDLINE_LINUX_DEFAULT="consoleblank=0 intel_idle.max_cstate=1"
ersetzt; update-grub; reboot - und werde berichten. Vielen Dank!

Aber ist die Herangehensweise für Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz relevant?
Gibt es einen cstate zwischen 1 und 9 der ebenfalls getestet werden könnte? Ich möchte den Stromverbrauch so gering wie möglich halten.

Code:

# turbostat -S --debug sleep 10
turbostat version 18.07.27 - Len Brown <lenb@kernel.org>
cpu 0 pkg 0 node 0 lnode 0 core 0 thread 0
cpu 1 pkg 0 node 0 lnode 0 core 1 thread 0
cpu 2 pkg 0 node 0 lnode 0 core 2 thread 0
cpu 3 pkg 0 node 0 lnode 0 core 3 thread 0
cpu 4 pkg 0 node 0 lnode 0 core 0 thread 1
cpu 5 pkg 0 node 0 lnode 0 core 1 thread 1
cpu 6 pkg 0 node 0 lnode 0 core 2 thread 1
cpu 7 pkg 0 node 0 lnode 0 core 3 thread 1
cpu 8 pkg 0 node 0 lnode 0 core 0 thread 0
cpu 9 pkg 0 node 0 lnode 0 core 0 thread 0
cpu 10 pkg 0 node 0 lnode 0 core 0 thread 0
cpu 11 pkg 0 node 0 lnode 0 core 0 thread 0
cpu 12 pkg 0 node 0 lnode 0 core 0 thread 0
cpu 13 pkg 0 node 0 lnode 0 core 0 thread 0
cpu 14 pkg 0 node 0 lnode 0 core 0 thread 0
cpu 15 pkg 0 node 0 lnode 0 core 0 thread 0
cpu 16 pkg 0 node 0 lnode 0 core 0 thread 0
cpu 17 pkg 0 node 0 lnode 0 core 0 thread 0
cpu 18 pkg 0 node 0 lnode 0 core 0 thread 0
cpu 19 pkg 0 node 0 lnode 0 core 0 thread 0
cpu 20 pkg 0 node 0 lnode 0 core 0 thread 0
cpu 21 pkg 0 node 0 lnode 0 core 0 thread 0
cpu 22 pkg 0 node 0 lnode 0 core 0 thread 0
cpu 23 pkg 0 node 0 lnode 0 core 0 thread 0
cpu 24 pkg 0 node 0 lnode 0 core 0 thread 0
cpu 25 pkg 0 node 0 lnode 0 core 0 thread 0
cpu 26 pkg 0 node 0 lnode 0 core 0 thread 0
cpu 27 pkg 0 node 0 lnode 0 core 0 thread 0
cpu 28 pkg 0 node 0 lnode 0 core 0 thread 0
cpu 29 pkg 0 node 0 lnode 0 core 0 thread 0
cpu 30 pkg 0 node 0 lnode 0 core 0 thread 0
cpu 31 pkg 0 node 0 lnode 0 core 0 thread 0
CPUID(0): GenuineIntel 0xd CPUID levels; 0x80000008 xlevels; family:model:stepping 0x6:3a:9 (6:58:9)
CPUID(1): SSE3 MONITOR SMX EIST TM2 TSC MSR ACPI-TM HT TM
CPUID(6): APERF, TURBO, DTS, PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, EPB
cpu4: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST MWAIT PREFETCH TURBO)
CPUID(7): No-SGX
cpu4: MSR_MISC_PWR_MGMT: 0x00400000 (ENable-EIST_Coordination DISable-EPB DISable-OOB)
RAPL: 851 sec. Joule Counter Range, at 77 Watts
cpu4: MSR_PLATFORM_INFO: 0x81010e0012200
16 * 100.0 = 1600.0 MHz max efficiency frequency
34 * 100.0 = 3400.0 MHz base frequency
cpu4: MSR_IA32_POWER_CTL: 0x0014005d (C1E auto-promotion: DISabled)
cpu4: MSR_TURBO_RATIO_LIMIT: 0x25262727
37 * 100.0 = 3700.0 MHz max turbo 4 active cores
38 * 100.0 = 3800.0 MHz max turbo 3 active cores
39 * 100.0 = 3900.0 MHz max turbo 2 active cores
39 * 100.0 = 3900.0 MHz max turbo 1 active cores
cpu4: MSR_CONFIG_TDP_NOMINAL: 0x00000022 (base_ratio=34)
cpu4: MSR_CONFIG_TDP_LEVEL_1: 0x1e0000000000000 (PKG_MIN_PWR_LVL1=480 PKG_MAX_PWR_LVL1=0 LVL1_RATIO=0 PKG_TDP_LVL1=0)
cpu4: MSR_CONFIG_TDP_LEVEL_2: 0x1e0000000000000 (PKG_MIN_PWR_LVL2=480 PKG_MAX_PWR_LVL2=0 LVL2_RATIO=0 PKG_TDP_LVL2=0)
cpu4: MSR_CONFIG_TDP_CONTROL: 0x80000000 ( lock=1)
cpu4: MSR_TURBO_ACTIVATION_RATIO: 0x00000000 (MAX_NON_TURBO_RATIO=0 lock=0)
cpu4: MSR_PKG_CST_CONFIG_CONTROL: 0x1e008400 (UNdemote-C3, UNdemote-C1, demote-C3, demote-C1, locked, pkg-cstate-limit=0 (pc0))
cpu4: POLL: CPUIDLE CORE POLL IDLE
cpu4: C1: MWAIT 0x00
cpu4: C1E: MWAIT 0x01
cpu4: C3: MWAIT 0x10
cpu4: C6: MWAIT 0x20
cpu4: cpufreq driver: intel_pstate
cpu4: cpufreq governor: performance
cpufreq intel_pstate no_turbo: 0
cpu4: MSR_MISC_FEATURE_CONTROL: 0x00000000 (L2-Prefetch L2-Prefetch-pair L1-Prefetch L1-IP-Prefetch)
cpu0: MSR_IA32_ENERGY_PERF_BIAS: 0x00000006 (balanced)
cpu0: MSR_RAPL_POWER_UNIT: 0x000a1003 (0.125000 Watts, 0.000015 Joules, 0.000977 sec.)
cpu0: MSR_PKG_POWER_INFO: 0xd000001e00268 (77 W TDP, RAPL 60 - 0 W, 0.012695 sec.)
cpu0: MSR_PKG_POWER_LIMIT: 0x8000830200148268 (locked)
cpu0: PKG Limit #1: ENabled (77.000000 Watts, 1.000000 sec, clamp DISabled)
cpu0: PKG Limit #2: ENabled (96.250000 Watts, 0.000977* sec, clamp DISabled)
cpu0: MSR_PP0_POLICY: 0
cpu0: MSR_PP0_POWER_LIMIT: 0x00000000 (UNlocked)
cpu0: Cores Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_PP1_POLICY: 0
cpu0: MSR_PP1_POWER_LIMIT: 0x00000000 (UNlocked)
cpu0: GFX Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x00691400 (105 C)
cpu0: MSR_IA32_PACKAGE_THERM_STATUS: 0x88420000 (39 C)
cpu0: MSR_IA32_PACKAGE_THERM_INTERRUPT: 0x00000003 (105 C, 105 C)
cpu0: MSR_IA32_THERM_STATUS: 0x88420000 (39 C +/- 1)
cpu0: MSR_IA32_THERM_INTERRUPT: 0x00000013 (105 C, 105 C)
cpu1: MSR_IA32_THERM_STATUS: 0x88440000 (37 C +/- 1)
cpu1: MSR_IA32_THERM_INTERRUPT: 0x00000013 (105 C, 105 C)
cpu2: MSR_IA32_THERM_STATUS: 0x88460000 (35 C +/- 1)
cpu2: MSR_IA32_THERM_INTERRUPT: 0x00000013 (105 C, 105 C)
cpu3: MSR_IA32_THERM_STATUS: 0x88430000 (38 C +/- 1)
cpu3: MSR_IA32_THERM_INTERRUPT: 0x00000013 (105 C, 105 C)
cpu4: MSR_PKGC3_IRTL: 0x0000883b (valid, 60416 ns)
cpu4: MSR_PKGC6_IRTL: 0x00008850 (valid, 81920 ns)
cpu4: MSR_PKGC7_IRTL: 0x00008857 (valid, 89088 ns)
10.001611 sec
usec    Time_Of_Day_Seconds     APIC    X2APIC  Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     POLL    C1      C1E     C3      C6      POLL%   C1%     C1E%    C3%     C6%     CPU%c1  CPU%c3  CPU%c6  CPU%c7  CoreTmp PkgTmp  PkgWatt CorWatt      GFXWatt
  898   1581514666.539025       -       -       75      3.42    2192    3400    53940   0       85      799     4453    4275    47653   0.00    0.04    0.57    1.45    94.41   8.42    7.16    81.00   0.00    38      39      25.06   6.790.00

KingArthur98 · Feb 12, 2020

The interesting thing is; If you run Ethtool -K gso off tso off rx off tx off; I can make it last alot longer: 1-2 days (atleast for me). Before it hangs.
But with c-state limiting it hasnt crashed at all (yet).

What I am thinking is maybe there is a problem with the e1000e driver or intel nic that hangs when switching or coming out of a c-state.

chotaire · Feb 13, 2020

@Licht laut Deinem Post ist max cstate nicht aktiv. Man sieht ganz klar, dass C3-C6 benutzt werden.

Licht said:
usec Time_Of_Day_Seconds APIC X2APIC Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI POLL C1 C1E C3 C6 POLL% C1% C1E% C3% C6% CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp PkgWatt CorWatt GFXWatt
898 1581514666.539025 - - 75 3.42 2192 3400 53940 0 85 799 4453 4275 47653 0.00 0.04 0.57 1.45 94.41 8.42 7.16 81.00 0.00 38 39 25.06 6.790.00

Licht · Feb 16, 2020

Alle Logs waren vor reboot - up 4 days - bisher kein Crash - danke! Offen ist noch:

Licht said:
Gibt es einen cstate zwischen 1 und 9 der ebenfalls getestet werden könnte? Ich möchte den Stromverbrauch so gering wie möglich halten.

Code:

10.001212 sec
usec    Time_Of_Day_Seconds     APIC    X2APIC  Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     POLL    C1      POLL%   C1%     CPU%c1  CPU%c3  CPU%c6  CPU%c7  CoreTmp PkgTmp  GFX%rc6 PkgWatt CorWatt GFXWatt
  444   1581866247.329351       -       -       102     2.75    3700    3400    78565   0       201     107752  0.00    97.24   97.25   0.00    0.00    0.00    50      50      0.00    36.50   18.15   0.00

chotaire · Feb 18, 2020

@Licht Ich habe vor Wochen mal gelesen, es ist bereits C3 welcher Probleme verursacht. Entsprechend laesst sich da wohl nichts optimieren.

chotaire · Feb 19, 2020

Can anyone else confirm that the aforementioned kernel configuration change fixes their crash issues with Hetzner servers?
Kann sonst noch jemand bestätigen, dass der oben genannte Kernel Config Change die Crash Probleme mit Hetzner Servern löst?

In this case, as it happens with a multitude of different hardware configurations not limited to Hetzner EX, this is not an issue that could simply be disregarded by pointing fingers at Hetzner. This is something that would need to get addressed upstream (likely in the Linux kernel) and Proxmox staff plus Hetzner staff should assist. At a minimum, add to documentation, but ideally get yourself some hardware where this can be reproduced by engineering team so that appropriate asses can be kicked. As it turns out, this happens with Kernel 5.* only, not with previous kernels, not with Windows, etc. This may very well be a critical kernel bug, it was addressed a few times and obviously never really fixed.

PS. Just now I have contacted Hetzner and have requested that their engineering team takes a look at this thread. I've also requested that Hetzner finally approves the recent January BIOS firmware for EX62 and suggested that they could contact Proxmox for possible hardware certification.

KingArthur98 · Feb 19, 2020

Mine is still up with no problems after 13 days.

chotaire · Feb 19, 2020

Hetzner has just now confirmed that they are aware of stability issues with some EX62-NVME servers. They are still testing and they're in touch with the manufacturers.

Ruflex · Feb 19, 2020

chotaire said:
Hetzner has just now confirmed that they are aware of stability issues with some EX62-NVME servers. They are still testing and they're in touch with the manufacturers.

They use a PSU 240W !
In my opinion this is not enough for I9 and when the CPU load is high, the server crashes.
I reported this to them back in October.... but in response they said that all servers were checked and other clients are not complaining

chotaire · Feb 20, 2020

The servers also crash when being idle with 2% CPU all the time and they will not crash faster if compiling kernels for hours. They also don't crash when running Windows or kernels older than 5.*. I had already tested that so I don't think that's the issue. But if you scroll up a bit you might find another issue be the cause.

Ruflex said:
They use a PSU 240W !
In my opinion this is not enough for I9 and when the CPU load is high, the server crashes.
I reported this to them back in October.... but in response they said that all servers were checked and other clients are not complaining

Ruflex · Feb 20, 2020

chotaire said:
The servers also crash when being idle with 2% CPU all the time and they will not crash faster if compiling kernels for hours. They also don't crash when running Windows or kernels older than 5.*. I had already tested that so I don't think that's the issue. But if you scroll up a bit you might find another issue be the cause.

I tried installing Debian9 and Proxmox5 and the issue was the same.

chotaire · Feb 20, 2020

Ruflex said:
I tried installing Debian9 and Proxmox5 and the issue was the same.

Oh, that's not kernel 5.*, on PVE that's the 4.15 LTS Linux Kernel (using the Ubuntu 18.04 LTS Bionic Kernel as a base). It may include a lot of backports by now. Check out #294285. If you still run Debian9 or Proxmox5 on EX62-NVME, please test if this fixes the issue and report back.

Ruflex · Feb 20, 2020

chotaire said:
Oh, that's not kernel 5.*, on PVE that's the 4.15 LTS Linux Kernel (using the Ubuntu 18.04 LTS Bionic Kernel as a base). It may include a lot of backports by now. Check out #294285. If you still run Debian9 or Proxmox5 on EX62-NVME, please test if this fixes the issue and report back.

I just checked.

root@node2 /boot # uname -a
Linux node2 4.15.18-23-pve #1 SMP PVE 4.15.18-51 (Wed, 13 Nov 2019 11:20:34 +0100) x86_64 GNU/Linux

rukh · Feb 26, 2020

Licht said:
GRUB_CMDLINE_LINUX_DEFAULT="consoleblank=0 intel_idle.max_cstate=1"

2x EX62-NVME with i9-9900K @Debian 10 & PVE 5.3.18-1

After applying this advice - works stably, 6d uptime

Ruflex · Feb 27, 2020

rukh said:
2x EX62-NVME with i9-9900K @Debian 10 & PVE 5.3.18-1

After applying this advice - works stably, 6d uptime

Do you have a Windows VM there?

KingArthur98 · Feb 27, 2020

Ruflex said:
Do you have a Windows VM there?

I sure have.

[SOLVED] Random Crashes/Reboots mit Proxmox VE 6.1 auf EX62-NVME (Hetzner)

Renowned Member

Member

Renowned Member

Renowned Member

Active Member

Member

Renowned Member

Active Member

Renowned Member

Renowned Member

Member

Renowned Member

Active Member

Renowned Member

Active Member

Renowned Member

Active Member

New Member

Active Member

Member

We value your privacy