Proxmox Kernel 6.8.12-2 Freezes (again)

I'm having the same problem. It first appeared on my new 7950x homelab-server with an AsrockRack Board.
Board was swapped by the distributor, problem persisted.
Yesterday i had to move an entire rack at a customer site and updated the proxmox-server at this point. This morning the system had rebootet without any further notice. Same behavior as with my 7950x.
It seems that on systems without watchdog, the host freezes. With watchdog, it just reboots.
Customer EPYC was working with 6.5 perfectly, now 6.8 had the reboot issue. Will switch back to 6.5 this evening and see what happens...

On the other hand, my 5950x system is stable as a rock with 6.8...
 
I'm having the same problem. It first appeared on my new 7950x homelab-server with an AsrockRack Board.
Board was swapped by the distributor, problem persisted.
Yesterday i had to move an entire rack at a customer site and updated the proxmox-server at this point. This morning the system had rebootet without any further notice. Same behavior as with my 7950x.
It seems that on systems without watchdog, the host freezes. With watchdog, it just reboots.
Customer EPYC was working with 6.5 perfectly, now 6.8 had the reboot issue. Will switch back to 6.5 this evening and see what happens...

On the other hand, my 5950x system is stable as a rock with 6.8...

Unfortunately I also had servers with watchdog enabled and they freezed. But less freezes as without haha.

With 6.5 I also had issues, so be aware of downgrading.
 
Unfortunately, the upgrade to 6.11.0-8-generic didn't help either. The machine is accessible for 4-5 hours, then it goes dead. Logs are sent to my NAS for debugging and there is nothing out of the ordinary. Only thing which helps is a restart.

uname -a
Code:
Linux proxmox 6.11.0-8-generic #8-Ubuntu SMP PREEMPT_DYNAMIC Mon Sep 16 13:41:20 UTC 2024 x86_64 GNU/Linux

lscpu
Code:
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          39 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   4
  On-line CPU(s) list:    0-3
Vendor ID:                GenuineIntel
  BIOS Vendor ID:         Intel(R) Corporation
  Model name:             Intel(R) N100
    BIOS Model name:      Intel(R) N100 To Be Filled By O.E.M. CPU @ 2.8GHz
    BIOS CPU family:      1
    CPU family:           6
    Model:                190
    Thread(s) per core:   1
    Core(s) per socket:   4
    Socket(s):            1
    Stepping:             0
    CPU(s) scaling MHz:   78%
    CPU max MHz:          3400.0000
    CPU min MHz:          700.0000
    BogoMIPS:             1612.80
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr ss
                          e sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nop
                          l xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est
                          tm2 ssse3 sdbg fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx
                           f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l2 cdp_l2 ssbd ibrs ibpb stibp ibrs_enhan
                          ced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt
                          _a rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect
                          user_shstk avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req vnmi u
                          mip pku ospke waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize arch_lbr
                          ibt flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:         VT-x
Caches (sum of all):     
  L1d:                    128 KiB (4 instances)
  L1i:                    256 KiB (4 instances)
  L2:                     2 MiB (1 instance)
  L3:                     6 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-3
Vulnerabilities:         
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Mitigation; Clear Register File
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS Not affected; BHI
                          BHI_DIS_S
  Srbds:                  Not affected
  Tsx async abort:        Not affected
 
Unfortunately, the upgrade to 6.11.0-8-generic didn't help either. The machine is accessible for 4-5 hours, then it goes dead. Logs are sent to my NAS for debugging and there is nothing out of the ordinary. Only thing which helps is a restart.

uname -a
Code:
Linux proxmox 6.11.0-8-generic #8-Ubuntu SMP PREEMPT_DYNAMIC Mon Sep 16 13:41:20 UTC 2024 x86_64 GNU/Linux

lscpu
Code:
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          39 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   4
  On-line CPU(s) list:    0-3
Vendor ID:                GenuineIntel
  BIOS Vendor ID:         Intel(R) Corporation
  Model name:             Intel(R) N100
    BIOS Model name:      Intel(R) N100 To Be Filled By O.E.M. CPU @ 2.8GHz
    BIOS CPU family:      1
    CPU family:           6
    Model:                190
    Thread(s) per core:   1
    Core(s) per socket:   4
    Socket(s):            1
    Stepping:             0
    CPU(s) scaling MHz:   78%
    CPU max MHz:          3400.0000
    CPU min MHz:          700.0000
    BogoMIPS:             1612.80
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr ss
                          e sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nop
                          l xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est
                          tm2 ssse3 sdbg fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx
                           f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l2 cdp_l2 ssbd ibrs ibpb stibp ibrs_enhan
                          ced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt
                          _a rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect
                          user_shstk avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req vnmi u
                          mip pku ospke waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize arch_lbr
                          ibt flush_l1d arch_capabilities
Virtualization features:
  Virtualization:         VT-x
Caches (sum of all):    
  L1d:                    128 KiB (4 instances)
  L1i:                    256 KiB (4 instances)
  L2:                     2 MiB (1 instance)
  L3:                     6 MiB (1 instance)
NUMA:                    
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-3
Vulnerabilities:        
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Mitigation; Clear Register File
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS Not affected; BHI
                          BHI_DIS_S
  Srbds:                  Not affected
  Tsx async abort:        Not affected

Did you add some parameters to the kernel cmdline?
 
Did you add some parameters to the kernel cmdline?
Yes, I edited these in /etc/default/grub:
Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on"
GRUB_CMDLINE_LINUX=""

I didn't touch /etc/kernel/cmdline, the file doesn't exist.
 
Yes, I edited these in /etc/default/grub:
Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on"
GRUB_CMDLINE_LINUX=""

I didn't touch /etc/kernel/cmdline, the file doesn't exist.

I was able to minimize freezes with these params:

Code:
kernel.softlockup_panic=0 pcie_port_pm=off pcie_aspm.policy=performance libata.force=noncq

Also use these params with 6.11 currently. But as I also just use it for few hours, I have to wait if I will have freezes or reboots again or not.
 
I was able to minimize freezes with these params:

Code:
kernel.softlockup_panic=0 pcie_port_pm=off pcie_aspm.policy=performance libata.force=noncq

Also use these params with 6.11 currently. But as I also just use it for few hours, I have to wait if I will have freezes or reboots again or not.
These are params for /etc/kernel/cmdline, correct?

Could you perhaps elaborate a bit what they do?
 
Update!

I’ve updated my servers to kernel 6.11 for testing. Unfortunately after few hours one server rebooted again.

I have the bad feeling that just upgrading the kernel won’t help…
 
Update!

I’ve updated my servers to kernel 6.11 for testing. Unfortunately after few hours one server rebooted again.

I have the bad feeling that just upgrading the kernel won’t help…
At least yours reboot. Mine becomes 100% unresponsive, but is still running. Only hard reboot helps.
 
Unfortunately no success. I added a cronjob to ping healthchecks.io every five minutes. The server was responsive for only 1h 30min. That is a new low, though.
 
  • Like
Reactions: Decco1337
We use
Code:
processor.max_cstate=0 idle=nomwait rcu_nocbs=0-39
as Kernel parameters. rcu_nocbs depends on number of cpu models/cores.
There was also some temporary hangs with increased sysctl vm.watermark_scale_factor setting but works fine with default value of 10.

So maybe other settings such as zfs parameters, memory management etc. could also have an influence on the success of the kernel upgrade.
It is difficult to publish all settings from our side. Some things we do not want to publish. ;)

Since upgrade to 6.11 we have much fewer outages. But ...
Last night we had 2 servers (out of hundreds with kernel 6.11) that also needed a hard reboot. But we are still investigating if this was the same kernel issue or something else. We identified a lot of reasons for kernel crashes of our PVE systems in the last months and the most of them had other causes (bugs in uksm patch, AMD-VI bug (only with 5.x kernels), misconfigured zram/swap, and so on).
 
Last edited:
  • Like
Reactions: Decco1337
We use
Code:
processor.max_cstate=0 idle=nomwait rcu_nocbs=0-39
as Kernel parameters. rcu_nocbs depends on number of cpu models/cores.
There was also some temporary hangs with increased sysctl vm.watermark_scale_factor setting but works fine with default value of 10.

So maybe other settings such as zfs parameters, memory management etc. could also have an influence on the success of the kernel upgrade.
It is difficult to publish all settings from our side. Some things we do not want to publish. ;)

Since upgrade to 6.11 we have much fewer outages. But ...
Last night we had 2 servers (out of hundreds with kernel 6.11) that also needed a hard reboot. But we are still investigating if this was the same kernel issue or something else. We identified a lot of reasons for kernel crashes of our PVE systems in the last months and the most of them had other causes (bugs in uksm patch, AMD-VI bug (only with 5.x kernels), misconfigured zram/swap, and so on).
intel or amd?
 
As I wrote before all hosts have AMD cpus like the most users that complains about freezes with 6.8 kernels.
Possible that's not the same issue/bug if your Intel systems freeze too.

We have also a few Intel hosts but no outages yet. Their share of the total mass is negligible.
 
Damn it. Turns out the reason on my customers epyc system was the RAM and not the kernel.
Had to downclock the ram to 1866MHz to get it stable until i have time to change the defective stick :-(

Horay for mce eventlog!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!