Random SIGSEGV in different Proxmox services on fresh PVE 9 installation (ThinkStation Tiny P3 / i9-14900)

Darkrys

Member
Apr 3, 2024
9
0
6
Hello,

I'm facing a strange issue on a brand new Proxmox VE 9 installation.

Hardware
========

- Lenovo ThinkStation Tiny P3
- Intel Core i9-14900
- 96 GB RAM (2 x 48 GB DDR5)
- NVMe SSD
- Intel integrated graphics (no discrete GPU currently installed)
- Latest BIOS available from Lenovo
- Latest Intel microcode (0x12B)

Software
========

Fresh installation from the latest Proxmox VE 9 ISO.

After installation:

apt update
apt full-upgrade

Everything installs successfully.

Symptoms
========

The system boots correctly.

The following services work:
- pveproxy
- pvedaemon
- pvestatd
- Web UI
- SSH

However, after each boot, one Proxmox service randomly crashes with SIGSEGV.

Examples:

First boot:
- pve-guests.service

Second boot:
- pvescheduler.service

Example log:
pvescheduler.service:
Main process exited, code=killed, status=11/SEGV

or

pve-guests.service:
ExecStart=/usr/bin/pvesh --nooutput create /nodes/localhost/startall
(code=killed, signal=SEGV)

Interestingly, executing the exact same command manually afterwards succeeds:
/usr/bin/pvesh --nooutput create /nodes/localhost/startall
returns exit code 0.

There are currently NO virtual machines or containers configured.

Kernel
======

Linux 7.0.12-1-pve
No kernel panic.
No ZFS corruption.
No filesystem errors.

Already tested
==============

- Fresh installation several times
- Latest Proxmox ISO
- Latest BIOS
- Latest Intel microcode
- Removed discrete NVIDIA GPU
- MemTest86 (>24h): PASS
- Ubuntu Server works perfectly
- XCP-ng works perfectly
- No overclocking
- BIOS defaults

Question
========

Has anyone already seen random SIGSEGV affecting different Proxmox Perl services on recent Intel 14th generation platforms?

Could this be a known regression in PVE 9 or is there anything specific I should investigate?

I can provide additional logs if needed.

Thank you.
 
Sounds like a hardware problem. I suggested some possible problems with 13/14th gen Intel here: https://forum.proxmox.com/threads/p...issue-with-intel-i9-14900k.184284/post-858532 . Such problems can be hard to pin down. I guess it's not a memory issue as those can also give very strange and random problems. PVE does stress your hardware different from running an mostly idle Ubuntu server. Do you have spare parts to replace hardware to test?
 
Thanks for your reply.

I also have some news.

I narrowed the issue down significantly.

After 10 consecutive reboots with maxcpus=30, I did not observe a single failed Proxmox service or any Perl SIGSEGV.

lscpu -e shows that the only offline logical CPUs are 13 and 15, which correspond to the second SMT threads of P-Core 6 and P-Core 7.

maxcpus=31 is still unstable.

maxcpus=32 consistently produces random SIGSEGVs in different Proxmox Perl processes.