Kernel Panic after Upgrade to 8.3 (Kernel stack is corrupted in: kmem_cache_alloc+0x37b/0x380)

logics · Nov 27, 2024

Today I tried to upgrade a server (Hetzner EX101, with Intel i9-13900) with PVE 8.2 to 8.3 via the Web GUI. The server is not part of a cluster. I can provide you the PVE license number if needed.

Hardware configuration: 2x 2 TB NVME in ZFS RAID1 (root), 2x 8 TB NVME in ZFS RAID1 (for PBS datastores etc.), 4x 32 GB memory.

As usual, first I downloaded the current packages ("Refresh" button), then clicked the "Upgrade" button which opened a shell. The package were upgraded quickly and without any errors. Then I did a

Code:

apt-get autoremove

as usual to get rid of old packages (I think it removed some very old kernel versions). After clicking the "Reboot" button on my node, the connection was lost. After an hour of no response (no ping, nothing), I ordered a Hetzner remote console.

The remote konsole showed "no video" output.

After online execution of the reset button, the system tried to boot up, but showed the following error:

The system is completely stuck here, and does not react to a simple CTRL ALT DEL (via remote KVM), but only reboots upon execution of reset via Hetzner Robot GUI.
Then I tried booting the old kernel versions:

No 1 is the most current version, which is auto selected if I do nothing. I've tried it several times. Sometimes the systems hangs at this screen:

Here is the output for no 2:

No 3 is stuck at this screen:

No 4 is stuck here:

Furthermore I've tried to shut the system completely off and start it up again, but unfortunately this changed nothing for me.

Hint: I think the server was running PVE 8.2 before trying to upgrade to the most current version. I have installed package upgrades a few times in the last months, but the last server reboot (which loads any newly installed kernel versions!) - excluding today - had been done 5 months ago, so maybe that's the root of the problems here.

I've tried mounting the system with the Hetzner Rescue Linux, but apparently its kernel version is too new and OpenZFS cannot be compiled there, so I can't load ZFS and therefore can't mount my disks. I will notify Hetzner about that ZFS issue.

Fortunately this is only a backup server, and I will try to do a complete new install of PVE 8.3 now.

logics · Nov 28, 2024

Hint: I think the server was running PVE 8.2 before trying to upgrade to the most current version. I have installed package upgrades a few times in the last months, but the last server reboot (which loads any newly installed kernel versions!) - excluding today - had been done 5 months ago, so maybe that's the root of the problems here.

That's not the problem.

I've just installed PVE https://enterprise.proxmox.com/iso/proxmox-ve_8.3-1.iso on the same server freshly:

After installation I directly encountered the same problem again:

Then after a hardware reset, the system booted surprisingly without a problem. I've entered SSH and tried to find information about the last boot. Unfortunately journalctl --list-boots only listed a single boot, without information about the previous crash.

After a reboot, the system booted successfully several times, before it crashed again:

Unfortunately the crash happens so fast, I can't grab any output in between because of the low refresh rate of the Hetzner KVM console output via HTML5 client.

In comparison, that's the sequence of images I get during a startup without crash:

some information from journalctl:

Here is another screenshot from a later startup that has encountered a crash:

So the lines that I didn't encounter in a crash-startup are the following, marked red (as far as I can grab the frames of the video stream from remote KVM, and as far as I was able to compare all the lines):

logics · Nov 28, 2024

Here another screenshot from another crash:

Here another one:

Another one:

Another one:

I have some more but I think they all look pretty similar.

Is the server hardware somehow faulty? Or is it a Proxmox problem, like PVE 8.3 being not compatible with this system?

In case of a hardware problem, switching to another Intel 13900 system (EX101) might help. In case of compability, switching to a whole different platform (AMD EPYC™ 9454P, AX162) should be better?

Please help.

fireon · Nov 28, 2024

It is possible that the errors coincided with the update to 8.3.x. Ok, let's go through things one by one:

GPT alternate Gpt header not at the end of the disk Use GNU Parted to correct GPT errors.

I've looked through my boot logs, and I don't have the message on any of my servers. So please have a look at your boot disks to see if they are OK.

Code:

smartclt -a /dev/disk/by-id/one_of_your_drives

To start a long smart test, use the following command:

Code:

smartctl -t long /dev/disk/by-id/one_of_your_drives

Please also check your RAM. You might find something there. Because you were able to boot sometimes. Another possibility would be to install version 8.2, because you said that it still worked normally with this version.

unable to access opcode bytes at XXX

This may or may not mean something.

Firmwareupgrade of the server
Microcode upgrade

dakralex · Nov 28, 2024

I also wonder if this could be caused by a disk or more likely memory corruption, as the kernel stacktraces point to page faults or kernel stack corruptions for some syscalls (unlinkat, mmap here), especially because it happens for multiple kernels. It would be great if you could check for disk health, firmware and microcode upgrades as mentioned by @fireon and run an additional memtest for at least of a couple of hours looking at the size of your memory.

logics · Nov 28, 2024

Thanks a lot for your replies!

fireon said:
This may or may not mean something.

Firmwareupgrade of the server

Microcode upgrade

Before doing further testing, I had already asked Hetzner for a BIOS/UEFI upgrade, and apparently they did something, see Release Date from 2024 (I rent the server since 3/2023, and it was never down for any maintanance from Hetzner, so they must've done it):

Code:

# dmidecode 3.3
Getting SMBIOS data from sysfs.
SMBIOS 3.6.0 present.
# SMBIOS implementations newer than version 3.3.0 are not
# fully supported by this version of dmidecode.
Table at 0x76B9C000.


Handle 0x0000, DMI type 0, 26 bytes
BIOS Information
        Vendor: American Megatrends International, LLC.
        Version: 10.35
        Release Date: 10/08/2024
        Address: 0xF0000
        Runtime Size: 64 kB
        ROM Size: 32 MB
        Characteristics:
                PCI is supported
                BIOS is upgradeable
                BIOS shadowing is allowed
                Boot from CD is supported
                Selectable boot is supported
                BIOS ROM is socketed
                EDD is supported
                ACPI is supported
                BIOS boot specification is supported
                Targeted content distribution is supported
                UEFI is supported
        BIOS Revision: 5.32


Handle 0x0001, DMI type 1, 27 bytes
System Information
        Manufacturer: Hetzner
        Product Name:
        Version: 1.0
        Serial Number:
        UUID: eceaeae8-7a0e-44a7-8b6d-a8a159c28833
        Wake-up Type: Power Switch
        SKU Number:
        Family:


Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
        Manufacturer: ASRockRack
        Product Name: W680D4U-1L
        Version:
        Serial Number:
        Asset Tag:
        Features:
                Board is a hosting board
                Board is replaceable
        Location In Chassis:
        Chassis Handle: 0x0003
        Type: Motherboard
        Contained Object Handles: 0

I have already checked https://www.asrockrack.com/general/productdetail.de.asp?Model=W680D4U#Download for BIOS Upgrades, but the newest Version on that website is "20.03 11/30/2023 BIOS" which has an older date but higher version number than the one on Hetzner's server. At least my server's version seems to be newer than "10.03 7/13/2023 BIOS". However Hetzner seems to use a custom BIOS version.

Unfortunately I did all of the tests above (except for the initial PVE Upgrade 8.2->8.3) with the version listed above (Version: 10.35, Release Date: 10/08/2024).

fireon said:
So please have a look at your boot disks to see if they are OK.

Code:

smartclt -a /dev/disk/by-id/one_of_your_drives

Thanks for the idea. I've checked all 4 drives, especially the boot drives (2x 2TB), and can't find any problematic values. According to https://askubuntu.com/a/1460952 "Media and Data Integrity Errors" is the most important one and it is 0 everywhere. Temps are 40-52°C. Available Spare is 100% and Percentage Used 2-3%. All 4 disks show "SMART overall-health self-assessment test result: PASSED". See smartinfo.txt attached for full output.

fireon said:
To start a long smart test, use the following command:

Code:

smartctl -t long /dev/disk/by-id/one_of_your_drives

Please also check your RAM. You might find something there. Because you were able to boot sometimes.

Today I asked Hetzner for a "Full Hardware Check" and am looking forward to their results. If they find nothing, I will do myself some more checks like long smartctl tests, memcheck etc.

dakralex said:
could be caused by a disk or more likely memory corruption, as the kernel stacktraces point to page faults or kernel stack corruptions for some syscalls (unlinkat, mmap here), especially because it happens for multiple kernels.

Thanks for the idea! I now hope we will quickly find some memory errors.

fireon said:
Another possibility would be to install version 8.2, because you said that it still worked normally with this version.

At first I thought "No, I don't want to run PVE 8.2 forever on my backup server, because I need a stable 8.3 before upgrading my prod servers", but you are right!, I should really just try it out again on the same server, because any crashes on 8.2 now would rule out any software problems, because the same server ran fine for months with 8.2 (and crash-free since 3/2023 with previous PVE versions).
I will certainly do it after Hetzner is done with their Full Hardware Check.

logics · Dec 5, 2024

logics said:
Today I asked Hetzner for a "Full Hardware Check" and am looking forward to their results. If they find nothing, I will do myself some more checks like long smartctl tests, memcheck etc.

This is the result of Hetzner's Full Hardware Check:

Code:

Dear Client,

The hardware check was completed without any errors.

Please see the results below:
-----------------%<-----------------
Hardware Check report for IP xx.xx.xx.xx

CPU check: OK
  CPU 1: OK
    Temperature: OK
    Clock speed: OK
Memory module check: OK
  DIMM 1 `03E24096`: OK
  DIMM 2 `03E23FA7`: OK
  DIMM 3 `03E234A4`: OK
  DIMM 4 `03E24AA9`: OK
Disk check: OK
  NVMe SSD `S64GNN0TB00470`: OK
    S.M.A.R.T Tests: OK
    Error counters: OK
  NVMe SSD `S64GNN0TB00462`: OK
    S.M.A.R.T Tests: OK
    Error counters: OK
  NVMe SSD `13D0A0AYTLZ9`: OK
    S.M.A.R.T Tests: OK
    Error counters: OK
  NVMe SSD `13D0A0ALTLZ9`: OK
    S.M.A.R.T Tests: OK
    Error counters: OK
NIC check: OK
  PCI-E NIC `a8:a1:59:c2:88:33`: OK
    Negotiated speed: OK
    Error counters: OK
    PCI error counters: OK
Stresstest: OK
System log check: OK
----------------->%-----------------

The server is now rebooting.

So according to Hetzner the server's hardware is fine.

Still I did some testing myself:

Testing the memory
I've used this version for testing: https://www.memtest.org/download/v7.20/mt86plus_7.20_64.iso.zip from https://www.memtest.org/

Doing memory tests on Hetzner's dedicated root servers is challenging, because you can only get a KVM console for 3 hours for free, and need to pay 8.40€ for 3 more hours. I don't feel like giving Hetzner money so I can hardware check their servers, so I tried to connect a KVM console, run the test and connect another console a few hours later, a day later etc. (Only the expensive Dell servers of Hetzner include a KVM console without time limits)

Unfortunately sometimes the KVM console showed "no video" after reconnection. I never saw an error in Memtest86+ though.

The longest test I was able to make lasted for 17 hours 30 minutes:

I am not sure if the server crashed thereafter, because I had no video output a day later.

Sometimes I got no video output a few hours after starting the test.

Today Hetzner apparently booted the server (since I asked them to fix the "No Video" problem from KVM console and they told me the only way to fix it is to reboot the server) and I was greeted with the same error as always, since the system tried to start PVE 8.2 and crashed as usual:

Testing the drives

Since I have NVME drives it was impossible to use smartctl for testing. I've used the nvme tool instead.

Code:

   CPU1: 13th Gen Intel(R) Core(TM) i9-13900 (Cores 32)
   Memory:  128596 MB
   Disk /dev/nvme0n1: 1920 GB (=> 1788 GiB)
   Disk /dev/nvme1n1: 7681 GB (=> 7153 GiB)
   Disk /dev/nvme2n1: 7681 GB (=> 7153 GiB)
   Disk /dev/nvme3n1: 1920 GB (=> 1788 GiB)
   Total capacity 17 TiB with 4 Disks

Here is how to use the tool:

Code:

# short tests:
nvme device-self-test /dev/nvme0n1 -n 1 -s 1
nvme device-self-test /dev/nvme1n1 -n 1 -s 1
nvme device-self-test /dev/nvme2n1 -n 1 -s 1
nvme device-self-test /dev/nvme3n1 -n 1 -s 1

# extended tests:
nvme device-self-test /dev/nvme0n1 -n 1 -s 2
nvme device-self-test /dev/nvme1n1 -n 1 -s 2
nvme device-self-test /dev/nvme2n1 -n 1 -s 2
nvme device-self-test /dev/nvme3n1 -n 1 -s 2

# test results:
nvme self-test-log /dev/nvme0n1
nvme self-test-log /dev/nvme1n1
nvme self-test-log /dev/nvme2n1
nvme self-test-log /dev/nvme3n1

edit: Accidentally I've submitted my comment before it was finished. Here is some more information:

So the test results show zero errors (according to https://forums.gentoo.org/viewtopic-p-8755019.html?sid=44d288c0515f687fd1928e723fab13d0#8755019 The "Operating Result" field: 0h Operation completed without error - and I got Operation Result : 0 in all 4 drives and all 8 tests)

Full test results: nvme-results.txt
Additional nvme information: nvme-info.txt

So what's next:

I will install PVE 8.2 (edit: or even 8.1?) and check if this one crashes too. If yes then certainly the hardware got a problem (Intel CPU bug affecting its 13900 CPU?), so I need to switch the server
If PVE 8.1/8.2 does not crash, then we have a problem with PVE 8.3 (edit: fixed version numbers)

logics · Dec 6, 2024

before testing PVE 8.1/8.2 again, I did some further CPU tests overnight:

Code:

# 1 hour test CPU
root@rescue ~ # stress --cpu 32 --timeout 3600
stress: info: [33029] dispatching hogs: 32 cpu, 0 io, 0 vm, 0 hdd
stress: info: [33029] successful run completed in 3600s

# 15 hours test CPU
root@rescue ~ # stress --cpu 32 --timeout 54000
stress: info: [33892] dispatching hogs: 32 cpu, 0 io, 0 vm, 0 hdd
stress: info: [33892] successful run completed in 54000s
root@rescue ~ #

# edit: some more done:
# 1 hour test CPU with 32 workers, I/O with 4 workers, and 32*3900MB VIRT memory workers
root@rescue ~ # stress --cpu 32 --io 4 --vm 32 --vm-bytes 3900M --timeout 3600
stress: info: [62366] dispatching hogs: 32 cpu, 4 io, 32 vm, 0 hdd
stress: info: [62366] successful run completed in 3602s

# edit one more:
# 1 hour test CPU with 32 workers, 32*3900MB VIRT memory workers, and verify the results
root@rescue ~ # apt-get install -y stress-ng
root@rescue ~ # stress-ng --cpu 32 --vm 32 --vm-bytes 3900M --verify --timeout 3600
stress-ng: info:  [4435] dispatching hogs: 32 cpu, 32 vm
stress-ng: info:  [4435] successful run completed in 3600.10s (1 hour, 0.10 secs)

# edit one more:
# 47 minutes: 47 tests for 1 minute each, memory only, 32 processes for each test
root@rescue ~ # stress-ng --class memory --sequential 32 --verify --timeout 60
stress-ng: info:  [82859] dispatching hogs: 32 atomic, 32 bad-altstack, 32 bsearch, 32 context, 32 full, 32 heapsort, 32 hsearch, 32 judy, 32 lockbus, 32 lsearch, 32 malloc, 32 matrix, 32 matrix-3d, 32 mcontend, 32 membarrier, 32 memcpy, 32 memfd, 32 memrate, 32 memthrash, 32 mergesort, 32 mincore, 32 null, 32 numa, 32 pipe, 32 pipeherd, 32 qsort, 32 radixsort, 32 remap, 32 resources, 32 rmap, 32 shellsort, 32 skiplist, 32 stack, 32 stackmmap, 32 str, 32 stream, 32 tlb-shootdown, 32 tmpfs, 32 tree, 32 tsearch, 32 vm, 32 vm-addr, 32 vm-rw, 32 vm-segv, 32 wcs, 32 zero, 32 zlib
stress-ng: info:  [4138982] stress-ng-memrate: write128:      4862.80 MB/sec
stress-ng: info:  [4138982] stress-ng-memrate:  read128:      1477.42 MB/sec
stress-ng: info:  [4138982] stress-ng-memrate:  write64:      3685.30 MB/sec
[...]
stress-ng: info:  [4138982] stress-ng-memrate:    read8:      1169.49 MB/sec
stress-ng: info:  [4138986] stress-ng-memrate: write128:      4608.82 MB/sec
[...]
stress-ng: info:  [4139074] stress-ng-memthrash: starting 1 thread on each of the 32 stressors on a 32 CPU system
stress-ng: info:  [4139316] stress-ng-numa: system has 1 of a maximum 64 memory NUMA nodes
stress-ng: info:  [4139466] stress-ng-pipeherd: 0.28 context switches per bogo operation (188694.58 per second)
stress-ng: info:  [4139651] stress-ng-pipeherd: 0.22 context switches per bogo operation (169550.53 per second)
[...]
# apparently the stress-ng-stackmmap is not working at all, but it doesn't look like a CPU problem. # 32x the following messages:
stress-ng: info:  [288968] stress-ng-stackmmap: skipping stressor, cannot mmap signal stack, errno=9 (Bad file descriptor)
stress-ng: error: [82859] process [288968] (stress-ng-stackmmap) aborted early, out of system resources
# other messages:
stress-ng: info:  [289045] stress-ng-stream: stressor loosely based on a variant of the STREAM benchmark code
stress-ng: info:  [289045] stress-ng-stream: do NOT submit any of these results to the STREAM benchmark results
stress-ng: info:  [289045] stress-ng-stream: Using CPU cache size of 36864K
stress-ng: info:  [289045] stress-ng-stream: memory rate: 1604.14 MB/sec, 641.66 Mflop/sec (instance 0)
stress-ng: info:  [289054] stress-ng-stream: memory rate: 1127.79 MB/sec, 451.12 Mflop/sec (instance 9)
stress-ng: info:  [289050] stress-ng-stream: memory rate: 1543.18 MB/sec, 617.27 Mflop/sec (instance 5)
stress-ng: info:  [289066] stress-ng-stream: memory rate: 1113.53 MB/sec, 445.41 Mflop/sec (instance 21)
[...]
stress-ng: info:  [2181132] stress-ng-zlib: instance 4: compression ratio: 15.57% (12.43 MB/sec)
stress-ng: info:  [2181132] stress-ng-zlib: zlib xsum values matches 19140972638/19140972638(deflate/inflate)
stress-ng: info:  [2181146] stress-ng-zlib: instance 11: compression ratio: 15.69% (12.30 MB/sec)
stress-ng: info:  [2181146] stress-ng-zlib: zlib xsum values matches 19055637410/19055637410(deflate/inflate)
stress-ng: info:  [2181151] stress-ng-zlib: instance 13: compression ratio: 15.53% (12.23 MB/sec)
[...]
stress-ng: info:  [2181136] stress-ng-zlib: instance 6: compression ratio: 15.77% (12.30 MB/sec)
stress-ng: info:  [2181136] stress-ng-zlib: zlib xsum values matches 19100700364/19100700364(deflate/inflate)
stress-ng: info:  [2181168] stress-ng-zlib: instance 21: compression ratio: 15.74% (12.45 MB/sec)
stress-ng: info:  [2181168] stress-ng-zlib: zlib xsum values matches 19233989550/19233989550(deflate/inflate)
stress-ng: info:  [2181187] stress-ng-zlib: instance 30: compression ratio: 15.55% (12.04 MB/sec)
stress-ng: info:  [2181187] stress-ng-zlib: zlib xsum values matches 18466409460/18466409460(deflate/inflate)
stress-ng: info:  [82859] successful run completed in 2767.28s (46 mins, 7.28 secs)

# edit: one more:
# 10 minutes: 1 CPU with 95% max. memory usage, verify result
root@rescue ~ # stress-ng --vm 1 --vm-bytes 95% --vm-method all --verify -t 10m -v
stress-ng: debug: [2181285] 32 processors online, 32 processors configured
stress-ng: info:  [2181285] dispatching hogs: 1 vm
stress-ng: debug: [2181285] cache allocate: default cache size: 36864K
stress-ng: debug: [2181285] starting stressors
stress-ng: debug: [2181285] 1 stressor started
stress-ng: debug: [2181290] stress-ng-vm: started [2181290] (instance 0)
stress-ng: debug: [2181290] stress-ng-vm using method 'all'
stress-ng: debug: [2181290] stress-ng-vm: exited [2181290] (instance 0)
stress-ng: debug: [2181285] process [2181290] terminated
stress-ng: info:  [2181285] successful run completed in 601.98s (10 mins, 1.98 secs)
stress-ng: debug: [2181285] metrics-check: all stressor metrics validated and sane

Apparently without any problems, I've also checked the syslog, no errors.

fireon · Dec 6, 2024

Thanks for all your tests. I'm curious to see if it works with 8.1/8.2.

dakralex · Dec 6, 2024

Thanks for all the thorough testing and documentation from my side as well, impressive work!

Taking another look at your panic stacktraces, I've noticed that the last was caused from zpool, which could mean that there might have been some kind of corruption at a filesystem / the virtual storage pool layer, e.g. during the apt autoremove, which briefly mounts the ESP partition that holds the kernel images. So if there's no apparent hardware error...

logics said:
I will install PVE 8.2 (edit: or even 8.1?) and check if this one crashes too.

...I will look forward to those reports and hope this will resolve your problem!

logics said:
If yes then certainly the hardware got a problem (Intel CPU bug affecting its 13900 CPU?), so I need to switch the server

I haven't noticed this before, but this could very well be a symptom of the recent hardware issues on Intel 13th and 14th gen 700 and 900 series Intel CPUs. There was at least one account in the Proxmox forum [0] (German), where the memory tests and CPU stress tests were all successful, but the hardware still caused the system to sporadically crash a process.

[0] https://forum.proxmox.com/threads/153793/

logics · Dec 6, 2024

PVE 8.2 has the same problem, directly after installation. So I guess my "system" (CPU? memory? I have no idea anymore, because my stress tools havn't found any problems) is somehow instable.

At first Hetzner provided me with the wrong USB stick, a PVE 8.3 again, so I've installed that one and tried it out again:

Same problem again:

The first boot after a PVE installation always seems to crash
Thereafter it usually boots fine at first, I've even checked zpool status and it seemed fine
After 2 more reboots, the system crashed again at startup:

This time I can read (maybe this helps? it was running kernel 6.8.12-4-pve):

Code:

[ 2.481057] kernel BUG at fs/inode.c:612!
[ 2.481172] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI

A wild guess in between: Maybe this happens because of the 13900's heterogenous architecture and during some startups the kernel is running on the "wrong" type of cores?

edit: sorry apparently CTRL + ENTER directly posts the comment. here's the rest of my comment:

So after a successful boot with PVE 8.3, I've flashed the USB drive myself with PVE 8.2-2:

Code:

root@hetzner-xxx:~# mkdir usb
root@hetzner-xxx:~# cd usb
root@hetzner-xxx:~/usb# wget https://enterprise.proxmox.com/iso/proxmox-ve_8.2-2.iso
root@hetzner-xxx:~/usb# ls -la /dev/disk/by-id/usb*
lrwxrwxrwx 1 root root  9 Dec  6 17:15 /dev/disk/by-id/usb-Kingston_DataTraveler_3.0_E0D55EA574E5E820E9560B83-0:0 -> ../../sda
lrwxrwxrwx 1 root root 10 Dec  6 17:15 /dev/disk/by-id/usb-Kingston_DataTraveler_3.0_E0D55EA574E5E820E9560B83-0:0-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Dec  6 17:15 /dev/disk/by-id/usb-Kingston_DataTraveler_3.0_E0D55EA574E5E820E9560B83-0:0-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 Dec  6 17:15 /dev/disk/by-id/usb-Kingston_DataTraveler_3.0_E0D55EA574E5E820E9560B83-0:0-part3 -> ../../sda3
lrwxrwxrwx 1 root root 10 Dec  6 17:15 /dev/disk/by-id/usb-Kingston_DataTraveler_3.0_E0D55EA574E5E820E9560B83-0:0-part4 -> ../../sda4
root@hetzner-xxx:~/usb# dd bs=1M conv=fdatasync if=./proxmox-ve_8.2-2.iso of=/dev/sda
1332+1 records in
1332+1 records out
1396899840 bytes (1.4 GB, 1.3 GiB) copied, 115.079 s, 12.1 MB/s
root@hetzner-xxx:~/usb# reboot

and installed PVE 8.2 - and it directly crashed during the first bootup after the system finished installation (where it boots automatically):

As usual:

I can reboot it via hardware reset (Hetzner Robot) only, and then it booted up normally.
After login I directly entered reboot, and it booted up normally. I've repeated that again 2 times.
Then after the third reboot a crash again:

So I guess the hardware is faulty, although sadly I can't test it with any other tools.

Do you guys have access to a PVE iso 8.1? I only found PVE 8.2 and 8.3 here: https://enterprise.proxmox.com/iso/

Would love to try 8.1, too.

logics · Dec 6, 2024

I don't have 8.1 to try, but installed PVE 7.4-1 from https://enterprise.proxmox.com/iso/proxmox-ve_7.4-1.iso

Installation is a pain, had to use the quirks from https://forum.proxmox.com/threads/g...ts-framebuffer-mode-fails.111577/#post-541768, and for some reason the installation resulted in a problem:

But I don't give up so easily, so I've installed it again, this time without problems.

It was my first installation since ages on this machine where I wasn't greeted with a crash after the first boot!

I've rebooted various times already (about 5 times I think), and got no crashes on startup.

But the syslog doesn't look good: check out the attached syslog-pve7.txt

Code:

Dec  6 18:45:54 hetzner-xxx systemd[1]: Starting PVE API Proxy Server...
Dec  6 18:45:54 hetzner-xxx systemd[1]: pve-ha-crm.service: Control process exited, code=killed, status=11/SEGV
Dec  6 18:45:54 hetzner-xxx systemd[1]: pve-ha-crm.service: Failed with result 'signal'.
Dec  6 18:45:54 hetzner-xxx systemd[1]: Failed to start PVE Cluster HA Resource Manager Daemon.
Dec  6 18:45:54 hetzner-xxx kernel: [    5.512072] show_signal_msg: 10 callbacks suppressed
Dec  6 18:45:54 hetzner-xxx kernel: [    5.512073] pve-ha-crm[2006]: segfault at 19 ip 0000559deba0952f sp 00007ffe31f279c0 error 4 in perl[559deb93d000+185000]
Dec  6 18:45:54 hetzner-xxx kernel: [    5.512512] Code: ef 48 89 c6 e8 12 33 00 00 48 89 43 f8 e9 32 fe ff ff 66 0f 1f 84 00 00 00 00 00 a9 00 00 20 00 0f 85 bc fd ff ff 49 8b 46 10 <f6> 40 0e 10 74 10 48 8b 00 48 8b 00 f6 40 0f 10 0f 85 a2 fd ff ff
Dec  6 18:45:54 hetzner-xxx pvecm[2011]: Generating public/private rsa key pair.
Dec  6 18:45:54 hetzner-xxx pvecm[2011]: Your identification has been saved in /root/.ssh/id_rsa
Dec  6 18:45:54 hetzner-xxx pvecm[2011]: Your public key has been saved in /root/.ssh/id_rsa.pub
Dec  6 18:45:54 hetzner-xxx pvecm[2011]: The key fingerprint is:
Dec  6 18:45:54 hetzner-xxx pvecm[2011]: SHA256:cSg5GozOE45D01HmG6l2u9PwClk2KsWkTMlXMDZgHGc root@hetzner-xxx
Dec  6 18:45:54 hetzner-xxx pvecm[2011]: The key's randomart image is:
Dec  6 18:45:54 hetzner-xxx pvecm[2011]: +---[RSA 2048]----+
Dec  6 18:45:54 hetzner-xxx pvecm[2011]: |.+oEo+           |
Dec  6 18:45:54 hetzner-xxx pvecm[2011]: |o.*oB .. .       |
Dec  6 18:45:54 hetzner-xxx pvecm[2011]: | *o+o++ o .      |
[...]
Dec  6 18:45:54 hetzner-xxx pvecm[2011]: |     oo          |
Dec  6 18:45:54 hetzner-xxx pvecm[2011]: +----[SHA256]-----+
Dec  6 18:45:54 hetzner-xxx pvecm[2008]: got inotify poll request in wrong process - disabling inotify
Dec  6 18:45:54 hetzner-xxx kernel: [    5.954392] pveproxy[2025]: segfault at f ip 0000558ae9a4452f sp 00007ffda3627d80 error 4 in perl[558ae9978000+185000]
Dec  6 18:45:54 hetzner-xxx kernel: [    5.954814] Code: ef 48 89 c6 e8 12 33 00 00 48 89 43 f8 e9 32 fe ff ff 66 0f 1f 84 00 00 00 00 00 a9 00 00 20 00 0f 85 bc fd ff ff 49 8b 46 10 <f6> 40 0e 10 74 10 48 8b 00 48 8b 00 f6 40 0f 10 0f 85 a2 fd ff ff
Dec  6 18:45:54 hetzner-xxx systemd[1]: pveproxy.service: Control process exited, code=killed, status=11/SEGV
Dec  6 18:45:54 hetzner-xxx systemd[1]: pveproxy.service: Failed with result 'signal'.
Dec  6 18:45:54 hetzner-xxx systemd[1]: Failed to start PVE API Proxy Server.
[...]
Dec  6 18:45:54 hetzner-xxx pve-ha-lrm[2026]: 400 internal error - unable to verify schema
Dec  6 18:45:54 hetzner-xxx pve-ha-lrm[2026]: properties.keyAlias.requires: type check ('string|object') failed
Dec  6 18:45:54 hetzner-xxx pve-ha-lrm[2026]: Compilation failed in require at /usr/share/perl5/PVE/INotify.pm line 20.
Dec  6 18:45:54 hetzner-xxx pve-ha-lrm[2026]: BEGIN failed--compilation aborted at /usr/share/perl5/PVE/INotify.pm line 20.
Dec  6 18:45:54 hetzner-xxx pve-ha-lrm[2026]: Compilation failed in require at /usr/share/perl5/PVE/Daemon.pm line 22.
Dec  6 18:45:54 hetzner-xxx pve-ha-lrm[2026]: BEGIN failed--compilation aborted at /usr/share/perl5/PVE/Daemon.pm line 22.
Dec  6 18:45:54 hetzner-xxx pve-ha-lrm[2026]: Compilation failed in require at /usr/share/perl5/PVE/Service/pve_ha_lrm.pm line 6.
Dec  6 18:45:54 hetzner-xxx pve-ha-lrm[2026]: BEGIN failed--compilation aborted at /usr/share/perl5/PVE/Service/pve_ha_lrm.pm line 6.
Dec  6 18:45:54 hetzner-xxx pve-ha-lrm[2026]: Compilation failed in require at /usr/sbin/pve-ha-lrm line 6.
Dec  6 18:45:54 hetzner-xxx pve-ha-lrm[2026]: BEGIN failed--compilation aborted at /usr/sbin/pve-ha-lrm line 6.
Dec  6 18:45:54 hetzner-xxx systemd[1]: pve-ha-lrm.service: Control process exited, code=exited, status=255/EXCEPTION
Dec  6 18:45:54 hetzner-xxx systemd[1]: pve-ha-lrm.service: Failed with result 'exit-code'.
Dec  6 18:45:54 hetzner-xxx systemd[1]: Failed to start PVE Local HA Resource Manager Daemon.
Dec  6 18:45:54 hetzner-xxx spiceproxy[2029]: starting server
Dec  6 18:45:54 hetzner-xxx spiceproxy[2029]: starting 1 worker(s)
Dec  6 18:45:54 hetzner-xxx spiceproxy[2029]: worker 2030 started
Dec  6 18:45:55 hetzner-xxx systemd[1]: Started PVE SPICE Proxy Server.
Dec  6 18:45:55 hetzner-xxx systemd[1]: pveproxy.service: Scheduled restart job, restart counter is at 1.
[...]
Dec  6 18:47:24 hetzner-xxx systemd[1]: pve-ha-crm.service: Control process exited, code=killed, status=11/SEGV
Dec  6 18:47:24 hetzner-xxx systemd[1]: pve-ha-crm.service: Failed with result 'signal'.
Dec  6 18:47:24 hetzner-xxx kernel: [    5.472445] show_signal_msg: 10 callbacks suppressed
Dec  6 18:47:24 hetzner-xxx kernel: [    5.472447] pve-ha-crm[2082]: segfault at 1c ip 000055ca2040952f sp 00007ffc2f6466e0 error 4 in perl[55ca2033d000+185000]
Dec  6 18:47:24 hetzner-xxx kernel: [    5.472885] Code: ef 48 89 c6 e8 12 33 00 00 48 89 43 f8 e9 32 fe ff ff 66 0f 1f 84 00 00 00 00 00 a9 00 00 20 00 0f 85 bc fd ff ff 49 8b 46 10 <f6> 40 0e 10 74 10 48 8b 00 48 8b 00 f6 40 0f 10 0f 85 a2 fd ff ff
Dec  6 18:47:24 hetzner-xxx systemd[1]: Failed to start PVE Cluster HA Resource Manager Daemon.
Dec  6 18:47:24 hetzner-xxx pveproxy[2086]: Not an ARRAY reference at /usr/share/perl5/Convert/ASN1/parser.pm line 555, <DATA> line 960.
Dec  6 18:47:24 hetzner-xxx pveproxy[2086]: Compilation failed in require at /usr/share/perl5/Net/LDAP/Message.pm line 8, <DATA> line 960.
Dec  6 18:47:24 hetzner-xxx pveproxy[2086]: BEGIN failed--compilation aborted at /usr/share/perl5/Net/LDAP/Message.pm line 8, <DATA> line 960.
[...]
Dec  6 18:47:24 hetzner-xxx systemd[1]: Started PVE SPICE Proxy Server.
Dec  6 18:47:24 hetzner-xxx kernel: [    5.816416] traps: pvecm[2089] general protection fault ip:561df25fef3c sp:7ffe5db18a30 error:0 in perl[561df2527000+185000]
Dec  6 18:47:25 hetzner-xxx pveproxy[2093]: starting server

What do you guys think? Let's stop here and change the hardware? Maybe I'll try another 13900 first or better directly switch to AMD? (for production actually I don't think I want Intel 13/14 gen anymore)

logics · Dec 6, 2024

dakralex said:
I haven't noticed this before, but this could very well be a symptom of the recent hardware issues on Intel 13th and 14th gen 700 and 900 series Intel CPUs. There was at least one account in the Proxmox forum [0] (German), where the memory tests and CPU stress tests were all successful, but the hardware still caused the system to sporadically crash a process.

[0] https://forum.proxmox.com/threads/153793/

Wow this looks super similar to my PVE 7 test at https://forum.proxmox.com/threads/k...em_cache_alloc-0x37b-0x380.158134/post-727108

I got some general protection fault and lots and lots of segfault instances, too.

So that's it - case closed? my Hetzner EX101 with Intel i9-13900 is affected by those Intel CPU bugs?

edit: I'll let Hetzner swap out all of the server's hardware and test again, probably PVE 7 first and then PVE 8.3

logics · Dec 9, 2024

The old server hardware was faulty!

At first, Hetzner tried to keep the old hardware and just do some settings in the BIOS (which looked promising too):

Thank you for the link. We made some adjustments for your server, which should help with voltage dips for the CPU under high load or heavy load changes.

Let us know if the server is still behaving the same despite this change, in that case we can replace the server for you.

But directly after a fresh PVE 8.3 install, I got the same old error again:

Therefore I chose the option to change all hardware, including the drives.

After providing me access to the server with new hardware (and same IP addresses, thanks Hetzner!),
I installed PVE 8.3 on the new hardware, and for the first time, I had no crash directly from the reboot after the setup.

Then I've booted the new server 10 times to confirm it wasn't just "luck" - no crashes anymore. The server is indeed "PVE-stable" now...

Bash:

~# last reboot
reboot   system boot  6.8.12-4-pve     Mon Dec  9 16:58   still running
reboot   system boot  6.8.12-4-pve     Mon Dec  9 16:57 - 16:57  (00:00)
reboot   system boot  6.8.12-4-pve     Mon Dec  9 16:55 - 16:56  (00:00)
reboot   system boot  6.8.12-4-pve     Mon Dec  9 16:53 - 16:54  (00:00)
reboot   system boot  6.8.12-4-pve     Mon Dec  9 16:52 - 16:52  (00:00)
reboot   system boot  6.8.12-4-pve     Mon Dec  9 16:50 - 16:51  (00:00)
reboot   system boot  6.8.12-4-pve     Mon Dec  9 16:49 - 16:49  (00:00)
reboot   system boot  6.8.12-4-pve     Mon Dec  9 16:47 - 16:48  (00:00)
reboot   system boot  6.8.12-4-pve     Mon Dec  9 16:46 - 16:46  (00:00)
reboot   system boot  6.8.12-4-pve     Mon Dec  9 16:43 - 16:44  (00:01)

wtmp begins Mon Dec  9 16:43:37 2024

I've checked dmidecode - I got the same CPU and motherboard model and same BIOS version date, but different UUIDs and serial numbers everywhere in the new server.

I still don't know if the CPU was indeed the only problem in the old server (Intel 13th Gen bug / degradation), or if it was faulty memory, NVMe drives, the motherboard or anything else.

Thanks to @fireon and @dakralex for your help!

fireon · Dec 9, 2024

Well, that was quite an odyssey. I'm very happy that you managed to do it. Good job!

logics · Dec 15, 2024

Okay so we have more instances of Intel CPU bug related crashes now:

My instance of Intel 13900 CPU crashes here in this thread (motherboard: ASRockRack W680D4U-1L)
Another one @dakralex showed me here with Intel 13900K: https://forum.proxmox.com/threads/153793/ by @Crash1601
Another one I've just posted right now that we've experienced on a different server (unfortunately a production machine) with Intel 13900 and ASRockRack W680D4U-1L: https://forum.proxmox.com/threads/random-freezes-maybe-zfs-related.145695/page-3#post-729608
And countless crashes by @ksb with Intel 13900 and ASUSTeK COMPUTER INC. System Product Name/W680/MB DC reported in https://forum.proxmox.com/threads/random-freezes-maybe-zfs-related.145695/
Another user @jpiszcz with a Intel Asus Pro WS W680-ACE Board with an Intel i9-14900k apparently here: https://forum.proxmox.com/threads/random-freezes-maybe-zfs-related.145695/page-3#post-681664

We keep the Intel 13900 only on 1 backup machine now, because its hardware had been swapped completely and I've tested it extensively (with PVE install and reboots). We try to shift to AMD now (or at least do not use any affected Intel CPUs anymore) for all other servers

mishagli · Feb 17, 2025

Hi!
I have the same problem in Hetzner with the same hardware configuration. (PVE 8.2.9) I had the same server replaced a year ago. But after 6 months the problem repeated. And this happens more and more often. Hetzner refuses to replace the server again. Most likely it is a problem of processor degradation. I will buy another server on AMD.

Снимок экрана от 2025-02-17 09-31-14.png

Снимок экрана от 2025-02-17 09-31-53.png

# dmidecode 3.4
Getting SMBIOS data from sysfs.
SMBIOS 3.6.0 present.
# SMBIOS implementations newer than version 3.5.0 are not
# fully supported by this version of dmidecode.
Table at 0x76B98000.

Handle 0x0000, DMI type 0, 26 bytes
BIOS Information
Vendor: American Megatrends International, LLC.
Version: 10.35
Release Date: 10/08/2024
Address: 0xF0000
Runtime Size: 64 kB
ROM Size: 32 MB
Characteristics:
PCI is supported
BIOS is upgradeable
BIOS shadowing is allowed
Boot from CD is supported
Selectable boot is supported
BIOS ROM is socketed
EDD is supported
ACPI is supported
BIOS boot specification is supported
Targeted content distribution is supported
UEFI is supported
BIOS Revision: 5.32

Handle 0x0001, DMI type 1, 27 bytes
System Information
Manufacturer: Hetzner
Product Name:
Version: 1.0
Serial Number:
UUID: eceaeae8-7a0e-44a7-8b6d-a8a159f6900e
Wake-up Type: Power Switch
SKU Number:
Family:

Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
Manufacturer: ASRockRack
Product Name: W680D4U-1L
Version:
Serial Number:
Asset Tag:
Features:
Board is a hosting board
Board is replaceable
Location In Chassis:
Chassis Handle: 0x0003
Type: Motherboard
Contained Object Handles: 0

billy999 · Feb 17, 2025

mishagli said:
Hi!
I have the same problem in Hetzner with the same hardware configuration. (PVE 8.2.9) I had the same server replaced a year ago. But after 6 months the problem repeated. And this happens more and more often. Hetzner refuses to replace the server again. Most likely it is a problem of processor degradation. I will buy another server on AMD.

You should be able to get a CPU replacement from Intel via their warranty extension

Kernel Panic after Upgrade to 8.3 (Kernel stack is corrupted in: kmem_cache_alloc+0x37b/0x380)

Well-Known Member

Well-Known Member

Well-Known Member

Distinguished Member

Proxmox Staff Member

Well-Known Member

Attachments

Well-Known Member

Attachments

Well-Known Member

Distinguished Member

Proxmox Staff Member

Well-Known Member

Well-Known Member

Attachments

Well-Known Member

Well-Known Member

The old server hardware was faulty!​

Distinguished Member

Well-Known Member

Member

Member

We value your privacy

The old server hardware was faulty!