promox freezes (kernel 6.8.8-1 pve)

z3t

New Member
May 26, 2024
8
3
3
Hi, i have the following issue with proxmox:
after some time e.g. 1-2 days the system freezes and everything is down. when connecting to ilo i can see the login screen but the server is not accepting any input.
also ping / webgui and all the vms on that server are not responding any more.
after a reset (and no errors during post / boot), the server is working for another couple of days.
I've checked the syslog of that time of the crash and there are no obvious errors visible. the last message was:
Jun 23 04:56:32 srv-19 pmxcfs[1765]: [dcdb] notice: data verification successful;

Linux 6.8.8-1-pve (2024-06-10T11:42Z) / pve-manager/8.2.4/faa83925c9641325
HPE DL380 Gen9, 2x Intel(R) Xeon(R) CPU E5-2673 v4, 756Gb ram,
Local Storage with 5x 600gb with HPE Smart Array P440ar
VMs are on redundant NFS storage (netapp); other serveres with vms on that storage are working well.

Does anybody have any suggestions i could try?
thx :)
 
Yep, definitely try dropping back to kernel 6.5, rather than the recently released 6.8 series that are causing a bunch of issues for people.

The steps here under "Kernel 6.8" should point you in the right direction:

https://pve.proxmox.com/wiki/Roadmap#Known_Issues_&_Breaking_Changes

There's a bunch of other people who've had to switch back to the older kernel too, so if you hit problems getting that done then there are probably exact steps by now in a few threads.
 
  • Like
Reactions: chuck7k
Also issues here after updating all packages to latest status (including kernel 6.8.8-1-pve) on three machines. Any advice how to switch back to kernel 6.5 (or an earlier 6.8 kernel)?
 
Pretty sure people have written the steps for doing it already in threads.

From memory it's something along the lines of implementing the above fix so the system no longer includes kernel 6.8 things in the list of updates you get, then you manually tell it to install 6.5. You might need to delete the left over 6.8 kernel afterwards too (not sure).

Do some poking through these forums, you should be ok to find it. :)
 
i used the following commands for the affected server:

apt install proxmox-kernel-6.5.13-5-pve-signed

apt install proxmox-headers-6.5.13-5-pve

proxmox-boot-tool kernel pin 6.5.13-5-pve


for unpining (future use & for my memory :D ):

proxmox-boot-tool kernel unpin 6.5.13-5-pve



after restart i am now on
Linux 6.5.13-5-pve (2024-04-05T11:03Z)
 
Just a short update -> still having troubles with one server; it keeps freezing and the only way is to reset the server...
 
@z3t As a general idea, are you familiar with using journalctl with the -b option?

With that you can grab the entire kernel boot log for a given system run, and output it to a plain text file. eg:

journalctl -b -1 > somefile.txt will dump the entire output for the previous run to "somefile.txt". -2 will do the system run before that (aka "two boots ago"), -3 will do "three boots ago", etc. :)

Can be a useful way to check if there's any strange kernel messages appearing around the time of the freeze. Though when a system truly freezes hard, any kernel messages that weren't yet written to disk can be lost.
 
thx for your input.
@z3t As a general idea, are you familiar with using journalctl with the -b option?

With that you can grab the entire kernel boot log for a given system run, and output it to a plain text file. eg:

journalctl -b -1 > somefile.txt will dump the entire output for the previous run to "somefile.txt". -2 will do the system run before that (aka "two boots ago"), -3 will do "three boots ago", etc. :)

Can be a useful way to check if there's any strange kernel messages appearing around the time of the freeze. Though when a system truly freezes hard, any kernel messages that weren't yet written to disk can be lost.
thx for your help. i've cheked the journalctl (i was not aware of the -b parameter - that is super helpful :D ) and unfortunately the last error messages dont' show anything suspicious:

Code:
Jun 28 12:54:58 srv-19 chronyd[1669]: Source 185.119.117.217 replaced with 46.102.157.67 (2.debian.pool.ntp.org)
Jun 28 12:56:57 srv-19 pveproxy[1568344]: worker exit
Jun 28 12:56:57 srv-19 pveproxy[1173149]: worker 1568344 finished
Jun 28 12:56:57 srv-19 pveproxy[1173149]: starting 1 worker(s)
Jun 28 12:56:57 srv-19 pveproxy[1173149]: worker 1594724 started
Jun 28 12:59:24 srv-19 nfsidmap[1595204]: nss_getpwnam: name 'root@defaultv4iddomain.com' does not map into domain '***'
Jun 28 12:59:24 srv-19 nfsidmap[1595205]: nss_name_to_gid: name 'root@defaultv4iddomain.com' does not map into domain '***'
Jun 28 13:04:24 srv-19 postfix/qmgr[1808]: 7AA13541011: from=<root@srv-19.***>, size=2814, nrcpt=1 (queue active)
Jun 28 13:04:25 srv-19 postfix/qmgr[1808]: AE148541014: from=<root@srv-19.***>, size=2822, nrcpt=1 (queue active)
Jun 28 13:04:55 srv-19 postfix/smtp[1596161]: connect to mailportal.***[10.49.5.20]:25: Connection timed out
Jun 28 13:04:55 srv-19 postfix/smtp[1596162]: connect to mailportal.***[10.49.5.20]:25: Connection timed out
Jun 28 13:05:25 srv-19 postfix/smtp[1596161]: connect to mailgate.***[10.49.5.10]:25: Connection timed out
Jun 28 13:05:25 srv-19 postfix/smtp[1596162]: connect to mailgate.***[10.49.5.10]:25: Connection timed out
Jun 28 13:05:25 srv-19 postfix/smtp[1596161]: 7AA13541011: to=<toc@***>, relay=none, delay=69902, delays=69842/0.03/60/0, dsn=4.4.1, status=deferred (connect to mailgate***[10.49.5.10]:25: Connection timed out)
Jun 28 13:05:25 srv-19 postfix/smtp[1596162]: AE148541014: to=<toc@***>, relay=none, delay=53101, delays=53041/0.02/60/0, dsn=4.4.1, status=deferred (connect to mailgate***[10.49.5.10]:25: Connection timed out)
Jun 28 13:05:54 srv-19 pmxcfs[1740]: [status] notice: received log
Jun 28 13:09:25 srv-19 nfsidmap[1597116]: nss_getpwnam: name 'root@defaultv4iddomain.com' does not map into domain '***'
Jun 28 13:09:25 srv-19 nfsidmap[1597117]: nss_name_to_gid: name 'root@defaultv4iddomain.com' does not map into domain '***'
Jun 28 13:17:01 srv-19 CRON[1598565]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 28 13:17:01 srv-19 CRON[1598566]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jun 28 13:17:01 srv-19 CRON[1598565]: pam_unix(cron:session): session closed for user root
Jun 28 13:19:26 srv-19 nfsidmap[1599033]: nss_getpwnam: name 'root@defaultv4iddomain.com' does not map into domain '***'
Jun 28 13:19:26 srv-19 nfsidmap[1599034]: nss_name_to_gid: name 'root@defaultv4iddomain.com' does not map into domain '***'
Jun 28 13:21:54 srv-19 pmxcfs[1740]: [status] notice: received log
Jun 28 13:25:30 srv-19 chronyd[1669]: Source 185.242.177.6 replaced with 162.159.200.123 (2.debian.pool.ntp.org)
Jun 28 13:26:45 srv-19 pveproxy[1583032]: worker exit
Jun 28 13:26:45 srv-19 pveproxy[1173149]: worker 1583032 finished
Jun 28 13:26:45 srv-19 pveproxy[1173149]: starting 1 worker(s)
Jun 28 13:26:45 srv-19 pveproxy[1173149]: worker 1600490 started
Jun 28 13:29:03 srv-19 pmxcfs[1740]: [dcdb] notice: data verification successful
Jun 28 13:29:25 srv-19 postfix/qmgr[1808]: A79A1541012: from=<root@srv-19.***>, size=23426, nrcpt=1 (queue active)
Jun 28 13:29:26 srv-19 nfsidmap[1600998]: nss_getpwnam: name 'root@defaultv4iddomain.com' does not map into domain '***'
Jun 28 13:29:26 srv-19 nfsidmap[1600999]: nss_name_to_gid: name 'root@defaultv4iddomain.com' does not map into domain '***'
Jun 28 13:29:55 srv-19 postfix/smtp[1600997]: connect to mailportal***[10.49.5.20]:25: Connection timed out
Jun 28 13:30:25 srv-19 postfix/smtp[1600997]: connect to mailgate***[10.49.5.10]:25: Connection timed out
Jun 28 13:30:25 srv-19 postfix/smtp[1600997]: A79A1541012: to=<toc@***>, relay=none, delay=145582, delays=145522/0.03/60/0, dsn=4.4.1, status=deferred (connect to mailgate***[10.49.5.10]:25: Connection timed out)
Jun 28 13:34:25 srv-19 postfix/qmgr[1808]: 82F98540FBD: from=<root@srv-19.***>, size=1939, nrcpt=1 (queue active)
Jun 28 13:34:55 srv-19 postfix/smtp[1601950]: connect to mailgate***[10.49.5.10]:25: Connection timed out

the crash was around Jun 28 13:35 (i get alerts by sms.)
Within the ILO / I don't see any hardware issues.

i hope that the next crash will reveal more details.
 
  • Like
Reactions: justinclift
Just a short update -> still having troubles with one server; it keeps freezing and the only way is to reset the server...
Just to be sure -- when the node froze the last time, was it running on kernel 6.5.13-5? Can you provide the output of last reboot -F -n10?

When it freezes, do you see any (error) output on the console? Sometimes, messages cannot be written to disk anymore so you wouldn't find them in the syslog/journal after the next boot.

I'd agree with @justinclift regarding the memtest86+ -- faulty memory can cause all sorts of weird issues, similarly with thermal/power issues.
 
  • Like
Reactions: justinclift
hi, just wanted to give you an update: 1 of 3 hosts is stable now. the other one has a faulty cpu (it passed the memtest but is still crashing. i have a replacement cpu on my desk, waiting to be changed. and the 3rd host - i forgot the status. it think the issue was some faulty memory, but right now i stopped the setup due to lack of time. i hope that we can continue within the next few weeks.

Just to be sure -- when the node froze the last time, was it running on kernel 6.5.13-5? Can you provide the output of last reboot -F -n10?

When it freezes, do you see any (error) output on the console? Sometimes, messages cannot be written to disk anymore so you wouldn't find them in the syslog/journal after the next boot.
no i havent received any error messages or anything unusual. when it freezes, it was in way that the https frontend was not responding and when connecting to the ilo, i was not able to enter the username.
the last reboot command did not show any special outcome, i think it was not written to disk.
one last thing: when the freez occures, some of the vms where still responsible. so maybe something like a process or thread got stuck?
another it administrator pointed out that maybe it has something to do that we are using NFS on a NetApp for Storage and maybe the network got stuck.

thx for your help, i am keeping you informed regarding the cpu (2nd host) / and the memory of the 3rd host.
 
hi.
today the pve server crashed like 10 times. Every time it was the same behaviour. Console freezes, https gui is not responsive any more. some vms are still running, some are not responding. only a hard-server reset is solving that problem.
on the last hung, i also received the following ilo / iml error message:

An Unrecoverable System Error (NMI) has occurred (Service Information: 0x00000008, 0x89480000)

any help appreciated :)
 
I have the same issue, already tried anything which was mentioned here, also replaced the hardware.

Do you use AMD Ryzen processors from the 7000th gen?

And do you have VMs with host CPU type? Just try to change the host type. After this I was able to solve the issue. Currently I am looking for a solution that I can use host as CPU type.


So I already tried several settings, like Eco mode, deactivating c-states, update boot parameters, but nothing helps. It seems that some instructions with CPU type hosts brings the machine to freeze.
 
An Unrecoverable System Error (NMI) has occurred (Service Information: 0x00000008, 0x89480000)
Ugh that sucks. That's almost definitely a hardware problem then. :(

Don't suppose it's on a system using a 13th or 14th generation Intel cpu? Those have some extreme problems. :(
 
Ahhh. Just checked your first message and it says you're using older generation Intel Xeon gear. That won't be due to the recent Intel 13/14th gen cpu problems then.

That being said, the error message above is a bad sign. If the server is still in support (!) then it'll be a support call thing.

If it's not though, then I'd definitely be powering it down and making sure all of the connections for everything (all components, cards, disk drives, ram, etc) are pushed in fully, properly seated in their place and so on.

Saying that because it's been on the rare occasion (years apart) that a PCIe card or ram stick will not be properly seated in a server and cause problems. Which magically all go away once it has a solid electrical contact.

If you're not that lucky, then it could indeed be a permanently dead server, or at least whatever component is failing. :(
 
Last edited:
  • Like
Reactions: z3t
i got a new crash today:


System Error 08/11/2024 20:56 08/11/2024 20:56 1 An Unrecoverable System Error (NMI) has occurred (Service Information: 0x00000008, 0x89480000)
214
System Error 08/08/2024 18:39 08/08/2024 18:39 1 An Unrecoverable System Error (NMI) has occurred (Service Information: 0x00000008, 0x89480000)

and now (12/08) i got antoher alert that the vms are not reachable any more. this time i was able to get some infos and the cli is still responsive: 1723451329601.png

this time i got more logging informations -> see attachment.

unfortunately i do not have any hardware support..
 

Attachments

i got a new crash today:


System Error 08/11/2024 20:56 08/11/2024 20:56 1 An Unrecoverable System Error (NMI) has occurred (Service Information: 0x00000008, 0x89480000)
214
System Error 08/08/2024 18:39 08/08/2024 18:39 1 An Unrecoverable System Error (NMI) has occurred (Service Information: 0x00000008, 0x89480000)

and now (12/08) i got antoher alert that the vms are not reachable any more. this time i was able to get some infos and the cli is still responsive: View attachment 72870

this time i got more logging informations -> see attachment.

unfortunately i do not have any hardware support..
I face similar issues with Ryzen 9 7950X. Sometimes who have this output, sometimes nothing.
 
Last edited:
Hi,
i got a new crash today:


System Error 08/11/2024 20:56 08/11/2024 20:56 1 An Unrecoverable System Error (NMI) has occurred (Service Information: 0x00000008, 0x89480000)
214
System Error 08/08/2024 18:39 08/08/2024 18:39 1 An Unrecoverable System Error (NMI) has occurred (Service Information: 0x00000008, 0x89480000)

and now (12/08) i got antoher alert that the vms are not reachable any more. this time i was able to get some infos and the cli is still responsive: View attachment 72870

this time i got more logging informations -> see attachment.

unfortunately i do not have any hardware support..
please try with the latest 6.8 kernel.

I face similar issues with Ryzen 9 7950X. Sometimes who have this output, sometimes nothing.
please clarify what you mean with "this output". The NMI message? Or the TCP stack trace? What kernel are you using? Your issue sounds like this: https://forum.proxmox.com/threads/sudden-bulk-stop-of-all-vms.139500/post-655004 and unfortunately no fix has been found yet.

For both, please make sure you have the latest BIOS updates and CPU microcode installed: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_firmware_cpu
 
Hi,

please try with the latest 6.8 kernel.


please clarify what you mean with "this output". The NMI message? Or the TCP stack trace? What kernel are you using? Your issue sounds like this: https://forum.proxmox.com/threads/sudden-bulk-stop-of-all-vms.139500/post-655004 and unfortunately no fix has been found yet.

For both, please make sure you have the latest BIOS updates and CPU microcode installed: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_firmware_cpu
I already tried the newest kernel, does not solved the issue. Already installed the latest microcode and BIOS updates. Was the first what I did.

Unfortunately my last freeze does not came with any error, I will post it here once I get an error again. The issue you have posted is not the issue I have. The whole system freezes with all VMs, a restart is necessary to bring it back
 
We had issues on 3 servers from 3 different clusters (one test cluster, 2 production). EPYC2 and Xeon cpus. All running kernel 6.8.12-1. Also NFS server side issues (we share stuff between servers via NFS). No errors besides NFS client errors. SO there were 2 kinds of issues
- NFS issues where the NFS server stopped sharing, NFS service could not be restarted. The server needed to be restarted to bring NFS sharing back.
- partial system freeze - just the above reports, no error messages besides nfs client ones, postfix complaining about something unreachable (on the local file system), some pmxcfs errors about accessing /var/lib/something (config.db?), ssh was not working, login from console not working. Some VMs responding to ping but not much else.