Proxmox VE 8.1.4 Troubleshooting

mtucker299

New Member
Feb 22, 2024
10
1
1
I have a single PVE setup that has been running for about 6 months now. Recently, the server has started to become unresponsive about once a week. When the happens the GUI and ssh are unavailable. I am also unable to ping the server IP and the local console is unresponsive, requiring a hard reset which I am NOT a fan of.

Attached is the console output from the latest lock up. Any thoughts are greatly appreciated.

Thanks.
 

Attachments

  • signal-2024-02-21-213121_002.jpeg
    signal-2024-02-21-213121_002.jpeg
    195.9 KB · Views: 38
I have a single PVE setup that has been running for about 6 months now. Recently, the server has started to become unresponsive about once a week. When the happens the GUI and ssh are unavailable. I am also unable to ping the server IP and the local console is unresponsive, requiring a hard reset which I am NOT a fan of.

Attached is the console output from the latest lock up. Any thoughts are greatly appreciated.

Thanks.

Can you post kernel logs from the period, e.g.:

Code:
journalctl -k -b all -o short-precise --since 2024-02-15
 
I recommend you apt install rasdaemon to further monitor those:

Code:
Feb 19 19:31:40.195075 pve kernel: mce: [Hardware Error]: Machine check events logged
Feb 19 19:31:40.195079 pve kernel: mce: [Hardware Error]: Machine check events logged
Feb 19 20:47:45.189671 pve kernel: mce: [Hardware Error]: Machine check events logged
Feb 21 01:58:59.421483 pve kernel: mce: [Hardware Error]: Machine check events logged

And report back after a freeze has occurred again.

EDIT: Confirm with lsmod | grep edac that your EDAC drivers got loaded.
 
Last edited:
Thanks for the reply.

I did get rasdaemon installed but the EDAC drivers did not. I am working through what the issue with this is, but meanwhile my machine dropped again. I will report back once I have EDAC drivers working.

Thanks again.

1708869590014.png
 
Thanks for the reply.

I did get rasdaemon installed but the EDAC drivers did not. I am working through what the issue with this is, but meanwhile my machine dropped again. I will report back once I have EDAC drivers working.

Thanks again.

View attachment 63744
Anything recorded according to journalctl -u rasdaemon?
 
I don’t know, I am not able to stay connected via ssh. I’ll have to get to the local console and see what’s up.
 
Here is the output. The boot at the end was a hard reset.


Feb 22 20:48:44 pve systemd[1]: Starting rasdaemon.service - RAS daemon to log the RAS events...
Feb 22 20:48:44 pve rasdaemon[128659]: ras:mc_event event enabled
Feb 22 20:48:44 pve rasdaemon[128659]: rasdaemon: ras:mc_event event enabled
Feb 22 20:48:44 pve rasdaemon[128659]: ras:aer_event event enabled
Feb 22 20:48:44 pve rasdaemon[128659]: rasdaemon: ras:aer_event event enabled
Feb 22 20:48:44 pve rasdaemon[128658]: rasdaemon: ras:mc_event event enabled
Feb 22 20:48:44 pve rasdaemon[128658]: rasdaemon: Enabled event ras:mc_event
Feb 22 20:48:44 pve rasdaemon[128658]: ras:mc_event event enabled
Feb 22 20:48:44 pve rasdaemon[128658]: Enabled event ras:mc_event
Feb 22 20:48:44 pve rasdaemon[128659]: mce:mce_record event enabled
Feb 22 20:48:44 pve rasdaemon[128659]: rasdaemon: mce:mce_record event enabled
Feb 22 20:48:44 pve rasdaemon[128659]: ras:extlog_mem_event event enabled
Feb 22 20:48:44 pve rasdaemon[128659]: rasdaemon: ras:extlog_mem_event event enabled
Feb 22 20:48:44 pve rasdaemon[128658]: rasdaemon: ras:aer_event event enabled
Feb 22 20:48:44 pve rasdaemon[128658]: rasdaemon: Enabled event ras:aer_event
Feb 22 20:48:44 pve rasdaemon[128658]: ras:aer_event event enabled
Feb 22 20:48:44 pve rasdaemon[128658]: Enabled event ras:aer_event
Feb 22 20:48:44 pve rasdaemon[128658]: rasdaemon: Family 6 Model 9e CPU: only decoding architectu>
Feb 22 20:48:44 pve rasdaemon[128658]: Family 6 Model 9e CPU: only decoding architectural errors
Feb 22 20:48:44 pve rasdaemon[128658]: mce:mce_record event enabled
Feb 22 20:48:44 pve rasdaemon[128658]: rasdaemon: mce:mce_record event enabled
Feb 22 20:48:44 pve rasdaemon[128658]: rasdaemon: Enabled event mce:mce_record
Feb 22 20:48:44 pve rasdaemon[128658]: Enabled event mce:mce_record
Feb 22 20:48:44 pve rasdaemon[128658]: ras:extlog_mem_event event enabled
Feb 22 20:48:44 pve systemd[1]: Started rasdaemon.service - RAS daemon to log the RAS events.
Feb 22 20:48:44 pve rasdaemon[128658]: rasdaemon: ras:extlog_mem_event event enabled
Feb 22 20:48:44 pve rasdaemon[128658]: rasdaemon: Enabled event ras:extlog_mem_event
Feb 22 20:48:44 pve rasdaemon[128658]: rasdaemon: Listening to events for cpus 0 to 7
Feb 22 20:48:44 pve rasdaemon[128658]: Enabled event ras:extlog_mem_event
Feb 22 20:48:44 pve rasdaemon[128658]: rasdaemon: Recording mc_event events
Feb 22 20:48:44 pve rasdaemon[128658]: rasdaemon: Recording aer_event events
Feb 22 20:48:44 pve rasdaemon[128658]: rasdaemon: Recording extlog_event events
Feb 22 20:48:44 pve rasdaemon[128658]: rasdaemon: Recording mce_record events
Feb 22 21:37:04 pve systemd[1]: Stopping rasdaemon.service - RAS daemon to log the RAS events...
Feb 22 21:37:04 pve rasdaemon[136518]: rasdaemon: ras:mc_event event disabled
Feb 22 21:37:04 pve rasdaemon[136518]: ras:mc_event event disabled
Feb 22 21:37:04 pve rasdaemon[136518]: rasdaemon: ras:aer_event event disabled
Feb 22 21:37:04 pve rasdaemon[136518]: rasdaemon: mce:mce_record event disabled
Feb 22 21:37:04 pve rasdaemon[136518]: ras:aer_event event disabled
Feb 22 21:37:04 pve rasdaemon[136518]: rasdaemon: ras:extlog_mem_event event disabled
Feb 22 21:37:04 pve rasdaemon[136518]: mce:mce_record event disabled
Feb 22 21:37:04 pve rasdaemon[136518]: ras:extlog_mem_event event disabled
Feb 22 21:37:04 pve rasdaemon[128658]: rasdaemon: Recevied signal=15
Feb 22 21:37:04 pve rasdaemon[128658]: overriding event (1468) ras:mc_event with new print handler
Feb 22 21:37:04 pve rasdaemon[128658]: overriding event (1465) ras:aer_event with new print handl>
Feb 22 21:37:04 pve rasdaemon[128658]: overriding event (113) mce:mce_record with new print handl>
Feb 22 21:37:04 pve rasdaemon[128658]: overriding event (1469) ras:extlog_mem_event with new prin>
Feb 22 21:37:04 pve rasdaemon[128658]: Calling ras_mc_event_opendb()
Feb 22 21:37:04 pve rasdaemon[128658]: Calling ras_mc_event_closedb()
Feb 22 21:37:04 pve rasdaemon[128658]: Huh! something got wrong. Aborting.
Feb 22 21:37:04 pve systemd[1]: rasdaemon.service: Deactivated successfully.
Feb 22 21:37:04 pve systemd[1]: Stopped rasdaemon.service - RAS daemon to log the RAS events.
-- Boot efed11dbaeeb4fa4b61eda29f8656acd --
Feb 22 21:37:53 pve systemd[1]: Starting rasdaemon.service - RAS daemon to log the RAS events...
Feb 22 21:37:53 pve rasdaemon[817]: ras:mc_event event enabled
Feb 22 21:37:53 pve rasdaemon[817]: rasdaemon: ras:mc_event event enabled
Feb 22 21:37:53 pve rasdaemon[816]: ras:mc_event event enabled
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: ras:mc_event event enabled
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: Enabled event ras:mc_event
Feb 22 21:37:53 pve rasdaemon[817]: rasdaemon: ras:aer_event event enabled
Feb 22 21:37:53 pve rasdaemon[816]: Enabled event ras:mc_event
Feb 22 21:37:53 pve rasdaemon[817]: ras:aer_event event enabled
Feb 22 21:37:53 pve rasdaemon[817]: mce:mce_record event enabled
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: ras:aer_event event enabled
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: Enabled event ras:aer_event
Feb 22 21:37:53 pve rasdaemon[817]: rasdaemon: mce:mce_record event enabled
Feb 22 21:37:53 pve rasdaemon[817]: rasdaemon: ras:extlog_mem_event event enabled
Feb 22 21:37:53 pve rasdaemon[816]: ras:aer_event event enabled
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: Family 6 Model 9e CPU: only decoding architectural>
Feb 22 21:37:53 pve rasdaemon[816]: Enabled event ras:aer_event
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: mce:mce_record event enabled
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: Enabled event mce:mce_record
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: ras:extlog_mem_event event enabled
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: Enabled event ras:extlog_mem_event
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: Listening to events for cpus 0 to 7
Feb 22 21:37:53 pve rasdaemon[817]: ras:extlog_mem_event event enabled
Feb 22 21:37:53 pve rasdaemon[816]: Family 6 Model 9e CPU: only decoding architectural errors
Feb 22 21:37:53 pve rasdaemon[816]: mce:mce_record event enabled
Feb 22 21:37:53 pve systemd[1]: Started rasdaemon.service - RAS daemon to log the RAS events.
Feb 22 21:37:53 pve rasdaemon[816]: Enabled event mce:mce_record
Feb 22 21:37:53 pve rasdaemon[816]: ras:extlog_mem_event event enabled
Feb 22 21:37:53 pve rasdaemon[816]: Enabled event ras:extlog_mem_event
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: Recording mc_event events
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: Recording aer_event events
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: Recording extlog_event events
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: Recording mce_record events
-- Boot 635e2f2f1d9e46c8b66b432e86bb847b --
Feb 25 07:45:09 pve systemd[1]: Starting rasdaemon.service - RAS daemon to log the RAS events...
 
Code:
Feb 22 21:37:04 pve systemd[1]: Stopping rasdaemon.service - RAS daemon to log the RAS events...
Feb 22 21:37:04 pve rasdaemon[136518]: rasdaemon: ras:mc_event event disabled
Feb 22 21:37:04 pve rasdaemon[136518]: ras:mc_event event disabled
Feb 22 21:37:04 pve rasdaemon[136518]: rasdaemon: ras:aer_event event disabled
Feb 22 21:37:04 pve rasdaemon[136518]: rasdaemon: mce:mce_record event disabled
Feb 22 21:37:04 pve rasdaemon[136518]: ras:aer_event event disabled
Feb 22 21:37:04 pve rasdaemon[136518]: rasdaemon: ras:extlog_mem_event event disabled
Feb 22 21:37:04 pve rasdaemon[136518]: mce:mce_record event disabled
Feb 22 21:37:04 pve rasdaemon[136518]: ras:extlog_mem_event event disabled
Feb 22 21:37:04 pve rasdaemon[128658]: rasdaemon: Recevied signal=15
Feb 22 21:37:04 pve rasdaemon[128658]: overriding event (1468) ras:mc_event with new print handler
Feb 22 21:37:04 pve rasdaemon[128658]: overriding event (1465) ras:aer_event with new print handl>
Feb 22 21:37:04 pve rasdaemon[128658]: overriding event (113) mce:mce_record with new print handl>
Feb 22 21:37:04 pve rasdaemon[128658]: overriding event (1469) ras:extlog_mem_event with new prin>

Code:
Feb 22 21:37:04 pve rasdaemon[128658]: Calling ras_mc_event_opendb()
Feb 22 21:37:04 pve rasdaemon[128658]: Calling ras_mc_event_closedb()
Feb 22 21:37:04 pve rasdaemon[128658]: Huh! something got wrong. Aborting.

This is frustrating.

Code:
Feb 22 21:37:04 pve systemd[1]: rasdaemon.service: Deactivated successfully.
Feb 22 21:37:04 pve systemd[1]: Stopped rasdaemon.service - RAS daemon to log the RAS events.
-- Boot efed11dbaeeb4fa4b61eda29f8656acd --

This was manual reboot?

Code:
Feb 22 21:37:53 pve systemd[1]: Starting rasdaemon.service - RAS daemon to log the RAS events...
Feb 22 21:37:53 pve rasdaemon[817]: ras:mc_event event enabled
Feb 22 21:37:53 pve rasdaemon[817]: rasdaemon: ras:mc_event event enabled
Feb 22 21:37:53 pve rasdaemon[816]: ras:mc_event event enabled
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: ras:mc_event event enabled
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: Enabled event ras:mc_event
Feb 22 21:37:53 pve rasdaemon[817]: rasdaemon: ras:aer_event event enabled
Feb 22 21:37:53 pve rasdaemon[816]: Enabled event ras:mc_event
Feb 22 21:37:53 pve rasdaemon[817]: ras:aer_event event enabled
Feb 22 21:37:53 pve rasdaemon[817]: mce:mce_record event enabled
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: ras:aer_event event enabled
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: Enabled event ras:aer_event
Feb 22 21:37:53 pve rasdaemon[817]: rasdaemon: mce:mce_record event enabled
Feb 22 21:37:53 pve rasdaemon[817]: rasdaemon: ras:extlog_mem_event event enabled
Feb 22 21:37:53 pve rasdaemon[816]: ras:aer_event event enabled
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: Family 6 Model 9e CPU: only decoding architectural>
Feb 22 21:37:53 pve rasdaemon[816]: Enabled event ras:aer_event
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: mce:mce_record event enabled
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: Enabled event mce:mce_record
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: ras:extlog_mem_event event enabled
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: Enabled event ras:extlog_mem_event
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: Listening to events for cpus 0 to 7
Feb 22 21:37:53 pve rasdaemon[817]: ras:extlog_mem_event event enabled
Feb 22 21:37:53 pve rasdaemon[816]: Family 6 Model 9e CPU: only decoding architectural errors
Feb 22 21:37:53 pve rasdaemon[816]: mce:mce_record event enabled
Feb 22 21:37:53 pve systemd[1]: Started rasdaemon.service - RAS daemon to log the RAS events.
Feb 22 21:37:53 pve rasdaemon[816]: Enabled event mce:mce_record
Feb 22 21:37:53 pve rasdaemon[816]: ras:extlog_mem_event event enabled
Feb 22 21:37:53 pve rasdaemon[816]: Enabled event ras:extlog_mem_event
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: Recording mc_event events
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: Recording aer_event events
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: Recording extlog_event events
Feb 22 21:37:53 pve rasdaemon[816]: rasdaemon: Recording mce_record events
-- Boot 635e2f2f1d9e46c8b66b432e86bb847b --
Feb 25 07:45:09 pve systemd[1]: Starting rasdaemon.service - RAS daemon to log the RAS events...[/CODE]

I suppose ras-mc-ctl --summary shows nothing now?
 
Yes the boot at 21:37:04 was a manual boot. I was confirming hardware BIOS settings.

Output of ras-mc-ctl…

root@pve:~# ras-mc-ctl --summary
No Memory errors.

No PCIe AER errors.

No Extlog errors.

No MCE errors.
root@pve:~#

Still researching why EDAC drivers are not loading.

Thanks for continuing to help/respond.
 
Yes the boot at 21:37:04 was a manual boot. I was confirming hardware BIOS settings.

Output of ras-mc-ctl…

root@pve:~# ras-mc-ctl --summary
No Memory errors.

No PCIe AER errors.

No Extlog errors.

No MCE errors.
root@pve:~#

Still researching why EDAC drivers are not loading.

Do not worry about this one. Originally I did not pay enough attention and did not realise this is Asrock Z390 Phantom Gaming 4S system, so there's no ECC memory there to begin with. But the MCE errors will be interesting (potentially impossible) to track. Other than this is likely "hardware / compatibility" issue, can't guess more for now.

The obligatory "please update your BIOS" for now only, yours is 1.31, there's newer one:

https://pg.asrock.com/mb/Intel/Z390 Phantom Gaming 4S/index.asp#BIOS

Thanks for continuing to help/respond.

I will have a look at the rasdaemon, why the error message on SIGTERM, but tracking MCE is sometimes worse than trying different combinations of hardware components.
 
  • Like
Reactions: mtucker299
Do not worry about this one. Originally I did not pay enough attention and did not realise this is Asrock Z390 Phantom Gaming 4S system, so there's no ECC memory there to begin with. But the MCE errors will be interesting (potentially impossible) to track. Other than this is likely "hardware / compatibility" issue, can't guess more for now.

The obligatory "please update your BIOS" for now only, yours is 1.31, there's newer one:

https://pg.asrock.com/mb/Intel/Z390 Phantom Gaming 4S/index.asp#BIOS



I will have a look at the rasdaemon, why the error message on SIGTERM, but tracking MCE is sometimes worse than trying different combinations of hardware components.
I think this is the reality of the issue. This is/was a gaming system my son no longer wanted so I turned it into a server for testing. The BIOS is locked to the manufacturer of the PC (iBuyPower), not the board so it won't let me update using the standard BIOS.

I did a reset of the BIOS settings yesterday, but I don't have much hope for that. I guess my real solution is to find a more compatible motherboard as the system runs great when it runs.

Thanks for sticking with me on this. It was a learning experience for sure.
 
I think this is the reality of the issue. This is/was a gaming system my son no longer wanted so I turned it into a server for testing. The BIOS is locked to the manufacturer of the PC (iBuyPower), not the board so it won't let me update using the standard BIOS.

I did a reset of the BIOS settings yesterday, but I don't have much hope for that. I guess my real solution is to find a more compatible motherboard as the system runs great when it runs.

Thanks for sticking with me on this. It was a learning experience for sure.

Oh I would still give it a try if you have patience, but it might be something you can't influence. For non-server hardware, I always turn off everything possible in BIOS/EFI that I would not need (e.g. audio card and such). You may wonder if taking out a RAM module won't resolve it, or if possible changing their timing in, well ... BIOS/EFI. Sometimes it's just incompatible RAM and it would work with another.

I would probably start humble by first installing Ubuntu Desktop on this system (PVE uses their tweaked kernel anyhow) and see if that works without issue. If that works, there are fewer things to wonder about (kvm-related possibly). If that does not work, I would go tweaking kernel params such as max_cstate=1. I would always lock the system down in terms of features (including things like disabling hyperthreading) and test from such limp mode. Then go enabling more and see when I have no luck anymore.

The only issue with MCE tracing is ... rather impractical in terms of return on time investment if its a sole system you have. But! You may even Google search your MB and (that will yield more results) Ubuntu - if it's some known hardware issues with the distro/kernel, they will pop up.
 
Just want to add something, even in case someone else traces similar in the future - the reason I did not pay attention to your original screenshot kernel messages was two-fold:

1) These message did not make it to the logs, so something was going on at that moment already that prevented them to be flushed onto drive. That something was the real problem.

2) It is quite normal for a system that is attempting to reboot itself to disable VMX [1], so this was just a sign, at least in my view, something had been amiss and the poor thing wanted to reboot (instead of freeze).

Other than the MCEs there was literally nothing suspicious in the logs.

[1] https://github.com/torvalds/linux/b...d585497cdcbe663/arch/x86/kernel/reboot.c#L578
 
Last edited:
These message did not make it to the logs, so something was going on at that moment already that prevented them to be flushed onto drive. That something was the real problem.
Do I understand you correctly in that this could be OS or hardware (or any number of other things)? Just trying to fully grasp.

Thanks for responding.
 
I would always lock the system down in terms of features (including things like disabling hyperthreading) and test from such limp mode. Then go enabling more and see when I have no luck anymore.
This was what I was attempting last and the system finally stopped booting. I then had to do the full BIOS/UEFI reset to get it to boot.
 
I did a reset of the BIOS settings yesterday, but I don't have much hope for that
Just a quick update to say that it has been 5 days since I factory reset the BIOS/UEFI settings and I have had no unresponsive events. I will continue to monitor but for now all is good. Thanks to everyone who responded.
 
  • Like
Reactions: esi_y
Do I understand you correctly in that this could be OS or hardware (or any number of other things)? Just trying to fully grasp.

You posted the messages on the screen originally, those are kernel ring buffer messages that have been configured to show into the console. You can tweak what shows into the console (in terms of priority) via kernel parameter, when you sysctl -a | grep printk you will see the defaults and you can change them by editing /etc/sysctl.conf [1] . You can also change the level at runtime with dmesg -n $level [2] or just remain logged in and keep them having printed realtime e.g. with dmesg -ew.

Anyhow, you saw some messages printed in the console before your machine froze, but they are also logged via systemd-journald and should be retained in the logs. But there were none (from the screen) there (as you had shared journalctl -k -b all). Given the fact that during an emergency reboot, disabling VMX is something routine, I think it was complaining about issues during that (see also my earlier post and comments within the code). And because it was also in that stage of rebooting, it was not possible to flush those messages onto the logs on the disk. Instead, the machine froze as per your report. So that alone does not really help analyzing the issue at all because all one knows is there was an emergency reboot attempt (likely) that failed in and of itself.

There was nothing in the logs that could point out why the emergency reboot had been even triggered, the logs just abruptly ended, so one can be assuming writing onto the disk was not possible anymore after certain point. So the only thing to focus on were the MCEs. Those are Machine Check Exceptions (sorry, I could not find some satisfactory source on explaining them nicely and consistently, but the usual forums will give you an idea) and basically it could be anything hardware related, or simply kernel having trouble interacting with it. We did not quite manage to find out exactly what they pertained to. Also, could be red herring (in relation to your freezes).

[1] https://www.kernel.org/doc/html/next/core-api/printk-basics.html
[2] https://manpages.debian.org/bookworm/util-linux/dmesg.1.en.html
 
This was what I was attempting last and the system finally stopped booting. I then had to do the full BIOS/UEFI reset to get it to boot.
If the system failed to boot after changing EFI defaults (other than boot oder), that would be kind of odd.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!