Fresh installation, restore and full server freeze over night

thj

Member
Jan 15, 2022
19
1
8
44
Yesterday I replaced the system NVME drive with new one. From USB I have installed latest ISO of Proxmox 9. Restore went without any problems. All my LXC and Home Assistant VM were restored and up and running. At around 2:30 at night the server completly froze. I had to push power button to power off and start it again this morning.

It is same server as before (T.Book MN48H - AMD Ryzen 7 4800H), just with fresh installation of OS and new NVME. Before I had no problems with it. Uptime was for sure over one year (last power loss).

I can not find anything in logs. Same in Grafana (I send proxmox stats to InfuxDB and have a Grafana panel to display graphs).

What else can I do to find the reason of freeze and to prevent it further?
 
Hi
journalctl command can help you in command line
It will show you the log file.
Code:
journalctl -b -1  -p warning  -e
Thanks to Udob8 for this tips (in one of my post).
You can understand the flags for journalctl on the web (lot of explanation)
 
Yet another crash today :(. I am open for any suggestions what I can do to fix this problem. Any kernel grub commands that might solve this problem? On HDMI output all I can see is green image across the screen.
 
I am currenly on:

Code:
Linux proxmox 6.14.11-4-pve #1 SMP PREEMPT_DYNAMIC PMX 6.14.11-4 (2025-10-10T08:04Z) x86_64 GNU/Linux

Which version do you suggest?

I have added this to grub and rebooted server:

Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet processor.max_cstate=1"

Will see if it will help.
 
Another crash today. But now there is not green screen in HDMI. Instead I can see this message. What else can I try?

1762176570259.png
 
For future me reading this after a year or two. I have added:

Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet amdgpu.runpm=0 amdgpu.aspm=0 processor.max_cstate=1 idle=nomwait"

And upgraded kernel to: 6.17.2-1-pve
Using command:

Code:
apt install proxmox-kernel-6.17.2-1-pve

Now for the waiting game...
 
  • Like
Reactions: Kingneutron
And how to downgrade. I only get 6.14 or 6.17 as option to install.
Proxmox 9.0 only ships kernel 6.14.* and the opt-in kernel 6.17.*
You could try one of the following tests;
1. (Incase you do not use ZFS as bootdisk) Try an older mainline Ubuntu kernel, something like kernel 6.5.* and see if that helps
2. Try a fresh install with an Proxmox 8.4 ISO, which still ships a load of older proxmox-kernel versions starting from 6.2.*

Not to mention, a memtest86+ wouldn't hurt the machine as this may perhaps just be an underlying hardware issue.
 
But now there is not green screen in HDMI. Instead I can see this message.
That is a very odd PSOD (Purple Screen Of Death) to receive in Proxmox during runtime. This would more likely be received in ESXi or some other OS.
Do you have some GPU passthrough going on? What is normally shown on that screen/monitor? Normal green screen - what is that?

I think you will have to more clearly describe your setup - running VMs & LXCs, storage, NW, passthroughs etc.

It is very difficult to know the last kernel that worked for you, since you state:
Uptime was for sure over one year
So it is going to be something historic, as even if you did regular updates (you haven't told us) - you never actually rebooted to load newer kernels.
The issue with your system, may not actually be linked to the newer PVE version or kernel(s) - as it could also easily be linked to something you setup in the last year, but actually never yet rebooted & tested.

You may also want to test that new NVMe - since that is also one of the changes you have made. What model is it?
 
  • Like
Reactions: Kingneutron
Looking more closely at that screen (top-center bar), it looks like some RDP/VNC/HDMI capture going on. Could you describe it accurately?
 
I am using NanoKVM to see the HDMI output and have remote mouse and keyboard control (but no ATX power as I was unable to correct connect to the motherboard).

I used Proxmox since it was version 7. The complete freeze happened before with some specific kernel versions but that was over one year ago. I have since upgraded to v8 and also v9. So I was at latest v9 version before I decided to swap the main OS NVME with brand new and do a fresh USB Proxmox v9 installation on it. Restore went without any issues and all my LXC are up and running. This is how it looks:

1762261763940.png
I have 8 LXC servers. Two are standalone for MariaDB and Unifi controller, other are running docker inside. I also have one VM running Home Assistant. I do pass GPU to the video-server LXC that has in docker running Frigate and Emby. But GPU is passed only to Frigate container.

I only saw that purple screen once. Other times when I check in KVM all I can see is pure full screen green background.

Proxmox 9.0 only ships kernel 6.14.* and the opt-in kernel 6.17.*
You could try one of the following tests;
1. (Incase you do not use ZFS as bootdisk) Try an older mainline Ubuntu kernel, something like kernel 6.5.* and see if that helps
2. Try a fresh install with an Proxmox 8.4 ISO, which still ships a load of older proxmox-kernel versions starting from 6.2.*

Not to mention, a memtest86+ wouldn't hurt the machine as this may perhaps just be an underlying hardware issue.
I did run memtest86+ and it found no issues. As I said before replacing NVME with fresh installation I had no problems with the server. Can I even use older kernel?

After my last update to v9 I of course also rebooted the server. So uptime was not that long and if during upgrade to v9, kernel version also changed, then I was already on newer version.

My current versions are:

Code:
proxmox-ve: 9.0.0 (running kernel: 6.17.2-1-pve)
pve-manager: 9.0.11 (running version: 9.0.11/3bf5476b8a4699e2)
proxmox-kernel-helper: 9.0.4
proxmox-kernel-6.17.2-1-pve: 6.17.2-1
proxmox-kernel-6.14.11-4-pve-signed: 6.14.11-4
proxmox-kernel-6.14: 6.14.11-4
proxmox-kernel-6.14.8-2-pve-signed: 6.14.8-2
amd64-microcode: 3.20250311.1
ceph-fuse: 19.2.3-pve1
corosync: 3.1.9-pve2
criu: 4.1.1-1
frr-pythontools: 10.3.1-1+pve4
ifupdown2: 3.3.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libproxmox-acme-perl: 1.7.0
libproxmox-backup-qemu0: 2.0.1
libproxmox-rs-perl: 0.4.1
libpve-access-control: 9.0.3
libpve-apiclient-perl: 3.4.0
libpve-cluster-api-perl: 9.0.6
libpve-cluster-perl: 9.0.6
libpve-common-perl: 9.0.11
libpve-guest-common-perl: 6.0.2
libpve-http-server-perl: 6.0.5
libpve-network-perl: 1.1.8
libpve-rs-perl: 0.10.10
libpve-storage-perl: 9.0.13
libspice-server1: 0.15.2-1+b1
lvm2: 2.03.31-2+pmx1
lxc-pve: 6.0.5-1
lxcfs: 6.0.4-pve1
novnc-pve: 1.6.0-3
proxmox-backup-client: 4.0.16-1
proxmox-backup-file-restore: 4.0.16-1
proxmox-backup-restore-image: 1.0.0
proxmox-firewall: 1.2.0
proxmox-kernel-helper: 9.0.4
proxmox-mail-forward: 1.0.2
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.3
proxmox-widget-toolkit: 5.0.6
pve-cluster: 9.0.6
pve-container: 6.0.13
pve-docs: 9.0.8
pve-edk2-firmware: 4.2025.02-4
pve-esxi-import-tools: 1.0.1
pve-firewall: 6.0.3
pve-firmware: 3.17-2
pve-ha-manager: 5.0.5
pve-i18n: 3.6.1
pve-qemu-kvm: 10.0.2-4
pve-xtermjs: 5.5.0-2
qemu-server: 9.0.23
smartmontools: 7.4-pve1
spiceterm: 3.4.1
swtpm: 0.8.0+pve2
vncterm: 1.9.1
zfsutils-linux: 2.3.4-pve1
 
But GPU is passed only to Frigate container
So that screenshot above is not produced by Proxmox OS but by the LXC Video server.

How do you know the host froze? Do you have any host logs post-freezing of that LXC container? Can you still GUI/SSH in etc.?

So I was at latest v9 version before I decided to swap the main OS NVME
Did it freeze ever on PVE 9, before you swapped out that NVMe?
Still unanswered:
You may also want to test that new NVMe - since that is also one of the changes you have made. What model is it?

As a test, to see what is causing the "freeze", start by running the host without any VMs/LXCs running & see if it still freezes. Then add them back one by one until you get a freeze again. Lengthy, but possible.

One other thought; Is that NanoKVM always connected to the host? You may want to try running the host without it. (I've seen before issues through the KVM's USB & HDMI connectors causing host issues).
 
So that screenshot above is not produced by Proxmox OS but by the LXC Video server.
No. That is from Proxmox host directly. When it is working I can see the login and can work over browser in the Proxmox console (even if network is down). I pass GPU to LXC (and also Google Coral in USB) using this 102.conf entries:

Code:
lxc.cgroup2.devices.allow: c 226:0 rwm
lxc.cgroup2.devices.allow: c 226:128 rwm
lxc.cgroup2.devices.allow: c 29:0 rwm
lxc.cgroup2.devices.allow: c 189:* rwm
lxc.apparmor.profile: unconfined
lxc.cgroup2.devices.allow: a
lxc.mount.entry: /dev/dri/renderD128 dev/dri/renderD128 none bind,optional,create=file 0, 0
lxc.mount.entry: /dev/bus/usb/ dev/bus/usb/ none bind,optional,create=dir 0,0
lxc.cap.drop:
lxc.mount.auto: cgroup:rw

Frigate uses it for decoding the camera video:

1762278160426.png

How do you know the host froze? Do you have any host logs post-freezing of that LXC container? Can you still GUI/SSH in etc.?

Nothing in logs. Keyboard is not working. Only hard power off helps (CTRL+ALT+Delete does nothing). Network is down. On HDMI output I see that green screen and one time that purple one with kernel panic message.

Did it freeze ever on PVE 9, before you swapped out that NVMe?
Still unanswered:

No. It was working very stable. No issues whatsoever. That is what I mean with uptime (not actual server uptime). I think last reboot was when I upgraded to Proxmox 9 using dedicated script from official manual (around two months ago). I did run apt update and upgrade after it, to get latest GUI version and other packages, but I do not think I have rebooted it since to apply possible newer kernel.

As a test, to see what is causing the "freeze", start by running the host without any VMs/LXCs running & see if it still freezes. Then add them back one by one until you get a freeze again. Lengthy, but possible.

I need most of the services running. Home Assistant and cameras. This would be hard to achive. I have updated packages to all latest version in each LXC (mostly debian 11). So they are all updated with latest packages over apt.

One other thought; Is that NanoKVM always connected to the host? You may want to try running the host without it. (I've seen before issues through the KVM's USB & HDMI connectors causing host issues).

I have it for over one year now. Doubt it can cause issues. But shouldn't something be in the logs before the crash/freeze? At least some clue?

Currently I am thinking about replacing back the NVME with old one. Or maybe DD over data from old to new one. Not sure how can I connect both on same server to do it, as it has only one NVME slot. But I would prefer to have this fresh OS, just stable.
 
No. That is from Proxmox host directly. When it is working I can see the login and can work over browser in the Proxmox console (even if network is down). I pass GPU to LXC (and also Google Coral in USB) using this 102.conf entries:
As I thought, that is from a GUI (unspecified) added by yourself to PVE.

Keyboard is not working.
The GUI itself may have crashed. The NanoKVM may have lost its host connection (in parallel with the GUI crash). How is the NanoKVM powered? Does it reboot/need rebooting after repowering the host? I'd try disconnecting that NanoKVM at least for testing/stability purposes.

Network is down.
What do you mean? Tried Ping/SSH from another client to the host? Maybe that unifi-server is misbehaving?

You wrote earlier, that you have your GUI/browser on the host, so you can access PVE when the NW is down. Does this happen often? To use PVE it is to be assumed that you have a stable enough NW.

No. It was working very stable. No issues whatsoever.
So again, the important change is the NVMe drive change. You again fail to disclose which model you are using. Also what prompted you to change the NVMe - maybe you try putting the original back in? (I now see at the end of your post that you are considering this).

I did run apt update and upgrade after it, to get latest GUI version and other packages
I'm hoping you don't actually mean apt-get upgrade. This must never be run on Proxmox. Should only be dist-upgrade. Concerning the GUI part - see my point documented further on.

But shouldn't something be in the logs before the crash/freeze? At least some clue?
I don't have direct experience with this, but I can only theorize, that a voltage surge backported across the USB hub etc. could potentially have that effect.

Not sure how can I connect both on same server to do it, as it has only one NVME slot.
NVMe to USB adapter.


I now would like to comment on your general setup. It appears you have a highly unconventional (& unsupported) setup:

  • You have the iGPU passed through to the LXC, yet you still use (actively) the HDMI (of the same iGPU?) on the host. I guess the LXC is only using the renderer. This is probably already a point-of-failure.
  • You have installed a GUI/Browser on the host. See here that clearly states this is for developmental purposes only & is not supported. Add that to the above passthrough, & I'd be amazed it ever worked on that Mini PC.
  • You are running docker in an LXC instead of a VM. See here.
  • You have unconfined the apparmor profile for this LXC. This is highly unsecure - as you probably already know.
  • You also appear to be running this LXC privileged, coupled with the apparmor unconfined & the running docker to this, & your system is extremely vulnerable.
  • You also have lxc.cap.drop: without any specifications, AFAIK that will give all permissions (which would be otherwise unpermitted) to the root user of that LXC. That is a recipe for havoc on your host.
  • You should probably consider trying to use VMs for the unconventional stuff (where possible).
Good luck.
 
  • Like
Reactions: Kingneutron
I had another crash over night. Same as before. Powered off and on to reboot it. Over night I did stop that video LXC with cameras and GPU "sharing" (not true passthru). Since it still crashed that LXC is clearly not the problem. I have disconnected now KVM from HDMI and USB on the server before I left for work.

As I thought, that is from a GUI (unspecified) added by yourself to PVE.

GUI? After server reboots it shows on HDMI (if I connect a monitor it it) plain text terminal screen with login prompt. No GUI or anything custom added to the Proxmox host. As I said this is fresh Proxmox installation to new NVMe disk. Nothing out of stock is installed on host.

The GUI itself may have crashed. The NanoKVM may have lost its host connection (in parallel with the GUI crash). How is the NanoKVM powered? Does it reboot/need rebooting after repowering the host? I'd try disconnecting that NanoKVM at least for testing/stability purposes.

KVM is just that. K(eyboard) V(ideo) M(mouse). To the server it is conntected with two cables. One is HDMI and one is USB (but not for power). KVM has extra USB power that is connected to external 5V power supply. It doesn't get its power from the server. It has no GUI, but since it has ethernet ports I can connect to KVM over browser to his IP and it will display the HDMI input he receives from the server. So there is no GUI (like KDE, Gnome, ...) on the Proxmox host.

What do you mean? Tried Ping/SSH from another client to the host? Maybe that unifi-server is misbehaving?

Proxmox server has for example IP 192.168.28.70. I can not ping it from anywhere in my network (wireless or wired) or even directly from the router. Network port is still up on router when server crashed. Not related connection to the Unifi wireless network. Unifi software just handles the three access poiints I have around the house and its connected clients.

You wrote earlier, that you have your GUI/browser on the host, so you can access PVE when the NW is down. Does this happen often? To use PVE it is to be assumed that you have a stable enough NW.

My home network is stable. What I mean was that that even if network is down (router upgrades for example), I can still connect remotly over KVM to the server to work in its terminal. Thing is that server is in server cabinet where I do not have a monitor. That is why this small KVM is perfect for me.

So again, the important change is the NVMe drive change. You again fail to disclose which model you are using. Also what prompted you to change the NVMe - maybe you try putting the original back in? (I now see at the end of your post that you are considering this).

Sorry. Did not know that matter. Stock NVMe HDD was ASRock 512GB. I replaced it with WD Blue NVMe 1GB of size.

I'm hoping you don't actually mean apt-get upgrade. This must never be run on Proxmox. Should only be dist-upgrade. Concerning the GUI part - see my point documented further on.

On LXC servers I have I do run apt update from time to time to upgrade the packages. There is not dist-upgrade command that I can use.

I don't have direct experience with this, but I can only theorize, that a voltage surge backported across the USB hub etc. could potentially have that effect.

On USB I have my external storage Terramaster D2-320 on USB-C port on my server and then on other USB ports I have APC UPS (to monitor), Google Coral (for Frigate) and Zigbee controller (passing it to Home Assistant VM). None of this devices pull many USB power from server. Maybe only Google Coral on USB3 port.

NVMe to USB adapter.

Just ordered one. So I will be able to connect my old NVMe disk. Maybe I can find anything on it that will remind of a change I made years ago that will make the server more stable. Low chance but still.

I now would like to comment on your general setup. It appears you have a highly unconventional (& unsupported) setup:

I appreciate it, but have in mind that server was 100 % stable with this setup before fresh installation. I did not have any reliability issues with current existing setup.

  • You have the iGPU passed through to the LXC, yet you still use (actively) the HDMI (of the same iGPU?) on the host. I guess the LXC is only using the renderer. This is probably already a point-of-failure.

I only use the hardware decoding of the GPU (AMD VAAPI). That has nothing to do with GPU HDMI output. It only offloads the CPU by letting GPU decode the video. All this is happening on software lever, but utilizing the GPU hardware. It is exactly same as if I were using hardware decoding on Proxmox host level or inside a VM.

  • You have installed a GUI/Browser on the host. See here that clearly states this is for developmental purposes only & is not supported. Add that to the above passthrough, & I'd be amazed it ever worked on that Mini PC.

That I explained above. There is not GUI/Browser on the host. It is stock Proxmox v9 installation. I just restored the virtual servers from backups that I made before replacing NVMe.

  • You are running docker in an LXC instead of a VM. See here.

Again there were no issues before with that and even if it is suggested to move them to VM, it is still not forbidden or impossible. I just have to enable unprivileged container and set nesting to 1. But if you think this is 100 % cause of this new crashes I can setup a VM instaed of LXC for all LXC servers I have now with docker inside. Which VM OS would you suggest?

  • You have unconfined the apparmor profile for this LXC. This is highly unsecure - as you probably already know.

No I did not know. That config was from official Frigate documentation and other boards that explained how to share GPU for HW decoding or any other USB device to the LXC container. If you are familliar with this custom LXC config can you suggest what part of it is not needed and can be removed?

  • You also appear to be running this LXC privileged, coupled with the apparmor unconfined & the running docker to this, & your system is extremely vulnerable.

For the attacks from outside? My LXC ports are not opened to the outside works. I have VPN directly to my router and can access them only by their local IP.

  • You also have lxc.cap.drop: without any specifications, AFAIK that will give all permissions (which would be otherwise unpermitted) to the root user of that LXC. That is a recipe for havoc on your host.

Stability problem or again just possible security issues?

While I wait for USB NVMe enclosure to arrive, maybe I can try to downgrade Kernel to version that was used in Proxmox v8. Would it even work on Proxmox v9? If it would do you know how can I install it?

Thanks a lot for help.
 
> Sorry. Did not know that matter. Stock NVMe HDD was ASRock 512GB. I replaced it with WD Blue NVMe 1GB of size.

I think you mean 1TB, but as Adam from Mythbusters might say: " Well there's your problem "...

Nobody is using WD Blue for serious use. It's a budget desktop-level drive, maybe ~600TBW rating, not suitable for sustained loads or a 24/7 hypervisor server. Do your research, proxmox specifically recommends Enterprise-level SSD for endurance.

I would RMA that silly thing and maybe invest in something like a Lexar NM790 if you don't want to go with used Enterprise off ebay or something. I have 3x of those for proxmox and minimal wear after about 1.75 years of near-continuous deployed use. (The original 1TB in my Qotom server is ~3% wear since going 24/7 in Feb 2024, it's 100% allocated as ZFS.) The 2TBs are still at 0% wear.

https://search.brave.com/search?q=p...summary=1&conversation=3ece5f6a75bb75c81691fb

https://www.google.com/search?q=pro...bNiP-QsA&csuir=1&mtid=nAELafqmGvzJp84P8NWMuAM

Ignore the recommend for Crucial MX500, there are bad experiences reported on the forum(s) with it.
 
  • Like
Reactions: gfngfn256