VM freezes irregularly

No, not sid!

Simply add: non-free-firmware to all of your existing (should be three per default: [1]) Debian bookworm repositories and then: apt update followed by: apt install intel-microcode and a reboot afterwards.

[1] https://pve.proxmox.com/wiki/Package_Repositories -> "Sources.list"

[0] https://packages.debian.org/bookworm/intel-microcode

Ah, okay. I get it now.

So:
Code:
deb http://ftp.us.debian.org/debian bookworm main contrib non-free-firmware

deb http://ftp.us.debian.org/debian bookworm-updates main contrib non-free-firmware

# security updates
deb http://security.debian.org bookworm-security main contrib non-free-firmware

deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription

But now I get this:
image.png

Seems I did something wrong...
 
Ah, okay. I get it now.

So:
Code:
deb http://ftp.us.debian.org/debian bookworm main contrib non-free-firmware

deb http://ftp.us.debian.org/debian bookworm-updates main contrib non-free-firmware

# security updates
deb http://security.debian.org bookworm-security main contrib non-free-firmware

deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription

But now I get this:
image.png

Seems I did something wrong...

Please provide the full output in code-tags each of:
  • pveversion -v
  • grep -r '' /etc/apt/sources.list*
  • apt update
 
Code:
root@pve:~# pveversion -v
proxmox-ve: 8.0.1 (running kernel: 6.2.16-4-bpo11-pve)
pve-manager: 8.0.3 (running version: 8.0.3/bbf3993334bfa916)
pve-kernel-6.2: 8.0.4
pve-kernel-5.15: 7.4-4
pve-kernel-5.19: 7.2-15
pve-kernel-6.2.16-5-pve: 6.2.16-6
pve-kernel-6.2.16-4-bpo11-pve: 6.2.16-4~bpo11+1
pve-kernel-5.19.17-2-pve: 5.19.17-2
pve-kernel-5.19.17-1-pve: 5.19.17-1
pve-kernel-5.18-edge: 5.18.19-1
pve-kernel-5.18.19-edge: 5.18.19-1
pve-kernel-5.15.108-1-pve: 5.15.108-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.107-1-pve: 5.15.107-1
pve-kernel-5.15.102-1-pve: 5.15.102-1
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.15.53-1-pve: 5.15.53-1
pve-kernel-5.15.39-4-pve: 5.15.39-4
pve-kernel-5.15.39-3-pve: 5.15.39-3
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.0
libpve-access-control: 8.0.3
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.6
libpve-guest-common-perl: 5.0.3
libpve-http-server-perl: 5.0.4
libpve-rs-perl: 0.8.4
libpve-storage-perl: 8.0.2
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 3.0.1-1
proxmox-backup-file-restore: 3.0.1-1
proxmox-kernel-helper: 8.0.2
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.2
proxmox-widget-toolkit: 4.0.6
pve-cluster: 8.0.2
pve-container: 5.0.4
pve-docs: 8.0.4
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.3
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.5
pve-qemu-kvm: 8.0.2-3
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1

Code:
root@pve:~# grep -r '' /etc/apt/sources.list*
/etc/apt/sources.list:deb http://ftp.us.debian.org/debian bookworm main contrib non-free-firmware
/etc/apt/sources.list:
/etc/apt/sources.list:deb http://ftp.us.debian.org/debian bookworm-updates main contrib non-free-firmware
/etc/apt/sources.list:
/etc/apt/sources.list:
/etc/apt/sources.list:# security updates
/etc/apt/sources.list:deb http://security.debian.org bookworm-security main contrib non-free-firmware
/etc/apt/sources.list:
/etc/apt/sources.list:deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription
/etc/apt/sources.list:
/etc/apt/sources.list.d/pve-enterprise.list:# deb https://enterprise.proxmox.com/debian/pve bullseye pve-enterprise
/etc/apt/sources.list.d/pve-enterprise.list:
/etc/apt/sources.list.d/pve-enterprise.list.dpkg-dist:deb https://enterprise.proxmox.com/debian/pve bookworm pve-enterprise
/etc/apt/sources.list.d/pve-edge-kernel.list:deb [signed-by=/usr/share/keyrings/pve-edge-kernel.gpg] https://dl.cloudsmith.io/public/pve-edge/kernel/deb/debian bullseye main

Code:
root@pve:~# apt update
Hit:1 http://security.debian.org bookworm-security InRelease
Get:2 https://dl.cloudsmith.io/public/pve-edge/kernel/deb/debian bullseye InRelease [5,182 B]
Hit:3 http://ftp.us.debian.org/debian bookworm InRelease
Hit:4 http://ftp.us.debian.org/debian bookworm-updates InRelease
Hit:5 http://download.proxmox.com/debian/pve bookworm InRelease
Fetched 5,182 B in 3s (1,877 B/s)                       
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
All packages are up to date.
 
Please provide the full output in code-tags each of:
  • pveversion -v
  • grep -r '' /etc/apt/sources.list*
  • apt update

Oops, I had to hit reply to notify you.

Anyway, it seems I messed up on past occasions. I see Bullseye entries there. I've no idea how to remove or replace any of this.
 
proxmox-ve: 8.0.1 (running kernel: 6.2.16-4-bpo11-pve)

This is a PVE 7 kernel. Did you not reboot after the upgrade or did you pin that kernel on purpose? If the former, reboot the PVE-host and afterwards check the running kernel again. If the latter, what is the reason?

/etc/apt/sources.list.d/pve-enterprise.list:# deb https://enterprise.proxmox.com/debian/pve bullseye pve-enterprise

It is (correctly, if you do not have a subscription) disabled anyway, but for completeness, you could/should change: bullseye to: bookworm.

/etc/apt/sources.list.d/pve-edge-kernel.list:deb [signed-by=/usr/share/keyrings/pve-edge-kernel.gpg] https://dl.cloudsmith.io/public/pve-edge/kernel/deb/debian bullseye main

This is for the third-party and therefore (officially) unsupported pve-edge-kernel and it is/was for PVE 7. I would recommend to remove it completely.


What is the output of?: cat /etc/os-release
 
Last edited:
This is a PVE 7 kernel. Did you not reboot after the upgrade or did you pin that kernel on purpose? If the former, reboot the PVE-host and afterwards check the running kernel again. If the latter, what is the reason?



It is (correctly, if you do not have a subscription) disabled anyway, but for completeness, you could/should change: bullseye to: bookworm.



This is for the third-party and therefore (officially) unsupported pve-edge-kernel and it is/was for PVE 7. I would recommend to remove it completely.


What is the output of?: cat /etc/os-release

Ah, I pinned it because I was using kernel 5.19 before on Proxmox 7. I was getting a warning when doing 'pve7to8'. I looked around how to fix it and I came to that conclusion. So now that you mention it, I did
Code:
proxmox-boot-tool kernel unpin
followed by
Code:
proxmox-boot-tool refresh
now.

"It is (correctly, if you do not have a subscription) disabled anyway, but for completeness, you could/should change: bullseye to: bookworm." All right, I fixed that now.

"This is for the third-party and therefore (officially) unsupported pve-edge-kernel and it is/was for PVE 7. I would recommend to remove it completely." All right. I nano'd in and remove that line. Or should I delete the file instead?

I rebooted but it didn't fix the problem.

"What is the output of?: cat /etc/os-release"


Code:
root@pve:~# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux trixie/sid"
NAME="Debian GNU/Linux"
VERSION_CODENAME=trixie
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Damn, you know how I asked if I should "sid" and you said no? I did so thinking nobody was going to answer me, so it seems I messed things up even further. I'm very sorry. I really don't know what I'm doing. I think I should just reinstall everything tomorrow or so...
 
Last edited:
Neobin, thanks for the easy way to fix the problem. I did a fresh reinstall of Proxmox 8 and followed your instructions from scratch.

I'm extremely happy to report that kernel 6.2.16-5-pve has been working fantastically. My main VM ran for almost 9 days straight. The only reason I had to stop it was because I needed to add more storage to this mini PC.

Kernel 6.2.16-7 is out but I'm honestly afraid to update.
 
  • Like
Reactions: Neobin
It took me weeks before I found this thread while trying to figure out this same issue of having a VM freeze while using 100% of 1 CPU core. The Proxmox host(s) continues to operate without noticeable issues.

I have a three host Proxmox cluster of identical Protectli VP2420 systems with the Intel Celeron J6412 CPU running Proxmox 8.0.4 and the latest kernel as of today. All of these have had VMs go into the frozen state talked about in this thread. So, this does not appear to be specific to the N5105 CPU class. Side note, I have another cluster of Protectli systems running Intel Core i7-10810U CPUs, with the same versions of Proxmox and Ubuntu VMs and I used the same scripts to spin up a k0s cluster on the VMs and none of those VMs have frozen.

There are two k0s clusters on these J6412 CPU systems where I have been testing different methods of installation, each with 3 controller VMs and 3 worker VMs spread across the three Proxmox hosts all running on a fully patched Ubuntu 22.04. I continue to see the controller VMs freeze in the same way described here, where the VM shows 100% of a single core in use. When I give them 2 cores, the Proxmox UI will show a steady 50% CPU use. When given 4 cores, it will show a steady 25% CPU use. At that point you cannot ping or SSH into the VM and the serial console (the VMs are built using a Cloud-Init image) is unresponsive or dead. I have not had the worker nodes (same Ubuntu and k0s versions) freeze. I spent far too much time thinking this was a k0s issue before I found this thread and had a completely different VM go into this state.

I attempted to "migrate" to another host as others have reported that allowed the VM to recover but this did not work.
I also attempted a hibernate (suspend to disk) and resume, which also did not get the VM to recover.
I also attempted setting the scaling_governor to powersave which did not help.

My last attempt is installing the latest microcode as described by others. In this case it is 0x17.

Code:
root@gadget1:~# pveversion
pve-manager/8.0.4/d258a813cfa6b390 (running kernel: 6.2.16-19-pve)

root@gadget1:~# uname -a
Linux gadget1 6.2.16-19-pve #1 SMP PREEMPT_DYNAMIC PMX 6.2.16-19 (2023-10-24T12:07Z) x86_64 GNU/Linux

A snippet of CPU info for the first one as the other 3 cores are the same.

Code:
root@gadget3:~# cat /proc/cpuinfo
processor    : 0
vendor_id    : GenuineIntel
cpu family    : 6
model        : 150
model name    : Intel(R) Celeron(R) J6412 @ 2.00GHz
stepping    : 1
microcode    : 0x16
cpu MHz        : 2000.000
cache size    : 4096 KB
physical id    : 0
siblings    : 4
core id        : 0
cpu cores    : 4
apicid        : 0
initial apicid    : 0
fpu        : yes
fpu_exception    : yes
cpuid level    : 27
wp        : yes
flags        : *REDACTED for space*
bugs        : spectre_v1 spectre_v2 spec_store_bypass swapgs srbds mmio_stale_data
bogomips    : 3993.60
clflush size    : 64
cache_alignment    : 64
address sizes    : 39 bits physical, 48 bits virtual
power management:

UPDATE: on 2023 Nov. 15
Previously one or more of the VMs on this cluster would usually freeze within 24 hours. Since installing microcode 0x17 on these systems, I have not had any VM freeze. This appears to also be the fix for this class of CPU.
 
Last edited:
Neobin, thanks for the easy way to fix the problem. I did a fresh reinstall of Proxmox 8 and followed your instructions from scratch.

I'm extremely happy to report that kernel 6.2.16-5-pve has been working fantastically. My main VM ran for almost 9 days straight. The only reason I had to stop it was because I needed to add more storage to this mini PC.

Kernel 6.2.16-7 is out but I'm honestly afraid to update.
There is nothing to worry about, it's a CPU errata, once the microcode patches it, it will be stable in any version of Kernel.
Patch the CPU microcode, use it and pretend that it is a fully working stable PC.
 

Attachments

  • kernel.jpg
    kernel.jpg
    15.7 KB · Views: 14
I came back here just to say that after applying the microcode my freezing stopped. Before, it happened every day, and it's been two weeks and everything is ok.
Enviroment:
Mini PC with Intel(R) Celeron(R) N5105
Proxmox: pve-manager/7.3-3/c3928077 (running kernel: 5.15.74-1-pve)

Procedure:


in the proxmox shell:

edit /etc/apt/sources.list and add "non-free", as above:


deb http://ftp.debian.org/debian bullseye main contrib non-free
deb http://ftp.debian.org/debian bullseye-updates main contrib non-free
deb http://security.debian.org bullseye-security main contrib non-free


execute the commands:

apt update
apt install intel-microcode

reboot

check if microcode changed


root@pve:~# cat /proc/cpuinfo | grep microcode
microcode : 0x24000024
microcode : 0x24000024
microcode : 0x24000024
microcode : 0x24000024

just rollback the changes in sources.list

edit/etc/apt/sources.list and remove "non-free"

be happy!!!
 
Are those microcodes just on the driver level in the kernel or will it flash the firmware of the CPU?
And would it be better to stick with the microcodes that come with an BIOS update of the manufacturer?
I could think this might cause some problems when the microcode is newer than the BIOS was tested/programmed for?

Some of my servers are EoL and won't receive BIOS/BMC updates anymore so I was thinking if it might be a good idea to install the intel-microcode package.
 
  • Like
Reactions: Dunuin
Hi guys,
i'm getting hopeless...

- I have this NUC: https://www.aliexpress.us/item/3256804662238664.html with N5105
- CPU(s) 4 x Intel(R) Celeron(R) N5105 @ 2.00GHz (1 Socket)
- Kernel Version Linux 6.5.13-1-pve (2024-02-05T13:50Z)
- Manager Version pve-manager/8.1.4/ec5affc9e41f1d79

On the proxmox host I installed latest intel-microcode:

root@proxmox:~# cat /proc/cpuinfo | grep microcode
microcode : 0x24000024
microcode : 0x24000024
microcode : 0x24000024
microcode : 0x24000024

my pfsense vm is stable AF, because pfsense is not linux based but freebsd based

but my ubuntu frigate vm crashes after few mins...
I have a mini-pcie google coral in this NUC, I have a passthrough of this pcie google coral to the frigate vm.

I read the whole thread, tried couple different things, but nothing helped. Not even getting the coral out of the picture (switching to CPU detector instead of coral and removing pcie device coral from the vm) but it still crashes like this... (picture)

Any help would be greatly appreciated. I can provide hw pages of the proxmox/frigate, docker compose config for the frigate, frigate's config, etc, but don't know if it is necessary for this problem...

1709152862186.png1709153201454.png
 
Last edited:
is it just the VM that crashes? or the whole host? anything in the logs?
 
Just the VM is crashing. Host is super stable.

Only action on the VM I'm able to do from proxmox is "Stop". Then it is startable again. In that io_error state nothing else works there. Even though it says its using like 0.2% of CPU and few hundreds of MBs of RAM. I'm also able to open console, but it is not writable (seems frozen)

Now which logs? proxmox->System->syslogs?

Code:
Mar 04 09:00:44 proxmox pvedaemon[348457]: <root@pam> successful auth for user 'root@pam'
Mar 04 09:00:57 proxmox pvedaemon[348457]: VM 101 qmp command failed - VM 101 qmp command 'guest-ping' failed - got timeout
Mar 04 09:01:02 proxmox pvedaemon[348438]: VM 101 qmp command failed - VM 101 qmp command 'guest-ping' failed - got timeout
Mar 04 09:02:16 proxmox pvedaemon[348438]: VM 101 qmp command failed - VM 101 qmp command 'guest-ping' failed - got timeout
Mar 04 09:02:22 proxmox pvedaemon[886319]: starting vnc proxy UPID:proxmox:000D862F:024FD803:65E5800E:vncproxy:101:root@pam:
Mar 04 09:02:22 proxmox pvedaemon[348457]: <root@pam> starting task UPID:proxmox:000D862F:024FD803:65E5800E:vncproxy:101:root@pam:
Mar 04 09:02:34 proxmox pvedaemon[348457]: <root@pam> end task UPID:proxmox:000D862F:024FD803:65E5800E:vncproxy:101:root@pam: OK
Mar 04 09:04:14 proxmox pvedaemon[886606]: starting vnc proxy UPID:proxmox:000D874E:0250038C:65E5807E:vncproxy:101:root@pam:
Mar 04 09:04:14 proxmox pvedaemon[348438]: <root@pam> starting task UPID:proxmox:000D874E:0250038C:65E5807E:vncproxy:101:root@pam:
Mar 04 09:04:21 proxmox pvedaemon[886630]: stop VM 101: UPID:proxmox:000D8766:0250065A:65E58085:qmstop:101:root@pam:
Mar 04 09:04:21 proxmox pvedaemon[348388]: <root@pam> starting task UPID:proxmox:000D8766:0250065A:65E58085:qmstop:101:root@pam:
Mar 04 09:04:21 proxmox kernel: tap101i0: left allmulticast mode
Mar 04 09:04:21 proxmox kernel: fwbr101i0: port 2(tap101i0) entered disabled state
Mar 04 09:04:21 proxmox kernel: fwbr101i0: port 1(fwln101i0) entered disabled state
Mar 04 09:04:21 proxmox kernel: vmbr1: port 3(fwpr101p0) entered disabled state
Mar 04 09:04:21 proxmox kernel: fwln101i0 (unregistering): left allmulticast mode
Mar 04 09:04:21 proxmox kernel: fwln101i0 (unregistering): left promiscuous mode
Mar 04 09:04:21 proxmox kernel: fwbr101i0: port 1(fwln101i0) entered disabled state
Mar 04 09:04:22 proxmox kernel: fwpr101p0 (unregistering): left allmulticast mode
Mar 04 09:04:22 proxmox kernel: fwpr101p0 (unregistering): left promiscuous mode
Mar 04 09:04:22 proxmox kernel: vmbr1: port 3(fwpr101p0) entered disabled state
Mar 04 09:04:22 proxmox qmeventd[607]: read: Connection reset by peer
Mar 04 09:04:22 proxmox pvedaemon[348438]: <root@pam> end task UPID:proxmox:000D874E:0250038C:65E5807E:vncproxy:101:root@pam: OK
Mar 04 09:04:22 proxmox systemd[1]: 101.scope: Deactivated successfully.
Mar 04 09:04:22 proxmox systemd[1]: 101.scope: Consumed 14min 2.690s CPU time.
Mar 04 09:04:23 proxmox pvedaemon[348388]: <root@pam> end task UPID:proxmox:000D8766:0250065A:65E58085:qmstop:101:root@pam: OK
Mar 04 09:04:23 proxmox qmeventd[886649]: Starting cleanup for 101
Mar 04 09:04:23 proxmox qmeventd[886649]: Finished cleanup for 101
Mar 04 09:06:17 proxmox pvedaemon[348457]: <root@pam> starting task UPID:proxmox:000D8892:0250339A:65E580F9:qmstart:101:root@pam:
Mar 04 09:06:17 proxmox pvedaemon[886930]: start VM 101: UPID:proxmox:000D8892:0250339A:65E580F9:qmstart:101:root@pam:
Mar 04 09:06:18 proxmox systemd[1]: Started 101.scope.
Mar 04 09:06:19 proxmox kernel: tap101i0: entered promiscuous mode
Mar 04 09:06:19 proxmox kernel: vmbr1: port 3(fwpr101p0) entered blocking state
Mar 04 09:06:19 proxmox kernel: vmbr1: port 3(fwpr101p0) entered disabled state
Mar 04 09:06:19 proxmox kernel: fwpr101p0: entered allmulticast mode
Mar 04 09:06:19 proxmox kernel: fwpr101p0: entered promiscuous mode
Mar 04 09:06:19 proxmox kernel: vmbr1: port 3(fwpr101p0) entered blocking state
Mar 04 09:06:19 proxmox kernel: vmbr1: port 3(fwpr101p0) entered forwarding state
Mar 04 09:06:19 proxmox kernel: fwbr101i0: port 1(fwln101i0) entered blocking state
Mar 04 09:06:19 proxmox kernel: fwbr101i0: port 1(fwln101i0) entered disabled state
Mar 04 09:06:19 proxmox kernel: fwln101i0: entered allmulticast mode
Mar 04 09:06:19 proxmox kernel: fwln101i0: entered promiscuous mode
Mar 04 09:06:19 proxmox kernel: fwbr101i0: port 1(fwln101i0) entered blocking state
Mar 04 09:06:19 proxmox kernel: fwbr101i0: port 1(fwln101i0) entered forwarding state
Mar 04 09:06:19 proxmox kernel: fwbr101i0: port 2(tap101i0) entered blocking state
Mar 04 09:06:19 proxmox kernel: fwbr101i0: port 2(tap101i0) entered disabled state
Mar 04 09:06:19 proxmox kernel: tap101i0: entered allmulticast mode
Mar 04 09:06:19 proxmox kernel: fwbr101i0: port 2(tap101i0) entered blocking state
Mar 04 09:06:19 proxmox kernel: fwbr101i0: port 2(tap101i0) entered forwarding state
Mar 04 09:06:21 proxmox pvedaemon[348457]: <root@pam> end task UPID:proxmox:000D8892:0250339A:65E580F9:qmstart:101:root@pam: OK
Mar 04 09:06:21 proxmox pvedaemon[348438]: <root@pam> starting task UPID:proxmox:000D88D8:02503542:65E580FD:vncproxy:101:root@pam:
Mar 04 09:06:21 proxmox pvedaemon[887000]: starting vnc proxy UPID:proxmox:000D88D8:02503542:65E580FD:vncproxy:101:root@pam:
Mar 04 09:06:59 proxmox pvedaemon[348438]: <root@pam> end task UPID:proxmox:000D88D8:02503542:65E580FD:vncproxy:101:root@pam:OK

There seems to be nothing there for the io_error on that VM.
Didn't find any other obvious logs from the promox's UI.
If you need other logs (which I presume you need), just tell me which, I will try to get them.

Btw the repro (stop, then start, then wait for io_error state) is basically in <1min, so the debug loop can be fast.
Also in this <1min, the vm starts normally and works normally, even the frigate docker inside the VM starts and shows cameras and the TPU (google coral) also loaded in there and seems to work in that <1min.
 
please setup a serial console for the VM and then dump both the system logs and the VM logs starting with the VM startup until it hangs with io-error.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!