[SOLVED] VMs freeze with 100% CPU

udo · Jun 19, 2023

There is no ssh access, the vnc console does not respond to requests. The static picture is standing. The log on the machine itself is abruptly interrupted and resumes after restarting:

Code:

June 16, 20:12:22 dev-kafka-2-kraft kafka-server-start.sh[23704]: [2023-06-16 20:12:22,285] INFO [Craft Manager NodeID=2] Request to vote VoteRequestDa>
June 16, 20:12:22 dev-kafka-2-kraft kafka-server-start.sh[23704]: [2023-06-16 20:12:22,368] INFORMATION [Root Manager Node ID=2] Transition to Fo is completed>
June 16, 20:12:22 dev-kafka-2-kraft kafka-server-start.sh[23704]: [2023-06-16 20:12:22,370] INFORMATION [BrokerToControllerChannelManager broker=2 name=h>
-- Bootable fbd87f21176742cb8ab0717732d2b6bc --
June 16, 21:27:04 dev-kafka-2-kraft kernel: Linux version 5.15.0-73-generic (buildd@bos03-amd64-060) (gcc (Ubuntu 11.3.0-1ubuntu1~04/22/04) 11.3.0,>
June 16, 21:27:04 dev-kafka kernel-2-kraft: Command line: BOOT_IMAGE=/vmlinuz-5.15.0-73- common root=/dev/mapper/ap--vg-ap--lv--root root=1 ip>
June 16, 21:27:04 dev-kafka-2-kraft core: Core-supported processors:

on the proxmox node itself in the syslog i see this:

Code:

 June 16, 18:36:32 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_18] [SAT], 16 currently unreadable (pending) sectors
June 16, 18:36:32 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_18] [SAT], 16 autonomous incorrigible sectors
June 16, 18:36:32 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_24], INTELLIGENT failure: FAILURE PREDICTION THRESHOLD EXCEEDED
June 16, 19:06:33 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_18] [SAT], 16 currently unreadable (pending) sectors
June 16, 19:06:33 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_18] [SAT], 16 autonomous incorrigible sectors
June 16, 19:06:33 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_24], INTELLIGENT failure: FAILURE PREDICTION THRESHOLD EXCEEDED
June 16, 19:36:33 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_18] [SAT], 16 currently unreadable sectors (pending)
June 16, 19:36:33 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_18] [SAT], 16 autonomous incorrigible sectors
June 16, 19:36:33 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_24], INTELLIGENT failure: FAILURE PREDICTION THRESHOLD EXCEEDED
June 16, 20:06:33 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_18] [SAT], 16 currently unreadable (pending) sectors
June 16, 20:06:33 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_18] [SAT], 16 autonomous incorrigible sectors
June 16, 20:06:33 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_24], INTELLIGENT failure: FAILURE PREDICTION THRESHOLD EXCEEDED
June 16, 20:08:14 petr-stor4 pvestatd[2860]: VM command error 105 qmp - VM command error 105 qmp 'request-proxmox-support' - timeout received
June 16, 20:18:49 petr-stor4 pvedaemon[1718506]: Failed to execute VM 121 qmp command - failed to execute VM 121 qmp 'guest-ping' command - timeout received
June 16, 20:19:08 petr-stor4 pvedaemon[1727264]: Failed to execute VM 121 qmp command - failed to execute VM 121 qmp 'guest-ping' command - timeout received
June 16, 20:19:28 petr-stor4 pvedaemon[1717223]: Failed to execute VM 121 qmp command - failed to execute VM 121 qmp 'guest-ping' command - timeout received
June 16, 20:21:24 petr-stor4 pvedaemon[1727264]: Failed to execute VM 121 qmp command - failed to execute VM 121 qmp 'guest-ping' command - timeout received
June 16, 20:21:43 petr-stor4 pvedaemon[1717223]: Failed to execute VM 121 qmp command - failed to execute VM 121 qmp 'guest-ping' command - timeout received
June 16, 20:22:05 petr-stor4 pvedaemon[1718506]: Failed to execute VM 121 qmp command - failed to execute VM 121 qmp 'guest-ping' command - timeout received

maybe my osd.54 slowly dying and so the machines freeze ? But i have replication factor 2/3 in my ceph... I have a Ceph warning in the PVE UI yesterday which said 1 daemons have recently crashed osd.54 crashed on host *****. But for now osd.54 service works well. Backtrace of osd.54 crash:

Code:

{
    "backtrace": [
        "/lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7f69cb483140]",
        "(BlueStore::Extent::~Extent()+0x27) [0x55d6e1ebb8e7]",
        "(BlueStore::Onode::put()+0x2c5) [0x55d6e1e32f25]",
        "(std::_Hashtable<ghobject_t, std::pair<ghobject_t const, boost::intrusive_ptr<BlueStore::Onode> >, mempool::pool_allocator<(mempool::pool_index_t)4, std::pair<ghobject_t const, boost::intrusive_ptr<BlueStore::Onode> > >, std::__detail::_Select1st, std::equal_to<ghobject_t>, std::hash<ghobject_t>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_erase(unsigned long, std::__detail::_Hash_node_base*, std::__detail::_Hash_node<std::pair<ghobject_t const, boost::intrusive_ptr<BlueStore::Onode> >, true>*)+0x67) [0x55d6e1ebc2c7]",
        "(LruOnodeCacheShard::_trim_to(unsigned long)+0xca) [0x55d6e1ebfb5a]",
        "(BlueStore::OnodeSpace::add(ghobject_t const&, boost::intrusive_ptr<BlueStore::Onode>&)+0x15d) [0x55d6e1e3371d]",
        "(BlueStore::Collection::get_onode(ghobject_t const&, bool, bool)+0x399) [0x55d6e1e3a309]",
        "(BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x154d) [0x55d6e1e814dd]",
        "(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x2e0) [0x55d6e1e82430]",
        "(non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x52) [0x55d6e1aa8412]",
        "(ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x7b4) [0x55d6e1cb8ef4]",
        "(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x53d) [0x55d6e1a2418d]",
        "(PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xd46) [0x55d6e1a80326]",
        "(PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x334a) [0x55d6e1a87c6a]",
        "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1bc) [0x55d6e18f789c]",
        "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x65) [0x55d6e1b77505]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa27) [0x55d6e1924367]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a) [0x55d6e1fcd3da]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55d6e1fcf9b0]",
        "/lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f69cb477ea7]",
        "clone()"
    ],
    "ceph_version": "16.2.9",
    "crash_id": "2023-06-15T20:43:02.275025Z_aad0cf01-3839-41a3-b8bd-d516080722b1",
    "entity_name": "osd.54",
    "os_id": "11",
    "os_name": "Debian GNU/Linux 11 (bullseye)",
    "os_version": "11 (bullseye)",
    "os_version_id": "11",
    "process_name": "ceph-osd",
    "stack_sig": "f33237076f54d8500909a0c8c279f6639d4e914520f35b288af4429eebfd958e",
    "timestamp": "2023-06-15T20:43:02.275025Z",
    "utsname_hostname": "petr-stor4",
    "utsname_machine": "x86_64",
    "utsname_release": "5.15.35-2-pve",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PVE 5.15.35-5 (Wed, 08 Jun 2022 15:02:51 +0200)"
}

Hi,
you have two disks with issues!
Please replace (fast) megaraid_disk_18 first - this disk has unrecoverable read-errors. not an good sign! If you can't replace set the osd as down/out, so that the data is moved to other OSDs.
And after replace (and rebuild), renew megaraid_disk_24 too.

BTW: ceph should access directly megaraid_disk_18 sound's a little bit as raid-0 for each disk, to simulate an HBA?

Udo

coenvl · Jun 20, 2023

So this morning I caught another one. Here is the output of the logs, but I don't think it looks anything different from the last time.

strace:

Code:

strace -c -p $(cat /var/run/qemu-server/115.pid)
strace: Process 195052 attached
^Cstrace: Process 195052 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 98.39   38.310114       16267      2355           ppoll
  1.12    0.434809          48      8936           write
  0.32    0.124000          53      2298           read
  0.17    0.066618          30      2187           recvmsg
  0.00    0.000126           2        43           sendmsg
  0.00    0.000039           4         9           accept4
  0.00    0.000035           3         9           close
  0.00    0.000014           0        18           fcntl
  0.00    0.000010           1         9           getsockname
------ ----------- ----------- --------- --------- ----------------
100.00   38.935765        2454     15864           total

gdb output attached as file.

VM config:

Code:

qm config 115
balloon: 0
boot: order=scsi0;ide2;net0
cores: 4
cpu: Broadwell
description: OS%3A Debian 11.2 bullseye%0AUpdated packages 10-2-2022%0AContains default HESI user account
ide2: none,media=cdrom
memory: 32768
name: GEIS-beta-16
net0: virtio=1E:71:E6:AA:5D:7E,bridge=vmbr0,tag=37
numa: 0
onboot: 1
ostype: l26
scsi0: hesi-storage:vm-115-disk-0,size=1000G
scsihw: virtio-scsi-pci
smbios1: uuid=5a766abc-9464-4995-ba6c-2c5c0772115d
sockets: 2

pveversion:

Code:

proxmox-ve: 7.4-1 (running kernel: 6.2.11-1-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-6.2: 7.4-3
pve-kernel-5.15: 7.4-3
pve-kernel-5.13: 7.1-9
pve-kernel-6.2.11-2-pve: 6.2.11-2
pve-kernel-6.2.11-1-pve: 6.2.11-1
pve-kernel-6.2.6-1-pve: 6.2.6-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.107-1-pve: 5.15.107-1
pve-kernel-5.15.102-1-pve: 5.15.102-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
ceph-fuse: 17.2.6-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-3
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-1
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.6
libpve-storage-perl: 7.4-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.7.0
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-2
pve-firewall: 4.3-2
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1

And now also the file descriptors:

Code:

for pid in $(pidof kvm); do prlimit -p $pid | grep NOFILE; ls -1 /proc/$pid/fd/ | wc -l; done
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                  1024     1048576 files
91
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                1024    524288 files
80
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                  1024     1048576 files
91
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                  1024     1048576 files
91
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                  1024     1048576 files
91
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                  1024     1048576 files
79
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                1024    524288 files
89
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                1024    524288 files
89
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                  1024     1048576 files
91
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                  1024     1048576 files
91
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                  1024     1048576 files
91
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                  1024     1048576 files
91
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                1024    524288 files
32

So, even though it says `balloon 0` in the QM config output, this is not the working configuration but a pending change after the last recommendations. I will reboot the machine now, and also update the aio_threads setting on the disk.

So far the machine that crashed last week, in which I did update the settings hasn't crashed yet, but then again I would not have expected it to. I get a freezing VM every week or so, alternating between 4 VMs out of a total of 50.

fiona · Jun 20, 2023

coenvl said:
So this morning I caught another one. Here is the output of the logs, but I don't think it looks anything different from the last time.

Yes, again only being in ppoll (likely some event never arrives), but doesn't look like the PLT corruption.

coenvl said:
So, even though it says `balloon 0` in the QM config output, this is not the working configuration but a pending change after the last recommendations. I will reboot the machine now, and also update the aio_threads setting on the disk.

So far the machine that crashed last week, in which I did update the settings hasn't crashed yet, but then again I would not have expected it to. I get a freezing VM every week or so, alternating between 4 VMs out of a total of 50.

Let's hope for the best. We know that neither ballooning nor io_uring are causing the PLT corruption issue according to @udo, but we don't yet know it for your issue.

machana · Jun 20, 2023

Hi, I'm currently experiencing same issue, always with Windows 10 machines.

I've tried many of the possible fixes in this thread, but none solved the issue.

It happens in both Intel (Xeon) CPUs and AMD Ryzen

Always using Virtio for SCSI and net

Kernel Version

Linux 5.15.107-2-pve #1 SMP PVE 5.15.107-2 (2023-05-10T09:10Z)

PVE Manager Version

pve-manager/7.4-13/46c37d9c

Machine i440fx latest (7.2)

The only solution so far has been to increase SPICE memory to 128. Before this change, I had freezes every hour or so, now I've been running for a couple of days with no issue.

VictorSTS · Jun 20, 2023

@machana I'm not using spice at all. Did you gather the information proposed in this thread (gdb, strace, openfiles) when the freeze happend? It could be a different issue with similar symptoms. Also, you are using kernel 5.15 which, afaik, should not be affected by this issue.

StephanS · Jun 20, 2023

It just happened again on my Debian-VM. 20 days of uptime without any problem and suddenly, right after the pbs-backup-job ran over this vm, the cpu was at 100%.

Could there be a connection with the bps backup-tasks or is it just a coincidence?

emunt6 · Jun 20, 2023

https://bugzilla.kernel.org/show_bug.cgi?id=199727

irekpias · Jun 21, 2023

Should it be a bug resolved with epoll ?

fiona · Jun 21, 2023

irekpias said:
Should it be a bug resolved with epoll ?
View attachment 51941

Unfortunately, no. The qmeventd is a separate process (used for cleanup) and not part of the QEMU process executing VMs.

fiona · Jun 21, 2023

StephanS said:
It just happened again on my Debian-VM. 20 days of uptime without any problem and suddenly, right after the pbs-backup-job ran over this vm, the cpu was at 100%.

Could there be a connection with the bps backup-tasks or is it just a coincidence?

Please share the requested information: https://forum.proxmox.com/threads/vms-freeze-with-100-cpu.127459/post-561792
and also have a look at the logs within the guest.

VictorSTS · Jun 21, 2023

emunt6 said:
https://bugzilla.kernel.org/show_bug.cgi?id=199727

IMHO, the issue described here is different. I've seen it myself on one cluster and the VM's restored their performance as soon as the heavy I/O finishes (backup, restore, whatever). At the time it was resolved using VirtIO SCSI single + I/O thread + cache=none.

showiproute · Jun 22, 2023

I also have those issues.
Mostly a Win2022 server hangs. I will check `strace -c -p $(cat /var/run/qemu-server/<ID>.pid)` next time

eddor · Jun 27, 2023

just my 2 cents.
I have this post, we're having similar issues.
I wanted to comment that we have 5 different clusters, and one of them is PVE 7.1 and it's not having any problem.
Our problems started with 7.2.
Again, a migration unfreezes VM instantly, like nothing happened.

edit: typo

coenvl · Jun 28, 2023

Yes it seems that more people are facing similar issues, in various (slightly differently named) threads. I can also confirm that on older versions of proxmox I have never seend this happening before, running for about 4 years. Perhaps anybody already switched to version 8, and can share their experiences?

I just got another freeze this morning. This time on a VM with ballooning disabled, so unfortunately that does not seem to cause the issue. I now updated the setting for the disk to use `aio_threads`.

VictorSTS · Jun 28, 2023

Please @coenvl and @eddor , would you mind to share the kernel version(s) used in your cluster? I've moved 3 of my problematic clusters from 6.1 and 6.2 back to 5.15 and haven't seen a freeze in a week on any of them (yet?).

coenvl · Jun 28, 2023

I am running kernel version 6.2 (see for example my post from June 20th). I never experienced this problem with kernel 5.15. in fact, the problem started occurring when I started using the opt-in kernel version 5.19, and later when the opt-in version changed to 6.2.

The reason I started using the opt-in kernel, is because my VMs would freeze after trying live migration, but only on specific host combinations. This problem is discussed here: https://forum.proxmox.com/threads/vm-stuck-freeze-after-live-migration. Long story short: this problem was resolved by using the opt-in kernel.

So, yes probably going back to kernel 5.15 would probably be a solution, but will cause different problems. I have considered going back completely to Proxmox 6.4 with kernel 5.4, which was super stable, but I do not feel really happy about that.

VictorSTS · Jun 28, 2023

Going back to PVE 6.4 is not a option for me (base distro is EOL and 6.4 does not support PBS namespaces). I ran kernel 6.x due to the same live migration problems with heterogeneous hardware. Currently I prefer not using live migration than suffering these freezes.

I'm still to test PVE8 but I'm worried as it runs on kernel 6.2 which suffers from these freezes on PVE7. PVE8 uses QEMU8, though, so it might work.

0x2d206cff · Jun 28, 2023

Happened again yesterday.

Cfgs/strace from this morning.

strace

Code:

strace: Process 2936050 attached
^Cstrace: Process 2936050 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.44   31.016833       27842      1114           ppoll
  0.30    0.092252          41      2249           clock_gettime
  0.09    0.029191           6      4208           write
  0.06    0.019527          18      1040        10 recvmsg
  0.06    0.017580          16      1082           read
  0.05    0.015723         786        20           sendmsg
  0.00    0.000014           3         4           close
  0.00    0.000012           1         8           fcntl
  0.00    0.000012           3         4           accept4
  0.00    0.000007           1         4           getsockname
------ ----------- ----------- --------- --------- ----------------
100.00   31.191151        3204      9733        10 total

fds

Code:

NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                1024    524288 files
59
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                1024    524288 files
50

cfg

Code:

balloon: 0
boot: order=scsi0;net0
cores: 4
hostpci0: 0000:01:00.0,rombar=0
ide2: none,media=cdrom
memory: 20480
meta: creation-qemu=7.1.0,ctime=1675864909
name: truenas
net0: virtio=EA:5B:87:BB:03:74,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
parent: s20230213
scsi0: local-zfs:vm-108-disk-0,discard=on,iothread=1,size=64G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=8d1ca9cb-3a25-4aac-853a-7bc6aa901c17
sockets: 1
tags: backups;data
usb0: host=1058:264d

versions

Code:

proxmox-ve: 7.4-1 (running kernel: 6.1.2-1-pve)
pve-manager: 7.4-4 (running version: 7.4-4/4a8501a8)
pve-kernel-5.15: 7.4-3
pve-kernel-6.1: 7.3-6
pve-kernel-6.1.15-1-pve: 6.1.15-1
pve-kernel-6.1.2-1-pve: 6.1.2-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.74-1-pve: 5.15.74-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-3
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-1
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.6
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.7.0
pve-cluster: 7.3-3
pve-container: 4.4-4
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-2
pve-firewall: 4.3-2
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1

Last signal from VM:

Code:

2023-06-27T14:10:01-0400,4098758,root,20,0,9076,2904,2220,R,6.7,0.0,0:00.01,top
2023-06-27T14:10:01-0400,1,root,20,0,168136,11224,5916,S,0.0,0.1,1:56.14,systemd
2023-06-27T14:10:01-0400,2,root,20,0,0,0,0,S,0.0,0.0,25:22.23,kthreadd
2023-06-27T14:10:01-0400,3,root,0,-20,0,0,0,I,0.0,0.0,0:00.00,rcu_gp
2023-06-27T14:10:01-0400,4,root,0,-20,0,0,0,I,0.0,0.0,0:00.00,rcu_par+

This doesn't happen on Linux 5.15.74-1-pve #1 SMP PVE 5.15.74-1

VivienM · Jun 28, 2023

coenvl said:
I am running kernel version 6.2 (see for example my post from June 20th). I never experienced this problem with kernel 5.15. in fact, the problem started occurring when I started using the opt-in kernel version 5.19, and later when the opt-in version changed to 6.2.

The reason I started using the opt-in kernel, is because my VMs would freeze after trying live migration, but only on specific host combinations. This problem is discussed here: https://forum.proxmox.com/threads/vm-stuck-freeze-after-live-migration. Long story short: this problem was resolved by using the opt-in kernel.

So, yes probably going back to kernel 5.15 would probably be a solution, but will cause different problems. I have considered going back completely to Proxmox 6.4 with kernel 5.4, which was super stable, but I do not feel really happy about that.

I am pretty sure that I had this problem with 5.15 and switched to the opt-in kernels in hope for a fix...

FYI, I've upgraded to 8 and will monitor to see what happens.

eddor · Jun 28, 2023

on our setup we use opt-in kernel, so ever since we moved from 7.1 to 7.2 on some of the clusters, we switched to 5.19.
That combination, pve 7.2 and kernel 5.19, had these freezes. Then we upgraded all the way to current PVE 7 and kernel 6.2.
Always having freezes.
On the other hand one of the clusters remained in pve 7.1 and kernel 5.13, and it never got frozen VMs

[SOLVED] VMs freeze with 100% CPU

Distinguished Member

Member

Attachments

Proxmox Staff Member

New Member

Distinguished Member

Member

Active Member

Member

Proxmox Staff Member

Proxmox Staff Member

Distinguished Member

Renowned Member

Member

Member

Distinguished Member

Member

Distinguished Member

New Member

Member

Member

We value your privacy