[SOLVED] VMs freeze with 100% CPU

There is no ssh access, the vnc console does not respond to requests. The static picture is standing. The log on the machine itself is abruptly interrupted and resumes after restarting:

Code:
June 16, 20:12:22 dev-kafka-2-kraft kafka-server-start.sh[23704]: [2023-06-16 20:12:22,285] INFO [Craft Manager NodeID=2] Request to vote VoteRequestDa>
June 16, 20:12:22 dev-kafka-2-kraft kafka-server-start.sh[23704]: [2023-06-16 20:12:22,368] INFORMATION [Root Manager Node ID=2] Transition to Fo is completed>
June 16, 20:12:22 dev-kafka-2-kraft kafka-server-start.sh[23704]: [2023-06-16 20:12:22,370] INFORMATION [BrokerToControllerChannelManager broker=2 name=h>
-- Bootable fbd87f21176742cb8ab0717732d2b6bc --
June 16, 21:27:04 dev-kafka-2-kraft kernel: Linux version 5.15.0-73-generic (buildd@bos03-amd64-060) (gcc (Ubuntu 11.3.0-1ubuntu1~04/22/04) 11.3.0,>
June 16, 21:27:04 dev-kafka kernel-2-kraft: Command line: BOOT_IMAGE=/vmlinuz-5.15.0-73- common root=/dev/mapper/ap--vg-ap--lv--root root=1 ip>
June 16, 21:27:04 dev-kafka-2-kraft core: Core-supported processors:
on the proxmox node itself in the syslog i see this:

Code:
 June 16, 18:36:32 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_18] [SAT], 16 currently unreadable (pending) sectors
June 16, 18:36:32 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_18] [SAT], 16 autonomous incorrigible sectors
June 16, 18:36:32 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_24], INTELLIGENT failure: FAILURE PREDICTION THRESHOLD EXCEEDED
June 16, 19:06:33 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_18] [SAT], 16 currently unreadable (pending) sectors
June 16, 19:06:33 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_18] [SAT], 16 autonomous incorrigible sectors
June 16, 19:06:33 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_24], INTELLIGENT failure: FAILURE PREDICTION THRESHOLD EXCEEDED
June 16, 19:36:33 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_18] [SAT], 16 currently unreadable sectors (pending)
June 16, 19:36:33 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_18] [SAT], 16 autonomous incorrigible sectors
June 16, 19:36:33 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_24], INTELLIGENT failure: FAILURE PREDICTION THRESHOLD EXCEEDED
June 16, 20:06:33 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_18] [SAT], 16 currently unreadable (pending) sectors
June 16, 20:06:33 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_18] [SAT], 16 autonomous incorrigible sectors
June 16, 20:06:33 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_24], INTELLIGENT failure: FAILURE PREDICTION THRESHOLD EXCEEDED
June 16, 20:08:14 petr-stor4 pvestatd[2860]: VM command error 105 qmp - VM command error 105 qmp 'request-proxmox-support' - timeout received
June 16, 20:18:49 petr-stor4 pvedaemon[1718506]: Failed to execute VM 121 qmp command - failed to execute VM 121 qmp 'guest-ping' command - timeout received
June 16, 20:19:08 petr-stor4 pvedaemon[1727264]: Failed to execute VM 121 qmp command - failed to execute VM 121 qmp 'guest-ping' command - timeout received
June 16, 20:19:28 petr-stor4 pvedaemon[1717223]: Failed to execute VM 121 qmp command - failed to execute VM 121 qmp 'guest-ping' command - timeout received
June 16, 20:21:24 petr-stor4 pvedaemon[1727264]: Failed to execute VM 121 qmp command - failed to execute VM 121 qmp 'guest-ping' command - timeout received
June 16, 20:21:43 petr-stor4 pvedaemon[1717223]: Failed to execute VM 121 qmp command - failed to execute VM 121 qmp 'guest-ping' command - timeout received
June 16, 20:22:05 petr-stor4 pvedaemon[1718506]: Failed to execute VM 121 qmp command - failed to execute VM 121 qmp 'guest-ping' command - timeout received

maybe my osd.54 slowly dying and so the machines freeze ? But i have replication factor 2/3 in my ceph... I have a Ceph warning in the PVE UI yesterday which said 1 daemons have recently crashed osd.54 crashed on host *****. But for now osd.54 service works well. Backtrace of osd.54 crash:

Code:
{
    "backtrace": [
        "/lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7f69cb483140]",
        "(BlueStore::Extent::~Extent()+0x27) [0x55d6e1ebb8e7]",
        "(BlueStore::Onode::put()+0x2c5) [0x55d6e1e32f25]",
        "(std::_Hashtable<ghobject_t, std::pair<ghobject_t const, boost::intrusive_ptr<BlueStore::Onode> >, mempool::pool_allocator<(mempool::pool_index_t)4, std::pair<ghobject_t const, boost::intrusive_ptr<BlueStore::Onode> > >, std::__detail::_Select1st, std::equal_to<ghobject_t>, std::hash<ghobject_t>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_erase(unsigned long, std::__detail::_Hash_node_base*, std::__detail::_Hash_node<std::pair<ghobject_t const, boost::intrusive_ptr<BlueStore::Onode> >, true>*)+0x67) [0x55d6e1ebc2c7]",
        "(LruOnodeCacheShard::_trim_to(unsigned long)+0xca) [0x55d6e1ebfb5a]",
        "(BlueStore::OnodeSpace::add(ghobject_t const&, boost::intrusive_ptr<BlueStore::Onode>&)+0x15d) [0x55d6e1e3371d]",
        "(BlueStore::Collection::get_onode(ghobject_t const&, bool, bool)+0x399) [0x55d6e1e3a309]",
        "(BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x154d) [0x55d6e1e814dd]",
        "(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x2e0) [0x55d6e1e82430]",
        "(non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x52) [0x55d6e1aa8412]",
        "(ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x7b4) [0x55d6e1cb8ef4]",
        "(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x53d) [0x55d6e1a2418d]",
        "(PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xd46) [0x55d6e1a80326]",
        "(PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x334a) [0x55d6e1a87c6a]",
        "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1bc) [0x55d6e18f789c]",
        "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x65) [0x55d6e1b77505]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa27) [0x55d6e1924367]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a) [0x55d6e1fcd3da]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55d6e1fcf9b0]",
        "/lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f69cb477ea7]",
        "clone()"
    ],
    "ceph_version": "16.2.9",
    "crash_id": "2023-06-15T20:43:02.275025Z_aad0cf01-3839-41a3-b8bd-d516080722b1",
    "entity_name": "osd.54",
    "os_id": "11",
    "os_name": "Debian GNU/Linux 11 (bullseye)",
    "os_version": "11 (bullseye)",
    "os_version_id": "11",
    "process_name": "ceph-osd",
    "stack_sig": "f33237076f54d8500909a0c8c279f6639d4e914520f35b288af4429eebfd958e",
    "timestamp": "2023-06-15T20:43:02.275025Z",
    "utsname_hostname": "petr-stor4",
    "utsname_machine": "x86_64",
    "utsname_release": "5.15.35-2-pve",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PVE 5.15.35-5 (Wed, 08 Jun 2022 15:02:51 +0200)"
}
Hi,
you have two disks with issues!
Please replace (fast) megaraid_disk_18 first - this disk has unrecoverable read-errors. not an good sign! If you can't replace set the osd as down/out, so that the data is moved to other OSDs.
And after replace (and rebuild), renew megaraid_disk_24 too.

BTW: ceph should access directly megaraid_disk_18 sound's a little bit as raid-0 for each disk, to simulate an HBA?

Udo
 
So this morning I caught another one. Here is the output of the logs, but I don't think it looks anything different from the last time.

strace:
Code:
strace -c -p $(cat /var/run/qemu-server/115.pid)
strace: Process 195052 attached
^Cstrace: Process 195052 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 98.39   38.310114       16267      2355           ppoll
  1.12    0.434809          48      8936           write
  0.32    0.124000          53      2298           read
  0.17    0.066618          30      2187           recvmsg
  0.00    0.000126           2        43           sendmsg
  0.00    0.000039           4         9           accept4
  0.00    0.000035           3         9           close
  0.00    0.000014           0        18           fcntl
  0.00    0.000010           1         9           getsockname
------ ----------- ----------- --------- --------- ----------------
100.00   38.935765        2454     15864           total


gdb output attached as file.

VM config:
Code:
qm config 115
balloon: 0
boot: order=scsi0;ide2;net0
cores: 4
cpu: Broadwell
description: OS%3A Debian 11.2 bullseye%0AUpdated packages 10-2-2022%0AContains default HESI user account
ide2: none,media=cdrom
memory: 32768
name: GEIS-beta-16
net0: virtio=1E:71:E6:AA:5D:7E,bridge=vmbr0,tag=37
numa: 0
onboot: 1
ostype: l26
scsi0: hesi-storage:vm-115-disk-0,size=1000G
scsihw: virtio-scsi-pci
smbios1: uuid=5a766abc-9464-4995-ba6c-2c5c0772115d
sockets: 2

pveversion:
Code:
proxmox-ve: 7.4-1 (running kernel: 6.2.11-1-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-6.2: 7.4-3
pve-kernel-5.15: 7.4-3
pve-kernel-5.13: 7.1-9
pve-kernel-6.2.11-2-pve: 6.2.11-2
pve-kernel-6.2.11-1-pve: 6.2.11-1
pve-kernel-6.2.6-1-pve: 6.2.6-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.107-1-pve: 5.15.107-1
pve-kernel-5.15.102-1-pve: 5.15.102-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
ceph-fuse: 17.2.6-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-3
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-1
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.6
libpve-storage-perl: 7.4-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.7.0
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-2
pve-firewall: 4.3-2
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1

And now also the file descriptors:
Code:
for pid in $(pidof kvm); do prlimit -p $pid | grep NOFILE; ls -1 /proc/$pid/fd/ | wc -l; done
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                  1024     1048576 files
91
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                1024    524288 files
80
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                  1024     1048576 files
91
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                  1024     1048576 files
91
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                  1024     1048576 files
91
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                  1024     1048576 files
79
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                1024    524288 files
89
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                1024    524288 files
89
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                  1024     1048576 files
91
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                  1024     1048576 files
91
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                  1024     1048576 files
91
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                  1024     1048576 files
91
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                1024    524288 files
32

So, even though it says `balloon 0` in the QM config output, this is not the working configuration but a pending change after the last recommendations. I will reboot the machine now, and also update the aio_threads setting on the disk.

So far the machine that crashed last week, in which I did update the settings hasn't crashed yet, but then again I would not have expected it to. I get a freezing VM every week or so, alternating between 4 VMs out of a total of 50.
 

Attachments

  • hanging_vm_debug_2.txt
    22.7 KB · Views: 4
So this morning I caught another one. Here is the output of the logs, but I don't think it looks anything different from the last time.
Yes, again only being in ppoll (likely some event never arrives), but doesn't look like the PLT corruption.
So, even though it says `balloon 0` in the QM config output, this is not the working configuration but a pending change after the last recommendations. I will reboot the machine now, and also update the aio_threads setting on the disk.

So far the machine that crashed last week, in which I did update the settings hasn't crashed yet, but then again I would not have expected it to. I get a freezing VM every week or so, alternating between 4 VMs out of a total of 50.
Let's hope for the best. We know that neither ballooning nor io_uring are causing the PLT corruption issue according to @udo, but we don't yet know it for your issue.
 
Hi, I'm currently experiencing same issue, always with Windows 10 machines.

I've tried many of the possible fixes in this thread, but none solved the issue.

It happens in both Intel (Xeon) CPUs and AMD Ryzen

Always using Virtio for SCSI and net

Kernel Version

Linux 5.15.107-2-pve #1 SMP PVE 5.15.107-2 (2023-05-10T09:10Z)
PVE Manager Version

pve-manager/7.4-13/46c37d9c

Machine i440fx latest (7.2)

The only solution so far has been to increase SPICE memory to 128. Before this change, I had freezes every hour or so, now I've been running for a couple of days with no issue.
 
@machana I'm not using spice at all. Did you gather the information proposed in this thread (gdb, strace, openfiles) when the freeze happend? It could be a different issue with similar symptoms. Also, you are using kernel 5.15 which, afaik, should not be affected by this issue.
 
It just happened again on my Debian-VM. 20 days of uptime without any problem and suddenly, right after the pbs-backup-job ran over this vm, the cpu was at 100%.

Could there be a connection with the bps backup-tasks or is it just a coincidence?
 
I also have those issues.
Mostly a Win2022 server hangs. I will check `strace -c -p $(cat /var/run/qemu-server/<ID>.pid)` next time
 
just my 2 cents.
I have this post, we're having similar issues.
I wanted to comment that we have 5 different clusters, and one of them is PVE 7.1 and it's not having any problem.
Our problems started with 7.2.
Again, a migration unfreezes VM instantly, like nothing happened.

edit: typo
 
Last edited:
Yes it seems that more people are facing similar issues, in various (slightly differently named) threads. I can also confirm that on older versions of proxmox I have never seend this happening before, running for about 4 years. Perhaps anybody already switched to version 8, and can share their experiences?

I just got another freeze this morning. This time on a VM with ballooning disabled, so unfortunately that does not seem to cause the issue. I now updated the setting for the disk to use `aio_threads`.
 
I am running kernel version 6.2 (see for example my post from June 20th). I never experienced this problem with kernel 5.15. in fact, the problem started occurring when I started using the opt-in kernel version 5.19, and later when the opt-in version changed to 6.2.

The reason I started using the opt-in kernel, is because my VMs would freeze after trying live migration, but only on specific host combinations. This problem is discussed here: https://forum.proxmox.com/threads/vm-stuck-freeze-after-live-migration. Long story short: this problem was resolved by using the opt-in kernel.

So, yes probably going back to kernel 5.15 would probably be a solution, but will cause different problems. I have considered going back completely to Proxmox 6.4 with kernel 5.4, which was super stable, but I do not feel really happy about that.
 
Going back to PVE 6.4 is not a option for me (base distro is EOL and 6.4 does not support PBS namespaces). I ran kernel 6.x due to the same live migration problems with heterogeneous hardware. Currently I prefer not using live migration than suffering these freezes.

I'm still to test PVE8 but I'm worried as it runs on kernel 6.2 which suffers from these freezes on PVE7. PVE8 uses QEMU8, though, so it might work.
 
Happened again yesterday.

Cfgs/strace from this morning.

strace
Code:
strace: Process 2936050 attached
^Cstrace: Process 2936050 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.44   31.016833       27842      1114           ppoll
  0.30    0.092252          41      2249           clock_gettime
  0.09    0.029191           6      4208           write
  0.06    0.019527          18      1040        10 recvmsg
  0.06    0.017580          16      1082           read
  0.05    0.015723         786        20           sendmsg
  0.00    0.000014           3         4           close
  0.00    0.000012           1         8           fcntl
  0.00    0.000012           3         4           accept4
  0.00    0.000007           1         4           getsockname
------ ----------- ----------- --------- --------- ----------------
100.00   31.191151        3204      9733        10 total

fds
Code:
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                1024    524288 files
59
NOFILE     max number of open files                1024      4096 files
0
NOFILE     max number of open files                1024    524288 files
50

cfg
Code:
balloon: 0
boot: order=scsi0;net0
cores: 4
hostpci0: 0000:01:00.0,rombar=0
ide2: none,media=cdrom
memory: 20480
meta: creation-qemu=7.1.0,ctime=1675864909
name: truenas
net0: virtio=EA:5B:87:BB:03:74,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
parent: s20230213
scsi0: local-zfs:vm-108-disk-0,discard=on,iothread=1,size=64G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=8d1ca9cb-3a25-4aac-853a-7bc6aa901c17
sockets: 1
tags: backups;data
usb0: host=1058:264d

versions
Code:
proxmox-ve: 7.4-1 (running kernel: 6.1.2-1-pve)
pve-manager: 7.4-4 (running version: 7.4-4/4a8501a8)
pve-kernel-5.15: 7.4-3
pve-kernel-6.1: 7.3-6
pve-kernel-6.1.15-1-pve: 6.1.15-1
pve-kernel-6.1.2-1-pve: 6.1.2-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.74-1-pve: 5.15.74-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-3
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-1
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.6
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.7.0
pve-cluster: 7.3-3
pve-container: 4.4-4
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-2
pve-firewall: 4.3-2
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1

Last signal from VM:
Code:
2023-06-27T14:10:01-0400,4098758,root,20,0,9076,2904,2220,R,6.7,0.0,0:00.01,top
2023-06-27T14:10:01-0400,1,root,20,0,168136,11224,5916,S,0.0,0.1,1:56.14,systemd
2023-06-27T14:10:01-0400,2,root,20,0,0,0,0,S,0.0,0.0,25:22.23,kthreadd
2023-06-27T14:10:01-0400,3,root,0,-20,0,0,0,I,0.0,0.0,0:00.00,rcu_gp
2023-06-27T14:10:01-0400,4,root,0,-20,0,0,0,I,0.0,0.0,0:00.00,rcu_par+

This doesn't happen on Linux 5.15.74-1-pve #1 SMP PVE 5.15.74-1
 
Last edited:
I am running kernel version 6.2 (see for example my post from June 20th). I never experienced this problem with kernel 5.15. in fact, the problem started occurring when I started using the opt-in kernel version 5.19, and later when the opt-in version changed to 6.2.

The reason I started using the opt-in kernel, is because my VMs would freeze after trying live migration, but only on specific host combinations. This problem is discussed here: https://forum.proxmox.com/threads/vm-stuck-freeze-after-live-migration. Long story short: this problem was resolved by using the opt-in kernel.

So, yes probably going back to kernel 5.15 would probably be a solution, but will cause different problems. I have considered going back completely to Proxmox 6.4 with kernel 5.4, which was super stable, but I do not feel really happy about that.
I am pretty sure that I had this problem with 5.15 and switched to the opt-in kernels in hope for a fix...

FYI, I've upgraded to 8 and will monitor to see what happens.
 
on our setup we use opt-in kernel, so ever since we moved from 7.1 to 7.2 on some of the clusters, we switched to 5.19.
That combination, pve 7.2 and kernel 5.19, had these freezes. Then we upgraded all the way to current PVE 7 and kernel 6.2.
Always having freezes.
On the other hand one of the clusters remained in pve 7.1 and kernel 5.13, and it never got frozen VMs
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!