Snapshot causes VM to become unresponsive

dvb91 · Sep 5, 2024

Hi,
I am running Proxmox VE v8.2.4, and I often create Debian 11 VM's snapshots with memory.
For several weeks (I'm not sure it happened with qemu v9) I have systematically a critical issue :

Every time I make a snapshot, VM becomes very slow / non-responsive.
The only way to return in normal mode is to stop and restart VM.

State of the VM at the end of snapshot :

Here is my configuration :

I am not sure, it seems there is a subject with "io" :
-> Did I missed something in configuration ?
-> Is there any patch to fix this critical issue ?

Don't hesitate to ask me for logs.

Regards.

fiona · Sep 6, 2024

Hi,
please share the output of pveversion -v and qm config 102 as well as the part of the system log/journal around the snapshot operation (both guest and host could be interesting). Are you using krbd for the Ceph storage or not?

dvb91 · Sep 6, 2024

I've done a new snapshot.

pverversion -v

Code:

root@pve1:~# pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.12-1-pve)
pve-manager: 8.2.4 (running version: 8.2.4/faa83925c9641325)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-1
proxmox-kernel-6.8.12-1-pve-signed: 6.8.12-1
proxmox-kernel-6.8.8-4-pve-signed: 6.8.8-4
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph: 18.2.2-pve1
ceph-fuse: 18.2.2-pve1
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx9
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.7
libpve-cluster-perl: 8.0.7
libpve-common-perl: 8.2.2
libpve-guest-common-perl: 5.1.4
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.9
libpve-storage-perl: 8.2.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.7-1
proxmox-backup-file-restore: 3.2.7-1
proxmox-firewall: 0.5.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.7
pve-container: 5.1.12
pve-docs: 8.2.3
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.1
pve-firewall: 5.0.7
pve-firmware: 3.13-1
pve-ha-manager: 4.0.5
pve-i18n: 3.2.2
pve-qemu-kvm: 9.0.2-2
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.4
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.4-pve1
root@pve1:~#

qm config 102

Code:

root@pve1:~# qm config 102
agent: 1
boot: order=scsi0;net0
cores: 12
cpu: x86-64-v2-AES
description: Ici zone de commentaires pour VM Jeedom.
memory: 16384
meta: creation-qemu=8.0.2,ctime=1702231538
name: 02-JEEDOM-deb11
net0: virtio=BC:24:11:47:10:94,bridge=vmbr0,tag=50
numa: 0
onboot: 1
ostype: l26
scsi0: ceph-ssd:vm-102-disk-0,iothread=1,size=250G
scsihw: virtio-scsi-single
smbios1: uuid=0b167d18-b564-466f-b08d-f84576ae82a7
sockets: 1
startup: order=2
tags: PROD
vmgenid: dbe9d964-bd6f-4185-9c41-8f45ea65acec
root@pve1:~#

I don't use krbd for the Ceph storage :

Snapshot output

Code:

saving VM state and RAM using storage 'ceph-ssd'
14.06 MiB in 0s
316.38 MiB in 1s
657.03 MiB in 2s
982.24 MiB in 3s
1.29 GiB in 4s
1.61 GiB in 5s
1.96 GiB in 6s
2.29 GiB in 7s
2.61 GiB in 8s
2.95 GiB in 9s
3.29 GiB in 10s
3.63 GiB in 11s
3.96 GiB in 12s
4.29 GiB in 13s
4.62 GiB in 14s
4.96 GiB in 15s
5.30 GiB in 16s
5.63 GiB in 17s
5.94 GiB in 18s
6.28 GiB in 19s
6.60 GiB in 20s
6.94 GiB in 21s
7.29 GiB in 22s
7.61 GiB in 23s
7.95 GiB in 24s
8.28 GiB in 25s
8.62 GiB in 26s
8.93 GiB in 27s
9.29 GiB in 28s
9.63 GiB in 29s
9.97 GiB in 30s
10.31 GiB in 31s
10.65 GiB in 32s
10.97 GiB in 33s
11.31 GiB in 34s
11.62 GiB in 35s
11.94 GiB in 36s
12.27 GiB in 37s
12.60 GiB in 38s
12.92 GiB in 39s
13.25 GiB in 40s
13.58 GiB in 41s
13.89 GiB in 42s
14.20 GiB in 43s
14.51 GiB in 44s
14.83 GiB in 45s
15.16 GiB in 46s
15.48 GiB in 47s
15.83 GiB in 48s
16.16 GiB in 49s
16.49 GiB in 50s
16.84 GiB in 51s
17.18 GiB in 52s
17.53 GiB in 53s
17.85 GiB in 54s
18.15 GiB in 55s
18.43 GiB in 56s
18.68 GiB in 57s
18.93 GiB in 58s
19.19 GiB in 59s
reducing reporting rate to every 10s
22.08 GiB in 1m 9s
completed saving the VM state in 1m 13s, saved 23.42 GiB
snapshotting 'drive-scsi0' (ceph-ssd:vm-102-disk-0)
TASK OK

Could you please tell me the exact name and path of the logs you need ?
Thank you

[EDIT]
Please find system.log from pve1 and syslog from VM.
I hope it helps

fiona · Sep 9, 2024

dvb91 said:
I've done a new snapshot.

Is it always the same kind of processes that will consume the CPU inside the VM afterwards?

dvb91 said:
Code:

scsi0: ceph-ssd:vm-102-disk-0,iothread=1,size=250G

dvb91 said:
I don't use krbd for the Ceph storage :
View attachment 74325

Can you try if turning krbd on as well as turning iothread off for the disk makes a difference?

dvb91 said:
[EDIT]
Please find system.log from pve1 and syslog from VM.
I hope it helps

Unfortunately, there is no

Code:

qmsnapshot

task mentioned in the host's system log. Are you sure this is the correct node?

dvb91 · Sep 9, 2024

fiona said:
Is it always the same kind of processes that will consume the CPU inside the VM afterwards?

Here is "htop" from the two last snapshots :

september 06 :

today :

-> Highest process seems to be mariadb

fiona said:
Can you try if turning krbd on as well as turning iothread off for the disk makes a difference?

I tried today with theses settings, unfortunately same problem :

fiona said:
task mentioned in the host's system log. Are you sure this is the correct node?

In case mismatch, please find last log from today -> "crashlog pve1 focus.txt"

fiona · Sep 9, 2024

What physical CPU model do you have (lscpu)? Is the latest CPU firmware/microcode installed: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_firmware_cpu ?

Does the issue also occur if you pause and then resume the VM after a while?

What kernel version is running inside the guest?

dvb91 · Sep 9, 2024

fiona said:
What physical CPU model do you have (lscpu)?

I am using Intel 13th :

Motherboard's BIOS is up to date (Intel micro code 0x129) :

But I have never installed intel- microcode, and no problem with snapshots.
If BIOS is up to date, is it mandatory ?

fiona said:
Does the issue also occur if you pause and then resume the VM after a while?

I don't understand how and when to suspend (immediately after VM becomes unresponsive ?).
Could you please detail ?

fiona said:
What kernel version is running inside the guest?

uname -sr

This VM is reading several video rtsp traffics. I stopped all theses videos, and made a news snapshot. -> Operation was done without issue and very quickly. Perhaps this dozen of video's rtsp traffic (many io ?) could disrupt end of snapshot and resume of VM ?

fiona · Sep 9, 2024

dvb91 said:
Motherboard's BIOS is up to date (Intel micro code 0x129) :
View attachment 74457

But I have never installed intel- microcode, and no problem with snapshots.
If BIOS is up to date, is it mandatory ?

No, if you are using persistent CPU microcode via BIOS update, you don't need the package (it is used for early OS microcode updates).

dvb91 said:
I don't understand how and when to suspend (immediately after VM becomes unresponsive ?).
Could you please detail ?

I mean instead of taking a snapshot, pause the VM, wait a few seconds and resume again. It would be interesting to know whether the issue also happens then.

dvb91 said:
This VM is reading several video rtsp traffics. I stopped all theses videos, and made a news snapshot. -> Operation was done without issue and very quickly. Perhaps this dozen of video's rtsp traffic (many io ?) could disrupt end of snapshot and resume of VM ?

That sounds plausible. It seems like the vCPU handling in QEMU or the guest kernel get confused for some reason.

Is the issue also there when you switch the VM to use host CPU type?

dvb91 · Sep 9, 2024

fiona said:
I mean instead of taking a snapshot, pause the VM, wait a few seconds and resume again. It would be interesting to know whether the issue also happens then.

I paused VM during approximately 20 seconds and resumed it
-> OK no issue.

I switched from x86-64-v2-AES to host and I done snapshot
-> OK with no issue.

Code:

/dev/rbd1
saving VM state and RAM using storage 'ceph-ssd'
3.01 MiB in 0s
1.31 GiB in 1s
2.45 GiB in 2s
completed saving the VM state in 3s, saved 3.24 GiB
snapshotting 'drive-scsi0' (ceph-ssd:vm-102-disk-0)
Creating snap: 10% complete...
Creating snap: 100% complete...done.
TASK OK

If I understood correctly :

There is an issue on x86-64-v2-AE, I need to use host temporarily.
How about loss performances ?
A host VM cannot be transferred to a host with another processor family (ex. AMD).

Do you think it will take a long time to correct x86-64-v2-AE ?

Regards.

fiona · Sep 10, 2024

dvb91 said:
How about loss performances ?

The host CPU model does provide better performance.

dvb91 said:
A host VM cannot be transferred to a host with another processor family (ex. AMD).

No matter what virtual CPU model you are using, live-migration between host CPUs from different vendors can never be guaranteed to work: https://pve.proxmox.com/pve-docs/chapter-qm.html#_online_migration

dvb91 said:
Do you think it will take a long time to correct x86-64-v2-AE ?

I was not able to reproduce the issue and haven't seen other reports about this. So it's not even clear what the issue is.

dvb91 · Sep 10, 2024

OK then i will keep host settings.

Last question, for better performances, what's your advice for theses settings :
-krbd -> off or on ?
-iothread -> off or on ?

Thank you

fiona · Sep 10, 2024

dvb91 said:
Last question, for better performances, what's your advice for theses settings :
-krbd -> off or on ?

There's no general answer for that. It depends and it's best to test it.

dvb91 said:
-iothread -> off or on ?

On.

dvb91 · Sep 10, 2024

Thanks a lot !

Search

Search

Snapshot causes VM to become unresponsive

dvb91

New Member

fiona

Proxmox Staff Member

dvb91

New Member

Attachments

fiona

Proxmox Staff Member

dvb91

New Member

Attachments

fiona

Proxmox Staff Member

dvb91

New Member

fiona

Proxmox Staff Member

dvb91

New Member

fiona

Proxmox Staff Member

dvb91

New Member

fiona

Proxmox Staff Member

dvb91

New Member