Snapshot causes VM to become unresponsive

dvb91

New Member
Apr 29, 2023
19
0
1
Hi,
I am running Proxmox VE v8.2.4, and I often create Debian 11 VM's snapshots with memory.
For several weeks (I'm not sure it happened with qemu v9) I have systematically a critical issue :
  • Every time I make a snapshot, VM becomes very slow / non-responsive.
  • The only way to return in normal mode is to stop and restart VM.

State of the VM at the end of snapshot :
1725531759395.png


Here is my configuration :
1725532018797.png
1725532046866.png


I am not sure, it seems there is a subject with "io" :
-> Did I missed something in configuration ?
-> Is there any patch to fix this critical issue ?

Don't hesitate to ask me for logs.

Regards.
 
Hi,
please share the output of pveversion -v and qm config 102 as well as the part of the system log/journal around the snapshot operation (both guest and host could be interesting). Are you using krbd for the Ceph storage or not?
 
I've done a new snapshot.

pverversion -v
Code:
root@pve1:~# pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.12-1-pve)
pve-manager: 8.2.4 (running version: 8.2.4/faa83925c9641325)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-1
proxmox-kernel-6.8.12-1-pve-signed: 6.8.12-1
proxmox-kernel-6.8.8-4-pve-signed: 6.8.8-4
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph: 18.2.2-pve1
ceph-fuse: 18.2.2-pve1
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx9
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.7
libpve-cluster-perl: 8.0.7
libpve-common-perl: 8.2.2
libpve-guest-common-perl: 5.1.4
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.9
libpve-storage-perl: 8.2.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.7-1
proxmox-backup-file-restore: 3.2.7-1
proxmox-firewall: 0.5.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.7
pve-container: 5.1.12
pve-docs: 8.2.3
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.1
pve-firewall: 5.0.7
pve-firmware: 3.13-1
pve-ha-manager: 4.0.5
pve-i18n: 3.2.2
pve-qemu-kvm: 9.0.2-2
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.4
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.4-pve1
root@pve1:~#

qm config 102
Code:
root@pve1:~# qm config 102
agent: 1
boot: order=scsi0;net0
cores: 12
cpu: x86-64-v2-AES
description: Ici zone de commentaires pour VM Jeedom.
memory: 16384
meta: creation-qemu=8.0.2,ctime=1702231538
name: 02-JEEDOM-deb11
net0: virtio=BC:24:11:47:10:94,bridge=vmbr0,tag=50
numa: 0
onboot: 1
ostype: l26
scsi0: ceph-ssd:vm-102-disk-0,iothread=1,size=250G
scsihw: virtio-scsi-single
smbios1: uuid=0b167d18-b564-466f-b08d-f84576ae82a7
sockets: 1
startup: order=2
tags: PROD
vmgenid: dbe9d964-bd6f-4185-9c41-8f45ea65acec
root@pve1:~#

I don't use krbd for the Ceph storage :
1725627886283.png

Snapshot output
Code:
saving VM state and RAM using storage 'ceph-ssd'
14.06 MiB in 0s
316.38 MiB in 1s
657.03 MiB in 2s
982.24 MiB in 3s
1.29 GiB in 4s
1.61 GiB in 5s
1.96 GiB in 6s
2.29 GiB in 7s
2.61 GiB in 8s
2.95 GiB in 9s
3.29 GiB in 10s
3.63 GiB in 11s
3.96 GiB in 12s
4.29 GiB in 13s
4.62 GiB in 14s
4.96 GiB in 15s
5.30 GiB in 16s
5.63 GiB in 17s
5.94 GiB in 18s
6.28 GiB in 19s
6.60 GiB in 20s
6.94 GiB in 21s
7.29 GiB in 22s
7.61 GiB in 23s
7.95 GiB in 24s
8.28 GiB in 25s
8.62 GiB in 26s
8.93 GiB in 27s
9.29 GiB in 28s
9.63 GiB in 29s
9.97 GiB in 30s
10.31 GiB in 31s
10.65 GiB in 32s
10.97 GiB in 33s
11.31 GiB in 34s
11.62 GiB in 35s
11.94 GiB in 36s
12.27 GiB in 37s
12.60 GiB in 38s
12.92 GiB in 39s
13.25 GiB in 40s
13.58 GiB in 41s
13.89 GiB in 42s
14.20 GiB in 43s
14.51 GiB in 44s
14.83 GiB in 45s
15.16 GiB in 46s
15.48 GiB in 47s
15.83 GiB in 48s
16.16 GiB in 49s
16.49 GiB in 50s
16.84 GiB in 51s
17.18 GiB in 52s
17.53 GiB in 53s
17.85 GiB in 54s
18.15 GiB in 55s
18.43 GiB in 56s
18.68 GiB in 57s
18.93 GiB in 58s
19.19 GiB in 59s
reducing reporting rate to every 10s
22.08 GiB in 1m 9s
completed saving the VM state in 1m 13s, saved 23.42 GiB
snapshotting 'drive-scsi0' (ceph-ssd:vm-102-disk-0)
TASK OK

Could you please tell me the exact name and path of the logs you need ?
Thank you

[EDIT]
Please find system.log from pve1 and syslog from VM.
I hope it helps
 

Attachments

Last edited:
I've done a new snapshot.
Is it always the same kind of processes that will consume the CPU inside the VM afterwards?

Code:
scsi0: ceph-ssd:vm-102-disk-0,iothread=1,size=250G
I don't use krbd for the Ceph storage :
View attachment 74325
Can you try if turning krbd on as well as turning iothread off for the disk makes a difference?

[EDIT]
Please find system.log from pve1 and syslog from VM.
I hope it helps
Unfortunately, there is no
Code:
qmsnapshot
task mentioned in the host's system log. Are you sure this is the correct node?
 
Is it always the same kind of processes that will consume the CPU inside the VM afterwards?
Here is "htop" from the two last snapshots :

september 06 :
htop_sorted_06sept.png

today :
htop_sorted_09sept.png

-> Highest process seems to be mariadb

Can you try if turning krbd on as well as turning iothread off for the disk makes a difference?

I tried today with theses settings, unfortunately same problem :
1725876389687.png

1725876428126.png


task mentioned in the host's system log. Are you sure this is the correct node?
In case mismatch, please find last log from today -> "crashlog pve1 focus.txt"
 

Attachments

What physical CPU model do you have (lscpu)?
I am using Intel 13th :

1725887661261.png

Motherboard's BIOS is up to date (Intel micro code 0x129) :
1725888250349.png

But I have never installed intel- microcode, and no problem with snapshots.
If BIOS is up to date, is it mandatory ?


Does the issue also occur if you pause and then resume the VM after a while?
I don't understand how and when to suspend (immediately after VM becomes unresponsive ?).
Could you please detail ?

What kernel version is running inside the guest?
uname -sr
1725888080283.png


This VM is reading several video rtsp traffics. I stopped all theses videos, and made a news snapshot. -> Operation was done without issue and very quickly. Perhaps this dozen of video's rtsp traffic (many io ?) could disrupt end of snapshot and resume of VM ?
 
Motherboard's BIOS is up to date (Intel micro code 0x129) :
View attachment 74457

But I have never installed intel- microcode, and no problem with snapshots.
If BIOS is up to date, is it mandatory ?
No, if you are using persistent CPU microcode via BIOS update, you don't need the package (it is used for early OS microcode updates).

I don't understand how and when to suspend (immediately after VM becomes unresponsive ?).
Could you please detail ?
I mean instead of taking a snapshot, pause the VM, wait a few seconds and resume again. It would be interesting to know whether the issue also happens then.

This VM is reading several video rtsp traffics. I stopped all theses videos, and made a news snapshot. -> Operation was done without issue and very quickly. Perhaps this dozen of video's rtsp traffic (many io ?) could disrupt end of snapshot and resume of VM ?
That sounds plausible. It seems like the vCPU handling in QEMU or the guest kernel get confused for some reason.

Is the issue also there when you switch the VM to use host CPU type?
 
I mean instead of taking a snapshot, pause the VM, wait a few seconds and resume again. It would be interesting to know whether the issue also happens then.
I paused VM during approximately 20 seconds and resumed it
-> OK no issue.

I switched from x86-64-v2-AES to host and I done snapshot
-> OK with no issue.

Code:
/dev/rbd1
saving VM state and RAM using storage 'ceph-ssd'
3.01 MiB in 0s
1.31 GiB in 1s
2.45 GiB in 2s
completed saving the VM state in 3s, saved 3.24 GiB
snapshotting 'drive-scsi0' (ceph-ssd:vm-102-disk-0)
Creating snap: 10% complete...
Creating snap: 100% complete...done.
TASK OK

If I understood correctly :
  • There is an issue on x86-64-v2-AE, I need to use host temporarily.
  • How about loss performances ?
  • A host VM cannot be transferred to a host with another processor family (ex. AMD).
Do you think it will take a long time to correct x86-64-v2-AE ?

Regards.
 
  • How about loss performances ?
The host CPU model does provide better performance.

  • A host VM cannot be transferred to a host with another processor family (ex. AMD).
No matter what virtual CPU model you are using, live-migration between host CPUs from different vendors can never be guaranteed to work: https://pve.proxmox.com/pve-docs/chapter-qm.html#_online_migration

Do you think it will take a long time to correct x86-64-v2-AE ?
I was not able to reproduce the issue and haven't seen other reports about this. So it's not even clear what the issue is.
 
OK then i will keep host settings.

Last question, for better performances, what's your advice for theses settings :
-krbd -> off or on ?
-iothread -> off or on ?

Thank you
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!