Slow Snapshots?

MoreDakka

Active Member
May 2, 2019
58
12
28
44
Hey.

We have a 4 node proxmox/ceph cluster (ceph on 40g nics, prox interconects on 10g nics, internet is dual 1g nics)
ceph is 8x2tb SM863a drives

Problem is the snapshots, I don't use them much but wanted to test before we allow clients on here:
This part is quick:

/dev/rbd4
saving VM state and RAM using storage 'Ceph-RBDStor'
1.51 MiB in 0s
836.81 MiB in 1s
1.69 GiB in 2s
2.57 GiB in 3s
3.49 GiB in 4s
4.33 GiB in 5s
5.13 GiB in 6s
5.93 GiB in 7s
6.88 GiB in 8s
7.83 GiB in 9s
8.76 GiB in 10s
9.67 GiB in 11s
10.59 GiB in 12s
11.50 GiB in 13s
12.37 GiB in 14s
13.23 GiB in 15s

==== Now it's been sitting here for 8m 59s min (just finished), here is the rest of the log:

completed saving the VM state in 18s, saved 13.97 GiB
snapshotting 'drive-scsi0' (Ceph-RBDStor:vm-101-disk-1)
Creating snap: 10% complete...
Creating snap: 100% complete...done.
snapshotting 'drive-efidisk0' (Ceph-RBDStor:vm-101-disk-0)
Creating snap: 10% complete...
Creating snap: 100% complete...done.
snapshotting 'drive-tpmstate0' (Ceph-RBDStor:vm-101-disk-2)
Creating snap: 10% complete...
Creating snap: 100% complete...done.
TASK OK



==== What causes it to take that long?

root@pve1-cpu1:~# pveversion -v
proxmox-ve: 7.3-1 (running kernel: 5.15.74-1-pve)
pve-manager: 7.3-3 (running version: 7.3-3/c3928077)
pve-kernel-5.15: 7.2-14
pve-kernel-helper: 7.2-14
pve-kernel-5.13: 7.1-9
pve-kernel-5.11: 7.0-10
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.15.60-2-pve: 5.15.60-2
pve-kernel-5.15.60-1-pve: 5.15.60-1
pve-kernel-5.15.53-1-pve: 5.15.53-1
pve-kernel-5.15.39-4-pve: 5.15.39-4
pve-kernel-5.15.39-3-pve: 5.15.39-3
pve-kernel-5.15.39-1-pve: 5.15.39-1
pve-kernel-5.15.35-3-pve: 5.15.35-6
pve-kernel-5.15.35-2-pve: 5.15.35-5
pve-kernel-5.15.35-1-pve: 5.15.35-3
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-4-pve: 5.13.19-9
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.13.19-1-pve: 5.13.19-3
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-5-pve: 5.11.22-10
pve-kernel-5.11.22-4-pve: 5.11.22-9
ceph: 16.2.9-pve1
ceph-fuse: 16.2.9-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-8
libpve-guest-common-perl: 4.2-3
libpve-http-server-perl: 4.1-5
libpve-storage-perl: 7.2-12
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.7-1
proxmox-backup-file-restore: 2.2.7-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.0-1
proxmox-widget-toolkit: 3.5.3
pve-cluster: 7.3-1
pve-container: 4.4-2
pve-docs: 7.3-1
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-7
pve-firmware: 3.5-6
pve-ha-manager: 3.5.1
pve-i18n: 2.8-1
pve-qemu-kvm: 7.1.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.3-1
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+2
vncterm: 1.7-1
zfsutils-linux: 2.1.6-pve1
root@pve1-cpu1:~#
 
Hi,
how big are the disks of the VM? IIRC, taking a snapshot needs to flush the whole disk before it can finish. Maybe this can be optimized somehow, but would need to be investigated.
 
Is it really "13.23 GiB in 15s", then 9 minutes nothing and then "completed saving the VM state in 18s, saved 13.97 GiB"?
Because then it sounds more like a problem with the RAM dump, when that RAM dump is reporting to need 18s but it needs over 9 minutes to finish.
 
Is it really "13.23 GiB in 15s", then 9 minutes nothing and then "completed saving the VM state in 18s, saved 13.97 GiB"?
Because then it sounds more like a problem with the RAM dump, when that RAM dump is reporting to need 18s but it needs over 9 minutes to finish.
Took another quick look at the code. It's likely not the saving itself, but the writing it to the state file and closing the state file.
The time is recorded here, so the culprit should be afterwards, but before the state is set to completed here. But haven't tried to debug it yet.

@MoreDakka does it help if you configure a different VM state storage (in the GUI in the Options menu of your VM)?
 
@fiona - The vm's HDD is 250gb. probably about 15% utilized.
As for the state storage, not sure. It's currently broken and not sure how to fix it. Have another post open for that one.
 
Hi,
how big are the disks of the VM? IIRC, taking a snapshot needs to flush the whole disk before it can finish. Maybe this can be optimized somehow, but would need to be investigated.
hi fabian
we are having issue where a snapshot for a 2 disk vm qcow2 ( 1TB 2TB ) total of 3 with 32gb ram , vm as 10gb ram in use, but the progress is crossing the 10gb usage,
on a host with Shared GFS2 file system and 128gb ram (75% free) take for ever.
so far writing this msg i always killed it after 20 min , it move from 1 to 11gb in 2-3 min but stall there and become to increment so slowly.
a non-ram state snapshot complete the 3TB storage in 1-2 minute sucessfuly.

current setting :

vm.dirty_background_ratio = 10
vm.min_free_kbytes = 524288
vm.swappiness = 1

thx for help


Formatting '/mnt/storage/images/200/vm-200-state-snap.raw', fmt=raw size=69243764736 preallocation=off
saving VM state and RAM using storage 'storage
311.00 B in 0s
60.31 MiB in 1s
602.99 MiB in 2s
835.72 MiB in 3s
971.47 MiB in 4s
1.32 GiB in 5s
1.70 GiB in 6s
1.93 GiB in 7s
2.13 GiB in 8s
2.13 GiB in 9s
2.13 GiB in 10s
2.13 GiB in 11s
2.13 GiB in 12s
2.13 GiB in 13s
2.13 GiB in 14s
2.13 GiB in 15s
2.13 GiB in 16s
2.13 GiB in 17s
2.13 GiB in 18s
2.13 GiB in 19s
2.13 GiB in 20s
2.13 GiB in 21s
2.13 GiB in 22s
2.13 GiB in 23s
2.13 GiB in 24s
2.13 GiB in 25s
2.13 GiB in 26s
2.13 GiB in 27s
2.13 GiB in 28s
2.13 GiB in 29s
2.13 GiB in 30s
2.13 GiB in 31s
2.13 GiB in 32s
2.13 GiB in 33s
2.13 GiB in 34s
2.13 GiB in 35s
2.13 GiB in 36s
2.13 GiB in 37s
2.13 GiB in 38s
2.13 GiB in 39s
2.13 GiB in 40s
2.13 GiB in 41s
2.13 GiB in 42s
2.13 GiB in 43s
2.13 GiB in 44s
2.13 GiB in 45s
2.13 GiB in 46s
2.27 GiB in 47s
2.41 GiB in 48s
2.41 GiB in 49s
3.01 GiB in 50s
3.83 GiB in 51s
4.23 GiB in 52s
4.35 GiB in 53s
4.51 GiB in 54s
4.60 GiB in 55s
4.84 GiB in 56s
4.85 GiB in 57s
5.09 GiB in 58s
5.09 GiB in 59s
reducing reporting rate to every 10s
10.90 GiB in 1m 9s
10.91 GiB in 1m 19s
10.91 GiB in 1m 29s
10.91 GiB in 1m 39s
10.91 GiB in 1m 49s
10.91 GiB in 1m 59s
10.91 GiB in 2m 9s
10.91 GiB in 2m 19s
10.91 GiB in 2m 30s
10.91 GiB in 2m 40s
10.91 GiB in 2m 50s
10.91 GiB in 3m
10.92 GiB in 3m 10s
10.92 GiB in 3m 20s
10.92 GiB in 3m 30s
10.92 GiB in 3m 40s
10.93 GiB in 3m 50s
10.93 GiB in 4m
10.93 GiB in 4m 10s
10.93 GiB in 4m 20s
10.93 GiB in 4m 30s
10.94 GiB in 4m 40s
10.94 GiB in 4m 50s
10.94 GiB in 5m
10.94 GiB in 5m 11s
10.95 GiB in 5m 21s
10.95 GiB in 5m 31s
10.95 GiB in 5m 41s
10.95 GiB in 5m 51s
10.96 GiB in 6m 1s
10.96 GiB in 6m 11s
10.96 GiB in 6m 21s
10.96 GiB in 6m 31s
10.96 GiB in 6m 41s
10.97 GiB in 6m 51s
10.97 GiB in 7m 1s
10.97 GiB in 7m 11s
10.97 GiB in 7m 21s
10.98 GiB in 7m 31s
10.98 GiB in 7m 41s
10.98 GiB in 7m 52s
10.98 GiB in 8m 2s
10.98 GiB in 8m 12s
10.99 GiB in 8m 22s
10.99 GiB in 8m 32s
10.99 GiB in 8m 42s
10.99 GiB in 8m 52s
11.00 GiB in 9m 2s
11.00 GiB in 9m 12s
11.00 GiB in 9m 22s
11.00 GiB in 9m 32s
11.01 GiB in 9m 42s
11.01 GiB in 9m 52s
11.01 GiB in 10m 2s
11.01 GiB in 10m 12s
11.01 GiB in 10m 23s
11.02 GiB in 10m 33s
11.02 GiB in 10m 43s
11.02 GiB in 10m 53s
11.02 GiB in 11m 3s
11.03 GiB in 11m 13s
11.03 GiB in 11m 23s
11.03 GiB in 11m 33s
11.03 GiB in 11m 43s
11.03 GiB in 11m 53s
11.04 GiB in 12m 3s
11.04 GiB in 12m 13s
11.04 GiB in 12m 23s
11.04 GiB in 12m 33s
11.05 GiB in 12m 43s
11.05 GiB in 12m 53s
11.05 GiB in 13m 4s
11.05 GiB in 13m 14s
11.06 GiB in 13m 24s
11.06 GiB in 13m 34s
11.06 GiB in 13m 44s
11.06 GiB in 13m 54s
11.06 GiB in 14m 4s
11.07 GiB in 14m 14s
11.07 GiB in 14m 24s
11.07 GiB in 14m 34s
11.07 GiB in 14m 44s
11.08 GiB in 14m 54s
11.08 GiB in 15m 4s
11.08 GiB in 15m 14s
 
Last edited:
Are you taking PBS backups on the same VM? An issue were snapshots were hanging after taking a PBS backup (at least once) was fixed in pve-qemu-kvm>=7.2.0-5. If you are already using a newer version, please share the output of pveversion -v.
 
I had this problem. Snapshots particularly for windows VM are much faster if you switch the storage from qcow2 to lvm thin. Restoring a snapshot even after minor changes would take 10 mins now its seconds.
 
I had this problem. Snapshots particularly for windows VM are much faster if you switch the storage from qcow2 to lvm thin. Restoring a snapshot even after minor changes would take 10 mins now its seconds.
While that should make a difference for the disk part of the snapshot, I'd be surprised if it'd affect the RAM part much (which is what this thread is about). Well, the RAM part is saved to the same storage as the disk resides on by default, always as a separate raw file/volume, but other than that it should not depend on whether the disk is qcow2 or on LVM-thin.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!