Guest stuck on snapshot delete while guest disks on local storage

akimov

New Member
Aug 19, 2022
14
1
3
Hi. We have a proxmox-cluster with local storages (simple dedicated disk, ext4, no lvm) on some nodes.
Also we have a several guest vms on this storages with big ( > 200gb) disks.
In this case we have a guest stuck on shapshot-delete: can't connect to guest via ssh/console and if we interrupt snaphot-delete, guest still stuck until restart VM
1660896744391.png
 
Hello,

If the snapshot can't be removed, you can try to edit the VM config file and remove the snapshot section... you will find the VM config file in /etc/pve/qemu-server/VMID.conf path.
 
Yes, we are using this method in this case. But guest vm still stuck and it's main problem, not snapshot-delete.
Hello,

If the snapshot can't be removed, you can try to edit the VM config file and remove the snapshot section... you will find the VM config file in /etc/pve/qemu-server/VMID.conf path.
 
can you provide us with the output of the following commands?

Bash:
cat /etc/pve/storage.cfg
qm config <VMID>
pveversion -v
 
can you provide us with the output of the following commands?

Bash:
cat /etc/pve/storage.cfg
qm config <VMID>
pveversion -v
Done
sudo cat /etc/pve/storage.cfg
dir: local-ssd
path /mnt/pve/local-ssd
content iso,backup,rootdir,snippets,vztmpl,images
is_mountpoint 1
nodes pve-88240,pve-88239,pve-88247
shared 0

sudo qm config 274
agent: 1
boot: order=scsi0;net0;ide1
cores: 2
ide1: local-ssd:274/vm-274-cloudinit.raw,media=cdrom,size=4M
ipconfig0: ip=192.168.13.243/23,gw=192.168.12.1
memory: 12288
meta: creation-qemu=6.2.0,ctime=1652695615
name: ats-zabbix-db3-13-243-local
net0: virtio=32:1D:23:55:76:55,bridge=vmbr0,rate=50
numa: 0
ostype: l26
parent: autodaily220818001941
scsi0: local-ssd:274/vm-274-disk-0.qcow2,cache=directsync,size=10G
scsi1: local-ssd:274/vm-274-disk-1.qcow2,cache=directsync,size=250G
scsihw: virtio-scsi-pci
smbios1: uuid=af5a8aac-552f-45bf-a367-1e18f767435e
sockets: 2
vmgenid: d4b73ae0-4a72-4d90-88ab-6fcfeeb0885d

sudo pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.35-2-pve)
pve-manager: 7.2-4 (running version: 7.2-4/ca9d43cc)
pve-kernel-5.15: 7.2-4
pve-kernel-helper: 7.2-4
pve-kernel-5.13: 7.1-9
pve-kernel-5.11: 7.0-10
pve-kernel-5.15.35-2-pve: 5.15.35-5
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-1-pve: 5.13.19-3
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-4-pve: 5.11.22-9
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-2
libpve-storage-perl: 7.2-4
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.3-1
proxmox-backup-file-restore: 2.2.3-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-10
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1
 
Hello,

Thank you for the output!

Did you see anything in the Syslog/Journalctl during the delete snapshot operation?

Does other VM with less than 200 GB have the same issue?
 
Hello,

Thank you for the output!

Did you see anything in the Syslog/Journalctl during the delete snapshot operation?

Does other VM with less than 200 GB have the same issue?
A lot of messages at syslog like this:
Aug 18 00:20:05 pve-88247 pvestatd[3420]: VM 274 qmp command failed - VM 274 qmp command 'query-proxmox-support' failed - got timeout
Aug 18 00:20:15 pve-88247 pvestatd[3420]: VM 274 qmp command failed - VM 274 qmp command 'query-proxmox-support' failed - unable to connect to VM 274 qmp socket - timeout after 31 retries
Some messages at syslog:
Aug 18 00:32:20 pve-88247 pvedaemon[1495874]: VM 274 qmp command failed - VM 274 qmp command 'blockdev-snapshot-delete-internal-sync' failed - got timeout
Aug 18 00:32:20 pve-88247 pvedaemon[1495874]: VM 274 qmp command 'blockdev-snapshot-delete-internal-sync' failed - got timeout

Yes, other vm have the same issue
 
Thank you for the logs

The snapshot can be rollbak?

Can you provide us with syslog when you create a new snapshot and remove it?

You can do this by open a new window terminal and run journalctl -f > /tmp/snap.log then from the GUI do a snapshot and remove it. Back to terminal Ctrl+c then attach the snap.log here please
 
Thank you for the logs

The snapshot can be rollbak?

Can you provide us with syslog when you create a new snapshot and remove it?

You can do this by open a new window terminal and run journalctl -f > /tmp/snap.log then from the GUI do a snapshot and remove it. Back to terminal Ctrl+c then attach the snap.log here please
No, such snapshots can't be rollback.

Logs will be later.
 
Thank you for the logs

The snapshot can be rollbak?

Can you provide us with syslog when you create a new snapshot and remove it?

You can do this by open a new window terminal and run journalctl -f > /tmp/snap.log then from the GUI do a snapshot and remove it. Back to terminal Ctrl+c then attach the snap.log here please
Hello! Here's logs.
 

Attachments

Hello,

Thank you for the Syslog!

I saw in the Syslog, that there are nodes lost the cluster connection via Corosync

Code:
Aug 22 14:34:05 pve-88247 corosync[3386]:   [KNET  ] link: host: 9 link: 1 is down
Aug 22 14:34:05 pve-88247 corosync[3386]:   [KNET  ] link: host: 7 link: 1 is down
Aug 22 14:34:05 pve-88247 corosync[3386]:   [KNET  ] link: host: 6 link: 1 is down
Aug 22 14:34:05 pve-88247 corosync[3386]:   [KNET  ] link: host: 1 link: 1 is down
Aug 22 14:34:05 pve-88247 corosync[3386]:   [KNET  ] link: host: 3 link: 1 is down
Aug 22 14:34:05 pve-88247 corosync[3386]:   [KNET  ] host: host: 9 (passive) best link: 0 (pri: 1)
Aug 22 14:34:05 pve-88247 corosync[3386]:   [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
Aug 22 14:34:05 pve-88247 corosync[3386]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Aug 22 14:34:05 pve-88247 corosync[3386]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 22 14:34:05 pve-88247 corosync[3386]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)

Corosync is very latency sensitive and will consider a link down once the latency gets too high. It is possible, that the backup running at the same time or so on caused this. We recommend separating services on different physical networks to reduce effects [0].

Is the delete snapshot on another node occur the same issue, or only on the pve-88247 node?
 
Hello,

Thank you for the Syslog!

I saw in the Syslog, that there are nodes lost the cluster connection via Corosync

Code:
Aug 22 14:34:05 pve-88247 corosync[3386]:   [KNET  ] link: host: 9 link: 1 is down
Aug 22 14:34:05 pve-88247 corosync[3386]:   [KNET  ] link: host: 7 link: 1 is down
Aug 22 14:34:05 pve-88247 corosync[3386]:   [KNET  ] link: host: 6 link: 1 is down
Aug 22 14:34:05 pve-88247 corosync[3386]:   [KNET  ] link: host: 1 link: 1 is down
Aug 22 14:34:05 pve-88247 corosync[3386]:   [KNET  ] link: host: 3 link: 1 is down
Aug 22 14:34:05 pve-88247 corosync[3386]:   [KNET  ] host: host: 9 (passive) best link: 0 (pri: 1)
Aug 22 14:34:05 pve-88247 corosync[3386]:   [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
Aug 22 14:34:05 pve-88247 corosync[3386]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Aug 22 14:34:05 pve-88247 corosync[3386]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 22 14:34:05 pve-88247 corosync[3386]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)

Corosync is very latency sensitive and will consider a link down once the latency gets too high. It is possible, that the backup running at the same time or so on caused this. We recommend separating services on different physical networks to reduce effects [0].

Is the delete snapshot on another node occur the same issue, or only on the pve-88247 node?
Thank you. We'll check our network for this latencies.

Delete snapshot on another node occur the same issue.
Upd. Guest vm not totally stuck. I've decided so because unable to connect on guest console via web-interface or spice. But innodb cluster (vm 274 is a read-only part of it) still has established sockects with 'stucked' guest, see screenshot.
Also i can capture traffic via this sockets and its dual-side! 'Stucked' guest (VM 274) sends and receives tcp-ACK and data packets (see dump.zip)
 

Attachments

  • 1661206965066.png
    1661206965066.png
    32 KB · Views: 3
  • dump.zip
    dump.zip
    891 KB · Views: 1
Hi,

I would check if there is a high I/O during deleting snapshot or test the snapshot on another storage (test storage) to narrow down the issue.
 
Hi,

I would check if there is a high I/O during deleting snapshot or test the snapshot on another storage (test storage) to narrow down the issue.
What do you mean by"another storage" ? rbd device, block device, iscsi-storage?
 
Storage local-ssd on ... ssd

sdd 8:48 0 1.7T 0 disk
└─sdd1 8:49 0 1.7T 0 part /mnt/pve/local-ssd

ssd model:

smartctl -a /dev/sdd
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.35-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Seagate Nytro SATA SSD
Device Model: XA1920LE10063

But still no anwer on my question: "Why guest vm stuck on snapshot-delete and how to fix it?"
 
Hi,
But still no anwer on my question: "Why guest vm stuck on snapshot-delete and how to fix it?"
the answer is that the operation took longer than the timeout Proxmox VE gives for the operation. What likely happens is that the operation finishes in the background afterwards (so the snapshot will be gone from the disk), but since the timeout happened, Proxmox VE can't know that and doesn't remove the snapshot (from the configuration). After all, it might have been an actual issue with the disk operation not finishing.

To remove the lock and get rid of the (supposedly partially deleted) snapshot, you should be able to run qm unlock <ID> and qm delsnapshot <ID> <snapshot name> --force. With --force, the fact that a snapshot doesn't exist anymore on the disk will be ignored and removing the snapshot from the configuration will proceed.
 
Hi,

the answer is that the operation took longer than the timeout Proxmox VE gives for the operation. What likely happens is that the operation finishes in the background afterwards (so the snapshot will be gone from the disk), but since the timeout happened, Proxmox VE can't know that and doesn't remove the snapshot (from the configuration). After all, it might have been an actual issue with the disk operation not finishing.

To remove the lock and get rid of the (supposedly partially deleted) snapshot, you should be able to run qm unlock <ID> and qm delsnapshot <ID> <snapshot name> --force. With --force, the fact that a snapshot doesn't exist anymore on the disk will be ignored and removing the snapshot from the configuration will proceed.
If i use --force to remove i have a downtime in this case and its very bad for our innodb-cluster.
Also on snapshot-delete in guest os i have a lot of messages like this :
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379454] watchdog: BUG: soft lockup - CPU#1 stuck for 466s! [kworker/1:0:13518]
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379657] Modules linked in: tls isofs input_leds joydev serio_raw mac_hid qemu_fw_cfg sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ipmi_devintf ipmi_msghandler ramoops reed_solomon msr pstore_blk pstore_zone mtd efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear bochs drm_vram_helper drm_ttm_helper ttm drm_kms_helper syscopyarea hid_generic sysfillrect sysimgblt fb_sys_fops virtio_net usbhid cec rc_core net_failover psmouse hid drm i2c_piix4 virtio_scsi failover pata_acpi floppy
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379720] CPU: 1 PID: 13518 Comm: kworker/1:0 Tainted: G L 5.15.0-43-generic #46-Ubuntu
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379724] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379727] Workqueue: rcu_gp wait_rcu_exp_gp
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379749] RIP: 0010:smp_call_function_single+0x94/0x120
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379755] Code: cd e9 48 a9 00 01 ff 00 0f 85 93 00 00 00 85 c9 75 42 48 c7 c6 c0 1a 03 00 65 48 03 35 25 67 e9 48 8b 46 08 a8 01 74 09 f3 90 <8b> 46 08 a8 01 75 f7 83 4e 08 01 4c 89 46 10 48 89 56 18 e8 54 fe
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379757] RSP: 0018:ffffb941c0617d80 EFLAGS: 00000202
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379759] RAX: 0000000000000001 RBX: 0000000000000003 RCX: 0000000000000000
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379761] RDX: 0000000000000000 RSI: ffff8dd6b3cb1ac0 RDI: 0000000000000003
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379762] RBP: ffffb941c0617dd0 R08: ffffffffb7153670 R09: 0000000000000282
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379763] R10: 0000000000000007 R11: 0000000000000000 R12: ffffffffb8f7f8f8
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379764] R13: ffff8dd6b3db1900 R14: 0000000000000003 R15: 0000000000000008
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379766] FS: 0000000000000000(0000) GS:ffff8dd6b3c80000(0000) knlGS:0000000000000000
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379767] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379769] CR2: 00007f34ce7bf240 CR3: 0000000116410000 CR4: 00000000000006e0
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379773] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379774] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379775] Call Trace:
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379777] <TASK>
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379783] ? _raw_spin_unlock_irqrestore+0xe/0x30
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379791] sync_rcu_exp_select_node_cpus+0x23a/0x330
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379794] sync_rcu_exp_select_cpus+0x1e9/0x480
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379797] wait_rcu_exp_gp+0x14/0x30
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379800] process_one_work+0x22b/0x3d0
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379805] worker_thread+0x53/0x410
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379823] ? process_one_work+0x3d0/0x3d0
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379826] kthread+0x12a/0x150
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379830] ? set_kthread_struct+0x50/0x50
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379833] ret_from_fork+0x22/0x30
Sep 8 00:29:10 ats-zabbix-db3-13-243 kernel: [106758.379839] </TASK>

Sep 8 00:32:56 ats-zabbix-db3-13-243 systemd-networkd[15660]: eth0: Link UP
Sep 8 00:32:56 ats-zabbix-db3-13-243 systemd-networkd[15660]: eth0: Gained carrier

VMs with disks on rbd-storage(ceph) does not affected by this issue (i think because ceph has built-in snapshot feature)
 
Last edited: