Snapshot and backups depletes Thin-Pool LUN space..

zhoid

Member
Sep 4, 2019
24
0
21
42
Hi Team,

We had an anomaly last night.

Backups were running on particular thin-pool LUN during this process a staff member took a snapshot of a VM as he did maintenance, the snapshot failed and the VM froze.
Rebooted the VM and unable to boot, giving FS corrupt errors. Attempted to delete the snapshot and no luck.

Another VM that was busy backing up also froze and locked up, VM rebooted and the same issue.

Both these VM's are CentoOS7 with XFS as the Filesystem.

We soon realised that the thin-pool had run out of space during this process.

running lvs -a PVE03

WARNING: /dev/PVE03/vm-137-disk-0: Thin's thin-pool needs inspection.
WARNING: /dev/PVE03/vm-154-disk-0: Thin's thin-pool needs inspection.
WARNING: /dev/PVE03/vm-156-disk-0: Thin's thin-pool needs inspection.
WARNING: /dev/PVE03/vm-157-disk-0: Thin's thin-pool needs inspection.
WARNING: /dev/PVE03/vm-158-disk-0: Thin's thin-pool needs inspection.
/dev/coraid-ssd-2/vm-135-disk-0: read failed after 0 of 4096 at 0: Input/output error
/dev/coraid-ssd-2/vm-135-disk-0: read failed after 0 of 4096 at 214748299264: Input/output error
/dev/coraid-ssd-2/vm-135-disk-0: read failed after 0 of 4096 at 214748356608: Input/output error
/dev/coraid-ssd-2/vm-135-disk-0: read failed after 0 of 4096 at 4096: Input/output error
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
PVE03 PVE03 twi-cotzM- 3.64t 94.95 48.94
[PVE03_tdata] PVE03 Twi-ao---- 3.64t
[PVE03_tmeta] PVE03 ewi-ao---- 120.00m
[lvol0_pmspare] PVE03 ewi------- 120.00m
snap_vm-137-disk-0_upgrade PVE03 Vri---tz-k 700.00g PVE03 vm-137-disk-0
vm-137-disk-0 PVE03 Vwi-aotz-- 700.00g PVE03 85.39
vm-154-disk-0 PVE03 Vwi-aotz-- 700.00g PVE03 74.90
vm-156-disk-0 PVE03 Vwi-aotz-- 850.00g PVE03 77.37
vm-157-disk-0 PVE03 Vwi-aotz-- 950.00g PVE03 96.60
vm-158-disk-0 PVE03 Vwi-aotz-- 950.00g PVE03 85.28
vm-158-disk-1 PVE03 Vwi---tz-- 50.00g PVE03

Notice there is still a 700g snapshot we unable to delete.

A few things:
How/why did the VM's go into this state, we are still unable to recover or fix these file systems and boot the VM's, we are busy recovering from backup.
Even after space was cleared, the VM's still cannot boot.
Howcome does the backups and snapshots continue even if there is not enough space on the disk or pool? Surely this can be calculated or anticipated and fail without causing issues, searching the Net, this seems to be a common problem if not monitored closely.

Please advise how we can delete the snapshot above?

Thanks

Zaid
 
Hi!

What version are you using pveversion -v? "Real" fails
Backups were running on particular thin-pool LUN during this process a staff member took a snapshot of a VM as he did maintenance, the snapshot failed and the VM froze.
should actually not be possible. VMs have a lock during a backup and the snapshot should "fail" with "VM is locked (backup)".

Notice there is still a 700g snapshot we unable to delete.
How did you try to remove it? PVE GUI? CLI? Do you have the exact output?

Howcome does the backups and snapshots continue even if there is not enough space on the disk or pool?
When I try it here abckups fail like that
Code:
INFO: started backup task 'c0e01ff6-dcf6-4de1-9988-9cdebce6153c'
INFO: status: 22% (2873622528/12884901888), sparse 16% (2131132416), duration 3, read/write 957/247 MB/s
ERROR: vma_queue_write: write error - No space left on device
INFO: aborting backup job
INFO: stopping kvm after backup task
ERROR: Backup of VM 102 failed - vma_queue_write: write error - No space left on device
INFO: Failed at 2020-06-18 09:48:14
INFO: Backup job finished with errors
job errors

Surely this can be calculated or anticipated and fail without causing issues
Compression actually makes this difficult.
 
Hi!

What version are you using pveversion -v? "Real" fails

output of pveversion -v

Code:
root@pve-206:~# pveversion -v
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
	LANGUAGE = (unset),
	LC_ALL = (unset),
	LC_CTYPE = "UTF-8",
	LC_TERMINAL = "iTerm2",
	LC_TERMINAL_VERSION = "3.3.11beta1",
	LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to a fallback locale ("en_US.UTF-8").
proxmox-ve: 5.4-1 (running kernel: 4.15.18-12-pve)
pve-manager: 5.4-3 (running version: 5.4-3/0a6eaa62)
pve-kernel-4.15: 5.3-3
pve-kernel-4.15.18-12-pve: 4.15.18-35
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-50
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-41
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-25
pve-cluster: 5.0-36
pve-container: 2.0-37
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-19
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-3
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-50
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2
should actually not be possible. VMs have a lock during a backup and the snapshot should "fail" with "VM is locked (backup)".


How did you try to remove it? PVE GUI? CLI? Do you have the exact output?

Tried from the Gui and the CLI, from the CLI this is the output
Code:
Do you really want to remove and DISCARD logical volume PVE03/snap_vm-137-disk-0_upgrade? [y/n]: y
  device-mapper: message ioctl on (253:116) failed: Operation not supported
  Failed to process thin pool message "delete 9".
  Failed to suspend PVE03/PVE03 with queued messages.
  Failed to update pool PVE03/PVE03.
It seems that I/O error of "/dev/sdt" is failing all changes of volume group PVE03.

When I try it here abckups fail like that
Code:
INFO: started backup task 'c0e01ff6-dcf6-4de1-9988-9cdebce6153c'
INFO: status: 22% (2873622528/12884901888), sparse 16% (2131132416), duration 3, read/write 957/247 MB/s
ERROR: vma_queue_write: write error - No space left on device
INFO: aborting backup job
INFO: stopping kvm after backup task
ERROR: Backup of VM 102 failed - vma_queue_write: write error - No space left on device
INFO: Failed at 2020-06-18 09:48:14
INFO: Backup job finished with errors
job errors


Compression actually makes this difficult.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!