Snapshot and backups depletes Thin-Pool LUN space..

zhoid · Jun 12, 2020

Hi Team,

We had an anomaly last night.

Backups were running on particular thin-pool LUN during this process a staff member took a snapshot of a VM as he did maintenance, the snapshot failed and the VM froze.
Rebooted the VM and unable to boot, giving FS corrupt errors. Attempted to delete the snapshot and no luck.

Another VM that was busy backing up also froze and locked up, VM rebooted and the same issue.

Both these VM's are CentoOS7 with XFS as the Filesystem.

We soon realised that the thin-pool had run out of space during this process.

running lvs -a PVE03

WARNING: /dev/PVE03/vm-137-disk-0: Thin's thin-pool needs inspection.
WARNING: /dev/PVE03/vm-154-disk-0: Thin's thin-pool needs inspection.
WARNING: /dev/PVE03/vm-156-disk-0: Thin's thin-pool needs inspection.
WARNING: /dev/PVE03/vm-157-disk-0: Thin's thin-pool needs inspection.
WARNING: /dev/PVE03/vm-158-disk-0: Thin's thin-pool needs inspection.
/dev/coraid-ssd-2/vm-135-disk-0: read failed after 0 of 4096 at 0: Input/output error
/dev/coraid-ssd-2/vm-135-disk-0: read failed after 0 of 4096 at 214748299264: Input/output error
/dev/coraid-ssd-2/vm-135-disk-0: read failed after 0 of 4096 at 214748356608: Input/output error
/dev/coraid-ssd-2/vm-135-disk-0: read failed after 0 of 4096 at 4096: Input/output error
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
PVE03 PVE03 twi-cotzM- 3.64t 94.95 48.94
[PVE03_tdata] PVE03 Twi-ao---- 3.64t
[PVE03_tmeta] PVE03 ewi-ao---- 120.00m
[lvol0_pmspare] PVE03 ewi------- 120.00m
snap_vm-137-disk-0_upgrade PVE03 Vri---tz-k 700.00g PVE03 vm-137-disk-0
vm-137-disk-0 PVE03 Vwi-aotz-- 700.00g PVE03 85.39
vm-154-disk-0 PVE03 Vwi-aotz-- 700.00g PVE03 74.90
vm-156-disk-0 PVE03 Vwi-aotz-- 850.00g PVE03 77.37
vm-157-disk-0 PVE03 Vwi-aotz-- 950.00g PVE03 96.60
vm-158-disk-0 PVE03 Vwi-aotz-- 950.00g PVE03 85.28
vm-158-disk-1 PVE03 Vwi---tz-- 50.00g PVE03

Notice there is still a 700g snapshot we unable to delete.

A few things:
How/why did the VM's go into this state, we are still unable to recover or fix these file systems and boot the VM's, we are busy recovering from backup.
Even after space was cleared, the VM's still cannot boot.
Howcome does the backups and snapshots continue even if there is not enough space on the disk or pool? Surely this can be calculated or anticipated and fail without causing issues, searching the Net, this seems to be a common problem if not monitored closely.

Please advise how we can delete the snapshot above?

Thanks

Zaid

Dominic · Jun 18, 2020

Hi!

What version are you using pveversion -v? "Real" fails

Backups were running on particular thin-pool LUN during this process a staff member took a snapshot of a VM as he did maintenance, the snapshot failed and the VM froze.

should actually not be possible. VMs have a lock during a backup and the snapshot should "fail" with "VM is locked (backup)".

Notice there is still a 700g snapshot we unable to delete.

How did you try to remove it? PVE GUI? CLI? Do you have the exact output?

Howcome does the backups and snapshots continue even if there is not enough space on the disk or pool?

When I try it here abckups fail like that

Code:

INFO: started backup task 'c0e01ff6-dcf6-4de1-9988-9cdebce6153c'
INFO: status: 22% (2873622528/12884901888), sparse 16% (2131132416), duration 3, read/write 957/247 MB/s
ERROR: vma_queue_write: write error - No space left on device
INFO: aborting backup job
INFO: stopping kvm after backup task
ERROR: Backup of VM 102 failed - vma_queue_write: write error - No space left on device
INFO: Failed at 2020-06-18 09:48:14
INFO: Backup job finished with errors
job errors

Surely this can be calculated or anticipated and fail without causing issues

Compression actually makes this difficult.

zhoid · Jun 18, 2020

Dominic said:

Hi!

What version are you using pveversion -v? "Real" fails

output of pveversion -v

Code:

root@pve-206:~# pveversion -v
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
	LANGUAGE = (unset),
	LC_ALL = (unset),
	LC_CTYPE = "UTF-8",
	LC_TERMINAL = "iTerm2",
	LC_TERMINAL_VERSION = "3.3.11beta1",
	LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to a fallback locale ("en_US.UTF-8").
proxmox-ve: 5.4-1 (running kernel: 4.15.18-12-pve)
pve-manager: 5.4-3 (running version: 5.4-3/0a6eaa62)
pve-kernel-4.15: 5.3-3
pve-kernel-4.15.18-12-pve: 4.15.18-35
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-50
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-41
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-25
pve-cluster: 5.0-36
pve-container: 2.0-37
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-19
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-3
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-50
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

should actually not be possible. VMs have a lock during a backup and the snapshot should "fail" with "VM is locked (backup)".

How did you try to remove it? PVE GUI? CLI? Do you have the exact output?

Tried from the Gui and the CLI, from the CLI this is the output

Code:

Do you really want to remove and DISCARD logical volume PVE03/snap_vm-137-disk-0_upgrade? [y/n]: y
  device-mapper: message ioctl on (253:116) failed: Operation not supported
  Failed to process thin pool message "delete 9".
  Failed to suspend PVE03/PVE03 with queued messages.
  Failed to update pool PVE03/PVE03.
It seems that I/O error of "/dev/sdt" is failing all changes of volume group PVE03.

When I try it here abckups fail like that

Code:

INFO: started backup task 'c0e01ff6-dcf6-4de1-9988-9cdebce6153c'
INFO: status: 22% (2873622528/12884901888), sparse 16% (2131132416), duration 3, read/write 957/247 MB/s
ERROR: vma_queue_write: write error - No space left on device
INFO: aborting backup job
INFO: stopping kvm after backup task
ERROR: Backup of VM 102 failed - vma_queue_write: write error - No space left on device
INFO: Failed at 2020-06-18 09:48:14
INFO: Backup job finished with errors
job errors

Compression actually makes this difficult.

Search

Search

Snapshot and backups depletes Thin-Pool LUN space..

zhoid

Member

Dominic

Proxmox Retired Staff

zhoid

Member