Hi Team,
We had an anomaly last night.
Backups were running on particular thin-pool LUN during this process a staff member took a snapshot of a VM as he did maintenance, the snapshot failed and the VM froze.
Rebooted the VM and unable to boot, giving FS corrupt errors. Attempted to delete the snapshot and no luck.
Another VM that was busy backing up also froze and locked up, VM rebooted and the same issue.
Both these VM's are CentoOS7 with XFS as the Filesystem.
We soon realised that the thin-pool had run out of space during this process.
running lvs -a PVE03
WARNING: /dev/PVE03/vm-137-disk-0: Thin's thin-pool needs inspection.
WARNING: /dev/PVE03/vm-154-disk-0: Thin's thin-pool needs inspection.
WARNING: /dev/PVE03/vm-156-disk-0: Thin's thin-pool needs inspection.
WARNING: /dev/PVE03/vm-157-disk-0: Thin's thin-pool needs inspection.
WARNING: /dev/PVE03/vm-158-disk-0: Thin's thin-pool needs inspection.
/dev/coraid-ssd-2/vm-135-disk-0: read failed after 0 of 4096 at 0: Input/output error
/dev/coraid-ssd-2/vm-135-disk-0: read failed after 0 of 4096 at 214748299264: Input/output error
/dev/coraid-ssd-2/vm-135-disk-0: read failed after 0 of 4096 at 214748356608: Input/output error
/dev/coraid-ssd-2/vm-135-disk-0: read failed after 0 of 4096 at 4096: Input/output error
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
PVE03 PVE03 twi-cotzM- 3.64t 94.95 48.94
[PVE03_tdata] PVE03 Twi-ao---- 3.64t
[PVE03_tmeta] PVE03 ewi-ao---- 120.00m
[lvol0_pmspare] PVE03 ewi------- 120.00m
snap_vm-137-disk-0_upgrade PVE03 Vri---tz-k 700.00g PVE03 vm-137-disk-0
vm-137-disk-0 PVE03 Vwi-aotz-- 700.00g PVE03 85.39
vm-154-disk-0 PVE03 Vwi-aotz-- 700.00g PVE03 74.90
vm-156-disk-0 PVE03 Vwi-aotz-- 850.00g PVE03 77.37
vm-157-disk-0 PVE03 Vwi-aotz-- 950.00g PVE03 96.60
vm-158-disk-0 PVE03 Vwi-aotz-- 950.00g PVE03 85.28
vm-158-disk-1 PVE03 Vwi---tz-- 50.00g PVE03
Notice there is still a 700g snapshot we unable to delete.
A few things:
How/why did the VM's go into this state, we are still unable to recover or fix these file systems and boot the VM's, we are busy recovering from backup.
Even after space was cleared, the VM's still cannot boot.
Howcome does the backups and snapshots continue even if there is not enough space on the disk or pool? Surely this can be calculated or anticipated and fail without causing issues, searching the Net, this seems to be a common problem if not monitored closely.
Please advise how we can delete the snapshot above?
Thanks
Zaid
We had an anomaly last night.
Backups were running on particular thin-pool LUN during this process a staff member took a snapshot of a VM as he did maintenance, the snapshot failed and the VM froze.
Rebooted the VM and unable to boot, giving FS corrupt errors. Attempted to delete the snapshot and no luck.
Another VM that was busy backing up also froze and locked up, VM rebooted and the same issue.
Both these VM's are CentoOS7 with XFS as the Filesystem.
We soon realised that the thin-pool had run out of space during this process.
running lvs -a PVE03
WARNING: /dev/PVE03/vm-137-disk-0: Thin's thin-pool needs inspection.
WARNING: /dev/PVE03/vm-154-disk-0: Thin's thin-pool needs inspection.
WARNING: /dev/PVE03/vm-156-disk-0: Thin's thin-pool needs inspection.
WARNING: /dev/PVE03/vm-157-disk-0: Thin's thin-pool needs inspection.
WARNING: /dev/PVE03/vm-158-disk-0: Thin's thin-pool needs inspection.
/dev/coraid-ssd-2/vm-135-disk-0: read failed after 0 of 4096 at 0: Input/output error
/dev/coraid-ssd-2/vm-135-disk-0: read failed after 0 of 4096 at 214748299264: Input/output error
/dev/coraid-ssd-2/vm-135-disk-0: read failed after 0 of 4096 at 214748356608: Input/output error
/dev/coraid-ssd-2/vm-135-disk-0: read failed after 0 of 4096 at 4096: Input/output error
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
PVE03 PVE03 twi-cotzM- 3.64t 94.95 48.94
[PVE03_tdata] PVE03 Twi-ao---- 3.64t
[PVE03_tmeta] PVE03 ewi-ao---- 120.00m
[lvol0_pmspare] PVE03 ewi------- 120.00m
snap_vm-137-disk-0_upgrade PVE03 Vri---tz-k 700.00g PVE03 vm-137-disk-0
vm-137-disk-0 PVE03 Vwi-aotz-- 700.00g PVE03 85.39
vm-154-disk-0 PVE03 Vwi-aotz-- 700.00g PVE03 74.90
vm-156-disk-0 PVE03 Vwi-aotz-- 850.00g PVE03 77.37
vm-157-disk-0 PVE03 Vwi-aotz-- 950.00g PVE03 96.60
vm-158-disk-0 PVE03 Vwi-aotz-- 950.00g PVE03 85.28
vm-158-disk-1 PVE03 Vwi---tz-- 50.00g PVE03
Notice there is still a 700g snapshot we unable to delete.
A few things:
How/why did the VM's go into this state, we are still unable to recover or fix these file systems and boot the VM's, we are busy recovering from backup.
Even after space was cleared, the VM's still cannot boot.
Howcome does the backups and snapshots continue even if there is not enough space on the disk or pool? Surely this can be calculated or anticipated and fail without causing issues, searching the Net, this seems to be a common problem if not monitored closely.
Please advise how we can delete the snapshot above?
Thanks
Zaid