Need to recover a file system on a QEMU harddisk stored on a ZFS pool

rdarioc · Jul 29, 2022

So, this disk was used by a VM running Ubuntu 20.04.03 server. It was not the primary disk, but a secondary.

It ran full and ever since, VM has shown as suspended on ProxMox console with Status: io-error label when hoovering on the QEMU-ID icon.

Proxmox is 7.1-7

From the console of the PVE node, I can confirm that the zfs pool is healty.

root@pve2:~# zpool status
pool: zblock01
state: ONLINE
scan: scrub repaired 0B in 01:04:27 with 0 errors on Sun Jul 10 01:28:28 2022
config:

NAME STATE READ WRITE CKSUM
zblock01 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0
sdi ONLINE 0 0 0

errors: No known data errors

I am also able to list such device:

root@pve2:~# zfs list
NAME USED AVAIL REFER MOUNTPOINT
zblock01 2.81T 0B 307K /zblock01
zblock01/vm-104-disk-0 2.81T 0B 2.81T -

But I am unable to mount it to recover the data.

I am able to stop the VM (qm stop <vm-id>) and then boot the VM from a mounted CD ISO and then list the SCSI devices.

Unfortunately I made the mistake to increase the SCSI size from the ProxMox GUI, so now, when I list the corresponding SCSI device I get:

root@ubuntu-server:/# fdisk /dev/sdb -l
The backup GPT table is not on the end of the device. This problem will be corrected by write.
Disk /dev/sdb: 3.31 TiB, 3633542332416 bytes, 7096762368 sectors
Disk model: QEMU HARDDISK
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 2ADFAF5E-2D1C-4C80-BAEC-38355D586CFA

Device Start End Sectors Size Type
/dev/sdb1 2048 6291453951 6291451904 3T Linux filesystem

Now, when I try to mount it with just mount /dev/sdb1 /mnt the system hangs

Any help on how I can recover those 2.8TB of data?

Thanks

LnxBil · Jul 29, 2022

First: Please use CODE tags to post output from the console, it is then readable.

Could you please add the output of zpool list? It seems that your pool is completely full, is that right?

rdarioc · Jul 29, 2022

Code:

root@pve2:~# zpool list
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zblock01  3.62T  3.51T   115G        -         -     4%    96%  1.00x    ONLINE  -

That is correct, it is almost full.

Dunuin · Jul 29, 2022

Your Pool is way to full. A ZFS pool shouldn't be filled more than 80%.

rdarioc · Jul 29, 2022

Dunuin said:
Your Pool is way to full. A ZFS pool shouldn't be filled more than 80%.

Rightful observation.

Any way to move forward?

My idea is to move the QEMU disk out onto another location. But I don't know if it is even possible.

Dunuin · Jul 29, 2022

First make sure you don't add more data. As soon as the pool gets completely full you won't be able to delete anything because it gets read-only and ZFS is CoW so you need to add more data in order to remove data.

rdarioc · Jul 29, 2022

Dunuin said:
First make sure you don't add more data. As soon as the pool gets completely full you won't be able to delete anything because it gets read-only and ZFS is CoW so you need to add more data in order to remove data.

Do you think that moving the QEMU HARDDISK out of the ZFS pool using the move function available on Proxmox might be a way out?

Dunuin · Jul 29, 2022

In case the guest is able to start you could do a vzdump backup to an external disk or NAS and destroy the guest and import it on a new storage later.

rdarioc · Jul 29, 2022

The guest doesn't complete its booting.

Can't a do a dump/backup/clone/whatever without starting the VM?

Dunuin · Jul 29, 2022

rdarioc said:
The guest doesn't complete its booting.

Can't a do a dump/backup/clone/whatever without starting the VM?

The guest doesn't need to be able to boot. The VM just need to be able to start. Backup needs QEMU running to backup the disks but this will happen before the guest will try to boot.

rdarioc · Jul 29, 2022

Then it should be working. The VM starts, I could even boot it with a live ISO from the CD. Tomorrow I should have 4TB of free storage over NAS and I'll do the move. It might take some time to copy 3plus TB.
What if instead of doing a cloning I just move the QEMU HDD?

Dunuin · Jul 29, 2022

Should work too. But a real backup might be still useful. That way you can always restore a working copy in case you screw something while fixing the boot errors of the VM.

rdarioc · Jul 29, 2022

It still isn't clear to me why the VM stopped working and doesn't boot anymore.
I mean, ok, the QEMU HDD is on a device that has become r/o. So? Does QEMU HDD needs to write onto the underlying media even if there is only a mount attempt?

LnxBil · Jul 30, 2022

rdarioc said:
It still isn't clear to me why the VM stopped working and doesn't boot anymore.
I mean, ok, the QEMU HDD is on a device that has become r/o. So?

You got I/O errors, maybe already a few days earlier and so the filesystem got corrupted and you did not notice it. a full disk in a virtualized environment is VERY VERY strange for the OS and yields very weird problems like yours. It's hard to estimate how much is corrupt.

rdarioc said:
Does QEMU HDD needs to write onto the underlying media even if there is only a mount attempt?

This is guest OS specific. Imagine you're the OS and you try to mount a filesystem with journal on it. While loading, you read the journal and compare it with the on disk data and see that the on disk data is older. The fs tries to write the journal back on the disk and it fails due to an IO error and that was already in the recovery phase ... that is not good.

The goal is to avoid this kind of corruption at all costs, really!

rdarioc · Jul 31, 2022

All well what ends well.

After about 26 hours I managed to clone the VM onto another volume. 2 disks, a primary of 64GB and a secondary of 3.4TB.

Remember that I had increased the size of the second this as soon as I got the i/o error to 3.4TB, without even looking at the size of the underlying zfs volume.

Migrating the second single disk failed almost immediately, even though the target volume had 4TB available, formatted with xfs. It might be worth for the proxmox team investigating why.

While the source disk on the zfs pool had a size of about 2.81TB,

Code:

root@pve2:~# zfs list
NAME                     USED  AVAIL     REFER  MOUNTPOINT
zblock01                2.81T     0B      307K  /zblock01
zblock01/vm-104-disk-0  2.81T     0B     2.81T  -

the used space on the target volume was of barely 2.2TB

Code:

root@pve2:/# df -h /dev/sdj1
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdj1       3.7T  2.2T  1.5T  61% /mnt/pve/fatass

It is interesting to know though that the space measured with ls was of 3.4TB, which matches with the final size I had set-up yesterday

Code:

root@pve2:/mnt/pve/fatass/images/107# ls -lh
total 2.2T
-rw-r----- 1 root root  65G Jul 31 15:45 vm-107-disk-0.qcow2
-rw-r----- 1 root root 3.4T Jul 31 15:38 vm-107-disk-1.qcow2

The cloned machine boot up and was able to recover the file system of the full volume practically entirely.

Now, it all happened because, with a zfs volume of 3.3TB, a current occupancy of 1.5TB, a virtual disk of 3TB, I launched an rsync to move about 900GB onto this VM. My calculation was the following:

1. the virtual disk is set at 3TB, it should leave at least 10% free on the zfs
2. the current occupation of this virtual disk is of 1.5T, plus 900GB, I should get up to 2.4TB, leaving about 30% free onto the virtual disk

What I got is a zfs volume full, when only about 380GB were copied with this rsync.

In other words, most probably, a good part of this copy was being only temporarily cached on the zfs and not flushed on the virtual disk so that, after having consolidated about 380GB onto the virtual disk, thus bringing it up to 1.9TB, the virtual disk was being increased (no further data copied in it) and the additional data was being cached onto the ZFS, thus bringing the ZFS volume to near 100% occupancy and the rest is history.

Lessons learned: it is not just about keeping the zfs below 80% of capacity, it is also about properly configuring its cache. Is there a GUI to configure the cache of zfs volumes built with proxmox or shall I go with CLI?

Any thoughts from anyone else?

Could have another block device, such as CEPH for example, avoided this issue?

LnxBil · Jul 31, 2022

rdarioc said:
Is there a GUI to configure the cache of zfs volumes built with proxmox or shall I go with CLI?

What cache do you mean? Each zpool has a level 1 cache (ARC, adaptive replacement cache) in memory and if you configured it a level 2 arc on at least one additional disk. There is no cache for volumes or datasets.

Dunuin · Jul 31, 2022

LnxBil said:
There is no cache for volumes or datasets.

Except for the KVM layer in case you choose a caching mode like writeback for a virtual disk. But yes, that not ZFS caching anything.

LnxBil · Aug 1, 2022

Dunuin said:
Except for the KVM layer in case you choose a caching mode like writeback for a virtual disk. But yes, that not ZFS caching anything.

Yes sure ...

rdarioc · Aug 2, 2022

To close the topic for posterity, data were recovered in its entirety. Actually, If I had a way to increase the size of the underlying ZFS, the VM would have continued were it stopped.

Lesson learned: ZFS, with raidz1, was built on top of 8 x 500GB SSD. So, one disk goes for the parity, About another 15% goes to the formatting. Afterwards I was not able to create a virtual disk on top bigger than 2TB of which, again, another 10% went to the formatting, thus leaving of usable space around 1.8T, from the original 4TB raw. Clearly a miscalculation of the available space led to this inconvenience.
Indeed a file system full supporting virtual disks must be avoided at all costs.

Only open topic: during the recovery phase, I was not able to move the single virtual disk (it failed almost immediately, maybe the moving process doesn't involve the format conversion from raw to something else?), I had to clone the entire machine and, from the GUI at least, cloning doesn't allow different targets for each virtual disk.

LnxBil · Aug 2, 2022

rdarioc said:
Lesson learned: ZFS, with raidz1, was built on top of 8 x 500GB SSD. So, one disk goes for the parity, About another 15% goes to the formatting.

RAIDz* is complicated and depends on a lot of things, so it' recommended to NOT use it at all if you want to have predictive free space.

Need to recover a file system on a QEMU harddisk stored on a ZFS pool

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

Distinguished Member

Distinguished Member

New Member

Distinguished Member

We value your privacy