Messed up ZFS pool

Davidoff

Well-Known Member
Nov 19, 2017
66
2
48
Well, looks like my system is well and truly messed up, so I'm really hoping someone here can offer some suggestions.

When my system was running OK, I had two ZFS drive pools set up. The first was called rpool and was created by Proxmox when I initially installed it. It consisted of a mirror of two 500 GB SSDs:

Code:
  pool: rpool

 state: ONLINE
  scan: scrub repaired 0B in 0h19m with 0 errors on Fri Apr  5 18:21:33 2019
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdw2    ONLINE       0     0     0
            sdx2    ONLINE       0     0     0

Over time I started running out of room for containers and the like, so I added another pool and named it rpool2. It consisted of a mirror of two 2TB SSDs:

Code:
  pool: rpool2
 state: ONLINE
  scan: scrub repaired 0B in 0h40m with 0 errors on Fri Apr  5 18:41:31 2019
config:

        NAME                        STATE     READ WRITE CKSUM
        rpool2                      ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            wwn-0x500a0751e1d27c6e  ONLINE       0     0     0
            wwn-0x500a0751e1d241bb  ONLINE       0     0     0

errors: No known data errors

I migrated all the containers to rpool2 and everything was working just fine.

Yesterday, I decided to set up a new ZFS pool, but this time for data storage. I had a data set that had a lot of small files and I had run out of nodes on the existing data drives that had been set up with ext4. Unfortunately, I made the mistake of following the Ubuntu tutorial (https://tutorials.ubuntu.com/tutorial/setup-zfs-storage-pool) which suggested using the following command:

Code:
sudo zpool create zdata mirror /dev/sdy /dev/sdz

Where "zdata" is the name of the new pool and sdy and sdz were two new 8TB hard drives. I'm sure most of you will already have seen the problem. I don't know why I didn't. Maybe I was tired or in a hurry or something. Anyway, the mistake I made here was setting up the new pool using "dev/sdX" rather than "dev/disk/by-id/[id of drive]", because Linux likes to change the /dev/sdX assignments, I guess just to keep life interesting.

Anyway, the new pool worked just fine. I started storing a few test files to zdata and everything worked. That is, until I rebooted this morning and things went to heck in a handbasket. Proxmox was able to boot, but none of the containers had started. There was an error message about a file not being found. So I looked in rpool2, where all my containers are stored. The individual directories for each container were still there (e.g. subvol-100-disk-0) but when I looked in each of them, to my horror, I found directories that had previously been written to zdata and none of the usual container files. There was no particular order to them. For others, the directories were just blank.

When I ran "zpool status" I noticed to my horror that the two drives for the rpool pool were identified as /dev/sdy and /dev/sdz (the drives that had been designated for the new zdata pool). The zdata pool was nowhere to be found, though there was a /zdata directory identified when I ran "ls -al" at /.

In my panic, I decided to try to remove the two new HDs comprimising the zdata pool and rebooted. The two SSDs were no longer assigned to /dev/sdy and /dev/sdz, but instead were as shown above (with the "wwn" identifiers). However, the directories for each container were still all messed up, as described above.

From the Proxmox GUI, I tried to restore a container (CT 100) from the most recent backup back to rpool2, but got this message:

Code:
gzip: stdin: invalid compressed data--format violated
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
TASK ERROR: unable to restore CT 100 - command 'lxc-usernsexec -m u:0:100000:65536 -m g:0:100000:65536 -- tar xpf - -z --totals --one-file-system -p --sparse --numeric-owner --acls --xattrs '--xattrs-include=user.*' '--xattrs-include=security.capability' '--warning=no-file-ignored' '--warning=no-xattr-write' -C /var/lib/lxc/100/rootfs --skip-old-files --anchored --exclude './dev/*'' failed: exit code 2

I then tried again, but this time from the list of backups in the GUI, and selected a new container number, CT 119. It seemed to restore successfully. I was just testing, so after it had restored, I deleted the container from the GUI. But then, when I go to rpool/data, I see subvol-100-disk-0 there. This is perplexing as I am positive I was restoring to rpool2, not rpool. In addition, I had deleted the container. In addition, in rpool/data, I also see this:

Code:
drwxr-xr-x 22 100000 100000 22 Dec 28 15:14 subvol-100-disk-1
drwxr-xr-x 23 100000 100000 23 Dec 28 15:14 subvol-109-disk-1
drwxr-xr-x 22 100000 100000 22 Mar 21 16:10 subvol-117-disk-0
drwxr-xr-x 14 100000 100000 14 Apr 13 13:22 subvol-119-disk-0

I had not yet attempted to restore CT 109 or 117. Both of them had previously been on rpool2, not rpool. I have no idea how or why they are in rpool/data.

I then tried restoring a few other containers to rpool2. But now, I get the same "failed: exit code 2" error as noted above with each of them.

Then I tried restoring a different container (CT 104) to rpool from backup, assigning a different number (CT 122), which worked. The container can run. However, when I try to destroy CT 122, I get the following error message:

Code:
umount: /rpool/data/subvol-122-disk-0: target is busy
        (In some cases useful info about processes that
         use the device is found by lsof(8) or fuser(1).)
TASK ERROR: zfs error: cannot unmount '/rpool/data/subvol-122-disk-0': umount failed

In short, it seems that in addition to messing up things on rpool2, rpool is somehow also messed up.

Would anyone have any suggestions on how I might be able to clean up the existing ZFS pools so that they behave as they did previously? I'm at a loss at even where to begin. Any thoughts or suggestions would be very, very much appreciated.
 
Well, things seem to be going from bad to worse. I had tried testing out the creation of a new container and restoring a container backup on rpool, which seemed to work. As they were for testing only I destroyed them after doing so. But when I rebooted Proxmox went into emergency mode. Looking through the results of journalctl -xb, I found the cause: The system was trying to mount a filesystem at /vdata but it indicated that the mount point was not empty.

/vdata is a mount point I use to pool a bunch of data drives using mergerfs, a union filesystem. The relevant /etc/fstab entry is:

Code:
/mnt/data/* /vdata fuse.mergerfs defaults,allow_other,direct_io,use_ino,category.create=lfs,moveonenospc=true,minfreespace=20G,fsname=mergerfsPool 0 0

The system had been booting up OK a few times after the debacle above, but then this error started cropping up. Commenting out the line above allowed the system to boot. I ran df -h /vdata which reported the following:

Code:
root@fava2:~# df -h /vdata
Filesystem        Size  Used Avail Use% Mounted on
rpool/ROOT/pve-1  223G   93G  131G  42% /
root@fava2:~#

I have no idea why this is being mounted as /vdata or what program is doing this. When I look into what is stored in /vdata, all I see are references to directories that had been on the "real" /vdata mount point which I used to store templates and backups, but only directories and not any files. I suspect this is somehow related to the ZFS woes described above, but have no idea what's causing this.

Might there be anyone out there that might be able to help me figure this out? Any thoughts or suggestions would be most gratefully appreciated. Help!
 
Hi,

Sorry but is not so clear to me this(my english is very limiteted):

In my panic, I decided to try to remove the two new HDs comprimising the zdata pool and rebooted

.... so you say that you remove phisicaly this 2 hdd from the server, or you say that you destroy the zfs pool using this 2 HDD?

As I can guess, if you do not tuch this 2 HDDs must hold your data! So you can use a test system (live cd with zfs support) and you can try to import this pool and check your data. We can continue the disscousin for next steps after your feedback!


Good luck ^ 10 ;)
 
Hi,
.... so you say that you remove phisicaly this 2 hdd from the server, or you say that you destroy the zfs pool using this 2 HDD?

Thank you. No worries about your English - just happy to have some feedback. Yes, I removed them from the server. Just to clarify, the two HDDs I removed were new drives and did not have anything important on them.

When I rebooted, the system seemed to confuse the two SSDs comprising the rpool2 pool with the new drives, and as far as I can tell, overwrote or deleted all the files on the two SSDs (in addition to many, many other problems).
 
Hi,

I think you have did a lot of data modify on yours zfs pools, and I can imagine any solution for your case! Best option will be to try to restore your data from a external backup system!

Anyway, I like your humor ;)
 
Hey thanks Guletz. Yeah, I figured by the limited response that my system in its current state is well and truly borked. Oh well. Hopefully I don't mess up the recovery.
 
Hey thanks Guletz. Yeah, I figured by the limited response that my system in its current state is well and truly borked. Oh well. Hopefully I don't mess up the recovery.


But you have learn a lesson, use by-id and snapshots. And belive me that I also learn the same lesson about by-id some time ago ;)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!