Well, looks like my system is well and truly messed up, so I'm really hoping someone here can offer some suggestions.
When my system was running OK, I had two ZFS drive pools set up. The first was called rpool and was created by Proxmox when I initially installed it. It consisted of a mirror of two 500 GB SSDs:
Over time I started running out of room for containers and the like, so I added another pool and named it rpool2. It consisted of a mirror of two 2TB SSDs:
I migrated all the containers to rpool2 and everything was working just fine.
Yesterday, I decided to set up a new ZFS pool, but this time for data storage. I had a data set that had a lot of small files and I had run out of nodes on the existing data drives that had been set up with ext4. Unfortunately, I made the mistake of following the Ubuntu tutorial (https://tutorials.ubuntu.com/tutorial/setup-zfs-storage-pool) which suggested using the following command:
Where "zdata" is the name of the new pool and sdy and sdz were two new 8TB hard drives. I'm sure most of you will already have seen the problem. I don't know why I didn't. Maybe I was tired or in a hurry or something. Anyway, the mistake I made here was setting up the new pool using "dev/sdX" rather than "dev/disk/by-id/[id of drive]", because Linux likes to change the /dev/sdX assignments, I guess just to keep life interesting.
Anyway, the new pool worked just fine. I started storing a few test files to zdata and everything worked. That is, until I rebooted this morning and things went to heck in a handbasket. Proxmox was able to boot, but none of the containers had started. There was an error message about a file not being found. So I looked in rpool2, where all my containers are stored. The individual directories for each container were still there (e.g. subvol-100-disk-0) but when I looked in each of them, to my horror, I found directories that had previously been written to zdata and none of the usual container files. There was no particular order to them. For others, the directories were just blank.
When I ran "zpool status" I noticed to my horror that the two drives for the rpool pool were identified as /dev/sdy and /dev/sdz (the drives that had been designated for the new zdata pool). The zdata pool was nowhere to be found, though there was a /zdata directory identified when I ran "ls -al" at /.
In my panic, I decided to try to remove the two new HDs comprimising the zdata pool and rebooted. The two SSDs were no longer assigned to /dev/sdy and /dev/sdz, but instead were as shown above (with the "wwn" identifiers). However, the directories for each container were still all messed up, as described above.
From the Proxmox GUI, I tried to restore a container (CT 100) from the most recent backup back to rpool2, but got this message:
I then tried again, but this time from the list of backups in the GUI, and selected a new container number, CT 119. It seemed to restore successfully. I was just testing, so after it had restored, I deleted the container from the GUI. But then, when I go to rpool/data, I see subvol-100-disk-0 there. This is perplexing as I am positive I was restoring to rpool2, not rpool. In addition, I had deleted the container. In addition, in rpool/data, I also see this:
I had not yet attempted to restore CT 109 or 117. Both of them had previously been on rpool2, not rpool. I have no idea how or why they are in rpool/data.
I then tried restoring a few other containers to rpool2. But now, I get the same "failed: exit code 2" error as noted above with each of them.
Then I tried restoring a different container (CT 104) to rpool from backup, assigning a different number (CT 122), which worked. The container can run. However, when I try to destroy CT 122, I get the following error message:
In short, it seems that in addition to messing up things on rpool2, rpool is somehow also messed up.
Would anyone have any suggestions on how I might be able to clean up the existing ZFS pools so that they behave as they did previously? I'm at a loss at even where to begin. Any thoughts or suggestions would be very, very much appreciated.
When my system was running OK, I had two ZFS drive pools set up. The first was called rpool and was created by Proxmox when I initially installed it. It consisted of a mirror of two 500 GB SSDs:
Code:
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 0h19m with 0 errors on Fri Apr 5 18:21:33 2019
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdw2 ONLINE 0 0 0
sdx2 ONLINE 0 0 0
Over time I started running out of room for containers and the like, so I added another pool and named it rpool2. It consisted of a mirror of two 2TB SSDs:
Code:
pool: rpool2
state: ONLINE
scan: scrub repaired 0B in 0h40m with 0 errors on Fri Apr 5 18:41:31 2019
config:
NAME STATE READ WRITE CKSUM
rpool2 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
wwn-0x500a0751e1d27c6e ONLINE 0 0 0
wwn-0x500a0751e1d241bb ONLINE 0 0 0
errors: No known data errors
I migrated all the containers to rpool2 and everything was working just fine.
Yesterday, I decided to set up a new ZFS pool, but this time for data storage. I had a data set that had a lot of small files and I had run out of nodes on the existing data drives that had been set up with ext4. Unfortunately, I made the mistake of following the Ubuntu tutorial (https://tutorials.ubuntu.com/tutorial/setup-zfs-storage-pool) which suggested using the following command:
Code:
sudo zpool create zdata mirror /dev/sdy /dev/sdz
Where "zdata" is the name of the new pool and sdy and sdz were two new 8TB hard drives. I'm sure most of you will already have seen the problem. I don't know why I didn't. Maybe I was tired or in a hurry or something. Anyway, the mistake I made here was setting up the new pool using "dev/sdX" rather than "dev/disk/by-id/[id of drive]", because Linux likes to change the /dev/sdX assignments, I guess just to keep life interesting.
Anyway, the new pool worked just fine. I started storing a few test files to zdata and everything worked. That is, until I rebooted this morning and things went to heck in a handbasket. Proxmox was able to boot, but none of the containers had started. There was an error message about a file not being found. So I looked in rpool2, where all my containers are stored. The individual directories for each container were still there (e.g. subvol-100-disk-0) but when I looked in each of them, to my horror, I found directories that had previously been written to zdata and none of the usual container files. There was no particular order to them. For others, the directories were just blank.
When I ran "zpool status" I noticed to my horror that the two drives for the rpool pool were identified as /dev/sdy and /dev/sdz (the drives that had been designated for the new zdata pool). The zdata pool was nowhere to be found, though there was a /zdata directory identified when I ran "ls -al" at /.
In my panic, I decided to try to remove the two new HDs comprimising the zdata pool and rebooted. The two SSDs were no longer assigned to /dev/sdy and /dev/sdz, but instead were as shown above (with the "wwn" identifiers). However, the directories for each container were still all messed up, as described above.
From the Proxmox GUI, I tried to restore a container (CT 100) from the most recent backup back to rpool2, but got this message:
Code:
gzip: stdin: invalid compressed data--format violated
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
TASK ERROR: unable to restore CT 100 - command 'lxc-usernsexec -m u:0:100000:65536 -m g:0:100000:65536 -- tar xpf - -z --totals --one-file-system -p --sparse --numeric-owner --acls --xattrs '--xattrs-include=user.*' '--xattrs-include=security.capability' '--warning=no-file-ignored' '--warning=no-xattr-write' -C /var/lib/lxc/100/rootfs --skip-old-files --anchored --exclude './dev/*'' failed: exit code 2
I then tried again, but this time from the list of backups in the GUI, and selected a new container number, CT 119. It seemed to restore successfully. I was just testing, so after it had restored, I deleted the container from the GUI. But then, when I go to rpool/data, I see subvol-100-disk-0 there. This is perplexing as I am positive I was restoring to rpool2, not rpool. In addition, I had deleted the container. In addition, in rpool/data, I also see this:
Code:
drwxr-xr-x 22 100000 100000 22 Dec 28 15:14 subvol-100-disk-1
drwxr-xr-x 23 100000 100000 23 Dec 28 15:14 subvol-109-disk-1
drwxr-xr-x 22 100000 100000 22 Mar 21 16:10 subvol-117-disk-0
drwxr-xr-x 14 100000 100000 14 Apr 13 13:22 subvol-119-disk-0
I had not yet attempted to restore CT 109 or 117. Both of them had previously been on rpool2, not rpool. I have no idea how or why they are in rpool/data.
I then tried restoring a few other containers to rpool2. But now, I get the same "failed: exit code 2" error as noted above with each of them.
Then I tried restoring a different container (CT 104) to rpool from backup, assigning a different number (CT 122), which worked. The container can run. However, when I try to destroy CT 122, I get the following error message:
Code:
umount: /rpool/data/subvol-122-disk-0: target is busy
(In some cases useful info about processes that
use the device is found by lsof(8) or fuser(1).)
TASK ERROR: zfs error: cannot unmount '/rpool/data/subvol-122-disk-0': umount failed
In short, it seems that in addition to messing up things on rpool2, rpool is somehow also messed up.
Would anyone have any suggestions on how I might be able to clean up the existing ZFS pools so that they behave as they did previously? I'm at a loss at even where to begin. Any thoughts or suggestions would be very, very much appreciated.