Here's what happened. Yesterday, all of a sudden I cannot access Proxmox via web or SSH, nor can I access the Home Assistant that is running as a VM on PVE. It's actually the second time this week. Last time I force restarted the machine and everything goes back to "normal", at least that's how it looked like.
Then I connected a monitor to the PVE server, the screen says a lot about "Corrected error" about another disk, and the whole system is not responsive. After a force reboot, the system halts after trying to mount `rpool` with this error message:
```
Pool rpool has encountered an uncorrectable I/O failure and has been suspended.
```
It's interesting that before the force restart, the other VM(NAS) was working fine. I think it's because the NAS VM isn't using any of the disk of PVE, but rather has the PCIe-SATA adapter passed-through to it so it's not affected by the disk error.
I'm trying to mount the rpool disk to another Ubuntu machine and see if I can rescue the data.
---
Update 1:
I connected the faulty disk to another Ubuntu machine, and mounted it as read only.
The root files is now available at `/rpool` folder. While copying this entire folder somewhere else as a backup, it does look like some of the files are corrupted.
And the status of the pool is DEGRADED
All of the corrupted files are logs. Is there a way to discard these files and make the pool at lease usable?
---
Update 2:
After copying all the files to a safe place(except those corrupted), I re-installed PVE on the same machine with the same NVMe drive. The file I copied to the new PVE instance includes:
1. The entire `/root` folder
2. The database file `/var/lib/pve-cluster/`
3. The cronjobs
With a little bit of additional work, I got all the VMs back running. Luckily all my VMs can be easily restored from a daily cloud backup.
Then I connected a monitor to the PVE server, the screen says a lot about "Corrected error" about another disk, and the whole system is not responsive. After a force reboot, the system halts after trying to mount `rpool` with this error message:
```
Pool rpool has encountered an uncorrectable I/O failure and has been suspended.
```
It's interesting that before the force restart, the other VM(NAS) was working fine. I think it's because the NAS VM isn't using any of the disk of PVE, but rather has the PCIe-SATA adapter passed-through to it so it's not affected by the disk error.
I'm trying to mount the rpool disk to another Ubuntu machine and see if I can rescue the data.
---
Update 1:
I connected the faulty disk to another Ubuntu machine, and mounted it as read only.
Code:
sudo zpool import rpool -R /rpool -o readonly=on
The root files is now available at `/rpool` folder. While copying this entire folder somewhere else as a backup, it does look like some of the files are corrupted.
Code:
root@name:/# cp -r rpool ~/rpool_full_copy
cp: error reading 'rpool/var/lib/vz/template/iso/ubuntu-22.04-live-server-amd64.iso': Input/output error
cp: error reading 'rpool/var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db': Input/output error
cp: error reading 'rpool/var/cache/apt/pkgcache.bin': Input/output error
cp: error reading 'rpool/var/log/pveproxy/access.log.1': Input/output error
cp: error reading 'rpool/var/log/journal/25cd6c0a5d3f45af997afdd5152f3be4/system@f500325b10ea4ad88aa87ebdbcaf69cd-000000000004c84b-0005e7eba978c6f5.journal': Input/output error
...
cp: error reading 'rpool/root/.vscode-server/extensions/eamodio.gitlens-12.2.2/images/docs/revision-navigation.gif': Input/output error
cp: error reading 'rpool/root/.vscode-server/extensions/eamodio.gitlens-12.2.2/images/docs/terminal-links.gif': Input/output error
cp: error reading 'rpool/root/.vscode-server/extensions/eamodio.gitlens-12.2.2/images/dark/icon-gitlab.svg': Input/output error
cp: error reading 'rpool/usr/share/doc/libfdisk1/changelog.Debian.gz': Input/output error
cp: error reading 'rpool/usr/share/locale/mk/LC_MESSAGES/iso_3166-1.mo': Input/output error
cp: error reading 'rpool/usr/share/locale/pl/LC_MESSAGES/coreutils.mo': Input/output error
And the status of the pool is DEGRADED
Code:
pool: rpool
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
config:
NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
sdc3 DEGRADED 0 0 180 too many errors
errors: Permanent errors have been detected in the following files:
All of the corrupted files are logs. Is there a way to discard these files and make the pool at lease usable?
---
Update 2:
After copying all the files to a safe place(except those corrupted), I re-installed PVE on the same machine with the same NVMe drive. The file I copied to the new PVE instance includes:
1. The entire `/root` folder
2. The database file `/var/lib/pve-cluster/`
3. The cronjobs
With a little bit of additional work, I got all the VMs back running. Luckily all my VMs can be easily restored from a daily cloud backup.
Last edited: