Live: Rescue a broken Proxmox VE installation

left4pve

Member
Jul 12, 2021
34
3
13
31
Here's what happened. Yesterday, all of a sudden I cannot access Proxmox via web or SSH, nor can I access the Home Assistant that is running as a VM on PVE. It's actually the second time this week. Last time I force restarted the machine and everything goes back to "normal", at least that's how it looked like.

Then I connected a monitor to the PVE server, the screen says a lot about "Corrected error" about another disk, and the whole system is not responsive. After a force reboot, the system halts after trying to mount `rpool` with this error message:
```
Pool rpool has encountered an uncorrectable I/O failure and has been suspended.
```

It's interesting that before the force restart, the other VM(NAS) was working fine. I think it's because the NAS VM isn't using any of the disk of PVE, but rather has the PCIe-SATA adapter passed-through to it so it's not affected by the disk error.

I'm trying to mount the rpool disk to another Ubuntu machine and see if I can rescue the data.

---

Update 1:
I connected the faulty disk to another Ubuntu machine, and mounted it as read only.
Code:
sudo zpool import rpool -R /rpool -o readonly=on

The root files is now available at `/rpool` folder. While copying this entire folder somewhere else as a backup, it does look like some of the files are corrupted.
Code:
root@name:/# cp -r rpool ~/rpool_full_copy
cp: error reading 'rpool/var/lib/vz/template/iso/ubuntu-22.04-live-server-amd64.iso': Input/output error
cp: error reading 'rpool/var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db': Input/output error
cp: error reading 'rpool/var/cache/apt/pkgcache.bin': Input/output error
cp: error reading 'rpool/var/log/pveproxy/access.log.1': Input/output error
cp: error reading 'rpool/var/log/journal/25cd6c0a5d3f45af997afdd5152f3be4/system@f500325b10ea4ad88aa87ebdbcaf69cd-000000000004c84b-0005e7eba978c6f5.journal': Input/output error
...
cp: error reading 'rpool/root/.vscode-server/extensions/eamodio.gitlens-12.2.2/images/docs/revision-navigation.gif': Input/output error
cp: error reading 'rpool/root/.vscode-server/extensions/eamodio.gitlens-12.2.2/images/docs/terminal-links.gif': Input/output error
cp: error reading 'rpool/root/.vscode-server/extensions/eamodio.gitlens-12.2.2/images/dark/icon-gitlab.svg': Input/output error
cp: error reading 'rpool/usr/share/doc/libfdisk1/changelog.Debian.gz': Input/output error
cp: error reading 'rpool/usr/share/locale/mk/LC_MESSAGES/iso_3166-1.mo': Input/output error
cp: error reading 'rpool/usr/share/locale/pl/LC_MESSAGES/coreutils.mo': Input/output error

And the status of the pool is DEGRADED
Code:
  pool: rpool
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       DEGRADED     0     0     0
      sdc3      DEGRADED     0     0   180  too many errors

errors: Permanent errors have been detected in the following files:

All of the corrupted files are logs. Is there a way to discard these files and make the pool at lease usable?

---

Update 2:
After copying all the files to a safe place(except those corrupted), I re-installed PVE on the same machine with the same NVMe drive. The file I copied to the new PVE instance includes:
1. The entire `/root` folder
2. The database file `/var/lib/pve-cluster/`
3. The cronjobs

With a little bit of additional work, I got all the VMs back running. Luckily all my VMs can be easily restored from a daily cloud backup.
 
Last edited:
All of the corrupted files are logs. Is there a way to discard these files and make the pool at lease usable?
You need to overwrite or delete (all copies of) (the damages parts of) those files and run a successful scrub on the pool to make those errors clear up. While the damage parts of the files exist, ZFS will complain about them.
Sometimes importing a slightly earlier version of the pool might fix this also. Anyway, glad to hear you already fixed it.
 
Last edited:
You need to overwrite or delete (all copies of) (the damages parts of) those files and run a successful scrub on the pool to make those errors clear up. While the damage parts of the files exist, ZFS will complain about them.
Thanks. I ended up reinstalling Proxmox VE and copy the database to the new installation. Do you know what could cause these corrected data?
 
Do you know what could cause these corrected data?
Maybe the meta-data was also corrupted but ZFS saves multiple copies and can correct it (if there is at least one block with a valid checksum). See the redundant_metadata option for a pool.
If you would have used multiple drives (not-striped) then you have multiple copies of everything and ZFS can correct whole failed drives. Except when all the writes failed because of a power outage for example.
 
Last edited:
  • Like
Reactions: left4pve
@leesteken Uh, the data corruption happened again today. In `zpool status -v` I saw a lot of `Permanent errors` and the status of the rpool became DEGRADED again. Do you think if it's an indicator for a bad NVMe SSD?
config:

NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
nvme-eui.0025385671b0c001-part3 DEGRADED 0 0 1.85K too many errors
Even one in the VM disk.
`rpool/data/vm-101-disk-0:<0x1>`
 
Last edited:
Checksum errors do indicate that you cannot trust the drive to return the same information that (you thought) you stored there, No read errors, so the drive does not report problems reading the data. It's also possible that the data get corrupted when writing to the drive (or reading) if you have bad memory (or a bad motherboard or bad CPU). I suggest you test both memory and the drive. And do not power off the system unexpectedly without a normal proper shutdown, that can cause drive issues too.
 
After re-installing everything on a new SSD, the degraded error happened again within 12 hours.
 
The builtin memory test all passed without any problem, even several times. I then used the same SSD but on a different socket on the motherboard(previously it was M2P and now it's M2A). It's been three days now and I don't see any error even if I have a very frequent scrub cronjob.

According to the manual, M2A is connected directly to the CPU and M2P is connected to the southbridge, sharing bandwidth with PCIe bus. Coincidentally that I do have a PCIe-SATA adapter card, which is passed through to the NAS VM. Wonder if that's the reason. I'll keep eye on the scrub result.
 
Coincidentally that I do have a PCIe-SATA adapter card, which is passed through to the NAS VM. Wonder if that's the reason.
Did you break PCIe device isolation with pcie_acs_override=...? Otherwise they should not interfere (expect that a device can freeze/reboot the motherboard as a whole).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!