Live: Rescue a broken Proxmox VE installation

left4pve · Sep 15, 2022

Here's what happened. Yesterday, all of a sudden I cannot access Proxmox via web or SSH, nor can I access the Home Assistant that is running as a VM on PVE. It's actually the second time this week. Last time I force restarted the machine and everything goes back to "normal", at least that's how it looked like.

Then I connected a monitor to the PVE server, the screen says a lot about "Corrected error" about another disk, and the whole system is not responsive. After a force reboot, the system halts after trying to mount `rpool` with this error message:
```
Pool rpool has encountered an uncorrectable I/O failure and has been suspended.
```

It's interesting that before the force restart, the other VM(NAS) was working fine. I think it's because the NAS VM isn't using any of the disk of PVE, but rather has the PCIe-SATA adapter passed-through to it so it's not affected by the disk error.

I'm trying to mount the rpool disk to another Ubuntu machine and see if I can rescue the data.

---

Update 1:
I connected the faulty disk to another Ubuntu machine, and mounted it as read only.

Code:

sudo zpool import rpool -R /rpool -o readonly=on

The root files is now available at `/rpool` folder. While copying this entire folder somewhere else as a backup, it does look like some of the files are corrupted.

Code:

root@name:/# cp -r rpool ~/rpool_full_copy
cp: error reading 'rpool/var/lib/vz/template/iso/ubuntu-22.04-live-server-amd64.iso': Input/output error
cp: error reading 'rpool/var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db': Input/output error
cp: error reading 'rpool/var/cache/apt/pkgcache.bin': Input/output error
cp: error reading 'rpool/var/log/pveproxy/access.log.1': Input/output error
cp: error reading 'rpool/var/log/journal/25cd6c0a5d3f45af997afdd5152f3be4/system@f500325b10ea4ad88aa87ebdbcaf69cd-000000000004c84b-0005e7eba978c6f5.journal': Input/output error
...
cp: error reading 'rpool/root/.vscode-server/extensions/eamodio.gitlens-12.2.2/images/docs/revision-navigation.gif': Input/output error
cp: error reading 'rpool/root/.vscode-server/extensions/eamodio.gitlens-12.2.2/images/docs/terminal-links.gif': Input/output error
cp: error reading 'rpool/root/.vscode-server/extensions/eamodio.gitlens-12.2.2/images/dark/icon-gitlab.svg': Input/output error
cp: error reading 'rpool/usr/share/doc/libfdisk1/changelog.Debian.gz': Input/output error
cp: error reading 'rpool/usr/share/locale/mk/LC_MESSAGES/iso_3166-1.mo': Input/output error
cp: error reading 'rpool/usr/share/locale/pl/LC_MESSAGES/coreutils.mo': Input/output error

And the status of the pool is DEGRADED

Code:

  pool: rpool
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       DEGRADED     0     0     0
      sdc3      DEGRADED     0     0   180  too many errors

errors: Permanent errors have been detected in the following files:

All of the corrupted files are logs. Is there a way to discard these files and make the pool at lease usable?

---

Update 2:
After copying all the files to a safe place(except those corrupted), I re-installed PVE on the same machine with the same NVMe drive. The file I copied to the new PVE instance includes:
1. The entire `/root` folder
2. The database file `/var/lib/pve-cluster/`
3. The cronjobs

With a little bit of additional work, I got all the VMs back running. Luckily all my VMs can be easily restored from a daily cloud backup.

leesteken · Sep 16, 2022

left4pve said:
All of the corrupted files are logs. Is there a way to discard these files and make the pool at lease usable?

You need to overwrite or delete (all copies of) (the damages parts of) those files and run a successful scrub on the pool to make those errors clear up. While the damage parts of the files exist, ZFS will complain about them.
Sometimes importing a slightly earlier version of the pool might fix this also. Anyway, glad to hear you already fixed it.

left4pve · Sep 16, 2022

leesteken said:
You need to overwrite or delete (all copies of) (the damages parts of) those files and run a successful scrub on the pool to make those errors clear up. While the damage parts of the files exist, ZFS will complain about them.

Thanks. I ended up reinstalling Proxmox VE and copy the database to the new installation. Do you know what could cause these corrected data?

leesteken · Sep 16, 2022

left4pve said:
Do you know what could cause these corrected data?

Maybe the meta-data was also corrupted but ZFS saves multiple copies and can correct it (if there is at least one block with a valid checksum). See the redundant_metadata option for a pool.
If you would have used multiple drives (not-striped) then you have multiple copies of everything and ZFS can correct whole failed drives. Except when all the writes failed because of a power outage for example.

left4pve · Sep 20, 2022

@leesteken Uh, the data corruption happened again today. In `zpool status -v` I saw a lot of `Permanent errors` and the status of the rpool became DEGRADED again. Do you think if it's an indicator for a bad NVMe SSD?

config:

NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
nvme-eui.0025385671b0c001-part3 DEGRADED 0 0 1.85K too many errors

Even one in the VM disk.
`rpool/data/vm-101-disk-0:<0x1>`

leesteken · Sep 20, 2022

Checksum errors do indicate that you cannot trust the drive to return the same information that (you thought) you stored there, No read errors, so the drive does not report problems reading the data. It's also possible that the data get corrupted when writing to the drive (or reading) if you have bad memory (or a bad motherboard or bad CPU). I suggest you test both memory and the drive. And do not power off the system unexpectedly without a normal proper shutdown, that can cause drive issues too.

left4pve · Sep 20, 2022

@leesteken Got it. I'll start testing with another NVMe SSD.

left4pve · Sep 21, 2022

After re-installing everything on a new SSD, the degraded error happened again within 12 hours.

left4pve · Sep 23, 2022

The builtin memory test all passed without any problem, even several times. I then used the same SSD but on a different socket on the motherboard(previously it was M2P and now it's M2A). It's been three days now and I don't see any error even if I have a very frequent scrub cronjob.

According to the manual, M2A is connected directly to the CPU and M2P is connected to the southbridge, sharing bandwidth with PCIe bus. Coincidentally that I do have a PCIe-SATA adapter card, which is passed through to the NAS VM. Wonder if that's the reason. I'll keep eye on the scrub result.

leesteken · Sep 24, 2022

left4pve said:
Coincidentally that I do have a PCIe-SATA adapter card, which is passed through to the NAS VM. Wonder if that's the reason.

Did you break PCIe device isolation with pcie_acs_override=...? Otherwise they should not interfere (expect that a device can freeze/reboot the motherboard as a whole).

left4pve · Sep 24, 2022

leesteken said:
Did you break PCIe device isolation with pcie_acs_override=...? Otherwise they should not interfere (expect that a device can freeze/reboot the motherboard as a whole).

No I haven't done anything like that.

Search

Search

Live: Rescue a broken Proxmox VE installation

left4pve

Member

leesteken

Distinguished Member

left4pve

Member

leesteken

Distinguished Member

left4pve

Member

leesteken

Distinguished Member

left4pve

Member

left4pve

Member

left4pve

Member

leesteken

Distinguished Member

left4pve

Member

We value your privacy