Hello
Recently, the raid card (PERC h730p) on one of our nodes got fried. After replacing the card and importing the raid array, we've noticed the lvm thin metadata got corrupted.
We've tried to repair it with thin_check/thin_repair, but they do not seem able to restore it. Our analysis shows that the mappings got scrambled somehow. While we're still working on it, time passes and we've already restored from backups the affected VMs. But I'm looking for some suggestions on how to prevent such issues for happening again in the future.
One idea would be to make backups of the lvm metadata. Not sure if they'd be any useful if new data gets written after the metadata backup was created.
Another idea would be to disable the write cache, if the issue could be traced to it. It was set on write back and the controller had battery.
Anyone has any thoughts on how to prevent the lvmthin metadata from crashing in such a scenario?
Recently, the raid card (PERC h730p) on one of our nodes got fried. After replacing the card and importing the raid array, we've noticed the lvm thin metadata got corrupted.
Code:
Check of pool pve/data failed (status:1). Manual repair required!
One idea would be to make backups of the lvm metadata. Not sure if they'd be any useful if new data gets written after the metadata backup was created.
Another idea would be to disable the write cache, if the issue could be traced to it. It was set on write back and the controller had battery.
Anyone has any thoughts on how to prevent the lvmthin metadata from crashing in such a scenario?