VMs Hard Drive Failing

Sherman Ravelo

New Member
Jan 3, 2019
5
0
1
33
Hello, I am writing to you to know if I can get some support about some events that have been occurring regarding Proxmox and different Linux VMs.

First of all, our company has 4 Proxmox servers running on IBM 3550 M3 connected to a NAS LenovoEMC PX12-450R with multiples disks arrays, PVEVersions are different one another. I'm going to describe the two of them which have been failing:

Proxmox1: PVEVersion pve-manager/5.0-30/5ab26bc (running kernel: 4.10.17-2-pve)
Proxmox2: PVEVersion pve-manager/5.1-41/0b958203 (running kernel: 4.13.13-2-pve)

There have been 2 Linux VMs with some random crashes regarding I/O hard drive superblock errors in its partition, one in either Proxmox.

I don't know exactly where to dig in this randomly crash but one of them when I restart the VM I completely lost the VM's Hard Drive.

There was no way to recover it. Any suggestions? Where do I exactly have to look the at?

If you need more information about this case let me know.
 
First, i recommend to upgrade all your Nodes to the newest PVE Version.

If you have an Error with your Disk and you use an external Storage (NAS LenovoEMC PX12-450R), have you Check this System if there are any faults?
 
Hello, thanks for the the answer. I change the array where the first VM got stuck meaning replace the disks for new one. Then yesterday a VM got stuck this one is mounted in another array... I double check my NAS an is working perfectly.

The last VM was changed to Local Storage in Proxmox2
 
Hi,

One of the best tool for a disk related problems is to use clonezilla (advance mode, using dd and you have a checkbox for rescue). In this mode clonezilla will try to read each disk block, and if it fail it will replace the data with 0 and go forward.
In some situation with some luck you can restore the cloned disk image. And maybe you can then restore some broken files from a previous backup.
If your data is too valuable for you you can also try dd-rescue who had a option for how many times can try to read a bad block . In such a situations I was able to recover all the bad data blocks using 100 as number of reads ....

Good luck! You will need a lot ;)
 
  • Like
Reactions: Sherman Ravelo
Hi,

One of the best tool for a disk related problems is to use clonezilla (advance mode, using dd and you have a checkbox for rescue). In this mode clonezilla will try to read each disk block, and if it fail it will replace the data with 0 and go forward.
In some situation with some luck you can restore the cloned disk image. And maybe you can then restore some broken files from a previous backup.
If your data is too valuable for you you can also try dd-rescue who had a option for how many times can try to read a bad block . In such a situations I was able to recover all the bad data blocks using 100 as number of reads ....

Good luck! You will need a lot ;)




Thanks for your advice will help me a lot.
 
- How many VMs you run?
- Is any other VM broken?
- The broken VMs run all on the same Node or different?
- Did you check for Updates for your NAS?
- How you use the NAS (NFS, iSCSI, SMB)?
- What's your HW Specs of the Nodes?
- What about Metrics from your Nodes, from your Network, the NAS?
- Did you check all Log files (in the VM, on the Nodes and your NAS)?
 
Hello,

- How many VMs you run?
Proxmox1: 12
Proxmox2: 7

- Is any other VM broken?
Proxmox1: The one that couldn't be recoverable.
Proxmox2: Took a backup of one when it register the I/O (same error ) then reboot it to start fine.

- The broken VMs run all on the same Node or different?
Different proxmox (hysical machines) one node each.

- Did you check for Updates for your NAS?
NAS up to date.

- How you use the NAS (NFS, iSCSI, SMB)?
NSF in use.

- What's your HW Specs of the Nodes?
Proxmox1: 16CPUs 2Sockets 64GB Ram
Proxmox2: 16CPUs 2Sockets 64GB Ram

- What about Metrics from your Nodes, from your Network, the NAS?
Proxmox1: Directly connected to NAS
Proxmos2: Switch > NAS

- Did you check all Log files (in the VM, on the Nodes and your NAS)?
Checked every possible log in both Proxmox - NAS and not a single error or bad behavior was found.

It was just like if someone deleted the VM's HD via Terminal in Proxmox.