All hard disks suddenly fail

ca77

New Member
Sep 30, 2022
4
0
1
Hi everyone,

I would like to ask for your advices on what to do. I am running proxmox PVE 7.2-3 with some virtual machines (VM).

It happened to me that suddenly all my hard disks fail, except the two /dev/nvme0n1 and /dev/sde on which proxmox is installed. The SSD /dev/sdd and all the other HDD (sda, sdb, sdc) fail (Fig 1). smartcl leads to the error (Fig 2): A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

All the VMs are installed on /dev/sdd and their backups are stored on /dev/sda. The disks sdb and sdc are connected to one VM as storage. The sdd and sda are two months old. The two others are 10 months old. They were not used extensively.

The VM disks are no more visible on sdd (Fig 3). Their backups are gone from sda (Fig 4). However, the disks summary still show the usual usage (Fig 3 and 4).

I can still have ssh access to some of the VMs. However, when using some commands like ls, lsblk or du -sh , or copy data from these VMs to my local machine with scp, there is Input/output error .

On the Console
of the VMs, there are errors messages (Fig 7 and 8).

I have two questions to ask:

1. Could you please suggest me a way to retrieve some data from the VMs?

I tried scp, but it fails due to I/O error (see below).

I could connect a USB stick to the VM. However, I cannot cd to the usb. Even lsblk fails with I/O error.

Code:
1,611,661,312  99%    1.49MB/s    0:00:00  rsync: [sender] read errors mapping "path/multisite20221127_19h56m.tar.gz": Input/output error (5)

  1,612,495,320 100%    1.64MB/s    0:15:38 (xfr#1, to-chk=1/3)

WARNING: multisite20221127_19h56m.tar.gz failed verification -- update discarded (will try again).

At least now I can still ssh to the VMs. Being afraid that all connection to the VMs will be lost, I have not yet restarted proxmox and the VMs.

2. What happened to my installation?

I did two things that might be related to this issue.

Yesterday, I updated proxmox (first deactivated the enterprise.proxmox repository, then update all). I also did apt update and apt upgrade in the command line. I did not see anything strange afterward.

Today, I was copying some files from one VM to another (1,5 Gb each, 5 Gb in total). In the middle of the transfer, one of the VM turned off. I restarted, retry the copy. The other turned off. Tried another time, both VM turned off. I doubted that proxmox run out of RAM.

After that, when copy some data from one VM to a local machine, I realized that it failed, which never happened before. I checked the files on the VMs and saw Input/output error every where.


Thank you for having read my questions. I'm grateful to you for any suggestion.

Have a good day. And let us backup our data more often.



Failed disks
Fig 1: all the disks. sdd, sda, sdb, sdc fail (Smart unknown). sde and nvme0n1 where proxmox is installed are still functionnal (Smart passed).

smart_error.png
Fig 2: smart error of disk

vms_disk.png
Fig 3: empty virtual machines disk sdd


backups_disk.png
Fig 4: empty backup disk sda


sdd.png
Fig 5: usage on sdd


sda.png

Fig 6: usage on sda


console1.png

Fig 7: error Console

console2.png

Fig 8: error Console
 
Last edited:
Just an update of the problem. I had no other choice than restarting proxmox. Even restarting the virtual machines failed (VM quit/powerdown failed - got timeout). Luckily, after restarting, the VM work again, just like nothing happened.

However, after two days, the VMs suddenly break again (Input/output error).

I wonder whether this is a hardware or software problem? if this is software related, is this proxmox or the VMs?
 
apt upgrade
Never do that or it might screw up the PVE installation. Only use apt dist-upgrade or apt full-upgrade. And you will have to do a reboot after updating the kernel, or the new kernel won't be used.
But I don't think these storage problems relate to that.

And your four failed disks are all either SMR HDDs or QLC SSDs. Both shouldn't be used with server workloads because of the terrible write performance. Once the write cache is full, performance will drop and especially with SMR HDDs the disks can become so slow, that the OS thinks they are dead, because they can't answer in time (for example latency of a few minutes instead of few milliseconds) and then you will see IO errors.

Is sde connected to the same disk controller as the failed 4 disks?

Sde is QLC too and you run ZFS on it. Performance and life expectation will be bad. The NVMe could be QLC or TLC. Looks like you bought the worst disks possible, because you only looked t the price per TB.
 
Last edited:
  • Like
Reactions: Neobin
Never do that or it might screw up the PVE installation. Only use apt dist-upgrade or apt full-upgrade. And you will have to do a reboot after updating the kernel, or the new kernel won't be used.
But I don't think these storage problems relate to that.

And your four failed disks are all either SMR HDDs or QLC SSDs. Both shouldn't be used with server workloads because of the terrible write performance. Once the write cache is full, performance will drop and especially with SMR HDDs the disks can become so slow, that the OS thinks they are dead, because they can't answer in time (for example latency of a few minutes instead of few milliseconds) and then you will see IO errors.

Is sde connected to the same disk controller as the failed 4 disks?

Sde is QLC too and you run ZFS on it. Performance and life expectation will be bad. The NVMe could be QLC or TLC. Looks like you bought the worst disks possible, because you only looked t the price per TB.
Thank you @Dunuin for your reply.

About the disk, you are right. This machine is not for critical processes, therefore I did not choose the best out there. However, now I learn that the disks might be the cause of the IO errors. I have run openmediavault on this machine for a couple of months as a NAS, but never saw this error before. Maybe it's now the time to upgrade.

I am more intrigued to see that
apt upgrade should not be used. As newbie, I thought this is the same as using the button Node > Update > Upgrade.

I did use this command anyway. So far, I have not seen broken things (yet). But should I do now the apt full-upgrade or apt disk-upgrade.
 
I am more intrigued to see that
apt upgrade should not be used. As newbie, I thought this is the same as using the button Node > Update > Upgrade
No, pressing the upgrade button in the webUI will run an apt-get dist-upgrade.
About the disk, you are right. This machine is not for critical processes, therefore I did not choose the best out there. However, now I learn that the disks might be the cause of the IO errors. I have run openmediavault on this machine for a couple of months as a NAS, but never saw this error before. Maybe it's now the time to upgrade.
It doesn't mean that these problems are caused by the bad disks. But SMR HDDs and QLC SSDs are known for causing IO errors because of a terrible latency when the caches get full. Thats why I asked you stuff like if all failed disks are connected to the same controller or backplane, as a problem with the controller or backplane could also cause all connected disks to fail at the same time.

And SMR HDDs and QLC SSDs are by the way not cheap. They are expensive. The initial buying cost might be 10-30% lower than buying a TLC SSD or CMR HDD, but you only get a very small fraction of the write performance and the disks will die multiple times faster, so you will have to replace them more frequentily. So on the long run it would be way cheaper to get some better disk, even if the initial cost is a bit higher. Better to buy a TLC SSD that might survive 5 years for 100€ than buying a QLC one for 80€ that you need to replace every 2 years.
 
Last edited:
  • Like
Reactions: ca77
Lesson learnt. Thank you again @Dunuin . I will pay for the TLC in the next update.

I'm not sure about controller or backplane (Please excuse my ignorance. I started playing with computers a couple of months ago. Although I could manage to build one, but I severely lack knowledge on hardwares. The disks I chose are an example). All the failed disks are connected to the the SATA ports depicted below. The disks that did not fail are at the M.2 ports.

mother_board1.png
 

Attachments

  • 1670533972472.png
    1670533972472.png
    560.4 KB · Views: 2
Lesson learnt. Thank you again @Dunuin . I will pay for the TLC in the next update.
And in case of ZFS you shouldn't even get consumer TLC SSDs. Here enterprise grade SSDs are highly recommended, which cost usually at least double the price of a consumer TLC SSD. Have a look at this table to see how different grades of SSDs compare in terms of price and rated write endurance: https://forum.proxmox.com/threads/c...-for-vms-nvme-sata-ssd-hdd.118254/post-512801

All the failed disks are connected to the the SATA ports depicted below. The disks that did not fail are at the M.2 ports.
Did you upgrade the UEFI or changed some settings there, so that the disk controller is operating in another mode? Or do you use PCI passthrough and the disk controller maybe was in the same IOMMU group of a passed through device, so starting that VM wil remove the disk controller form the PVE host? Did you check the SATA power cable, in case all 4 failed disks are connected to the same cable?
When all disks on the same controller fail at the same time it should be a problem with the disk controller (so your mainboard) or the power.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!