Intel S3510 1.2TB | Lost all the partitions after reboot

FlorinMarian

Well-Known Member
Nov 13, 2017
88
4
48
28
Hello!
I have several proxmox nodes and several dozen Intel S3510 SSDs of 1.2TB each.
The SSDs have no errors and the lifetime is between 80-89% for each of them.
For some unknown reason, after every reboot, the SSDs lose their partitions.
Has anyone happened to that?
I think it's impossible for more than 30 SSDs to have the same problem, and I'd rather go with the idea that it's an incompatibility with Proxmox.
The same problem exists on both a Dell node and an HP one, so we can remove the HBA/Raid Controller (in any case, I use RAID directly in proxmox).
Thank you!
 
Hello, could you please share the output of lsblk? What kind of storage setup is each node using, which raid level, filesystem, etc?
 
Do you have the latest firmware installed on all disks? Do you see something strange on journalctl after or before booting? You could also run testdisk to see if there is any issue with the disk or the partition table itself.
 
Do you have the latest firmware installed on all disks? Do you see something strange on journalctl after or before booting? You could also run testdisk to see if there is any issue with the disk or the partition table itself.
No idea about Firmware version.
I just tested one of the disks and same thing happens also on windows.
No idea what can be wrong with all of them.
 
You can use this tool to update the firmware and to change a lot of settings: https://www.solidigm.de/support-page/drivers-downloads/ka-00085.html

Is your PSU actually powerful enough to handle 30x S3510? This could be up to 139W on the 5V rail when powering it with 5V+12V and up to 284W on the 5V rail when not providing 12V in addition: https://www.intel.com/content/dam/w.../product-specifications/ssd-dc-s3510-spec.pdf
Hey!
The PSU has 1600W, no worries about it.
I found something about TPM with `fwupd` and may be related to my issue.
Gvx4aab.jpg


QLjOeab.jpg
 
But probably 1450W of that 1600W can only be provided on the 12V rails with something like 120-150W max on the 5V rail.
It's a 4 node server cluster, 2x 1600W for 4 nodes dual CPU.
This is not a power issue because power consumption is under 300W right now and just on node is up.
 
You can use this tool to update the firmware and to change a lot of settings: https://www.solidigm.de/support-page/drivers-downloads/ka-00085.html

Is your PSU actually powerful enough to handle 30x S3510? This could be up to 139W on the 5V rail when powering it with 5V+12V and up to 284W on the 5V rail when not providing 12V in addition: https://www.intel.com/content/dam/w.../product-specifications/ssd-dc-s3510-spec.pdf
I used this tool and a firmware upgrade was available for 5 out of 6 SSDs (from version XXXX40 to XXXX50).
I did the upgrade, I rebooted but the firmware version was still the old one.
After that I did a shutdown / bootup and that's how I lost the partitions again.
 
First two screenshots are before any restart, the 3rd one is after a full shutdown/bootup cicle.
zbHLdab.jpg

RqGvgab.jpg

U7Ugcab.jpg


SSD parameters (maybe something relevant?):
```
root@node03:~# hdparm -I /dev/sda

/dev/sda:

ATA device, with non-removable media
Model Number: INTEL SSDSC2BB012T6
Serial Number: PHWA602400GZ1P2JGN
Firmware Revision: G2010150
Media Serial Num:
Media Manufacturer:
Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6
Standards:
Used: unknown (minor revision code 0x0110)
Supported: 9 8 7 6 5
Likely used: 9
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 2344225968
Logical Sector size: 512 bytes
Physical Sector size: 4096 bytes
Logical Sector-0 offset: 0 bytes
device size with M = 1024*1024: 1144641 MBytes
device size with M = 1000*1000: 1200243 MBytes (1200 GB)
cache/buffer size = unknown
Form Factor: 2.5 inch
Nominal Media Rotation Rate: Solid State Device
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 1 Current = 1
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* Host Protected Area feature set
* WRITE_BUFFER command
* READ_BUFFER command
* NOP cmd
* DOWNLOAD_MICROCODE
SET_MAX security extension
* 48-bit Address feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* SMART self-test
* General Purpose Logging feature set
* WRITE_{DMA|MULTIPLE}_FUA_EXT
* 64-bit World wide name
* IDLE_IMMEDIATE with UNLOAD
* WRITE_UNCORRECTABLE_EXT command
* {READ,WRITE}_DMA_EXT_GPL commands
* Segmented DOWNLOAD_MICROCODE
* unknown 119[6]
* Gen1 signaling speed (1.5Gb/s)
* Gen2 signaling speed (3.0Gb/s)
* Gen3 signaling speed (6.0Gb/s)
* Native Command Queueing (NCQ)
* Phy event counters
* READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
* Software settings preservation
* SMART Command Transport (SCT) feature set
* SCT Write Same (AC2)
* SCT Error Recovery Control (AC3)
* SCT Features Control (AC4)
* SCT Data Tables (AC5)
* SANITIZE feature set
* CRYPTO_SCRAMBLE_EXT command
* BLOCK_ERASE_EXT command
* Device encrypts all user data
* Data Set Management TRIM supported (limit 4 blocks)
* Deterministic read ZEROs after TRIM
Security:
Master password revision code = 65534
supported
not enabled
not locked
not frozen
not expired: security count
supported: enhanced erase
4min for SECURITY ERASE UNIT. 4min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 55cd2e404c6d60c7
NAA : 5
IEEE OUI : 5cd2e4
Unique ID : 04c6d60c7
Checksum: correct
```
 
Last edited:
Proxmox is a Hypervisor product that uses many freely available technologies with smarts on top of them.
Promox does not re-invent the OS - it uses Debian. The kernel is based on Ubuntu (Debian derivative). Your drives could be a the end of their life, if they were used concurrently - its very possible that all of them are at a similar point.

A year or so ago there was a rush of complete drive failures (HPE, Intel, others) due to firmware bugs that would render disks useless after specific runtime. Whole corporation went down at the same moment as all drives reached 40000 hours at the same time.

You tried Proxmox/Debian and Windows with the same result. The drives are almost 10 years old. I would try to pop out all but one - try it like that (following on @Dunuin train of thought) and then start planning for new hardware.

P.S. its also possible that some or all of your disks were with OEM firmware (ie HP) and by upgrading with wrong firmware your made it worse.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited:
I vaguely remember the existence of some rootkit protection (in the motherboard BIOS) that would (silently) prevent writing to the first sector(s) of a drive. But then the backup GPT at the end of the disk should still be there. I'm intrigued by this bizarre issue but can offer no solutions.
 
  • Like
Reactions: FlorinMarian
I vaguely remember the existence of some rootkit protection (in the motherboard BIOS) that would (silently) prevent writing to the first sector(s) of a drive. But then the backup GPT at the end of the disk should still be there. I'm intrigued by this bizarre issue but can offer no solutions.
This output scares me a lot..!
Screenshot 2023-06-22 095454.png
 
I tested all the SSDs and 9 out of 39 are just fine but no idea what can be wrong with other 30 since they are in a good state according to smartctl
 
@leesteken
I discovered the problem!
Comparing the 9 functional SSDs with the other 30 non-functional ones, I found out with the `hdparm -I /dev/sdX' command that all the functional disks have the "not frozen" flag, while the other 30 have the "frozen" flag set.
Now the problem is...can it be that frozen removed if I don't have any password?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!