create ceph disk, error wiping

lknite

Member
Sep 27, 2024
63
5
8
I had an ssd go out, so I ordered a new one.

Marked the old disk as out. I shut down the pvc, added the disk in the same spot as the old one was, booted back up. When I try to create a new ceph disk I get this error.
TASK ERROR: error wiping '/dev/sdc': 209715200 bytes (210 MB, 200 MiB) copied, 91.0836 s, 2.3 MB/s

After this error the disk isn't available to be added by ceph anymore, unless I shut down the pvc host, and start it back up.
Which I tried. I tried also 'destroy' the disk. After a reboot I tried to add it again, same error.

I'm hoping its not just a bad disk. Ideas on troubleshooting steps?

Code:
create OSD on /dev/sdc (bluestore)
wiping block device /dev/sdc
dd: fdatasync failed for '/dev/sdc': Input/output error
dd: fsync failed for '/dev/sdc': Input/output error
200+0 records in
200+0 records out
TASK ERROR: error wiping '/dev/sdc': 209715200 bytes (210 MB, 200 MiB) copied, 91.0836 s, 2.3 MB/s
 
Last edited:
Marked the old disk as out. I shut down the pvc, added the disk in the same spot as the old one was, booted back up.
To be clear did you destroy it before removing it? I would: out, down, destroy, physically replace, add new. Have moved several that way. A new disk shouldn't be destroyable, only an OSD...?
 
after a reboot, /dev/sdc passes with smartctl:
Code:
root@pve-c:~# smartctl -a /dev/sdc
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-4-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     SAMSUNG MZ7LM1T9HMJP-00005
Serial Number:    S2TVNX0JB02112
LU WWN Device Id: 5 002538 c408f4b0d
Firmware Version: GXT5204Q
User Capacity:    1,920,383,410,176 bytes [1.92 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed May 14 16:46:26 2025 MDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                ( 6000) seconds.
Offline data collection
capabilities:                    (0x53) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 100) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   093   093   000    Old_age   Always       -       34699
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       51
177 Wear_Leveling_Count     0x0013   099   099   005    Pre-fail  Always       -       44
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0013   100   100   010    Pre-fail  Always       -       6575
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   055   054   000    Old_age   Always       -       45
194 Temperature_Celsius     0x0022   055   054   000    Old_age   Always       -       45 (Min/Max 21/46)
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
202 Exception_Mode_Status   0x0033   100   100   010    Pre-fail  Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       46
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       119915830204
242 Total_LBAs_Read         0x0032   099   099   000    Old_age   Always       -       24464603316
243 SATA_Downshift_Ct       0x0032   100   100   000    Old_age   Always       -       0
244 Thermal_Throttle_St     0x0032   100   100   000    Old_age   Always       -       0
245 Timed_Workld_Media_Wear 0x0032   100   100   000    Old_age   Always       -       65535
246 Timed_Workld_RdWr_Ratio 0x0032   100   100   000    Old_age   Always       -       65535
247 Timed_Workld_Timer      0x0032   100   100   000    Old_age   Always       -       65535
251 NAND_Writes             0x0032   100   100   000    Old_age   Always       -       191904914432

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     34699         -
# 2  Short offline       Completed without error       00%      8790         -
# 3  Short offline       Aborted by host               70%      8789         -
# 4  Short offline       Completed without error       00%         1         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
  255        0    65535  Read_scanning was completed without error
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
To be clear did you destroy it before removing it? I would: out, down, destroy, physically replace, add new. Have moved several that way. A new disk shouldn't be destroyable, only an OSD...?
Steve ... I'd pretty much just be guessing. We should proceed as if we aren't sure.

I believe the sequence was:
It was down, I marked it as out a couple days ago. I did not destroy it, and swapped in the new disk today, it was recognized as a new disk when I did a ceph add, got an error while it was trying to prepare it so I then did a destroy which removed it from the proxmox gui, did a reboot and tried to add as a ceph disk again, same error as written in the opening post.
 
Last edited:
I started the long test, says it needs 100 minutes to complete:

Code:
# smartctl --test=long /dev/sdc
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-4-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 100 minutes for test to complete.
Test will complete after Wed May 14 18:33:10 2025 MDT
Use smartctl -X to abort test.
 
well, ssd not's responding anymore, guess that says its a defect disk, yes?

Code:
# smartctl -a /dev/sdc
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-4-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

Short INQUIRY response, skip product id
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
 
New disk is behaving exactly the same:
```
TASK ERROR: error wiping '/dev/sdc': 209715200 bytes (210 MB, 200 MiB) copied, 91.0836 s, 2.3 MB/s
```

Then it doesn't show up as an available disk to add via ceph, until I shutdown the proxmox host, and then power it back on (reboot doesn't have same effect).
 
two disks acting this way makes me think maybe the disk isn't the issue?

Code:
()
create OSD on /dev/sdc (bluestore)
wiping block device /dev/sdc
dd: fdatasync failed for '/dev/sdc': Input/output error
dd: fsync failed for '/dev/sdc': Input/output error
200+0 records in
200+0 records out
TASK ERROR: error wiping '/dev/sdc': 209715200 bytes (210 MB, 200 MiB) copied, 91.1325 s, 2.3 MB/s

thinking ...