Hi guys.
I know it may not be a PVE specific question, but I'm taking a chance here just in case.
One of my PVE server crash with a RAID error, and I had to reboot it so the GUI come back online and I start migrating my VM away on a secondary server. Once it will be empty, I'll be able to play with it and try to rebuild this RAID, but I may need a few advices on how to. I know I can reformat and reinstall PVE and join back the cluster, but I want to learn how to repair these kind of problem in Linux.
Here are some informations with the limited knowledge I have with RAID, correct my if I'm wrong please.
This show that I have 2 disks of equal size NVME1N1 and NVME0N1, and also 2 RAID arrays MD2 and MD4
This block show that my RAID array MD2 only contain the partition NVME0N1P2 of the disk NVME0N1, so it's missing the partition NVME1N1P2 of the disk NVME1N1 :
Details about this MD2 RAID array:
Details about this NVME1N1 disk:
Will I only need to remove and readd the partition NVME1N1P2 of the disk NVME1N1 to the array MD2 with command lines like this; will it be enough to rebuild this partition ?
Anything telling you this drive is bad and need to be replaced, or is it just a software issue that damaged the RAID?
I also saw somewhere on this forum something about the Linux boot partition that was missing on the second disk, so in case of a failure with the first disk, the server won't boot. Do you think it's also the case here with the partition "/dev/nvme0n1p5 2000406528 2000408575 2048 1M Linux filesystem" that isn't present on the NVME1N1 disk? How to correct this?
Thanks A LOT for your help btw !!
I know it may not be a PVE specific question, but I'm taking a chance here just in case.
One of my PVE server crash with a RAID error, and I had to reboot it so the GUI come back online and I start migrating my VM away on a secondary server. Once it will be empty, I'll be able to play with it and try to rebuild this RAID, but I may need a few advices on how to. I know I can reformat and reinstall PVE and join back the cluster, but I want to learn how to repair these kind of problem in Linux.
Here are some informations with the limited knowledge I have with RAID, correct my if I'm wrong please.
This show that I have 2 disks of equal size NVME1N1 and NVME0N1, and also 2 RAID arrays MD2 and MD4
Code:
root@proxmox13s:~# fdisk -l
Disk /dev/nvme1n1: 953.87 GiB, 1024209543168 bytes, 2000409264 sectors
Disk model: WDC CL SN720 SDAQNTW-1T00-2000
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: BD86A76A-6FF1-4636-A9AB-A43E47CDEB3B
Device Start End Sectors Size Type
/dev/nvme1n1p1 2048 1048575 1046528 511M EFI System
/dev/nvme1n1p2 1048576 42989567 41940992 20G Linux RAID
/dev/nvme1n1p3 42989568 45084671 2095104 1023M Linux swap
/dev/nvme1n1p4 45084672 2000394239 1955309568 932.4G Linux RAID
Disk /dev/nvme0n1: 953.87 GiB, 1024209543168 bytes, 2000409264 sectors
Disk model: WDC CL SN720 SDAQNTW-1T00-2000
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 8E102BE3-D90D-47B4-A323-7B4772C33370
Device Start End Sectors Size Type
/dev/nvme0n1p1 2048 1048575 1046528 511M EFI System
/dev/nvme0n1p2 1048576 42989567 41940992 20G Linux RAID
/dev/nvme0n1p3 42989568 45084671 2095104 1023M Linux RAID
/dev/nvme0n1p4 45084672 2000394239 1955309568 932.4G Linux RAID
/dev/nvme0n1p5 2000406528 2000408575 2048 1M Linux filesystem
Disk /dev/md2: 20 GiB, 21473722368 bytes, 41940864 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk /dev/md4: 932.36 GiB, 1001118433280 bytes, 1955309440 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk /dev/mapper/pve-data: 928.36 GiB, 996818288640 bytes, 1946910720 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
This block show that my RAID array MD2 only contain the partition NVME0N1P2 of the disk NVME0N1, so it's missing the partition NVME1N1P2 of the disk NVME1N1 :
Code:
root@proxmox13s:~# cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md4 : active raid1 nvme1n1p4[1] nvme0n1p4[0]
977654720 blocks [2/2] [UU]
bitmap: 7/8 pages [28KB], 65536KB chunk
md2 : active raid1 nvme0n1p2[0]
20970432 blocks [2/1] [U_]
unused devices: <none>
Details about this MD2 RAID array:
Code:
root@proxmox13s:~# mdadm --detail /dev/md2
/dev/md2:
Version : 0.90
Creation Time : Mon Oct 26 19:15:03 2020
Raid Level : raid1
Array Size : 20970432 (20.00 GiB 21.47 GB)
Used Dev Size : 20970432 (20.00 GiB 21.47 GB)
Raid Devices : 2
Total Devices : 1
Preferred Minor : 2
Persistence : Superblock is persistent
Update Time : Wed Aug 3 11:08:02 2022
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
Consistency Policy : resync
UUID : 688c4347:5b16e05d:a4d2adc2:26fd5302
Events : 0.8160
Number Major Minor RaidDevice State
0 259 7 0 active sync /dev/nvme0n1p2
- 0 0 1 removed
Details about this NVME1N1 disk:
Code:
root@proxmox13s:~# smartctl /dev/nvme1n1 -a
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.39-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: WDC CL SN720 SDAQNTW-1T00-2000
Serial Number: 1851AF801711
Firmware Version: 10109122
PCI Vendor/Subsystem ID: 0x15b7
IEEE OUI Identifier: 0x001b44
Total NVM Capacity: 1,024,209,543,168 [1.02 TB]
Unallocated NVM Capacity: 0
Controller ID: 8215
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,024,209,543,168 [1.02 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 001b44 8b441dc53a
Local Time is: Wed Aug 3 11:10:21 2022 EDT
Firmware Updates (0x14): 2 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x02): Cmd_Eff_Lg
Maximum Data Transfer Size: 128 Pages
Warning Comp. Temp. Threshold: 80 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Namespace 1 Features (0x02): NA_Fields
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 6.00W - - 0 0 0 0 0 0
1 + 3.50W - - 1 1 1 1 0 0
2 + 3.00W - - 2 2 2 2 0 0
3 - 0.1000W - - 3 3 3 3 4000 10000
4 - 0.0025W - - 4 4 4 4 4000 45000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 2
1 - 4096 0 1
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 37 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 25%
Data Units Read: 1,270,546,616 [650 TB]
Data Units Written: 290,354,003 [148 TB]
Host Read Commands: 5,258,258,506
Host Write Commands: 5,830,127,572
Controller Busy Time: 30,587
Power Cycles: 27
Power On Hours: 24,639
Unsafe Shutdowns: 23
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged
Will I only need to remove and readd the partition NVME1N1P2 of the disk NVME1N1 to the array MD2 with command lines like this; will it be enough to rebuild this partition ?
Code:
mdadm /dev/md2 --manage --remove /dev/nvme1n1p2
mdadm /dev/md2 --manage --add /dev/nvme1n1p2
Anything telling you this drive is bad and need to be replaced, or is it just a software issue that damaged the RAID?
I also saw somewhere on this forum something about the Linux boot partition that was missing on the second disk, so in case of a failure with the first disk, the server won't boot. Do you think it's also the case here with the partition "/dev/nvme0n1p5 2000406528 2000408575 2048 1M Linux filesystem" that isn't present on the NVME1N1 disk? How to correct this?
Thanks A LOT for your help btw !!