BTRFS Crash, Bootloader nicht mehr verfügbar.

Nexo

New Member
Mar 30, 2025
2
0
1
Hallo,

in der Nacht vom 29.03.2025 auf den 30.03.2025 hat mein Proxmox Server die Arbeit eingestellt.
Falls relevant: in diesem Zeitfenster lag eine Zeitumstellung von Winter auf Sommerzeit

Der Rechner zeigte folgende Fehlermeldung:
System Log2.jpg
Nach einem anschließenden Neustart kam der Rechner nicht mal mehr bis zum Grub Bootloader.

Zur Hardware:
Lenovo M920x
2 x Samsung M990 Pro 4TB
BTRFS Raid 1
Proxmox Version relativ aktuell, genaue Version im Moment nicht bekannt.

Das Recovery System von der ISO meldete nur:
Code:
error: no such device: rpool.
ERROR: unable to find boot disk automatically.

Press any key to continue...

Ich habe dann eine der beiden Festplatten in ein Rettungssystem eingebaut und das BTRFS Dateisystem mit der Option mount -o degraded /dev/nvme0n1p3 gemounted. Dabei sind eine Reihe von Fehlern angezeigt worden die etwa so lauten:
Code:
sudo mount -o degraded /dev/nvme0n1p3
[  316.175539] BTRFS error (device nvme1n1p3): bad tree block start, mirror 1 want 25493504 have 0
[  316.177848] BTRFS error (device nvme1n1p3): bad tree block start, mirror 1 want 23412736 have 0
[  316.178856] BTRFS error (device nvme1n1p3): bad tree block start, mirror 1 want 28524544 have 0
[  316.179445] BTRFS error (device nvme1n1p3): bad tree block start, mirror 1 want 28524544 have 0
[  316.183546] BTRFS error (device nvme1n1p3): parent transid verify failed on logical 1020051456 mirror 1 wanted 744690 found 724451
[  316.184380] BTRFS error (device nvme1n1p3): parent transid verify failed on logical 829243392 mirror 1 wanted 744690 found 725326
[  316.185984] BTRFS error (device nvme1n1p3): parent transid verify failed on logical 249069568 mirror 1 wanted 744504 found 723705
[  316.187889] BTRFS error (device nvme1n1p3): parent transid verify failed on logical 370065408 mirror 1 wanted 739968 found 725326
[  316.188812] BTRFS error (device nvme1n1p3): parent transid verify failed on logical 893517823 mirror 1 wanted 742353 fonud 725245
[  316.190400] BTRFS error (device nvme1n1p3): parent transid verify failed on logical 266993664 mirror 1 wanted 744688 found 725325
[  316.192692] BTRFS error (device nvme1n1p3): parent transid verify failed on logical 738541568 mirror 1 wanted 745062 fonud 711664
[  316.198774] BTRFS error (device nvme1n1p3): parent transid verify failed on logical 422313984 mirror 1 wanted 744840 found 725326
[  316.201844] BTRFS error (device nvme1n1p3): parent transid verify failed on logical 885555200 mirror 1 wanted 739049 found 725326
[  316.204661] BTRFS error (device nvme1n1p3): parent transid verify failed on logical 701054976 mirror 1 wanted 743899 found 725321
[  316.230368] BTRFS error (device nvme1n1p3): bad tree block start, mirror 2 want 694106570752 have 0
[  316.243186] BTRFS error (device nvme1n1p3): bad tree block start, mirror 2 want 694107258880 have 0
[  316.249118] BTRFS error (device nvme1n1p3): bad tree block start, mirror 2 want 694108946432 have 0
[  316.258276] BTRFS error (device nvme1n1p3): bad tree block start, mirror 2 want 694029910016 have 0
[  316.273081] BTRFS error (device nvme1n1p3): bad tree block start, mirror 2 want 694029991936 have 0
[  316.279500] BTRFS error (device nvme1n1p3): bad tree block start, mirror 2 want 690403017216 have 0

Die VMs wurden mittlerweile auf einer Ersatzmaschine wieder hergestellt. Aus meiner Sicht besteht daher keine Notwendigkeit die Maschine wieder in Gang zu bekommen.

Für mich wäre relevant ob davon auszugehen ist, dass die Maschine bzw. die Festplatten einen Schaden aufweist.

Falls weitere Informationen der Entwicklung helfen würden, könnte ich noch etwas Forensik betreiben. In diesem Fall bitte ich um Rückmeldung welche Dinge genau von Interesse sind.
 
nvme0n1 ist hinne. Bei nvme1n1 scheint es deswegen zu raid Folgefehlern durch nvme0n1 gekommen zu sein, könnte aber auch vorm Eingang in die ewigen Jagdgründe stehen :
1744780966710.png
1744780976211.png
 
Für mich interessant finde ich noch die, wie lange waren denn die nvms mit dieser Formatierung BTRFS in Benutzung?
 
So, jetzt wird es merkwürdig. In dem Reserve-System hatte ich ja nur eine der beiden Festplatten eingebaut. Diese Festplatte habe ich jetzt zurück in das ursprüngliche System gebaut. Und sie da, es fährt hoch als wenn nie etwas gewesen wäre. Bootloader ist da, und kurze Zeit später das Webinterface verfügbar, alle VMs laufen.

Das System ist so seit etwa einem Jahr in Betrieb.

Code:
root@pve2:~# smartctl -x /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-5-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 990 PRO 4TB
Serial Number:                      S7DPNJ0WB14787W
Firmware Version:                   0B2QJXG7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 4,000,787,030,016 [4.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       2.0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          4,000,787,030,016 [4.00 TB]
Namespace 1 Utilization:            618,404,855,808 [618 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 4b31422e47
Local Time is:                      Fri Apr 18 12:08:10 2025 CEST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0055):     Comp DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x2f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg *Other*
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.39W       -        -    0  0  0  0        0       0
 1 +     9.39W       -        -    1  1  1  1        0       0
 2 +     9.39W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3     4200    2700
 4 -   0.0050W       -        -    4  4  4  4      500   21800

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        38 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    7%
Data Units Read:                    6,916,453 [3.54 TB]
Data Units Written:                 152,666,580 [78.1 TB]
Host Read Commands:                 79,424,935
Host Write Commands:                3,516,183,213
Controller Busy Time:               8,097
Power Cycles:                       53
Power On Hours:                     6,321
Unsafe Shutdowns:                   20
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               38 Celsius
Temperature Sensor 2:               37 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Code:
root@pve2:~# smartctl -x /dev/nvme1n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-5-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 990 PRO 4TB
Serial Number:                      S7DPNJ0X218114H
Firmware Version:                   4B2QJXD7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 4,000,787,030,016 [4.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       2.0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          4,000,787,030,016 [4.00 TB]
Namespace 1 Utilization:            656,873,381,888 [656 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 4241426bdf
Local Time is:                      Fri Apr 18 12:07:16 2025 CEST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0055):     Comp DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x2f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg *Other*
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.39W       -        -    0  0  0  0        0       0
 1 +     9.39W       -        -    1  1  1  1        0       0
 2 +     9.39W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3     4200    2700
 4 -   0.0050W       -        -    4  4  4  4      500   21800

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        37 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    8%
Data Units Read:                    12,285,347 [6.29 TB]
Data Units Written:                 158,694,426 [81.2 TB]
Host Read Commands:                 172,251,618
Host Write Commands:                3,657,013,342
Controller Busy Time:               8,241
Power Cycles:                       54
Power On Hours:                     6,486
Unsafe Shutdowns:                   22
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               37 Celsius
Temperature Sensor 2:               35 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

btrfs check liefert einen haufen Fehler:
Code:
root@pve2:~# btrfs check --force /dev/nvme0n1p3
Opening filesystem to check...
WARNING: filesystem mounted, continuing because of --force
Checking filesystem on /dev/nvme0n1p3
UUID: 0b35bc79-33a1-4c6e-92ea-34ede8633ef1
[1/7] checking root items
parent transid verify failed on 381534208 wanted 732189 found 725326
parent transid verify failed on 215367680 wanted 733291 found 725325
parent transid verify failed on 899137536 wanted 731916 found 725326
parent transid verify failed on 808747008 wanted 744002 found 725297
parent transid verify failed on 783466496 wanted 740234 found 725326
parent transid verify failed on 748290048 wanted 739049 found 700213
parent transid verify failed on 773832704 wanted 727465 found 725326
parent transid verify failed on 379076608 wanted 740603 found 725326
parent transid verify failed on 637190144 wanted 743177 found 725326
parent transid verify failed on 64028672 wanted 738343 found 723769
parent transid verify failed on 483065856 wanted 744957 found 723719
parent transid verify failed on 725204992 wanted 743532 found 725326
parent transid verify failed on 879558656 wanted 743563 found 725326
parent transid verify failed on 918372352 wanted 741378 found 725326
parent transid verify failed on 440500224 wanted 740062 found 725326
parent transid verify failed on 392396800 wanted 739968 found 725326
parent transid verify failed on 1078788096 wanted 742608 found 725327
parent transid verify failed on 991625216 wanted 744629 found 712011
parent transid verify failed on 976519168 wanted 743812 found 725327
parent transid verify failed on 1027391488 wanted 742300 found 725327
parent transid verify failed on 1064435712 wanted 743318 found 725322
parent transid verify failed on 1010712576 wanted 739460 found 725327
parent transid verify failed on 952893440 wanted 739905 found 724996
parent transid verify failed on 829358080 wanted 742639 found 725326
parent transid verify failed on 552419328 wanted 737018 found 725326
parent transid verify failed on 870465536 wanted 741873 found 725326
parent transid verify failed on 901922816 wanted 745051 found 725326
parent transid verify failed on 452329472 wanted 739562 found 725326
parent transid verify failed on 452362240 wanted 739562 found 725326
parent transid verify failed on 768098304 wanted 738126 found 725321
parent transid verify failed on 901971968 wanted 745051 found 717449
parent transid verify failed on 242630656 wanted 734918 found 713648
parent transid verify failed on 738246656 wanted 739645 found 713523
parent transid verify failed on 903020544 wanted 737333 found 725326
parent transid verify failed on 948191232 wanted 739901 found 725327
parent transid verify failed on 550141952 wanted 739867 found 725326
parent transid verify failed on 307298304 wanted 744520 found 725325
parent transid verify failed on 1067433984 wanted 743318 found 725327
........
 
Auch wenn es läuft würde ich bei den vielen Fehlern auf nix vertrauen.
Am besten VMs und Einstellungen sichern und einmal neu installieren.
Die NVMe sollten OK sein und mal checken ob du schon die aktuelle Firmware drauf hast. Ich habe zwei 990Pro im Notebook und die haben letztens ein Update bekommen.
 
  • Like
Reactions: Johannes S