PVE 8.3 node with btrfs booting into grub command line after hard reset

udotirol

Well-Known Member
Mar 9, 2018
67
21
48
53
Hi,

for my home lab, I wanted to give btrfs a try and since PVE supports btrfs+RAID on root, I decided to give it a try and added a PVE 8.3 node with exactly that into my little 2 node home lab. Things went smoothly so far, but after a recent hard reboot, the node doesn't boot anymore, but throws me into a grub command line instead.

In there, I can successfully access my EFI partition, but I can't access the root partition, grub isn't able to mount it yet at that stage:

grub.jpeg

I then tried to do a recovery boot from a fresh ISO, but apparently the recovery mode doesn't work for btrfs root devices.

Next I tried to boot the server with a live system and even from there, I can't access any btrfs partition.
rescue.jpeg

My knowledge about btrfs is very limited, so maybe I am missing something essential.

Anyway, I guess it's learning time now ... any help appreciated!
 
Last edited:
?? It's same UUID (which it should as you said it's a btrfs raid(1)) but the other device by name.
But looks like a new installation needed then ... hopefully you have good backups for restores ... :)
 
Firstly I suggest to check there isn't hw issue (mainly that disk don't have issue, secondary can be also ram or other things that can cause corrupt), if not hw issue related to the error showed you can check if only timeout (for big fs or with big snapshots usage with very high fragmentation, but I suppose is low probability if on good nvme disk), try to increase it on mount options: https://www.reddit.com/r/btrfs/comments/ppggbm/btrfs_open_ctree_failed_on_boot/
If is corrupted instead need more accurate check.
 
Last edited:
  • Like
Reactions: waltar
I am afraid, I wasn't clear enough, sorry.

blkid shows that both partitions on both NVMes have the same UUID (see my last screenshot).

What is different, is the error message I get.

One says "device 1 uuid 3f3d..." and when trying with the other NVMe, I get a different "device 2 uuid 6....", don't remember exactly.

hmm but my previous "live system" was just a proxmox ISO, where I interrupted the installation process and then worked on the command line.

After trying with clonezilla, I can successfully mount and access both btrfs partitions ... not sure, what's going on there. Will try to reinstall grub from there.

Haha and yes, a backup is available, but restoring would be the easy way. I am more interested figuring out what went wrong after the hard reset and of course, how to get back on track :D
 
  • Like
Reactions: waltar
well, and it get's even more fascinating. After just mounting and subsequently unmounting the two partitions in clonezilla, I attempted a simple reboot into proxmox and that worked.

After booting, the ring buffer contains heaps of lines like these, however:

Code:
[Mon Jan  6 15:47:54 2025] BTRFS info (device nvme1n1p3): read error corrected: ino 78524 off 21628747776 (dev /dev/nvme1n1p3 sector 931730376)
[Mon Jan  6 15:47:56 2025] btrfs_print_data_csum_error: 4 callbacks suppressed
[Mon Jan  6 15:47:56 2025] BTRFS warning (device nvme1n1p3): csum failed root 5 ino 78513 off 4096 csum 0xd5ffab47 expected csum 0xbbd013fc mirror 2
[Mon Jan  6 15:47:56 2025] btrfs_dev_stat_inc_and_print: 4 callbacks suppressed
[Mon Jan  6 15:47:56 2025] BTRFS error (device nvme1n1p3): bdev /dev/nvme1n1p3 errs: wr 2342179, rd 11266, flush 15027, corrupt 90280, gen 0
[Mon Jan  6 15:47:56 2025] BTRFS info (device nvme1n1p3): read error corrected: ino 78509 off 4096 (dev /dev/nvme1n1p3 sector 312043192)
[Mon Jan  6 15:47:56 2025] BTRFS info (device nvme1n1p3): read error corrected: ino 78513 off 4096 (dev /dev/nvme1n1p3 sector 311272416)
[Mon Jan  6 15:47:56 2025] BTRFS warning (device nvme1n1p3): csum failed root 5 ino 78509 off 8192 csum 0xcab94f36 expected csum 0x97ba6de4 mirror 2

So the filesystem must have suffered from a pretty bad corruption (which isn't entirely unexpected after the hard reset), but on the other hand, the scale of corruption seems to be wild, rendering the server unable to boot.
 
Mmh, are the partitions in sync if you mount them individually in clonezilla by btrfs after the new boot now - hopefully ?!
Still btrfs is tagged as experimental by pve and that could be better in the future but maybe even not. I appreciate your rocky way anyway - good luck :)
 
I am afraid, I wasn't clear enough, sorry.

blkid shows that both partitions on both NVMes have the same UUID (see my last screenshot).

What is different, is the error message I get.

One says "device 1 uuid 3f3d..." and when trying with the other NVMe, I get a different "device 2 uuid 6....", don't remember exactly.

hmm but my previous "live system" was just a proxmox ISO, where I interrupted the installation process and then worked on the command line.

After trying with clonezilla, I can successfully mount and access both btrfs partitions ... not sure, what's going on there. Will try to reinstall grub from there.

Haha and yes, a backup is available, but restoring would be the easy way. I am more interested figuring out what went wrong after the hard reset and of course, how to get back on track :D
That disks of same BTRFS filesystem have same UUID is correct.
But then it's not clear what you did.
What you did with clonezilla?
It is important to know at least the essential information about btrfs to avoid creating problems, when a btrfs filesystem is on multiple disks they must all be present and when you mount one it automatically "mounts" the filesystem with all the disks on it, when mounting it is usually better to go with uuid to avoid unexpected events (the path device may vary if something changes, but mount for UUID is useful for any fs).
Also, btrfs does not do the mirror "sector by sector" but makes multiple copies of the data on multiple disks based on the profile selected but in different positions, it is very flexible so much so that you can also use disks of different sizes and add others disks later. Did you make a copy by sector from one disk to another? Cloning them would be wrong, it is done only in the case of defective disks from a defective disk to a new one in rare cases, for example if you have 2 disks with problems in profiles that tolerate only one disk lost or with issue (such as raid 1 or raid 5), then you do ddrescue of at least one of the defective ones (still "working" even if with errors) to limit the possible data loss.
 
That disks of same BTRFS filesystem have same UUID is correct.
But then it's not clear what you did.
What you did with clonezilla?
It is important to know at least the essential information about btrfs to avoid creating problems, when a btrfs filesystem is on multiple disks they must all be present and when you mount one it automatically "mounts" the filesystem with all the disks on it, when mounting it is usually better to go with uuid to avoid unexpected events (the path device may vary if something changes, but mount for UUID is useful for any fs).
Also, btrfs does not do the mirror "sector by sector" but makes multiple copies of the data on multiple disks based on the profile selected but in different positions, it is very flexible so much so that you can also use disks of different sizes and add others disks later. Did you make a copy by sector from one disk to another? Cloning them would be wrong, it is done only in the case of defective disks from a defective disk to a new one in rare cases, for example if you have 2 disks with problems in profiles that tolerate only one disk lost or with issue (such as raid 1 or raid 5), then you do ddrescue of at least one of the defective ones (still "working" even if with errors) to limit the possible data loss.
Thanks for the extensive explanation.

What I did with clonezilla is to just see if I could manually mount the two btrfs partitions like I did before with the proxmox live system. And I could, apparently, it just worked. I didn't perform any cloning or whatever, but headed directly to the console/shell.

And about the UUIDs: I am aware of that, all I wanted to do is to see if I could access the individual devices, and that worked.

Like I said before, my knowledge about btrfs is limited, but I am interested in learning more about it. After all, this is my homelab, so things are expected to break every now and then :D

Now that the server is running again, I am trying to figure out what went wrong, however.

Code:
$ btrfs device stats -c /
[/dev/nvme1n1p3].write_io_errs    2342179
[/dev/nvme1n1p3].read_io_errs     11266
[/dev/nvme1n1p3].flush_io_errs    15027
[/dev/nvme1n1p3].corruption_errs  90557
[/dev/nvme1n1p3].generation_errs  0
[/dev/nvme0n1p3].write_io_errs    0
[/dev/nvme0n1p3].read_io_errs     0
[/dev/nvme0n1p3].flush_io_errs    0
[/dev/nvme0n1p3].corruption_errs  0
[/dev/nvme0n1p3].generation_errs  0

This probably corresponds to the messages I get in the ring buffer, for whatever reasons my second NVMe doesn't seem to behave correctly ... but maybe those errors are related to the hard reboot.

And when I check the partition, no errors are reported:

Code:
$ btrfs check --force /dev/nvme1n1p3
Opening filesystem to check...
WARNING: filesystem mounted, continuing because of --force
Checking filesystem on /dev/nvme1n1p3
UUID: 61425a22-f2bb-481c-817c-da10f7e17cec
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 675886034944 bytes used, no error found
total csum bytes: 652967220
total tree bytes: 1543831552
total fs tree bytes: 400801792
total extent tree bytes: 369868800
btree space waste bytes: 239259682
file data blocks allocated: 700220489728
 referenced 671974440960

Right now, while everything appears to be working, I still keep getting the btrfs errors in the ring buffer:

Code:
[Mon Jan  6 22:54:51 2025] BTRFS warning (device nvme1n1p3): csum failed root 5 ino 78524 off 56961204224 csum 0x64eee23c expected csum 0xa7b1523b mirror 2
[Mon Jan  6 22:54:51 2025] BTRFS error (device nvme1n1p3): bdev /dev/nvme1n1p3 errs: wr 2342179, rd 11266, flush 15027, corrupt 90562, gen 0
[Mon Jan  6 22:54:51 2025] BTRFS warning (device nvme1n1p3): csum failed root 5 ino 78524 off 56961208320 csum 0x6de4b6f9 expected csum 0x51ad5db7 mirror 2
[Mon Jan  6 22:54:51 2025] BTRFS error (device nvme1n1p3): bdev /dev/nvme1n1p3 errs: wr 2342179, rd 11266, flush 15027, corrupt 90563, gen 0
[Mon Jan  6 22:54:51 2025] BTRFS info (device nvme1n1p3): read error corrected: ino 78524 off 56961204224 (dev /dev/nvme1n1p3 sector 1375316968)
[Mon Jan  6 22:54:51 2025] BTRFS info (device nvme1n1p3): read error corrected: ino 78524 off 56961208320 (dev /dev/nvme1n1p3 sector 1375316976)

Those errors only occur since I managed to revive the node, ie nothing like this before the hard reboot.

In order to rule out hardware issues, I've run both a short and long self test on that NVMe, it completes successfully:

Code:
# nvme self-test-log /dev/nvme1
Device Self Test Log for NVME device:nvme1
Current operation  : 0
Current Completion : 0%
Self Test Result[0]:
  Operation Result             : 0
  Self Test Code               : 2
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x13d3
  Vendor Specific              : 0 0
Self Test Result[1]:
  Operation Result             : 0
  Self Test Code               : 1
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x13d3
  Vendor Specific              : 0 0

Not sure what to make out of it ...
 
It seems that these were correctable errors, however if there are errors it is always a symptom of problems, almost always hardware from what I have seen. It is important to avoid thinking that the checksum and profiles with 2 or more copies can always self-repair and avoid all problems (it may seem obvious to many but I say it because unfortunately I have read of several cases of people who continued to use defective hardware as if nothing had happened).

In most cases to check the integrity of the data of the entire filesystem using checksums need to use scrub (and more rarely check), obviously the data without checksums cannot be detected if damaged (for example where cow is disabled with nodatacow).
The journal files have cow disabled by default and to avoid errors with raid need to re-enable it, at least I re-enable it on all servers where I have btrfs in raid1.
I also advise against using btrfs on vm disks if you do not have enterprise disks, in some cases of very intensive writing and/or perhaps with heavy use of databases even with enterprise disks but instead use storage that does not have host side fs or with non-cow fs.
If you use with vm disks it can be very useful to set noatime, especially if you use snapshots.
There would be other useful things to say, but I don't have the time.

About the disk check also smart data, for example value like "Media and Data Integrity Errors", try to post full smart date, if you want.
 
well, looks like I'm back at square one: the node again only boots into grub. Yesterday evening I decided to start a scrub, and that ended in low level block storage errors like these:

Code:
[27979.988958] nvme nvme1: I/O tag 604 (b25c) opcode 0x2 (I/O Cmd) QID 8 timeout, aborting req_op:READ(0) size:524288
[27979.988987] nvme nvme1: I/O tag 605 (125d) opcode 0x2 (I/O Cmd) QID 8 timeout, aborting req_op:READ(0) size:524288
[27979.988991] nvme nvme1: I/O tag 606 (925e) opcode 0x2 (I/O Cmd) QID 8 timeout, aborting req_op:READ(0) size:524288
[27979.988994] nvme nvme1: I/O tag 607 (725f) opcode 0x2 (I/O Cmd) QID 8 timeout, aborting req_op:READ(0) size:524288
[27979.988997] nvme nvme1: I/O tag 608 (2260) opcode 0x2 (I/O Cmd) QID 8 timeout, aborting req_op:READ(0) size:524288
[27979.989000] nvme nvme1: I/O tag 610 (5262) opcode 0x2 (I/O Cmd) QID 8 timeout, aborting req_op:READ(0) size:524288
[27979.989003] nvme nvme1: I/O tag 611 (1263) opcode 0x2 (I/O Cmd) QID 8 timeout, aborting req_op:READ(0) size:524288
[27979.989006] nvme nvme1: I/O tag 613 (4265) opcode 0x2 (I/O Cmd) QID 8 timeout, aborting req_op:READ(0) size:524288
[27979.989068] nvme nvme1: Abort status: 0x0
[27979.989129] nvme nvme1: Abort status: 0x0
[27979.989192] nvme nvme1: Abort status: 0x0
[27979.989257] nvme nvme1: Abort status: 0x0
[27979.989338] nvme nvme1: Abort status: 0x0
[27979.989399] nvme nvme1: Abort status: 0x0
[27979.989476] nvme nvme1: Abort status: 0x0
[27979.989537] nvme nvme1: Abort status: 0x0

so this really looks like some hardware issue ...

After that I booted into clonezilla once more, did the same as before, successfully mounted and umounted both of the btrfs partitions, rebooted, but unfortunately this time it didn't "magically" fix my issue, instead I still end in the grub commandline.

I remember from other RAID capable filesystems that sometimes there are issues when you attempt to boot a degraded array. Can this be the case for btrfs too? I actually found an old - very old - post from the btrfs ML stating just that:

https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg46598.html

Can this still be the case?

And as suggested by @Fantu, here's my smart data:

Code:
$ smartctl -x /dev/nvme1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-5-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 990 PRO 4TB
Serial Number:                      S7DPNJ0WC34795J
Firmware Version:                   4B2QJXD7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 4,000,787,030,016 [4.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       2.0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          4,000,787,030,016 [4.00 TB]
Namespace 1 Utilization:            669,775,155,200 [669 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 4c3144eec0
Local Time is:                      Mon Jan  6 22:19:33 2025 CET
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0055):     Comp DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x2f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg *Other*
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.39W       -        -    0  0  0  0        0       0
 1 +     9.39W       -        -    1  1  1  1        0       0
 2 +     9.39W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3     4200    2700
 4 -   0.0050W       -        -    4  4  4  4      500   21800

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        46 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    2%
Data Units Read:                    6,649,601 [3.40 TB]
Data Units Written:                 32,917,461 [16.8 TB]
Host Read Commands:                 25,680,641
Host Write Commands:                652,533,956
Controller Busy Time:               4,594
Power Cycles:                       85
Power On Hours:                     5,074
Unsafe Shutdowns:                   68
Media and Data Integrity Errors:    0
Error Information Log Entries:      1
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               46 Celsius
Temperature Sensor 2:               52 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged
 
Last edited:
There are hw issue on disk or on adapter (if you use pcie adapter for m.2).
Samsung disk have also many firmware issues, I also have many of them and first thing I do is check and update the firmware, I'm not well-informed about the 990 pro, I don't have any of that model but from a quick search it also had problems with some firmware.
 
  • Like
Reactions: waltar
Good idea a fw update on the samsung first and try again the nvme as the smart values aren't that bad yet.
But that would help maybe just for a quiet further short time period work ...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!