PVE 8.3 node with btrfs booting into grub command line after hard reset

udotirol · Jan 6, 2025

Hi,

for my home lab, I wanted to give btrfs a try and since PVE supports btrfs+RAID on root, I decided to give it a try and added a PVE 8.3 node with exactly that into my little 2 node home lab. Things went smoothly so far, but after a recent hard reboot, the node doesn't boot anymore, but throws me into a grub command line instead.

In there, I can successfully access my EFI partition, but I can't access the root partition, grub isn't able to mount it yet at that stage:

I then tried to do a recovery boot from a fresh ISO, but apparently the recovery mode doesn't work for btrfs root devices.

Next I tried to boot the server with a live system and even from there, I can't access any btrfs partition.

My knowledge about btrfs is very limited, so maybe I am missing something essential.

Anyway, I guess it's learning time now ... any help appreciated!

waltar · Jan 6, 2025

What about the btrfs mirror nvme ?
mount -t btrfs /dev/nvme1n1p3 /rescue

udotirol · Jan 6, 2025

waltar said:
What about the btrfs mirror nvme ?
mount -t btrfs /dev/nvme1n1p3 /rescue

it's the same, unfortunately, just bailing out with a different UUID

waltar · Jan 6, 2025

?? It's same UUID (which it should as you said it's a btrfs raid(1)) but the other device by name.
But looks like a new installation needed then ... hopefully you have good backups for restores ...

Fantu · Jan 6, 2025

Firstly I suggest to check there isn't hw issue (mainly that disk don't have issue, secondary can be also ram or other things that can cause corrupt), if not hw issue related to the error showed you can check if only timeout (for big fs or with big snapshots usage with very high fragmentation, but I suppose is low probability if on good nvme disk), try to increase it on mount options: https://www.reddit.com/r/btrfs/comments/ppggbm/btrfs_open_ctree_failed_on_boot/
If is corrupted instead need more accurate check.

udotirol · Jan 6, 2025

I am afraid, I wasn't clear enough, sorry.

blkid shows that both partitions on both NVMes have the same UUID (see my last screenshot).

What is different, is the error message I get.

One says "device 1 uuid 3f3d..." and when trying with the other NVMe, I get a different "device 2 uuid 6....", don't remember exactly.

hmm but my previous "live system" was just a proxmox ISO, where I interrupted the installation process and then worked on the command line.

After trying with clonezilla, I can successfully mount and access both btrfs partitions ... not sure, what's going on there. Will try to reinstall grub from there.

Haha and yes, a backup is available, but restoring would be the easy way. I am more interested figuring out what went wrong after the hard reset and of course, how to get back on track

udotirol · Jan 6, 2025

well, and it get's even more fascinating. After just mounting and subsequently unmounting the two partitions in clonezilla, I attempted a simple reboot into proxmox and that worked.

After booting, the ring buffer contains heaps of lines like these, however:

Code:

[Mon Jan  6 15:47:54 2025] BTRFS info (device nvme1n1p3): read error corrected: ino 78524 off 21628747776 (dev /dev/nvme1n1p3 sector 931730376)
[Mon Jan  6 15:47:56 2025] btrfs_print_data_csum_error: 4 callbacks suppressed
[Mon Jan  6 15:47:56 2025] BTRFS warning (device nvme1n1p3): csum failed root 5 ino 78513 off 4096 csum 0xd5ffab47 expected csum 0xbbd013fc mirror 2
[Mon Jan  6 15:47:56 2025] btrfs_dev_stat_inc_and_print: 4 callbacks suppressed
[Mon Jan  6 15:47:56 2025] BTRFS error (device nvme1n1p3): bdev /dev/nvme1n1p3 errs: wr 2342179, rd 11266, flush 15027, corrupt 90280, gen 0
[Mon Jan  6 15:47:56 2025] BTRFS info (device nvme1n1p3): read error corrected: ino 78509 off 4096 (dev /dev/nvme1n1p3 sector 312043192)
[Mon Jan  6 15:47:56 2025] BTRFS info (device nvme1n1p3): read error corrected: ino 78513 off 4096 (dev /dev/nvme1n1p3 sector 311272416)
[Mon Jan  6 15:47:56 2025] BTRFS warning (device nvme1n1p3): csum failed root 5 ino 78509 off 8192 csum 0xcab94f36 expected csum 0x97ba6de4 mirror 2

So the filesystem must have suffered from a pretty bad corruption (which isn't entirely unexpected after the hard reset), but on the other hand, the scale of corruption seems to be wild, rendering the server unable to boot.

waltar · Jan 6, 2025

Mmh, are the partitions in sync if you mount them individually in clonezilla by btrfs after the new boot now - hopefully ?!
Still btrfs is tagged as experimental by pve and that could be better in the future but maybe even not. I appreciate your rocky way anyway - good luck

Fantu · Jan 6, 2025

udotirol said:
I am afraid, I wasn't clear enough, sorry.

blkid shows that both partitions on both NVMes have the same UUID (see my last screenshot).

What is different, is the error message I get.

One says "device 1 uuid 3f3d..." and when trying with the other NVMe, I get a different "device 2 uuid 6....", don't remember exactly.

hmm but my previous "live system" was just a proxmox ISO, where I interrupted the installation process and then worked on the command line.

After trying with clonezilla, I can successfully mount and access both btrfs partitions ... not sure, what's going on there. Will try to reinstall grub from there.

Haha and yes, a backup is available, but restoring would be the easy way. I am more interested figuring out what went wrong after the hard reset and of course, how to get back on track

That disks of same BTRFS filesystem have same UUID is correct.
But then it's not clear what you did.
What you did with clonezilla?
It is important to know at least the essential information about btrfs to avoid creating problems, when a btrfs filesystem is on multiple disks they must all be present and when you mount one it automatically "mounts" the filesystem with all the disks on it, when mounting it is usually better to go with uuid to avoid unexpected events (the path device may vary if something changes, but mount for UUID is useful for any fs).
Also, btrfs does not do the mirror "sector by sector" but makes multiple copies of the data on multiple disks based on the profile selected but in different positions, it is very flexible so much so that you can also use disks of different sizes and add others disks later. Did you make a copy by sector from one disk to another? Cloning them would be wrong, it is done only in the case of defective disks from a defective disk to a new one in rare cases, for example if you have 2 disks with problems in profiles that tolerate only one disk lost or with issue (such as raid 1 or raid 5), then you do ddrescue of at least one of the defective ones (still "working" even if with errors) to limit the possible data loss.

udotirol · Jan 6, 2025

Fantu said:
That disks of same BTRFS filesystem have same UUID is correct.
But then it's not clear what you did.
What you did with clonezilla?
It is important to know at least the essential information about btrfs to avoid creating problems, when a btrfs filesystem is on multiple disks they must all be present and when you mount one it automatically "mounts" the filesystem with all the disks on it, when mounting it is usually better to go with uuid to avoid unexpected events (the path device may vary if something changes, but mount for UUID is useful for any fs).
Also, btrfs does not do the mirror "sector by sector" but makes multiple copies of the data on multiple disks based on the profile selected but in different positions, it is very flexible so much so that you can also use disks of different sizes and add others disks later. Did you make a copy by sector from one disk to another? Cloning them would be wrong, it is done only in the case of defective disks from a defective disk to a new one in rare cases, for example if you have 2 disks with problems in profiles that tolerate only one disk lost or with issue (such as raid 1 or raid 5), then you do ddrescue of at least one of the defective ones (still "working" even if with errors) to limit the possible data loss.

Thanks for the extensive explanation.

What I did with clonezilla is to just see if I could manually mount the two btrfs partitions like I did before with the proxmox live system. And I could, apparently, it just worked. I didn't perform any cloning or whatever, but headed directly to the console/shell.

And about the UUIDs: I am aware of that, all I wanted to do is to see if I could access the individual devices, and that worked.

Like I said before, my knowledge about btrfs is limited, but I am interested in learning more about it. After all, this is my homelab, so things are expected to break every now and then

Now that the server is running again, I am trying to figure out what went wrong, however.

Code:

$ btrfs device stats -c /
[/dev/nvme1n1p3].write_io_errs    2342179
[/dev/nvme1n1p3].read_io_errs     11266
[/dev/nvme1n1p3].flush_io_errs    15027
[/dev/nvme1n1p3].corruption_errs  90557
[/dev/nvme1n1p3].generation_errs  0
[/dev/nvme0n1p3].write_io_errs    0
[/dev/nvme0n1p3].read_io_errs     0
[/dev/nvme0n1p3].flush_io_errs    0
[/dev/nvme0n1p3].corruption_errs  0
[/dev/nvme0n1p3].generation_errs  0

This probably corresponds to the messages I get in the ring buffer, for whatever reasons my second NVMe doesn't seem to behave correctly ... but maybe those errors are related to the hard reboot.

And when I check the partition, no errors are reported:

Code:

$ btrfs check --force /dev/nvme1n1p3
Opening filesystem to check...
WARNING: filesystem mounted, continuing because of --force
Checking filesystem on /dev/nvme1n1p3
UUID: 61425a22-f2bb-481c-817c-da10f7e17cec
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 675886034944 bytes used, no error found
total csum bytes: 652967220
total tree bytes: 1543831552
total fs tree bytes: 400801792
total extent tree bytes: 369868800
btree space waste bytes: 239259682
file data blocks allocated: 700220489728
 referenced 671974440960

Right now, while everything appears to be working, I still keep getting the btrfs errors in the ring buffer:

Code:

[Mon Jan  6 22:54:51 2025] BTRFS warning (device nvme1n1p3): csum failed root 5 ino 78524 off 56961204224 csum 0x64eee23c expected csum 0xa7b1523b mirror 2
[Mon Jan  6 22:54:51 2025] BTRFS error (device nvme1n1p3): bdev /dev/nvme1n1p3 errs: wr 2342179, rd 11266, flush 15027, corrupt 90562, gen 0
[Mon Jan  6 22:54:51 2025] BTRFS warning (device nvme1n1p3): csum failed root 5 ino 78524 off 56961208320 csum 0x6de4b6f9 expected csum 0x51ad5db7 mirror 2
[Mon Jan  6 22:54:51 2025] BTRFS error (device nvme1n1p3): bdev /dev/nvme1n1p3 errs: wr 2342179, rd 11266, flush 15027, corrupt 90563, gen 0
[Mon Jan  6 22:54:51 2025] BTRFS info (device nvme1n1p3): read error corrected: ino 78524 off 56961204224 (dev /dev/nvme1n1p3 sector 1375316968)
[Mon Jan  6 22:54:51 2025] BTRFS info (device nvme1n1p3): read error corrected: ino 78524 off 56961208320 (dev /dev/nvme1n1p3 sector 1375316976)

Those errors only occur since I managed to revive the node, ie nothing like this before the hard reboot.

In order to rule out hardware issues, I've run both a short and long self test on that NVMe, it completes successfully:

Code:

# nvme self-test-log /dev/nvme1
Device Self Test Log for NVME device:nvme1
Current operation  : 0
Current Completion : 0%
Self Test Result[0]:
  Operation Result             : 0
  Self Test Code               : 2
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x13d3
  Vendor Specific              : 0 0
Self Test Result[1]:
  Operation Result             : 0
  Self Test Code               : 1
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x13d3
  Vendor Specific              : 0 0

Not sure what to make out of it ...

Fantu · Jan 7, 2025

It seems that these were correctable errors, however if there are errors it is always a symptom of problems, almost always hardware from what I have seen. It is important to avoid thinking that the checksum and profiles with 2 or more copies can always self-repair and avoid all problems (it may seem obvious to many but I say it because unfortunately I have read of several cases of people who continued to use defective hardware as if nothing had happened).

In most cases to check the integrity of the data of the entire filesystem using checksums need to use scrub (and more rarely check), obviously the data without checksums cannot be detected if damaged (for example where cow is disabled with nodatacow).
The journal files have cow disabled by default and to avoid errors with raid need to re-enable it, at least I re-enable it on all servers where I have btrfs in raid1.
I also advise against using btrfs on vm disks if you do not have enterprise disks, in some cases of very intensive writing and/or perhaps with heavy use of databases even with enterprise disks but instead use storage that does not have host side fs or with non-cow fs.
If you use with vm disks it can be very useful to set noatime, especially if you use snapshots.
There would be other useful things to say, but I don't have the time.

About the disk check also smart data, for example value like "Media and Data Integrity Errors", try to post full smart date, if you want.

waltar · Jan 7, 2025

I would replace the "nvme1(n1)" and rebuild your raid1.

udotirol · Jan 7, 2025

well, looks like I'm back at square one: the node again only boots into grub. Yesterday evening I decided to start a scrub, and that ended in low level block storage errors like these:

Code:

[27979.988958] nvme nvme1: I/O tag 604 (b25c) opcode 0x2 (I/O Cmd) QID 8 timeout, aborting req_op:READ(0) size:524288
[27979.988987] nvme nvme1: I/O tag 605 (125d) opcode 0x2 (I/O Cmd) QID 8 timeout, aborting req_op:READ(0) size:524288
[27979.988991] nvme nvme1: I/O tag 606 (925e) opcode 0x2 (I/O Cmd) QID 8 timeout, aborting req_op:READ(0) size:524288
[27979.988994] nvme nvme1: I/O tag 607 (725f) opcode 0x2 (I/O Cmd) QID 8 timeout, aborting req_op:READ(0) size:524288
[27979.988997] nvme nvme1: I/O tag 608 (2260) opcode 0x2 (I/O Cmd) QID 8 timeout, aborting req_op:READ(0) size:524288
[27979.989000] nvme nvme1: I/O tag 610 (5262) opcode 0x2 (I/O Cmd) QID 8 timeout, aborting req_op:READ(0) size:524288
[27979.989003] nvme nvme1: I/O tag 611 (1263) opcode 0x2 (I/O Cmd) QID 8 timeout, aborting req_op:READ(0) size:524288
[27979.989006] nvme nvme1: I/O tag 613 (4265) opcode 0x2 (I/O Cmd) QID 8 timeout, aborting req_op:READ(0) size:524288
[27979.989068] nvme nvme1: Abort status: 0x0
[27979.989129] nvme nvme1: Abort status: 0x0
[27979.989192] nvme nvme1: Abort status: 0x0
[27979.989257] nvme nvme1: Abort status: 0x0
[27979.989338] nvme nvme1: Abort status: 0x0
[27979.989399] nvme nvme1: Abort status: 0x0
[27979.989476] nvme nvme1: Abort status: 0x0
[27979.989537] nvme nvme1: Abort status: 0x0

so this really looks like some hardware issue ...

After that I booted into clonezilla once more, did the same as before, successfully mounted and umounted both of the btrfs partitions, rebooted, but unfortunately this time it didn't "magically" fix my issue, instead I still end in the grub commandline.

I remember from other RAID capable filesystems that sometimes there are issues when you attempt to boot a degraded array. Can this be the case for btrfs too? I actually found an old - very old - post from the btrfs ML stating just that:

https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg46598.html

Can this still be the case?

And as suggested by @Fantu, here's my smart data:

Code:

$ smartctl -x /dev/nvme1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-5-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 990 PRO 4TB
Serial Number:                      S7DPNJ0WC34795J
Firmware Version:                   4B2QJXD7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 4,000,787,030,016 [4.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       2.0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          4,000,787,030,016 [4.00 TB]
Namespace 1 Utilization:            669,775,155,200 [669 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 4c3144eec0
Local Time is:                      Mon Jan  6 22:19:33 2025 CET
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0055):     Comp DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x2f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg *Other*
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.39W       -        -    0  0  0  0        0       0
 1 +     9.39W       -        -    1  1  1  1        0       0
 2 +     9.39W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3     4200    2700
 4 -   0.0050W       -        -    4  4  4  4      500   21800

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        46 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    2%
Data Units Read:                    6,649,601 [3.40 TB]
Data Units Written:                 32,917,461 [16.8 TB]
Host Read Commands:                 25,680,641
Host Write Commands:                652,533,956
Controller Busy Time:               4,594
Power Cycles:                       85
Power On Hours:                     5,074
Unsafe Shutdowns:                   68
Media and Data Integrity Errors:    0
Error Information Log Entries:      1
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               46 Celsius
Temperature Sensor 2:               52 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Fantu · Jan 7, 2025

There are hw issue on disk or on adapter (if you use pcie adapter for m.2).
Samsung disk have also many firmware issues, I also have many of them and first thing I do is check and update the firmware, I'm not well-informed about the 990 pro, I don't have any of that model but from a quick search it also had problems with some firmware.

waltar · Jan 7, 2025

Good idea a fw update on the samsung first and try again the nvme as the smart values aren't that bad yet.
But that would help maybe just for a quiet further short time period work ...

udotirol · Jan 8, 2025

First thank you @Fantu and @waltar for your support so far. I've had a little time today to investigate the issue further, and I found some strange things. This time, from the grub command line, which I ended upon reboot, I could successfully do a

ls (hd0,gpt3)/

as well as a

ls (hd1,gpt3)/

gpt3 is my btrfs root partition. Listing the brtfs root partition did not work when I first met this issue.

When looking at (hd0,gpt3)/boot/grub/grub.cfg, I got this:

Code:

grub> cat (hd0,gpt3)/boot/grub/grub.cfg
error: inode not found.
grub>
grub> cat (hd1,gpt3)/boot/grub/grub.cfg
error: inode not found.

As you can see, the file system containing /boot/grub/grub.cfg on both devices seems to have some major corruption. Not sure why, because that file as well as the entire /boot directory hasn't seen any change for a while.

Anyhow, after booting into clonezilla again and mounting the two partitions manually, I can successfully access the contents of said grub.cfg without getting any error, on each of the partitions.

Next I tried was to mount the btrfs RAID using its uuid and even there, I could see the contents of grub.cfg without issues.

When running btrfs check on both partitions, I now get a huge number of messages like

Code:

parent transid verify failed on .... wanted .... found ....

... in [1/7] checking root items and [2/7] checking extends. Apparently, there seems to be some bad corruption. Now after booting the server again, those errors seem to be fixed in the background, as seen in the logs, eg.

BTRFS info (device nvme1n1p3): read error corrected: ino 0 off 756592164864 (dev /dev/nvme1n1p3 sector 1343501344)

So I will wait for a while and see, if btrfs check eventually doesn't show any csum errors any more.

But apart from that, I would like to understand why I got the "error: inode not found." for the grub.cfg file on gpt3. "inode not found" suggests that the FS index seems be have been corrupted on both devices, but I can't think why? I could understand if that happened on my probably faulty nvme1 device, but why also on my "good" nvme0?

And also, why could I mount the btrfs root partition in clonezilla without issues and access the file without getting an inode error?

Let me also add that I will of course update both devices' firmware, but before I do, I am still curious about how btrfs deals with failures like these.

waltar · Jan 8, 2025

udotirol said:
I am still curious about how btrfs deals with failures like these.

Me too ...

Fantu · Jan 8, 2025

You still have hardware issue (as you posted nvme errors), btrfs raid 1 can fix read error or corruptions that detect thanks to checksum (if same data on other disk is ok) but it can't work miracles and if there were problems on the other disk or some other unforeseen event you risk damaging the filesystem and it wouldn't be the filesystem's fault.

From the tests you did it gives me more the idea of pci slot and/or adapter problems (I can't say for sure) but do the first tests by changing the pci slot and then the second by changing the card. It's hard to find good ones at a low cost (as I needed) and I've already had problems, in my case I have further difficulty in choosing also for the small dimensions supported in the servers I use.
More of these errors can also occur due to additional unexpected events caused by specific kernel version drivers, this could be a case that explain why you have fewer problems when mounting from live but it is more probable hw issue that it is just a kernel driver problem, it is more likely a problem with the disk, its firmware or with any inexpensive adapters, for example in my case with specific card certified for the server (for example Dell BOSS card) I didn't saw issue but on generic card that which also cost up to 1/8 of those yes.

So to recap firstly verify that there are no hardware problems if then despite the various tests there are no hardware problems can move on to the possible problems of the kernel versions, unfortunately proxmox is a rolling release on the kernel and the latest versions it uses are not LTS (upstream) so there are more chances of bugs or regressions but basically it is based on the Ubuntu kernel and proxmox is quite used so I don't think (and I hope) there is a high probability of problems given by the kernel.
If you wanted to verify the kernel problem you can try to use a live distro that uses an LTS kernel and a version without known problems to the drivers you need and try to mount and do some operations on the filesystem with that.

EDIT:
about nvme error is also possible that is broken APST support, workaround included in kernel time ago avoid problem in major of disk/firmware that had issue (was thinking solved) but seems some rare cases are still present.
https://wiki.archlinux.org/title/So...Controller_failure_due_to_broken_APST_support
https://wiki.archlinux.org/title/Talk:Solid_state_drive/NVMe#NVME_power_saving_patch_v.2
For try if is it you can try to add these kernel parameters:

Code:

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off

Fantu · Jan 8, 2025

IMPORTANT: Delicate operations such as filesystem repair (like "btrfs check --repair" with btrfs) have high risk of creating even worse damage when there are hardware issue, they must be done only without any hardware problems.

I have seen a lot of cases where users continued to use defective hardware or even do filesystem repairs (major are automatic and not done by the user) making the problems worse, which in most cases were minor before. I have seen them mostly on Windows with NTFS filesystem, mainly because it is not easy visible when there are problems on the disks, unless a good monitoring has been set up which helps at least some of the cases. I have also seen them on ext4 but much rarer as the problems are more easily visible and even without monitoring in case of significant problems it is remounted in read-only and with serious errors it does not do automatic corrections even if set at startup and requires manual intervention.

So much so that I have learned over the years that if there are problems with the disks you should stop using them immediately, they should no longer be mounted, and if you need to recover data you should first clone them sector by sector onto a healthy disk (ddrescue is the best tool among those I have used) and only after repair the filesystem on the healthy disk.

With btrfs I haven't had enough cases yet to evaluate how it behaves in the worst case with defective disks, I'll see maybe after a year if I'll switch the thin clients, which we use with raspberry with system on sd card to btrfs instead of ext4 in the next major release that i'll do (i don't know yet exactly when i'll have time). We have some corruptions of the system every year that i hope to be able to reduce at least a little with btrfs with duplicate metadata (I'm not sure about making duplicates of the data as well, risking reducing the life more quickly, I also use duplicate data on some pendrives and some HDDs), compression (to reduce the amount of writings) and thanks to the checksums to be able to know exactly if there is corrupted data and which ones.

What I have been able to verify so far is that it copes well with brutal shutdowns, at least as much as ext4 if not perhaps slightly better, obviously not with raid 5/6 profiles that I do not use in which the risk of corruption with brutal shutdowns has remained even in the latest kernels.
There have been some servers on customers with frequent power problems that have been without ups coverage for a long time before replacing them and despite many brutal shutdowns of some servers I have not had any uncorrectable corruption or that required manual intervention on the btrfs filesystems, some with minor errors on the ext4 ones and many problems on the ntfs ones that required repairs of both the filesystem and the system, in 1 case even the restore from backup due to the completely corrupted filesystem.

udotirol · Jan 9, 2025

thanks again @Fantu for those really helpful explanations!

For the time being, I think I have experimented enough with the effects of this likely hardware issue and my next step would be to remove the faulty device from my RAID1 array or "profile" in btrfs terms

Simply attempting to remove it from the array ends like this:

Code:

$ btrfs device delete /dev/nvme1 /
ERROR: error removing device '/dev/nvme1n1p3': unable to go below two devices on raid1

This is not entirely unexpected (while a little bit cumbersome), so the next thing that comes to mind is to convert to a "single" profile, probably like this:

btrfs balance start -dconvert=single,devid=2 -mconvert=single,devid=2 -sconvert=single,devid=2 -f /

devid=2 is my good nvme from "btrfs device usage". From what read however, is that the balance operation is potentially dangerous if being used with a faulty device.

Comparing this to zfs for example, I would just mark the device as offline, which would put the pool in degeraded state and then physically remove the device (or replace it). Is "btrfs balance" really the correct operation for this, or am I missing something here?

PVE 8.3 node with btrfs booting into grub command line after hard reset

Well-Known Member

Renowned Member

Well-Known Member

Renowned Member

Active Member

Well-Known Member

Well-Known Member

Renowned Member

Active Member

Well-Known Member

Active Member

Renowned Member

Well-Known Member

Active Member

Renowned Member

Well-Known Member

Renowned Member

Active Member

Active Member

Well-Known Member

We value your privacy