USB Flash boot drive keeps going read only

Reliant8275 · Mar 16, 2024

Yes, I know I'm not working within expected parameters. I took three identical e-waste SFF PCs, installed Proxmox 8.1 on Transcend 128GB JetFlash 920 USB drives for boot and reasonably similar 1TB SATA SSDs in the only internal storage for Ceph. Ceph is working well and I'm able to migrate an LXC from PVE2 to PVE4 in about two seconds. However, one of the nodes (PVE3) keeps going read-only. For instance, I try to do most any command that would write to the disk and I get

Code:

-bash: /usr/bin/*command*: Input/output error

. Nano says the disk is read only. Locally, the errors on the screen looked like

Code:

[269117.049596] systemd-journald[312]: Failed to rotate /var/log/journal/very-long-number/system.journal: Read-only file system

. The other two systems seem to be fine but this one is going into this state within hours of reboot. I cannot properly reboot because a shutdown -r now returns

Code:

Call to Reboot failed: Access denied

via SSH and, while the Web GUI acts like it is going to reboot, it does not.
I have tried to reduce the stress on the flash drives by disabling SWAP (should be irrelevant with 16GB of RAM and nothing running yet), reducing logging (a long term issue for drive health but not a problem for today, right?), disabling TRIM (USB flash drive doesn't support TRIM but that was another read only issue I read about), and checking drive health (another missing feature of USB flash drives).

Any thoughts as to why one of three identical systems would be having this issue. Could one of the flash drives be defective? Could a 10 year old SFF PC have a failed USB port? Please save me from buying good new hardware and keep this cluster alive.

What else could be useful?

Code:

root@pve3:/etc/ssh# lsblk
NAME                                                                                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                                                                                            8:0    0 931.5G  0 disk 
└─ceph--f04948df--44a3--441f--99b6--6f4dbdd29ab4-osd--block--20ffda48--2cc5--45d5--897c--f506939f011e
                                                                                             252:0    0 931.5G  0 lvm  
sdc                                                                                            8:32   0 115.2G  0 disk 
├─sdc1                                                                                         8:33   0  1007K  0 part 
├─sdc2                                                                                         8:34   0     1G  0 part 
└─sdc3                                                                                         8:35   0 114.2G  0 part

Dunuin · Mar 17, 2024

PVE isn't meant to be installed on pen drives or SD cards. Its writing too much and will kill those very fast as they usually use the crappiest NAND flash available and are lacking all those features reducing wear like wear leveling, GC, DRAM caching, PLP and so on.
And yes, not unusual that a flash device switches read-only on the hardwre level if the NAND cells are worn out too much.
Other common problems are corrupted filesystem or 100% filled root filesystem where Linux will switch read-only to prevent additional data-loss.
If your only option is USB, then at least use a USB SSD, USB HDD or external USB to M.2 enclosure. And USB isn't that reliable in the first place.

Reliant8275 · Mar 17, 2024

Dunuin said:
PVE isn't meant to be installed on pen drives or SD cards. Its writing too much and will kill those very fast as they usually use the crappiest NAND flash available and are lacking all those features reducing wear like wear leveling, GC, DRAM caching, PLP and so on.
And yes, not unusual that a flash device switches read-only on the hardwre level if the NAND cells are worn out too much.
Other common problems are corrupted filesystem or 100% filled root filesystem where Linux will switch read-only to prevent additional data-loss.
If your only option is USB, then at least use a USB SSD, USB HDD or external USB to M.2 enclosure. And USB isn't that reliable in the first place.

While I understand your point in the long term, this is a brand new drive. It is advertised as high endurance MLC flash for purposes just like this. And the other two are operating correctly. I'm looking for other options before I send the drive back.

Dunuin · Mar 17, 2024

Reliant8275 said:
It is advertised as high endurance MLC flash for purposes just like this.

There are some industrial pen drives that might be up for the task but then you pay 4 times the price for 1/4 of the capacity of those JetFlashs. A 20$ SATA SSD/HDD + a 10$ USB-to-SATA cable will probably do a way better job. Will work for some time with pen drives...at least you got a cluster with ceph for HA so not that bad if those nodes are failing all the time...just keep in mind that you are running your nodes below the minimum hardware requirements.

Reliant8275 · Mar 17, 2024

Update: I ran a badblocks destructive test on the flash boot drive and found no errors. I ran memtest 6 times and found no errors. I am restoring the backup image of the flash drive now.

gfngfn256 · Mar 17, 2024

Reliant8275 said:
sdc 8:32 0 115.2G 0 disk ├─sdc1 8:33 0 1007K 0 part ├─sdc2 8:34 0 1G 0 part └─sdc3 8:35 0 114.2G 0 part

Assuming the above is the actual lsblk output from the installed /sdc (flashdrive) boot partitions/pve drive; something doesn't look right. I don't see the normal named partitions that PVE creates here.

Try and compare it to an lsblk from the other nodes.

Maybe I'm missing something here.

Reliant8275 · Mar 17, 2024

lsblk from pve1 (running for a year or so)

Code:

root@pve1:~# lsblk
NAME                                      MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                                         8:0    1     0B  0 disk 
nvme0n1                                   259:0    0 465.8G  0 disk 
├─nvme0n1p1                               259:1    0  1007K  0 part 
├─nvme0n1p2                               259:2    0     1G  0 part /boot/efi
└─nvme0n1p3                               259:3    0 464.8G  0 part 
  ├─pve-swap                              252:0    0     8G  0 lvm  
  ├─pve-root                              252:1    0    96G  0 lvm  /
  ├─pve-data_tmeta                        252:2    0   3.4G  0 lvm  
  │ └─pve-data-tpool                      252:4    0 337.9G  0 lvm  
  │   ├─pve-data                          252:5    0 337.9G  1 lvm  
  │   ├─pve-vm--100--disk--0              252:6    0    80G  0 lvm  
  │   ├─pve-vm--102--disk--0              252:7    0   120G  0 lvm  
  │   ├─pve-vm--102--state--Working       252:8    0  16.5G  0 lvm  
  │   ├─pve-vm--102--state--reboots--well 252:9    0  16.5G  0 lvm  
  │   ├─pve-vm--101--disk--0              252:10   0     2G  0 lvm  
  │   ├─pve-vm--102--state--pre--QOS      252:11   0  16.5G  0 lvm  
  │   ├─pve-vm--103--disk--0              252:12   0     8G  0 lvm  
  │   ├─pve-vm--107--disk--0              252:13   0     2G  0 lvm  
  │   ├─pve-vm--108--disk--0              252:14   0     8G  0 lvm  
  │   ├─pve-vm--109--disk--0              252:15   0    20G  0 lvm  
  │   ├─pve-vm--104--disk--0              252:16   0    32G  0 lvm  
  │   ├─pve-vm--105--disk--0              252:17   0     2G  0 lvm  
  │   ├─pve-vm--110--disk--0              252:18   0     4G  0 lvm  
  │   └─pve-vm--111--disk--0              252:19   0     2G  0 lvm  
  └─pve-data_tdata                        252:3    0 337.9G  0 lvm  
    └─pve-data-tpool                      252:4    0 337.9G  0 lvm  
      ├─pve-data                          252:5    0 337.9G  1 lvm  
      ├─pve-vm--100--disk--0              252:6    0    80G  0 lvm  
      ├─pve-vm--102--disk--0              252:7    0   120G  0 lvm  
      ├─pve-vm--102--state--Working       252:8    0  16.5G  0 lvm  
      ├─pve-vm--102--state--reboots--well 252:9    0  16.5G  0 lvm  
      ├─pve-vm--101--disk--0              252:10   0     2G  0 lvm  
      ├─pve-vm--102--state--pre--QOS      252:11   0  16.5G  0 lvm  
      ├─pve-vm--103--disk--0              252:12   0     8G  0 lvm  
      ├─pve-vm--107--disk--0              252:13   0     2G  0 lvm  
      ├─pve-vm--108--disk--0              252:14   0     8G  0 lvm  
      ├─pve-vm--109--disk--0              252:15   0    20G  0 lvm  
      ├─pve-vm--104--disk--0              252:16   0    32G  0 lvm  
      ├─pve-vm--105--disk--0              252:17   0     2G  0 lvm  
      ├─pve-vm--110--disk--0              252:18   0     4G  0 lvm  
      └─pve-vm--111--disk--0              252:19   0     2G  0 lvm  
nvme1n1                                   259:4    0   3.7T  0 disk 
└─nvme1n1p1                               259:5    0   3.7T  0 part /mnt/nvme
                                                                    /mnt/pve/teamgroup

PVE2, one of the three "identical" systems:

Code:

root@pve2:~# lsblk
NAME                                                                                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                                                                                            8:0    0 931.5G  0 disk 
└─ceph--10655fa6--84d2--4c1f--8d92--5b9bfd5e7721-osd--block--6bd15324--9102--4758--aef5--091ff3a28c87
                                                                                             252:0    0 931.5G  0 lvm  
sdb                                                                                            8:16   0 115.2G  0 disk 
├─sdb1                                                                                         8:17   0  1007K  0 part 
├─sdb2                                                                                         8:18   0     1G  0 part 
└─sdb3                                                                                         8:19   0 114.2G  0 part 
  ├─pve-root                                                                                 252:2    0  38.6G  0 lvm  /
  ├─pve-data_tmeta                                                                           252:3    0     1G  0 lvm  
  │ └─pve-data-tpool                                                                         252:5    0  51.4G  0 lvm  
  │   └─pve-data                                                                             252:6    0  51.4G  1 lvm  
  └─pve-data_tdata                                                                           252:4    0  51.4G  0 lvm  
    └─pve-data-tpool                                                                         252:5    0  51.4G  0 lvm  
      └─pve-data                                                                             252:6    0  51.4G  1 lvm

PVE3, the problem system (now after dumping to a .img and restoring):

Code:

root@pve3:~# lsblk
NAME                                                                                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                                                                                            8:0    0 931.5G  0 disk 
└─ceph--f04948df--44a3--441f--99b6--6f4dbdd29ab4-osd--block--20ffda48--2cc5--45d5--897c--f506939f011e
                                                                                             252:0    0 931.5G  0 lvm  
sdb                                                                                            8:16   0 115.2G  0 disk 
├─sdb1                                                                                         8:17   0  1007K  0 part 
├─sdb2                                                                                         8:18   0     1G  0 part /boot/efi
└─sdb3                                                                                         8:19   0 114.2G  0 part 
  ├─pve-root                                                                                 252:1    0  38.6G  0 lvm  /
  ├─pve-data_tmeta                                                                           252:2    0     1G  0 lvm  
  │ └─pve-data                                                                               252:4    0  51.4G  0 lvm  
  └─pve-data_tdata                                                                           252:3    0  51.4G  0 lvm  
    └─pve-data                                                                               252:4    0  51.4G  0 lvm

PVE4, another "identical" system:

Code:

root@pve4:~# lsblk
NAME                                                                                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                                                                                            8:0    0 953.9G  0 disk 
└─ceph--89ebb9fe--bf62--43eb--8888--23e46edc76ea-osd--block--79fa5bbd--97c3--43cc--86d5--c64571e3403a
                                                                                             252:0    0 953.9G  0 lvm  
sdb                                                                                            8:16   0 115.2G  0 disk 
├─sdb1                                                                                         8:17   0  1007K  0 part 
├─sdb2                                                                                         8:18   0     1G  0 part /boot/efi
└─sdb3                                                                                         8:19   0 114.2G  0 part 
  ├─pve-root                                                                                 252:1    0  38.6G  0 lvm  /
  ├─pve-data_tmeta                                                                           252:2    0     1G  0 lvm  
  │ └─pve-data                                                                               252:4    0  51.4G  0 lvm  
  └─pve-data_tdata                                                                           252:3    0  51.4G  0 lvm  
    └─pve-data                                                                               252:4    0  51.4G  0 lvm

gfngfn256 · Mar 17, 2024

Now PVE3 looks correctly populated in lsblk. Look at you original output. Maybe you just redacted it. IDK.
Is it now functioning correctly?

Anyway the flashcards aren't going to hold out too long anyway.

Reliant8275 · Mar 17, 2024

gfngfn256 said:
Now PVE3 looks correctly populated in lsblk. Look at you original output. Maybe you just redacted it. IDK.
Is it now functioning correctly?

Anyway the flashcards aren't going to hold out too long anyway.

The lsblk from the earlier post was when the system was in read-only mode. A reboot fixes it, for a little while.

RolandK · Mar 17, 2024

you should check your kernel log / dmesg for errors. i guess your usb connection isn't stable or your usb stick is not reliable . could also be related to powersave settings on usb ports

twistacatz · Jul 3, 2024

I'm running into the same issue and it's driving me crazy. In my case I'm trying to use USB M.2 drives.

Does anyone have any other ideas?

gfngfn256 · Jul 3, 2024

twistacatz said:
I'm running into the same issue and it's driving me crazy. In my case I'm trying to use USB M.2 drives.

Most probably the problem is linked to the USB bandwidth/latency/port & general inconsistency not the drives themselves. The system not managing to write communicate with the disk, almost immediately marks the system as read-only?

Relooking at the OP's problem, since he checked the for USB flash drive for bad blocks etc. most probably was suffering the same, i.e. : the USB communication was failing/stuttering so he got a read-only system.

I must point out - that recently reports have been seen on the recent & newer kernels of USB problems. I'm not sure if they have as of yet been cleaned up completely.

AFAIR, most of these problems are linked to faster USB links, such as 3.1,3.2 & Thunderbolt etc.

So maybe try using an older 3.0 USB connection device or even a USB 2.0 one (& port). For you that may not be practical/possible on that m.2 adapter - but for the OP, I don't see the big problem in trying out on an old USB 2.0 flash drive (& USB 2.0 port?). Except for the kernel concerns I mentioned, the slower USB may in fact be able to "keep up" with the system more consistently. Give it a try.

twistacatz · Jul 3, 2024

Thank you for your response. I appreciate your insights. I understand your point about a potential I/O issue; however, I haven't observed this behavior with other operating systems, like TrueNAS, using the same USB 2.0 ports. Also Proxmox runs smoothly on these machines until the filesystem switches to read-only mode. I also don't believe Proxmox is very read/write intensive when idle or without any containers or VMs, which is the state these machines are in when I encounter this issue (currently).

I do have a USB 3.1 card that I can install in one of my machines to test your theory.

After posting my initial query, I continued investigating and I might have identified the culprit. I noticed in the logs that right before a crash, the smartd service would read the disk attributes. Each time Proxmox crashed, the logs would show attributes for all disks except the USB M.2 drive (which is the last drive in the list). So I ran "/sbin/service smartd stop," and the filesystem immediately switched to read-only mode. I then rebooted, disabled smartd from starting at boot, and rebooted again. Crashes typically occur within 1-3 days, so I will monitor the situation and report back with any updates. However I think me stopping the smartd service and in-turn the filesystem going into read only mode is very telling. Fingers crossed.

twistacatz · Jul 5, 2024

Unfortunately, I'm still having the same issue. I give up on the idea of using a USB drive.

Search

Search

USB Flash boot drive keeps going read only

Reliant8275

New Member

Dunuin

Distinguished Member

Reliant8275

New Member

Dunuin

Distinguished Member

Reliant8275

New Member

gfngfn256

Renowned Member

Reliant8275

New Member

gfngfn256

Renowned Member

Reliant8275

New Member

RolandK

Renowned Member

twistacatz

New Member

gfngfn256

Renowned Member

twistacatz

New Member

twistacatz

New Member