USB Flash boot drive keeps going read only

Reliant8275

New Member
Sep 4, 2023
20
2
3
Yes, I know I'm not working within expected parameters. I took three identical e-waste SFF PCs, installed Proxmox 8.1 on Transcend 128GB JetFlash 920 USB drives for boot and reasonably similar 1TB SATA SSDs in the only internal storage for Ceph. Ceph is working well and I'm able to migrate an LXC from PVE2 to PVE4 in about two seconds. However, one of the nodes (PVE3) keeps going read-only. For instance, I try to do most any command that would write to the disk and I get
Code:
-bash: /usr/bin/*command*: Input/output error
. Nano says the disk is read only. Locally, the errors on the screen looked like
Code:
[269117.049596] systemd-journald[312]: Failed to rotate /var/log/journal/very-long-number/system.journal: Read-only file system
. The other two systems seem to be fine but this one is going into this state within hours of reboot. I cannot properly reboot because a shutdown -r now returns
Code:
Call to Reboot failed: Access denied
via SSH and, while the Web GUI acts like it is going to reboot, it does not.
I have tried to reduce the stress on the flash drives by disabling SWAP (should be irrelevant with 16GB of RAM and nothing running yet), reducing logging (a long term issue for drive health but not a problem for today, right?), disabling TRIM (USB flash drive doesn't support TRIM but that was another read only issue I read about), and checking drive health (another missing feature of USB flash drives).

Any thoughts as to why one of three identical systems would be having this issue. Could one of the flash drives be defective? Could a 10 year old SFF PC have a failed USB port? Please save me from buying good new hardware and keep this cluster alive.

What else could be useful?
Code:
root@pve3:/etc/ssh# lsblk
NAME                                                                                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                                                                                            8:0    0 931.5G  0 disk 
└─ceph--f04948df--44a3--441f--99b6--6f4dbdd29ab4-osd--block--20ffda48--2cc5--45d5--897c--f506939f011e
                                                                                             252:0    0 931.5G  0 lvm  
sdc                                                                                            8:32   0 115.2G  0 disk 
├─sdc1                                                                                         8:33   0  1007K  0 part 
├─sdc2                                                                                         8:34   0     1G  0 part 
└─sdc3                                                                                         8:35   0 114.2G  0 part
 
PVE isn't meant to be installed on pen drives or SD cards. Its writing too much and will kill those very fast as they usually use the crappiest NAND flash available and are lacking all those features reducing wear like wear leveling, GC, DRAM caching, PLP and so on.
And yes, not unusual that a flash device switches read-only on the hardwre level if the NAND cells are worn out too much.
Other common problems are corrupted filesystem or 100% filled root filesystem where Linux will switch read-only to prevent additional data-loss.
If your only option is USB, then at least use a USB SSD, USB HDD or external USB to M.2 enclosure. And USB isn't that reliable in the first place.
 
Last edited:
  • Like
Reactions: Kingneutron
PVE isn't meant to be installed on pen drives or SD cards. Its writing too much and will kill those very fast as they usually use the crappiest NAND flash available and are lacking all those features reducing wear like wear leveling, GC, DRAM caching, PLP and so on.
And yes, not unusual that a flash device switches read-only on the hardwre level if the NAND cells are worn out too much.
Other common problems are corrupted filesystem or 100% filled root filesystem where Linux will switch read-only to prevent additional data-loss.
If your only option is USB, then at least use a USB SSD, USB HDD or external USB to M.2 enclosure. And USB isn't that reliable in the first place.
While I understand your point in the long term, this is a brand new drive. It is advertised as high endurance MLC flash for purposes just like this. And the other two are operating correctly. I'm looking for other options before I send the drive back.
 
It is advertised as high endurance MLC flash for purposes just like this.
There are some industrial pen drives that might be up for the task but then you pay 4 times the price for 1/4 of the capacity of those JetFlashs. A 20$ SATA SSD/HDD + a 10$ USB-to-SATA cable will probably do a way better job. Will work for some time with pen drives...at least you got a cluster with ceph for HA so not that bad if those nodes are failing all the time...just keep in mind that you are running your nodes below the minimum hardware requirements.
 
Update: I ran a badblocks destructive test on the flash boot drive and found no errors. I ran memtest 6 times and found no errors. I am restoring the backup image of the flash drive now.
 
sdc 8:32 0 115.2G 0 disk ├─sdc1 8:33 0 1007K 0 part ├─sdc2 8:34 0 1G 0 part └─sdc3 8:35 0 114.2G 0 part
Assuming the above is the actual lsblk output from the installed /sdc (flashdrive) boot partitions/pve drive; something doesn't look right. I don't see the normal named partitions that PVE creates here.

Try and compare it to an lsblk from the other nodes.

Maybe I'm missing something here.
 
lsblk from pve1 (running for a year or so)
Code:
root@pve1:~# lsblk
NAME                                      MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                                         8:0    1     0B  0 disk 
nvme0n1                                   259:0    0 465.8G  0 disk 
├─nvme0n1p1                               259:1    0  1007K  0 part 
├─nvme0n1p2                               259:2    0     1G  0 part /boot/efi
└─nvme0n1p3                               259:3    0 464.8G  0 part 
  ├─pve-swap                              252:0    0     8G  0 lvm  
  ├─pve-root                              252:1    0    96G  0 lvm  /
  ├─pve-data_tmeta                        252:2    0   3.4G  0 lvm  
  │ └─pve-data-tpool                      252:4    0 337.9G  0 lvm  
  │   ├─pve-data                          252:5    0 337.9G  1 lvm  
  │   ├─pve-vm--100--disk--0              252:6    0    80G  0 lvm  
  │   ├─pve-vm--102--disk--0              252:7    0   120G  0 lvm  
  │   ├─pve-vm--102--state--Working       252:8    0  16.5G  0 lvm  
  │   ├─pve-vm--102--state--reboots--well 252:9    0  16.5G  0 lvm  
  │   ├─pve-vm--101--disk--0              252:10   0     2G  0 lvm  
  │   ├─pve-vm--102--state--pre--QOS      252:11   0  16.5G  0 lvm  
  │   ├─pve-vm--103--disk--0              252:12   0     8G  0 lvm  
  │   ├─pve-vm--107--disk--0              252:13   0     2G  0 lvm  
  │   ├─pve-vm--108--disk--0              252:14   0     8G  0 lvm  
  │   ├─pve-vm--109--disk--0              252:15   0    20G  0 lvm  
  │   ├─pve-vm--104--disk--0              252:16   0    32G  0 lvm  
  │   ├─pve-vm--105--disk--0              252:17   0     2G  0 lvm  
  │   ├─pve-vm--110--disk--0              252:18   0     4G  0 lvm  
  │   └─pve-vm--111--disk--0              252:19   0     2G  0 lvm  
  └─pve-data_tdata                        252:3    0 337.9G  0 lvm  
    └─pve-data-tpool                      252:4    0 337.9G  0 lvm  
      ├─pve-data                          252:5    0 337.9G  1 lvm  
      ├─pve-vm--100--disk--0              252:6    0    80G  0 lvm  
      ├─pve-vm--102--disk--0              252:7    0   120G  0 lvm  
      ├─pve-vm--102--state--Working       252:8    0  16.5G  0 lvm  
      ├─pve-vm--102--state--reboots--well 252:9    0  16.5G  0 lvm  
      ├─pve-vm--101--disk--0              252:10   0     2G  0 lvm  
      ├─pve-vm--102--state--pre--QOS      252:11   0  16.5G  0 lvm  
      ├─pve-vm--103--disk--0              252:12   0     8G  0 lvm  
      ├─pve-vm--107--disk--0              252:13   0     2G  0 lvm  
      ├─pve-vm--108--disk--0              252:14   0     8G  0 lvm  
      ├─pve-vm--109--disk--0              252:15   0    20G  0 lvm  
      ├─pve-vm--104--disk--0              252:16   0    32G  0 lvm  
      ├─pve-vm--105--disk--0              252:17   0     2G  0 lvm  
      ├─pve-vm--110--disk--0              252:18   0     4G  0 lvm  
      └─pve-vm--111--disk--0              252:19   0     2G  0 lvm  
nvme1n1                                   259:4    0   3.7T  0 disk 
└─nvme1n1p1                               259:5    0   3.7T  0 part /mnt/nvme
                                                                    /mnt/pve/teamgroup

PVE2, one of the three "identical" systems:
Code:
root@pve2:~# lsblk
NAME                                                                                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                                                                                            8:0    0 931.5G  0 disk 
└─ceph--10655fa6--84d2--4c1f--8d92--5b9bfd5e7721-osd--block--6bd15324--9102--4758--aef5--091ff3a28c87
                                                                                             252:0    0 931.5G  0 lvm  
sdb                                                                                            8:16   0 115.2G  0 disk 
├─sdb1                                                                                         8:17   0  1007K  0 part 
├─sdb2                                                                                         8:18   0     1G  0 part 
└─sdb3                                                                                         8:19   0 114.2G  0 part 
  ├─pve-root                                                                                 252:2    0  38.6G  0 lvm  /
  ├─pve-data_tmeta                                                                           252:3    0     1G  0 lvm  
  │ └─pve-data-tpool                                                                         252:5    0  51.4G  0 lvm  
  │   └─pve-data                                                                             252:6    0  51.4G  1 lvm  
  └─pve-data_tdata                                                                           252:4    0  51.4G  0 lvm  
    └─pve-data-tpool                                                                         252:5    0  51.4G  0 lvm  
      └─pve-data                                                                             252:6    0  51.4G  1 lvm

PVE3, the problem system (now after dumping to a .img and restoring):
Code:
root@pve3:~# lsblk
NAME                                                                                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                                                                                            8:0    0 931.5G  0 disk 
└─ceph--f04948df--44a3--441f--99b6--6f4dbdd29ab4-osd--block--20ffda48--2cc5--45d5--897c--f506939f011e
                                                                                             252:0    0 931.5G  0 lvm  
sdb                                                                                            8:16   0 115.2G  0 disk 
├─sdb1                                                                                         8:17   0  1007K  0 part 
├─sdb2                                                                                         8:18   0     1G  0 part /boot/efi
└─sdb3                                                                                         8:19   0 114.2G  0 part 
  ├─pve-root                                                                                 252:1    0  38.6G  0 lvm  /
  ├─pve-data_tmeta                                                                           252:2    0     1G  0 lvm  
  │ └─pve-data                                                                               252:4    0  51.4G  0 lvm  
  └─pve-data_tdata                                                                           252:3    0  51.4G  0 lvm  
    └─pve-data                                                                               252:4    0  51.4G  0 lvm

PVE4, another "identical" system:
Code:
root@pve4:~# lsblk
NAME                                                                                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                                                                                            8:0    0 953.9G  0 disk 
└─ceph--89ebb9fe--bf62--43eb--8888--23e46edc76ea-osd--block--79fa5bbd--97c3--43cc--86d5--c64571e3403a
                                                                                             252:0    0 953.9G  0 lvm  
sdb                                                                                            8:16   0 115.2G  0 disk 
├─sdb1                                                                                         8:17   0  1007K  0 part 
├─sdb2                                                                                         8:18   0     1G  0 part /boot/efi
└─sdb3                                                                                         8:19   0 114.2G  0 part 
  ├─pve-root                                                                                 252:1    0  38.6G  0 lvm  /
  ├─pve-data_tmeta                                                                           252:2    0     1G  0 lvm  
  │ └─pve-data                                                                               252:4    0  51.4G  0 lvm  
  └─pve-data_tdata                                                                           252:3    0  51.4G  0 lvm  
    └─pve-data                                                                               252:4    0  51.4G  0 lvm
 
Now PVE3 looks correctly populated in lsblk. Look at you original output. Maybe you just redacted it. IDK.
Is it now functioning correctly?

Anyway the flashcards aren't going to hold out too long anyway.
 
Now PVE3 looks correctly populated in lsblk. Look at you original output. Maybe you just redacted it. IDK.
Is it now functioning correctly?

Anyway the flashcards aren't going to hold out too long anyway.
The lsblk from the earlier post was when the system was in read-only mode. A reboot fixes it, for a little while.
 
you should check your kernel log / dmesg for errors. i guess your usb connection isn't stable or your usb stick is not reliable . could also be related to powersave settings on usb ports
 
I'm running into the same issue and it's driving me crazy. In my case I'm trying to use USB M.2 drives.

Does anyone have any other ideas?
 
I'm running into the same issue and it's driving me crazy. In my case I'm trying to use USB M.2 drives.
Most probably the problem is linked to the USB bandwidth/latency/port & general inconsistency not the drives themselves. The system not managing to write communicate with the disk, almost immediately marks the system as read-only?

Relooking at the OP's problem, since he checked the for USB flash drive for bad blocks etc. most probably was suffering the same, i.e. : the USB communication was failing/stuttering so he got a read-only system.

I must point out - that recently reports have been seen on the recent & newer kernels of USB problems. I'm not sure if they have as of yet been cleaned up completely.

AFAIR, most of these problems are linked to faster USB links, such as 3.1,3.2 & Thunderbolt etc.

So maybe try using an older 3.0 USB connection device or even a USB 2.0 one (& port). For you that may not be practical/possible on that m.2 adapter - but for the OP, I don't see the big problem in trying out on an old USB 2.0 flash drive (& USB 2.0 port?). Except for the kernel concerns I mentioned, the slower USB may in fact be able to "keep up" with the system more consistently. Give it a try.
 
  • Like
Reactions: twistacatz
Thank you for your response. I appreciate your insights. I understand your point about a potential I/O issue; however, I haven't observed this behavior with other operating systems, like TrueNAS, using the same USB 2.0 ports. Also Proxmox runs smoothly on these machines until the filesystem switches to read-only mode. I also don't believe Proxmox is very read/write intensive when idle or without any containers or VMs, which is the state these machines are in when I encounter this issue (currently).

I do have a USB 3.1 card that I can install in one of my machines to test your theory.

After posting my initial query, I continued investigating and I might have identified the culprit. I noticed in the logs that right before a crash, the smartd service would read the disk attributes. Each time Proxmox crashed, the logs would show attributes for all disks except the USB M.2 drive (which is the last drive in the list). So I ran "/sbin/service smartd stop," and the filesystem immediately switched to read-only mode. I then rebooted, disabled smartd from starting at boot, and rebooted again. Crashes typically occur within 1-3 days, so I will monitor the situation and report back with any updates. However I think me stopping the smartd service and in-turn the filesystem going into read only mode is very telling. Fingers crossed.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!