Read only file system

Sep 3, 2021
36
11
13
When I try to refresh (under updates) I get the following error:

starting worker failed: unable to create output file '/var/log/pve/tasks/F/UPID:TEAPHOMEPVE02:00001E64:00025ADC:6578CB3F:aptupdate::root@pam:' - Read-only file system (500)

The journal showed. the following:

1702415140296.png

It seems that the NVMe storage is corrupted. Any possibilities to resolve this or should I replace the NVMe storage? System has three (Nodes) and the failure is on the secondary Node.
 
Any possibilities to resolve this or should I replace the NVMe storage?
It depends. The controller in the CPU or on the mainboard could be defective; reseating the CPU or the NVMe could help. You could also check the SMART values. Unfortunately, you don't reveal much about your setup, which is why only a generic statement of possible solutions is possible.
System has three (Nodes) and the failure is on the secondary Node.
Just to avoid misunderstandings. You have a cluster (not system) of three nodes and the error is on the second node?
 
nvme0n1 appears to be failing. boot to a livecd and verify- you can either use the nvme mfg utility and/or smartmontools.
Using smartmontools:

1702448617867.png

1702448586065.png

My conclusion is that the NVMe should be replaced.

According to Samsung I have 5-Year Limited Warranty or 600 TBW Limited Warranty

The NVMe is 3 Years old and less than 600TBW (180 TB).

Is the best way forward to do a fresh install or clone this NVMe?
 
It depends. The controller in the CPU or on the mainboard could be defective; reseating the CPU or the NVMe could help. You could also check the SMART values. Unfortunately, you don't reveal much about your setup, which is why only a generic statement of possible solutions is possible.

Just to avoid misunderstandings. You have a cluster (not system) of three nodes and the error is on the second node?
It is indeed a cluster (with in total 3 Nodes). See in this thread also the results of the smartmontools. It seems a failing NVMe.
 
You have at least 60 unsafe shutdowns. If data hasn't been written from the cache yet, your FS could suffer. So you could also try formatting the NVMe and then checking again whether the problems still occur. But you should definitely find the cause of the many unsafe shutdowns.
 
You have at least 60 unsafe shutdowns. If data hasn't been written from the cache yet, your FS could suffer. So you could also try formatting the NVMe and then checking again whether the problems still occur. But you should definitely find the cause of the many unsafe shutdowns.
These are at least not from power loss as there is an UPS. If I assume that a shutdown will cause also shutdown of the Node it looks like it relates to Proxmox kernel updates.
 
As far as I know, an unsafe shutdown can only occur if the OS has not previously been able to instruct the NVMe to write everything away and shut down.

For me, "unsafe shutdown" means you pulled the plug, pressed the power button for 4 seconds, simply removed the NVMe or the system failed and restarted.
 
As far as I know, an unsafe shutdown can only occur if the OS has not previously been able to instruct the NVMe to write everything away and shut down.

For me, "unsafe shutdown" means you pulled the plug, pressed the power button for 4 seconds, simply removed the NVMe or the system failed and restarted.
That should than happen when a kernel update is applied as this is almost the only reason to restart. Shutdown will almost never happen.
 
We seem to be talking past each other here. An unsafe shutdown occurs when the system is not shut down properly. If you simply pull the plug on a kernel update, then your statement would be correct. But if you type reboot or something similar, then it's actually not an unsafe shutdown because your OS tells the NVMe that it now wants to restart and the NVMe therefore writes everything from the cache.
 
We seem to be talking past each other here. An unsafe shutdown occurs when the system is not shut down properly. If you simply pull the plug on a kernel update, then your statement would be correct. But if you type reboot or something similar, then it's actually not an unsafe shutdown because your OS tells the NVMe that it now wants to restart and the NVMe therefore writes everything from the cache.
I agree. But in that case I can not explain that number of unsafe shutdowns. The Node is connected to an UPS. And after a kernel update I type "reboot"
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!