Read only file system

John245 · Dec 12, 2023

When I try to refresh (under updates) I get the following error:

starting worker failed: unable to create output file '/var/log/pve/tasks/F/UPID:TEAPHOMEPVE02:00001E64:00025ADC:6578CB3F:aptupdate::root@pam:' - Read-only file system (500)

The journal showed. the following:

It seems that the NVMe storage is corrupted. Any possibilities to resolve this or should I replace the NVMe storage? System has three (Nodes) and the failure is on the secondary Node.

sb-jw · Dec 12, 2023

John245 said:
Any possibilities to resolve this or should I replace the NVMe storage?

It depends. The controller in the CPU or on the mainboard could be defective; reseating the CPU or the NVMe could help. You could also check the SMART values. Unfortunately, you don't reveal much about your setup, which is why only a generic statement of possible solutions is possible.

John245 said:
System has three (Nodes) and the failure is on the secondary Node.

Just to avoid misunderstandings. You have a cluster (not system) of three nodes and the error is on the second node?

alexskysilk · Dec 13, 2023

nvme0n1 appears to be failing. boot to a livecd and verify- you can either use the nvme mfg utility and/or smartmontools.

John245 · Dec 13, 2023

alexskysilk said:
nvme0n1 appears to be failing. boot to a livecd and verify- you can either use the nvme mfg utility and/or smartmontools.

Using smartmontools:

My conclusion is that the NVMe should be replaced.

According to Samsung I have 5-Year Limited Warranty or 600 TBW Limited Warranty

The NVMe is 3 Years old and less than 600TBW (180 TB).

Is the best way forward to do a fresh install or clone this NVMe?

John245 · Dec 13, 2023

sb-jw said:
It depends. The controller in the CPU or on the mainboard could be defective; reseating the CPU or the NVMe could help. You could also check the SMART values. Unfortunately, you don't reveal much about your setup, which is why only a generic statement of possible solutions is possible.

Just to avoid misunderstandings. You have a cluster (not system) of three nodes and the error is on the second node?

It is indeed a cluster (with in total 3 Nodes). See in this thread also the results of the smartmontools. It seems a failing NVMe.

sb-jw · Dec 13, 2023

You have at least 60 unsafe shutdowns. If data hasn't been written from the cache yet, your FS could suffer. So you could also try formatting the NVMe and then checking again whether the problems still occur. But you should definitely find the cause of the many unsafe shutdowns.

John245 · Dec 13, 2023

sb-jw said:
You have at least 60 unsafe shutdowns. If data hasn't been written from the cache yet, your FS could suffer. So you could also try formatting the NVMe and then checking again whether the problems still occur. But you should definitely find the cause of the many unsafe shutdowns.

These are at least not from power loss as there is an UPS. If I assume that a shutdown will cause also shutdown of the Node it looks like it relates to Proxmox kernel updates.

sb-jw · Dec 13, 2023

As far as I know, an unsafe shutdown can only occur if the OS has not previously been able to instruct the NVMe to write everything away and shut down.

For me, "unsafe shutdown" means you pulled the plug, pressed the power button for 4 seconds, simply removed the NVMe or the system failed and restarted.

John245 · Dec 13, 2023

sb-jw said:
As far as I know, an unsafe shutdown can only occur if the OS has not previously been able to instruct the NVMe to write everything away and shut down.

For me, "unsafe shutdown" means you pulled the plug, pressed the power button for 4 seconds, simply removed the NVMe or the system failed and restarted.

That should than happen when a kernel update is applied as this is almost the only reason to restart. Shutdown will almost never happen.

sb-jw · Dec 13, 2023

We seem to be talking past each other here. An unsafe shutdown occurs when the system is not shut down properly. If you simply pull the plug on a kernel update, then your statement would be correct. But if you type reboot or something similar, then it's actually not an unsafe shutdown because your OS tells the NVMe that it now wants to restart and the NVMe therefore writes everything from the cache.

John245 · Dec 13, 2023

sb-jw said:
We seem to be talking past each other here. An unsafe shutdown occurs when the system is not shut down properly. If you simply pull the plug on a kernel update, then your statement would be correct. But if you type reboot or something similar, then it's actually not an unsafe shutdown because your OS tells the NVMe that it now wants to restart and the NVMe therefore writes everything from the cache.

I agree. But in that case I can not explain that number of unsafe shutdowns. The Node is connected to an UPS. And after a kernel update I type "reboot"

alexskysilk · Dec 13, 2023

sb-jw said:
You have at least 60 unsafe shutdowns.

https://www.youtube.com/watch?v=TDCjw0TKQ8E

Read only file system

John245

Member

sb-jw

Famous Member

alexskysilk

Distinguished Member

John245

Member

John245

Member

sb-jw

Famous Member

John245

Member

sb-jw

Famous Member

John245

Member

sb-jw

Famous Member

John245

Member

alexskysilk

Distinguished Member

We value your privacy