Write-error on swap-device on brand new hardware

Smoochii

New Member
Jun 2, 2024
24
7
3
I just bought a brand new mini-pc and installed proxmox on it. I then restored all of my VMs from my backup drive and everything was working. It looks like the computer crashed in the middle of the day and when I rebooted it I had this error message a bunch: `Write-error on swap-device`. I power cycled it again and it booted normally. I properly shut it down and removed it from the rack to move it to a new rack and upon booting it I got this error message again.

I've read this can be related to a bad drive but this thing is brand new. If I just reinstall proxmox could that fix the issue? Is there anything else I can do? Thanks!

1752759713035.png
 
If this error occurs, you can also try to read other parts of the disk in order to check if they also fail.
Is the SSD maybo too hot?
 
Ya, I guess that could be. It's in a really small enclosure but the room has really good air flow (it's where the rest of my server stuff is). It's just strange to me that a reboot fixes it. I guess if I keep seeing it I'll just reformat the disk and start over.

How can I "try to read other parts of the disk" if it gets in that state again?
 
Ugh, now I'm getting this in the web GUI, I think something might just be corrupt. `file '/usr/share/javascript/proxmox-widget-toolkit/proxmoxlib.js' exists but open for reading failed - Input/output error`
 
Run a long SMART test (using whatever tools you like from any of the many guides on the internet) and if it fails (indicating that the device agrees that it is dying) return it for RMA.

EDIT: If the long SMART succeeds then reseat/replace cables and other connectors. Try a different slot and test RAM and maybe a different computer or try different drive.

EDIT2: nvme device-self-test --help
 
Last edited:
I tried this but it doesn't look like SMART supports nvme drives. I just ended up wiping the drive and starting over. If it keeps happening I'll replace the drive or computer.
 
  • Like
Reactions: Kingneutron
What's the drive model? You can find it via smartctl -i /dev/nvme... or nvme list or lsblk -do+MODEL,SERIAL.
The SMART data of it would be interesting too. You can use smartctl -A /dev/nvme... or nvme smart-log /dev/nvme....
Try to run update-smart-drivedb first.
 
Last edited:
The model is CT1000P3PSSD8. Here is the output of smartctl:

Also, I installed proxmox on the second hard drive that I bought to see if it still crashes. I'll keep an eye on it overnight and check tomorrow.

=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 37 Celsius
Available Spare: 100%
Available Spare Threshold: 5%
Percentage Used: 0%
Data Units Read: 67,064 [34.3 GB]
Data Units Written: 163,246 [83.5 GB]
Host Read Commands: 794,402
Host Write Commands: 1,097,745
Controller Busy Time: 4
Power Cycles: 22
Power On Hours: 50
Unsafe Shutdowns: 4
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 37 Celsius
 
It's a QLC drive and from what I've read not a very good one. Might work okay with the default LVM-Thin though. Beside the Unsafe Shutdowns the values look okay to me. I'd also check for a firmware update and as mentioned monitor drive temperature under load and do a self test.
To monitor other temperatures you can use something like watch -c -d -n1 sensors. Run apt install lm-sensors first.
 
Last edited:
  • Like
Reactions: Kingneutron
I've never updated the firmware for an SSD before, how do I do that? Also, I tried running SMART test and it didn't seem to do anything. I ran `nvme device-self-test /dev/nvme0n1 -s 1` and all the results are 0xf, what does that mean?
 
Not every vendor provides firmware though that and you have to download their tool. Seems like there are none for those models though.
 
So far on the new drive I bought for extra storage it hasn't crashed yet so I think it's just a bad drive issue. I actually think I'm just going to return this machine and pick up a minisforum MS-01 instead.
 
  • Like
Reactions: Kingneutron
Hello, can't help with the error but I've had same cruical nvme and after 2 years of use the same error here. I think it's really bad quality...

EDIT: Have to redo my post... After set
echo max_performance | tee /sys/class/scsi_host/host*/link_power_management_policy

no more errors here with cruical nvme...
 
Last edited: