Hello - I'm new to Proxmox. I set my Proxmox server up about a week ago. Since then it has crashed a few times. When the crash happens I lose access to all of my LXC containers and VM's. However I'm still able to access the Proxmox web interface as well as ssh into the server. If I force a reboot the server boots up and all of the LXC's and VM's come back online.
The longest the server has stayed online is almost 4 days. Other crashes have happened after 12 hours of uptime, others a little over 24 hours of uptime.
I'm wondering if you fine folks have any ideas on how I can dig a bit deeper and figure out what the cause of the issue is. I think it may be a bad NVMe, but I wanted to be a bit more certain before pulling the trigger on a new one.
When the crash happens if I login to the web interface and try to reboot one of the LXC's, I receive this message:
I will post the results of the command "smartctl -a /dev/nvme0" below. The ErrCount is quite high. Even though the smart status is passed, I'm wondering if the drive is faulty.
The longest the server has stayed online is almost 4 days. Other crashes have happened after 12 hours of uptime, others a little over 24 hours of uptime.
I'm wondering if you fine folks have any ideas on how I can dig a bit deeper and figure out what the cause of the issue is. I think it may be a bad NVMe, but I wanted to be a bit more certain before pulling the trigger on a new one.
When the crash happens if I login to the web interface and try to reboot one of the LXC's, I receive this message:
Connection failed (Error 500: unable to open file '/var/tmp/pve-reserved-ports.tmp.1543363' - Read-only file system)
I will post the results of the command "smartctl -a /dev/nvme0" below. The ErrCount is quite high. Even though the smart status is passed, I'm wondering if the drive is faulty.
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.74-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: CT1000P2SSD8
Serial Number: 2150E5F1FC62
Firmware Version: P2CR033
PCI Vendor/Subsystem ID: 0xc0a9
IEEE OUI Identifier: 0x6479a7
Total NVM Capacity: 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 1
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 6479a7 59e00002b8
Local Time is: Thu Jan 5 09:43:00 2023 EST
Firmware Updates (0x12): 1 Slot, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005e): Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size: 64 Pages
Warning Comp. Temp. Threshold: 70 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 3.50W - - 0 0 0 0 0 0
1 + 1.90W - - 1 1 1 1 0 0
2 + 1.50W - - 2 2 2 2 0 0
3 - 0.0700W - - 3 3 3 3 5000 1900
4 - 0.0020W - - 4 4 4 4 13000 100000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 1
1 - 4096 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 34 Celsius
Available Spare: 100%
Available Spare Threshold: 5%
Percentage Used: 10%
Data Units Read: 512,099,059 [262 TB]
Data Units Written: 100,485,372 [51.4 TB]
Host Read Commands: 8,570,439,398
Host Write Commands: 2,519,330,591
Controller Busy Time: 25,983
Power Cycles: 20
Power On Hours: 8,565
Unsafe Shutdowns: 15
Media and Data Integrity Errors: 32
Error Information Log Entries: 34,602
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Error Information (NVMe Log 0x01, 16 of 16 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 34602 0 0x301b 0x4005 0x028 0 0 -