Server Crashes

Gnat2268

New Member
Jan 5, 2023
7
4
3
Hello - I'm new to Proxmox. I set my Proxmox server up about a week ago. Since then it has crashed a few times. When the crash happens I lose access to all of my LXC containers and VM's. However I'm still able to access the Proxmox web interface as well as ssh into the server. If I force a reboot the server boots up and all of the LXC's and VM's come back online.

The longest the server has stayed online is almost 4 days. Other crashes have happened after 12 hours of uptime, others a little over 24 hours of uptime.

I'm wondering if you fine folks have any ideas on how I can dig a bit deeper and figure out what the cause of the issue is. I think it may be a bad NVMe, but I wanted to be a bit more certain before pulling the trigger on a new one.

When the crash happens if I login to the web interface and try to reboot one of the LXC's, I receive this message:

Connection failed (Error 500: unable to open file '/var/tmp/pve-reserved-ports.tmp.1543363' - Read-only file system)

I will post the results of the command "smartctl -a /dev/nvme0" below. The ErrCount is quite high. Even though the smart status is passed, I'm wondering if the drive is faulty.

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.74-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: CT1000P2SSD8
Serial Number: 2150E5F1FC62
Firmware Version: P2CR033
PCI Vendor/Subsystem ID: 0xc0a9
IEEE OUI Identifier: 0x6479a7
Total NVM Capacity: 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 1
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 6479a7 59e00002b8
Local Time is: Thu Jan 5 09:43:00 2023 EST
Firmware Updates (0x12): 1 Slot, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005e): Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size: 64 Pages
Warning Comp. Temp. Threshold: 70 Celsius
Critical Comp. Temp. Threshold: 85 Celsius

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 3.50W - - 0 0 0 0 0 0
1 + 1.90W - - 1 1 1 1 0 0
2 + 1.50W - - 2 2 2 2 0 0
3 - 0.0700W - - 3 3 3 3 5000 1900
4 - 0.0020W - - 4 4 4 4 13000 100000

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 1
1 - 4096 0 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 34 Celsius
Available Spare: 100%
Available Spare Threshold: 5%
Percentage Used: 10%
Data Units Read: 512,099,059 [262 TB]
Data Units Written: 100,485,372 [51.4 TB]
Host Read Commands: 8,570,439,398
Host Write Commands: 2,519,330,591
Controller Busy Time: 25,983
Power Cycles: 20
Power On Hours: 8,565
Unsafe Shutdowns: 15
Media and Data Integrity Errors: 32
Error Information Log Entries: 34,602
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0

Error Information (NVMe Log 0x01, 16 of 16 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 34602 0 0x301b 0x4005 0x028 0 0 -
 
Can you provide the complete log from boot until the crash?
/var/tmp being read-only could indicate disk issues.

Are there firmware updates for the NVMe available?
 
Can you provide the complete log from boot until the crash?
/var/tmp being read-only could indicate disk issues.

Are there firmware updates for the NVMe available?

I've verified the NVMe is running the latest firmware.

The NVMe model is CT1000P2SSD8. It's running firmware version P2CR033 which according to Crucials website is the latest and greatest.

Please excuse my ignorance regarding the logs, do you mind pointing me to where the logs are stored?
 
You can either access the syslog (/var/log/syslog) or the journal (journalctl).
The journal can be limited to the last boot, or the one before that with the `-b` parameter.
For the previous boot use `journalctl -b -1 > journal.txt`. The output of the command will be redirected to the file `journal.txt`.
Please check the file for any sensitive information and redact if necessary. Afterwards attach it here please.
 
You can either access the syslog (/var/log/syslog) or the journal (journalctl).
The journal can be limited to the last boot, or the one before that with the `-b` parameter.
For the previous boot use `journalctl -b -1 > journal.txt`. The output of the command will be redirected to the file `journal.txt`.
Please check the file for any sensitive information and redact if necessary. Afterwards attach it here please.
Thank you for the instructions on grabbing the log file. I've attached the previous boot log. At the end of the log file when the server crashes, it looks like it stops logging any pertinent information. I'm thinking it's because the drive goes into Read Only mode and the system is unable to write to it.
 
Nothing really visible in the journal, except that Docker was installed in a container.
Please always install it in a VM, since it can influence and break things when installed in a container. See the recent threads here in the forum.

Is root installed on the NVMe or /dev/sda?
 
I was not aware that docker within a container could cause issues. Do you think it could possibly be causing the issues I'm having with the server crash?

/dev/sda is an external drive I use for media storage. Root is installed on the NVMe. Here's the output of lsblk:

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 3.6T 0 disk
└─extUSB-vm--101--disk--0 253:13 0 3.6T 0 lvm
nvme0n1 259:0 0 931.5G 0 disk
├─nvme0n1p1 259:1 0 1007K 0 part
├─nvme0n1p2 259:2 0 512M 0 part
└─nvme0n1p3 259:3 0 931G 0 part
├─pve-swap 253:0 0 8G 0 lvm [SWAP]
├─pve-root 253:1 0 96G 0 lvm /
├─pve-data_tmeta 253:2 0 8.1G 0 lvm
│ └─pve-data-tpool 253:4 0 794.8G 0 lvm
│ ├─pve-data 253:5 0 794.8G 1 lvm
│ ├─pve-vm--100--disk--0 253:6 0 32G 0 lvm
│ ├─pve-vm--101--disk--0 253:7 0 30G 0 lvm
│ ├─pve-vm--103--disk--0 253:8 0 8G 0 lvm
│ ├─pve-vm--104--disk--0 253:9 0 20G 0 lvm
│ ├─pve-vm--105--disk--0 253:10 0 20G 0 lvm
│ ├─pve-vm--106--disk--0 253:11 0 30G 0 lvm
│ └─pve-vm--107--disk--0 253:12 0 8G 0 lvm
└─pve-data_tdata 253:3 0 794.8G 0 lvm
└─pve-data-tpool 253:4 0 794.8G 0 lvm
├─pve-data 253:5 0 794.8G 1 lvm
├─pve-vm--100--disk--0 253:6 0 32G 0 lvm
├─pve-vm--101--disk--0 253:7 0 30G 0 lvm
├─pve-vm--103--disk--0 253:8 0 8G 0 lvm
├─pve-vm--104--disk--0 253:9 0 20G 0 lvm
├─pve-vm--105--disk--0 253:10 0 20G 0 lvm
├─pve-vm--106--disk--0 253:11 0 30G 0 lvm
└─pve-vm--107--disk--0 253:12 0 8G 0 lvm
 
Last edited:
I wanted to provide everyone with an update. I installed a new nvme drive and the server has been stable ever since.
 
  • Like
Reactions: mira