All VM`s crashing every couple of days

Eddie the Eagle

New Member
Apr 20, 2022
17
4
3
I`m a happy home user (7.3.4) for about a year now but recently I`m running into the problem that every couple of days, sometimes after a week, sometimes after 2 days, my system becomes unstable. I cannot reach any of the VM`s. Looks like a complete crash. As you can see below, the SDA1 which is representing a Crucial SSD shows a question mark as if it is unavailable. A soft reboot within PM does not work; a soft shutdown does work after which I can start the server via hardware button and that makes the system become available again for some days. The log in GUI is showing nothing on it. Any help on getting this diagnosed and resolved is highly appreciated.

1673867098927.png
 
Even in the Syslog /var/log/syslog you didn't find anything interesting?
Thanks for mentioning that; I`m still rather new in PM. Indeed; al lot happened in there but hard to interpret so I`ll attach it here. I hope we can narrow this down to maybe one VM causing the issue.

@ 05:56 this morning it happened again with this first "Exception Emask". The log before that one was regular and expected, so the file starts here.
@ 11:00 you`ll see me rebooting the system, works fine till now
 
Last edited:
Hi,

Thank you for the syslog!

The first log messages indicate that there is an issue with an ATA device on your PVE server, specifically with the "WRITE FPDMA QUEUED" command that the device is trying to execute, but got an error message with "Emask 0x4 (timeout)". I would check of the `smartctl` to see if there is any issue on the output, or with the data cable connecting it.
 
Thanks for your assistance (and patience) ;); that worked better; here we go:

root@pve:/# smartctl -a /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.83-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Crucial/Micron Client SSDs
Device Model: CT1000MX500SSD1
Serial Number: 2151E5F4932F
LU WWN Device Id: 5 00a075 1e5f4932f
Firmware Version: M3CR043
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Jan 17 13:37:09 2023 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off supp ort.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 30) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x0031) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_ FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 7862
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 11
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 097 097 000 Old_age Always - 46
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 5
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 63
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 063 035 000 Old_age Always - 37 (Min/Max 0/65)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_ECC_Cnt 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 097 097 001 Old_age Offline - 3
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 12720634769
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 227715392
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 166145158

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Completed [00% left] (0-65535)
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

root@pve:/#
 
Last edited:
Could be an out of memory condition.
Are you booting from SDA?
Had once similar problems with USB sticks
 
Could be an out of memory condition.
Are you booting from SDA?
Had once similar problems with USB sticks
I`m booting from a separate M.2 2280 SSD; this one does not seem to give any problems. Proxmox as such stays alive; it only the VM`s that are on that 2,5" SSD are giving problems (visible in syslog). Uptime is 1d5hrs now.
 
That means it is likely related to your "data ssd"
Which ssd type are you using?
 
Not sure but seems ur drive is reaching its eol...
Yeah, sharp notice, thx. I did some digging on that matter since the SSD is less than a year old. It seems the numbers are shown in reverse with Corsair SSD. Below a Corsair SSD that is failing on smartctl because of percent lifetime used @ 99%. Or this is just the way smartctl is displaying this 202 test (so confusing). Anyway.....over 30 yrs to go with this speed ;).

Code:
202 Percent_Lifetime_Used 0x0030 001 001 001 Old_age Offline FAILING_NOW 99
 
Last edited:
As always -> only Enterprise-Disks.

I take your advice, highly appreciated; I`m a home user and rather new in the world of hypervisors so without high availability (yet). If you have a good concrete 1TB SSD that you could advice, would make me very happy.

But you have sorting out the issue before buying a new disk.

That is why I started this topic in the first place but maybe there is some benefit in installing a 2nd SSD in parallel anyway, so no money wasted. Also I had a slight suspicion that 1 VM (W10) is causing the issue. I have stopped this VM as a matter of test and see what happens. Uptime is 2d5h now.

Is your Server AMD-based?

Correct; AMD Ryzen 7 4800U
 
  • Like
Reactions: ITT
My advice for Home-Use (low Budget) is Kingston DC500M (personal experience, price/performance ratio)

I read in other Forums (Unraid or so) that there is a incompatibility between AMD-Chipset <-> Crucial/Samsung disks.

I think your Win10-VM stresses your Disk (and Subsystem) so the issue is more present (also present without this VM, but less "noisy")
 
Last edited:
  • Like
Reactions: Eddie the Eagle
My advice for Home-Use (low Budget) is Kingston DC500M (personal experience, price/performance ratio)

I read in other Forums (Unraid or so) that there is a incompatibility between AMD-Chipset <-> Crucial/Samsung disks.

I think your Win10-VM stresses your Disk (and Subsystem) so the issue is more present (also present without this VM, but less "noisy")
Thanks so far, others too ! Really appreciate the help; I`ll leave the W10 offline for now and see what this does. I`ll definitely look into the Kingston and keep this topic updated.
 
  • Like
Reactions: ITT
Ok, short update for the interested folks. Uptime is >7 days now and I`m shutting down the W10 VM daily now. I really start to think that keeping W10 online for some days is causing the issue. Maybe memory leak that makes the VM use SSD to heavily, I just don`t know. Also I realized that maybe allocating 4GB of RAM is not enough. So I increased to 8 and also doubled storage from 64GB to 128.

Happy to hear your thoughts.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!