All VM`s crashing every couple of days

Eddie the Eagle · Jan 16, 2023

I`m a happy home user (7.3.4) for about a year now but recently I`m running into the problem that every couple of days, sometimes after a week, sometimes after 2 days, my system becomes unstable. I cannot reach any of the VM`s. Looks like a complete crash. As you can see below, the SDA1 which is representing a Crucial SSD shows a question mark as if it is unavailable. A soft reboot within PM does not work; a soft shutdown does work after which I can start the server via hardware button and that makes the system become available again for some days. The log in GUI is showing nothing on it. Any help on getting this diagnosed and resolved is highly appreciated.

Moayad · Jan 16, 2023

Eddie the Eagle said:
The log in GUI is showing nothing on it

Even in the Syslog /var/log/syslog you didn't find anything interesting?

Eddie the Eagle · Jan 16, 2023

Moayad said:
Even in the Syslog /var/log/syslog you didn't find anything interesting?

Thanks for mentioning that; I`m still rather new in PM. Indeed; al lot happened in there but hard to interpret so I`ll attach it here. I hope we can narrow this down to maybe one VM causing the issue.

@ 05:56 this morning it happened again with this first "Exception Emask". The log before that one was regular and expected, so the file starts here.
@ 11:00 you`ll see me rebooting the system, works fine till now

Moayad · Jan 17, 2023

Hi,

Thank you for the syslog!

The first log messages indicate that there is an issue with an ATA device on your PVE server, specifically with the "WRITE FPDMA QUEUED" command that the device is trying to execute, but got an error message with "Emask 0x4 (timeout)". I would check of the `smartctl` to see if there is any issue on the output, or with the data cable connecting it.

Eddie the Eagle · Jan 17, 2023

Mm, trying the command but nothing happens; am I doing something wrong ?

leesteken · Jan 17, 2023

Eddie the Eagle said:
Mm, trying the command but nothing happens; am I doing something wrong ?

View attachment 45729

Don't run smartctl on a partition (like sda1) but run it on a whole drive (like sda) with smartctl -a /dev/sda. Or use /dev/disk/by-id/... to identify drives by their brand and model.

Eddie the Eagle · Jan 17, 2023

Thanks for your assistance (and patience)

; that worked better; here we go:

root@pve:/# smartctl -a /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.83-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Crucial/Micron Client SSDs
Device Model: CT1000MX500SSD1
Serial Number: 2151E5F4932F
LU WWN Device Id: 5 00a075 1e5f4932f
Firmware Version: M3CR043
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Jan 17 13:37:09 2023 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off supp ort.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 30) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x0031) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_ FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 7862
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 11
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 097 097 000 Old_age Always - 46
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 5
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 63
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 063 035 000 Old_age Always - 37 (Min/Max 0/65)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_ECC_Cnt 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 097 097 001 Old_age Offline - 3
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 12720634769
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 227715392
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 166145158

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Completed [00% left] (0-65535)
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

root@pve:/#

apoc · Jan 17, 2023

Could be an out of memory condition.
Are you booting from SDA?
Had once similar problems with USB sticks

Eddie the Eagle · Jan 17, 2023

apoc said:
Could be an out of memory condition.
Are you booting from SDA?
Had once similar problems with USB sticks

I`m booting from a separate M.2 2280 SSD; this one does not seem to give any problems. Proxmox as such stays alive; it only the VM`s that are on that 2,5" SSD are giving problems (visible in syslog). Uptime is 1d5hrs now.

apoc · Jan 17, 2023

That means it is likely related to your "data ssd"
Which ssd type are you using?

Eddie the Eagle · Jan 17, 2023

apoc said:
That means it is likely related to your "data ssd"
Which ssd type are you using?

Indeed & thx; that one seems to have issues:

Crucial MX500 1TB 3D NAND SATA 2.5 Inch Internal SSD - Up to 560MB/s - CT1000MX500SSD1

jaceqp · Jan 18, 2023

202 Percent_Lifetime_Remain 0x0030 097 097 001 Old_age Offline - 3

Not sure but seems ur drive is reaching its eol...

Eddie the Eagle · Jan 18, 2023

jaceqp said:
Not sure but seems ur drive is reaching its eol...

Yeah, sharp notice, thx. I did some digging on that matter since the SSD is less than a year old. It seems the numbers are shown in reverse with Corsair SSD. Below a Corsair SSD that is failing on smartctl because of percent lifetime used @ 99%. Or this is just the way smartctl is displaying this 202 test (so confusing). Anyway.....over 30 yrs to go with this speed

.

Code:
202 Percent_Lifetime_Used 0x0030 001 001 001 Old_age Offline FAILING_NOW 99

ITT · Jan 18, 2023

Change your SATA-Cable or try a better Disk.
Every day Postings pops up with issues related to Crucial Disks.

Eddie the Eagle · Jan 18, 2023

ITT said:
try a better Disk.

What would be a better disk ? And would it make sense to add this in the empty 2nd 2,5" slot and work towards some high availability situation with 2 disks ? And then later on replace the Corsair

ITT · Jan 18, 2023

As always -> only Enterprise-Disks.
But you have sorting out the issue before buying a new disk.
Is your Server AMD-based?

Eddie the Eagle · Jan 18, 2023

ITT said:
As always -> only Enterprise-Disks.

I take your advice, highly appreciated; I`m a home user and rather new in the world of hypervisors so without high availability (yet). If you have a good concrete 1TB SSD that you could advice, would make me very happy.

ITT said:
But you have sorting out the issue before buying a new disk.

That is why I started this topic in the first place but maybe there is some benefit in installing a 2nd SSD in parallel anyway, so no money wasted. Also I had a slight suspicion that 1 VM (W10) is causing the issue. I have stopped this VM as a matter of test and see what happens. Uptime is 2d5h now.

ITT said:
Is your Server AMD-based?

Correct; AMD Ryzen 7 4800U

ITT · Jan 18, 2023

My advice for Home-Use (low Budget) is Kingston DC500M (personal experience, price/performance ratio)

I read in other Forums (Unraid or so) that there is a incompatibility between AMD-Chipset <-> Crucial/Samsung disks.

I think your Win10-VM stresses your Disk (and Subsystem) so the issue is more present (also present without this VM, but less "noisy")

Eddie the Eagle · Jan 18, 2023

ITT said:
My advice for Home-Use (low Budget) is Kingston DC500M (personal experience, price/performance ratio)

I read in other Forums (Unraid or so) that there is a incompatibility between AMD-Chipset <-> Crucial/Samsung disks.

I think your Win10-VM stresses your Disk (and Subsystem) so the issue is more present (also present without this VM, but less "noisy")

Thanks so far, others too ! Really appreciate the help; I`ll leave the W10 offline for now and see what this does. I`ll definitely look into the Kingston and keep this topic updated.

Eddie the Eagle · Jan 24, 2023

Ok, short update for the interested folks. Uptime is >7 days now and I`m shutting down the W10 VM daily now. I really start to think that keeping W10 online for some days is causing the issue. Maybe memory leak that makes the VM use SSD to heavily, I just don`t know. Also I realized that maybe allocating 4GB of RAM is not enough. So I increased to 8 and also doubled storage from 64GB to 128.

Happy to hear your thoughts.

All VM`s crashing every couple of days

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Distinguished Member

New Member

Famous Member

New Member

Famous Member

New Member

Crucial MX500 1TB 3D NAND SATA 2.5 Inch Internal SSD - Up to 560MB/s - CT1000MX500SSD1​

Well-Known Member

New Member

Well-Known Member

New Member

Well-Known Member

New Member

Well-Known Member

New Member

New Member

Crucial MX500 1TB 3D NAND SATA 2.5 Inch Internal SSD - Up to 560MB/s - CT1000MX500SSD1