Linux VMs in Proxmox 8.xx gets corrupted but proxmox host no issue

phanos

Renowned Member
Oct 23, 2015
44
2
73
Hi I have upgrade my old proxmox server (an atom C2550 cpu) to an intel Xeon CPU. Specifically I have a now

Code:
1) Supermicro X11SSM-F motherboard,
2) Intel xeon E3-2145L v5 CPU
3) 64 GB ram
4)  512 SSD disk (CT500BX500SSD1) -- I know this disk is dramless and it is not fast but I am not sure if this is the issue.

I am running the latest proxmox version on top of debian 12 and I have 4 VMs running. I have a remote storage option for taking backups every night that is running on another machine on my network.

Everything runs smoothly until I try to do some IO intensive task on the server like restore a VM on the disk (backup on the remote storage works fine thought). Even if I leave the server alone and leave the restore to proceed (from remote backup to local disk) all the VMs will start spitting IO errors on their console and on their kernel "dmesg -wH" shell until the restore is completed (please see pictures).

If I try to check the server during that time there is NO error on its console or in the kernel messages but it is getting slower to a halt on the disk resquests. After that the server returns to normal but the VMs need either a reboot or also a restore since their disk could get corrupted in the process.

So far I have check the hard disk for for issues using the smartctl command but found no issues (that I can tell).

Code:
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-8-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     CT500BX500SSD1
Serial Number:    2448E9972018
LU WWN Device Id: 5 00a075 1e9972018
Firmware Version: M6CR061
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        Not in smartctl database 7.3/5319
ATA Version is:   ACS-3 T13/2161-D revision 4
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Mar  9 11:21:09 2025 EET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Unavailable

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  120) seconds.
Offline data collection
capabilities:              (0x11) SMART execute Offline immediate.
                    No Auto Offline data collection support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    No Selective Self-test supported.
SMART capabilities:            (0x0002)    Does not save SMART data before
                    entering power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (  10) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   100   100   000    -    0
  5 Reallocated_Sector_Ct   -O--CK   100   100   010    -    0
  9 Power_On_Hours          -O--CK   100   100   000    -    255
 12 Power_Cycle_Count       -O--CK   100   100   000    -    18
171 Unknown_Attribute       -O--CK   100   100   000    -    0
172 Unknown_Attribute       -O--CK   100   100   000    -    0
173 Unknown_Attribute       -O--CK   099   099   000    -    9
174 Unknown_Attribute       -O--CK   100   100   000    -    18
180 Unused_Rsvd_Blk_Cnt_Tot PO--CK   100   100   000    -    5
183 Runtime_Bad_Block       -O--CK   100   100   000    -    0
184 End-to-End_Error        -O--CK   100   100   000    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
194 Temperature_Celsius     -O---K   075   068   000    -    25 (Min/Max 13/32)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_Sector  -O--CK   100   100   000    -    0
198 Offline_Uncorrectable   ----CK   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   100   100   000    -    0
202 Unknown_SSD_Attribute   ----CK   099   099   001    -    1
206 Unknown_SSD_Attribute   -OSR--   100   100   000    -    0
210 Unknown_Attribute       -O--CK   100   100   000    -    0
246 Unknown_Attribute       -O--CK   100   100   000    -    2526808869
247 Unknown_Attribute       -O--CK   100   100   000    -    78962777
248 Unknown_Attribute       -O--CK   100   100   000    -    117822464
249 Unknown_Attribute       -O--CK   100   100   000    -    0
251 Unknown_Attribute       -O--CK   100   100   000    -    3085416729
252 Unknown_Attribute       -O--CK   100   100   000    -    0
253 Unknown_Attribute       -O--CK   100   100   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x24       GPL     R/O     88  Current Device Internal Status Data log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log

SMART Extended Comprehensive Error Log (GP Log 0x03) not supported

SMART Error Log not supported

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%        51         -
# 2  Short offline       Completed without error       00%        42         -

Selective Self-tests/Logging not supported

SCT Commands not supported

Device Statistics (GP/SMART Log 0x04) not supported

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  4            0  Command failed due to ICRC error
0x0002  4            0  R_ERR response for data FIS
0x0005  4            1  R_ERR response for non-data FIS
0x000a  4            2  Device-to-host register FISes sent due to a COMRESET

I have also try,

1) change SATA cable for the hard disk
2) change port on the motherboard
3) check if there is some visual issue on the motherboard or on disk that is preventing the cable for sitting correctly or there is some (visual) fault
4) upgrade the IPMI and BIOS to the latest version
5) checked bios settings if there is something related

Could this be something related to compatibility of my hardware with linux? Or could this be the hard disk being too slow that is causing this issues? I am asking this because I used slower disk in proxmox before and never had issues but the combination of slow ssd could be the culprit?

The same VMs (with the same hardware options) run just fine on my old machine so I am running out of ideas what to try next. Doe someone have any any other idea about what I could try or what could the issue?

Thanks

Phanos
 

Attachments

  • console2.jpg
    console2.jpg
    498.8 KB · Views: 10
  • console1.jpg
    console1.jpg
    491.2 KB · Views: 10
  • console.png
    console.png
    551.5 KB · Views: 10
Crucial BX models are the worst.
Replace it and errors will gone.
I throw some away every month, (was running on Windows bare metal) , mainly because so slow that they are unusable.
Thanks but are they really that bad? I had run proxmox on mechanical drives in the past and never had issued with it. I also run it on my old machine on slower SSDs (without dram less also but other brands) and had no issues either.

How can an ssd produced such errors when under some stress. I mean is not just really slow is just completely useless. If a hard drive looses data like that then it is simply unreliable and it is as good as dead right?

Which 512MB SSD do you recommended?
 
Avoid QLC drives, which are worst than HDD as they have not predictible performance.

Crucial MX models are little better, but there is plenty of recommended drives.
Cheapest are not recommended.

Datacenter models are recommended, like Kingston DC (easy available), or even second hand is better than new consumer drives.
 
Last edited:
Thanks @_gabriel I will try to get a new drive and try again again but I am not 100% sure the ssd is the real culprit here. What if I get a faster disk drive and simply put more VMs on my host? Will the problem re-appear?

If the proxmox host would give any kind of error I would say it is 100% a hardware error (probably disk or motherboard controller) but the host only becomes slow and returns to normal after a while. "dmesg -wH" and "journalctl" reports nothing which make me believe this could be related to some driver/configuration issue.

Maybe a really silly question here but what is the recomended way to migrate a VM to a new machine in proxmox? In the past and even in different proxmox versions (7.xx and 8.xx) I simply backup the entire machine and restore it to another proxmox host machine. Is this ok or should I re-create the machine on the new host and then re-add the disk from the old host to the new?
 
Hi @_gabriel ,

result of slow SSD, they lie about real speed because they use memory as SLC cache, once full, system hang/stuck because SSD cannot substain writes.
disk is not full, is brand new, only around 120 GB out of 500GB is written.


simply backup the entire machine and restore it to another proxmox host machine
what is this ?
I mean using the proxmox build in backup/restore feature. I backup my VMs on a network storage and then use this backup image to restore them on the new proxmox host. Both host run latest Proxmox 8.xx. Is this not recommended?
 
disk is not full
I mean "SLC cache" full which seems 32 GB , not the disk itself. You can learn how SSD works.
I mean using the proxmox built in backup/restore feature. I backup my VMs on a network storage and then use this backup image to restore them on the new proxmox host. Both host run latest Proxmox 8.xx. Is this not recommended?
Of course this is the always recommended way.
 
The BX500 has terrible write performance and is slow but should not give errors (except time-outs with ZFS because of the use of QLC flash).
but I am not 100% sure the ssd is the real culprit here.
You are getting real I/O errors (on the host). It could be a drive or cable or controller or memory or motherboard or CPU issue. Try replacing each part in the chain?
 
The BX500 has terrible write performance and is slow but should not give errors (except time-outs with ZFS because of the use of QLC flash).

You are getting real I/O errors (on the host). It could be a drive or cable or controller or memory or motherboard or CPU issue. Try replacing each part in the chain?
Hi @leesteken, no errors on hosts at all. Host is only being slowed down and then recovers after a while. Only the VMs are affected. Some need only reboot, some need to be restored from backup since they are getting so corrupted that can not even boot afterwards. But this only happens when I try to restore a previously created backup on the host disk (BX500), otherwise the server is running just fine.
 
Hi @leesteken, no errors on hosts at all.
Sorry, my mistake.
Host is only being slowed down and then recovers after a while. Only the VMs are affected. Some need only reboot, some need to be restored from backup since they are getting so corrupted that can not even boot afterwards. But this only happens when I try to restore a previously created backup on the host disk (BX500), otherwise the server is running just fine.
Maybe you have bad memory, which is known to show weird things (masquerading as other problems) and getting worse over time. Did you run a memtest (for several hours)?
EDIT: But it could still be data corruption on the drive (or cable or controller).
 
Last edited:
Sorry, my mistake.

Maybe you have bad memory, which is known to show weird things (masquerading as other problems) and getting worse over time. Did you run a memtest (for several hours)?
EDIT: But it could still be data corruption on the drive (or cable or controller).
@leesteken I have not run memory test yet but please note that every time I try to write something big to disk this happens. It seems that the issue is consistent to disk write and only after a write have been initiate for a few seconds the issue starts to occurred. Otherwise the VMs run for days without any problem (VMs consume around 30% of memory every time and never had any error). Will try to run memtest the soonest to be sure of course.

I did try change both cable and sata port on the motherboard. Also check for visual defects on the motherboard and hard disk ports. BIOS and firmware are up to date and smartctl reports nothing unusual. Also check BIOS settings if there is anything related and played arround with options with no luck.

The way I understand it is that the problem is either:

1) A hardware issue althought I beleive this is less likely at this point since the host would have reported something by now,
2) The write performance of the BX500 is really SO BAD that is causing the VMs to time out and report IO Errors. To what I know about computers I do not understand how this is possible. I mean the VMs should be slowed down but never go to 0 and cause error and timeout unless there is hardware error correct? The host always recovers and never reports anything and never seen any corruption so why corruption happens on the VMs?

I am clueless at this point...
 
1) A hardware issue althought I beleive this is less likely at this point since the host would have reported something by now,
Please check your memory.
2) The write performance of the BX500 is really SO BAD that is causing the VMs to time out and report IO Errors.
In case of ZFS, yes, this would give (write) errors/time-outs for sure (on the host). On ext4/LVM, I would expect writes dropping to tens of KB/s but not errors.
To what I know about computers I do not understand how this is possible. I mean the VMs should be slowed down but never go to 0 and cause error and timeout unless there is hardware error correct? The host always recovers and never reports anything and never seen any corruption so why corruption happens on the VMs?
QLC flash becomes much slower than old HDDs (maybe even like old floppy speeds) when doing big/sustained writes. Eventually, it takes longer than some time-out, so it is definitely possible.
Do a search on this forum for QLC and see that most people refuse to believe that QLC SSDs are really that bad. People don't want to believe they wasted their money. And those threads also have suggestions for better drives.
I am clueless at this point...
A better drive (like second-hand enterprise with PLP) will make you life better anyway. But please check your memory too.
 
Please check your memory.

In case of ZFS, yes, this would give (write) errors/time-outs for sure (on the host). On ext4/LVM, I would expect writes dropping to tens of KB/s but not errors.

QLC flash becomes much slower than old HDDs (maybe even like old floppy speeds) when doing big/sustained writes. Eventually, it takes longer than some time-out, so it is definitely possible.
Do a search on this forum for QLC and see that most people refuse to believe that QLC SSDs are really that bad. People don't want to believe they wasted their money. And those threads also have suggestions for better drives.

A better drive (like second-hand enterprise with PLP) will make you life better anyway. But please check your memory too.
Will do @leesteken thanks. I have already order a temporary drive (tlc nand) to test and if everything is find I will invest in an enterprise level drive. QLC is really bad but I did not realize that it could be so bad..
 
QLC is really bad but I did not realize that it could be so bad..
Yes, it really sad for the people that have those drives to always find out after buying them. And it's take an unexpected amount of effort to convince them.
However, I have not seen your specific issues with your kind of setup. And I do think it could well be bad memory (since your RAM and filesystems have no checksums).
 
Yes, it really sad for the people that have those drives to always find out after buying them. And it's take an unexpected amount of effort to convince them.
However, I have not seen your specific issues with your kind of setup. And I do think it could well be bad memory (since your RAM and filesystems have no checksums).
will test to make sure
 
May I ask why on top of Debian? Did you have some incompatibility in installing Proxmox directly?
I have been running proxmox for many years now and this is how I make it on my first machine so I keep on installing it like this. First installing Debian and then proxmox. Do you think I loose something like this?