Jan 1, 2016
I have a Windows VM running on LVM and a few days ago the backup started failing reporting disk I/O errors.

The VM continues to run perfectly. I have run a chkdsk in Windows and no errors were reported.

I am unable to move the disk to another storage group as at 49% I/O errors are reported.

The underlying physical storage is a SSD hardware Raid1 device.
Hi Henry
What is the exact error message ?
If the underlying disks have hardware defects, you also errors in the kernel log with the dmesg command.
The backup log reports:

INFO: status: 8% (68275142656/852551008256), sparse 0% (1875738624), duration 1041, 54/53 MB/s
INFO: status: 9% (76751568896/852551008256), sparse 0% (2160574464), duration 1197, 54/52 MB/s
INFO: status: 9% (79479046144/852551008256), sparse 0% (2205278208), duration 1245, 56/55 MB/s
ERROR: job failed with err -5 - Input/output error
INFO: aborting backup job
ERROR: Backup of VM 1002 failed - job failed with err -5 - Input/output error

When I try to move the disk to another storage location it also fails.

What are my options ?
Seems like serious problem to me. Before doing anything else, I would try to save all user-data I can (from within VM, when as you say VM does not report any problem)...
root@pve:~# smartctl --all -d megaraid,2 /dev/sdc
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.4.44-1-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

Device Model: Samsung SSD 850 PRO 2TB
Serial Number: S2KMNWAG801986X
LU WWN Device Id: 5 002538 c70008499
Firmware Version: EXM02B6Q
User Capacity: 2,048,408,248,320 bytes [2.04 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Apr 5 06:24:51 2017 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART Status not supported: ATA return descriptor not supported by controller firmware
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x53) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 600) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
5 Reallocated_Sector_Ct 0x0033 099 099 010 Pre-fail Always - 8
9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 8172
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 17
177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always - 18
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 099 099 010 Pre-fail Always - 8
181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0
183 Runtime_Bad_Block 0x0013 099 099 010 Pre-fail Always - 8
187 Reported_Uncorrect 0x0032 099 099 000 Old_age Always - 18
190 Airflow_Temperature_Cel 0x0032 077 061 000 Old_age Always - 23
195 Hardware_ECC_Recovered 0x001a 199 199 000 Old_age Always - 18
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
235 Unknown_Attribute 0x0012 099 099 000 Old_age Always - 13
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 34730102596

SMART Error Log Version: 1
ATA Error Count: 18 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 18 occurred at disk power-on lifetime: 8043 hours (335 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
-- -- -- -- -- -- --
00 51 01 10 00 00 00 Error: at LBA = 0x00000010 = 16

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 80 00 80 92 be 00 00 00:29:02.763 READ FPDMA QUEUED
60 80 00 00 92 be 00 00 00:29:02.763 READ FPDMA QUEUED
60 80 00 80 91 be 00 00 00:29:02.763 READ FPDMA QUEUED
60 80 00 00 91 be 00 00 00:29:02.763 READ FPDMA QUEUED
60 80 00 80 90 be 00 00 00:29:02.763 READ FPDMA QUEUED

Error 17 occurred at disk power-on lifetime: 7995 hours (333 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
-- -- -- -- -- -- --
00 51 01 10 00 00 00 Error: at LBA = 0x00000010 = 16

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 80 00 00 1c 7e 40 00 00:26:10.405 READ FPDMA QUEUED
60 80 00 80 1b 7e 00 00 00:26:10.405 READ FPDMA QUEUED
60 80 00 00 1b 7e 00 00 00:26:10.405 READ FPDMA QUEUED
60 80 00 80 1a 7e 00 00 00:26:10.405 READ FPDMA QUEUED
60 80 00 00 1a 7e 00 00 00:26:10.405 READ FPDMA QUEUED

Error 16 occurred at disk power-on lifetime: 7947 hours (331 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
-- -- -- -- -- -- --
00 51 01 10 00 00 00 Error: at LBA = 0x00000010 = 16

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 80 00 80 d2 be 00 00 00:23:17.295 READ FPDMA QUEUED
60 80 00 00 d2 be 00 00 00:23:17.295 READ FPDMA QUEUED
60 80 00 80 d1 be 00 00 00:23:17.295 READ FPDMA QUEUED
60 80 00 00 d1 be 00 00 00:23:17.295 READ FPDMA QUEUED
60 80 00 80 d0 be 00 00 00:23:17.295 READ FPDMA QUEUED

Error 15 occurred at disk power-on lifetime: 7707 hours (321 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
-- -- -- -- -- -- --
00 51 01 10 00 00 00 Error: at LBA = 0x00000010 = 16

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 80 00 00 d4 be 40 00 00:08:53.280 READ FPDMA QUEUED
2f 00 01 10 00 00 00 00 00:08:53.280 READ LOG EXT
60 80 00 00 d4 be 00 00 00:08:53.280 READ FPDMA QUEUED
60 80 00 80 d3 be 00 00 00:08:53.280 READ FPDMA QUEUED
60 80 00 00 d3 be 00 00 00:08:53.280 READ FPDMA QUEUED

Error 14 occurred at disk power-on lifetime: 7707 hours (321 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
-- -- -- -- -- -- --
00 51 01 10 00 00 00 Error: at LBA = 0x00000010 = 16

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 80 00 00 d4 be 00 00 00:08:53.278 READ FPDMA QUEUED
60 80 00 80 d3 be 00 00 00:08:53.278 READ FPDMA QUEUED
60 80 00 00 d3 be 00 00 00:08:53.278 READ FPDMA QUEUED
60 80 00 80 d2 be 00 00 00:08:53.278 READ FPDMA QUEUED
60 80 00 00 d2 be 00 00 00:08:53.278 READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 8154 -
# 2 Short offline Completed without error 00% 8153 -

SMART Selective self-test log data structure revision number 1
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
255 0 65535 Read_scanning was never started
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
root@pve:~# smartctl --all -d megaraid,3 /dev/sdc
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.4.44-1-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

Device Model: Samsung SSD 850 PRO 2TB
Serial Number: S2KMNWAG801987H
LU WWN Device Id: 5 002538 c7000849a
Firmware Version: EXM02B6Q
User Capacity: 2,048,408,248,320 bytes [2.04 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Apr 5 06:25:33 2017 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART Status not supported: ATA return descriptor not supported by controller firmware
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x53) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 600) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
5 Reallocated_Sector_Ct 0x0033 099 099 010 Pre-fail Always - 17
9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 8960
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 19
177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always - 10
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 099 099 010 Pre-fail Always - 17
181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0
183 Runtime_Bad_Block 0x0013 099 099 010 Pre-fail Always - 17
187 Reported_Uncorrect 0x0032 099 099 000 Old_age Always - 26
190 Airflow_Temperature_Cel 0x0032 076 061 000 Old_age Always - 24
195 Hardware_ECC_Recovered 0x001a 199 199 000 Old_age Always - 26
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
235 Unknown_Attribute 0x0012 099 099 000 Old_age Always - 13
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 32606934686

SMART Error Log Version: 1
ATA Error Count: 26 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 26 occurred at disk power-on lifetime: 8855 hours (368 days + 23 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
-- -- -- -- -- -- --
00 51 01 10 00 00 00 Error: at LBA = 0x00000010 = 16

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 80 00 00 fa 57 00 00 01:17:46.788 READ FPDMA QUEUED
2f 00 01 10 00 00 00 00 01:17:46.788 READ LOG EXT
60 80 00 00 fa 57 00 00 01:17:46.788 READ FPDMA QUEUED
60 80 00 80 f9 57 00 00 01:17:46.788 READ FPDMA QUEUED
60 80 00 00 f9 57 00 00 01:17:46.788 READ FPDMA QUEUED

Error 25 occurred at disk power-on lifetime: 8855 hours (368 days + 23 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
-- -- -- -- -- -- --
00 51 01 10 00 00 00 Error: at LBA = 0x00000010 = 16

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 80 00 00 fa 57 00 00 01:17:46.788 READ FPDMA QUEUED
60 80 00 80 f9 57 00 00 01:17:46.788 READ FPDMA QUEUED
60 80 00 00 f9 57 00 00 01:17:46.788 READ FPDMA QUEUED
60 80 00 80 f8 57 00 00 01:17:46.788 READ FPDMA QUEUED
60 80 00 00 f8 57 00 00 01:17:46.788 READ FPDMA QUEUED

Error 24 occurred at disk power-on lifetime: 8283 hours (345 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
-- -- -- -- -- -- --
00 51 01 10 00 00 00 Error: at LBA = 0x00000010 = 16

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 08 00 10 40 00 01 00:43:26.467 READ FPDMA QUEUED
60 80 00 00 a0 7a 00 00 00:43:26.467 READ FPDMA QUEUED
60 80 00 80 9f 7a 00 00 00:43:26.467 READ FPDMA QUEUED
60 80 00 00 9f 7a 00 00 00:43:26.467 READ FPDMA QUEUED
60 80 00 80 9e 7a 00 00 00:43:26.467 READ FPDMA QUEUED

Error 23 occurred at disk power-on lifetime: 7755 hours (323 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
-- -- -- -- -- -- --
00 51 01 10 00 00 00 Error: at LBA = 0x00000010 = 16

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 80 00 80 45 69 00 00 00:11:45.536 READ FPDMA QUEUED
60 80 00 00 45 69 00 00 00:11:45.536 READ FPDMA QUEUED
60 80 00 80 44 69 00 00 00:11:45.536 READ FPDMA QUEUED
60 80 00 00 44 69 00 00 00:11:45.536 READ FPDMA QUEUED
60 80 00 80 43 69 00 00 00:11:45.536 READ FPDMA QUEUED

Error 22 occurred at disk power-on lifetime: 7703 hours (320 days + 23 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
-- -- -- -- -- -- --
00 51 01 10 00 00 00 Error: at LBA = 0x00000010 = 16

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 80 00 80 c9 ba 00 00 00:08:39.610 READ FPDMA QUEUED
60 80 00 00 c9 ba 00 00 00:08:39.610 READ FPDMA QUEUED
60 80 00 80 c8 ba 00 00 00:08:39.610 READ FPDMA QUEUED
60 80 00 00 c8 ba 00 00 00:08:39.610 READ FPDMA QUEUED
60 80 00 80 c7 ba 00 00 00:08:39.610 READ FPDMA QUEUED

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
255 0 65535 Read_scanning was never started
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
I see the first disk has reported no errors for the short and long tests:

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 8154 -
# 2 Short offline Completed without error 00% 8153 -

Running tests on the 2nd disk now.


