Hi everybody, i know that this is not specific to proxmox but i am always finding good help here so i try anyway if the staff is ok
I have a proxmox running on a machine with 6 disk in raid10 + 1 spare
the disk were connected to a lsi 9211-8i and i always had strange logs so i decided to change the controller sas/sata cable, so i bought 2 lsi original cable then swapped with the new one
After that i loosed one disk... so i lauched a mdadm recovery by re-adding the disk then i loosed two during the recovery :S
so i decided to turn the server off, put every drive directly on the mother board and give a try like that : recovery done; no error...
My lsi was with the P20 firmware so i downgraded it to P18 that seems more reliable...
i did extended smarttest on each drive and everything is OK
so i was going to put back the disk on the HBA to give a try with the new firmware BUT :
since this morning the server is showing high level of io wait, but it is running nothing heavy ! almost idle !
i found that in dmesg output :
we can see that the controler give the same strange logs than before swapping the cable
put the most scary part is the device reset, i have maybe 20 time this king of entry in the log for today
now see this line in "atop" output
/dev/sdb is at is maximum, everybody is waiting for it, and the server is really slow !
now take a look at the smart :
nothing really scary here... just the device reset :S
mdadm not crying :
So what i have to do ?
can it be a sata cable problem ?
do i have to try to put back the drives on the HBA ?
maybe i can FAIL the drive manually in mdadm to restore io performance ?
do i just have to stop thinking and change this fu****g drive with a new one ?
one more strange thin : pve-root lvm seems to be loaded too : ( atop command )
please give advice
have a nice night or day
Hugo
EDIT : 5 minutes after writing this post :
EDIT 2 :
i tested the drive with hdparm :
seems really like a faulty cable on port on the motherboard, cache is fast then slow then fast, like a bad connection
I will fault the drive to restore performance...
I have a proxmox running on a machine with 6 disk in raid10 + 1 spare
the disk were connected to a lsi 9211-8i and i always had strange logs so i decided to change the controller sas/sata cable, so i bought 2 lsi original cable then swapped with the new one
After that i loosed one disk... so i lauched a mdadm recovery by re-adding the disk then i loosed two during the recovery :S
so i decided to turn the server off, put every drive directly on the mother board and give a try like that : recovery done; no error...
My lsi was with the P20 firmware so i downgraded it to P18 that seems more reliable...
i did extended smarttest on each drive and everything is OK
so i was going to put back the disk on the HBA to give a try with the new firmware BUT :
since this morning the server is showing high level of io wait, but it is running nothing heavy ! almost idle !
i found that in dmesg output :
Code:
[188489.228510] mpt2sas0: log_info(0x30030101): originator(IOP), code(0x03), sub_code(0x0101)[188489.315586] mpt2sas0: log_info(0x30030101): originator(IOP), code(0x03), sub_code(0x0101)
[188489.395651] mpt2sas0: log_info(0x30030101): originator(IOP), code(0x03), sub_code(0x0101)
[188731.962971] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[188731.963042] ata2.00: failed command: FLUSH CACHE EXT
[188731.963101] ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 19
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[188731.963216] ata2.00: status: { DRDY }
[188731.963287] ata2: hard resetting link
[188732.898874] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[188732.899461] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20150619/psargs-359)
[188732.899535] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT1._GTF] (Node ffff88040e0bfe88), AE_NOT_FOUND (20150619/psparse-536)
[188732.900264] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20150619/psargs-359)
[188732.900368] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT1._GTF] (Node ffff88040e0bfe88), AE_NOT_FOUND (20150619/psparse-536)
[188732.900619] ata2.00: configured for UDMA/133
[188732.900741] ata2.00: retrying FLUSH 0xea Emask 0x4
[188732.900897] ata2.00: device reported invalid CHS sector 0
[188732.901022] ata2: EH complete
[189089.461664] mpt2sas0: log_info(0x30030101): originator(IOP), code(0x03), sub_code(0x0101)
[189089.571193] mpt2sas0: log_info(0x30030101): originator(IOP), code(0x03), sub_code(0x0101)
[189089.651375] mpt2sas0: log_info(0x30030101): originator(IOP), code(0x03), sub_code(0x0101)
[189689.718649] mpt2sas0: log_info(0x30030101): originator(IOP), code(0x03), sub_code(0x0101)
[189689.798596] mpt2sas0: log_info(0x30030101): originator(IOP), code(0x03), sub_code(0x0101)
[189689.877976] mpt2sas0: log_info(0x30030101): originator(IOP), code(0x03), sub_code(0x0101)
[190289.943671] mpt2sas0: log_info(0x30030101): originator(IOP), code(0x03), sub_code(0x0101)
[190290.023620] mpt2sas0: log_info(0x30030101): originator(IOP), code(0x03), sub_code(0x0101)
we can see that the controler give the same strange logs than before swapping the cable
put the most scary part is the device reset, i have maybe 20 time this king of entry in the log for today
now see this line in "atop" output
Code:
DSK | sdb | | busy 99% | read 6 | | write 40 | KiB/r 4 | | KiB/w 20 | MBr/s 0.00 | | MBw/s 0.08 | avq 6.49 | | avio 215 ms |
/dev/sdb is at is maximum, everybody is waiting for it, and the server is really slow !
now take a look at the smart :
Code:
SMART Attributes Data Structure revision number: 16Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 2
3 Spin_Up_Time 0x0027 172 171 021 Pre-fail Always - 2375
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 72
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 090 090 000 Old_age Always - 7707
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 72
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 59
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 188
194 Temperature_Celsius 0x0022 119 106 000 Old_age Always - 24
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 199 000 Old_age Always - 13450
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 7696 -
# 2 Extended offline Interrupted (host reset) 10% 7674 -
# 3 Short offline Completed without error 00% 7671 -
# 4 Short offline Completed without error 00% 7648 -
# 5 Short offline Completed without error 00% 7624 -
# 6 Short offline Completed without error 00% 7601 -
# 7 Extended offline Completed without error 00% 7579 -
# 8 Short offline Completed without error 00% 7553 -
# 9 Short offline Completed without error 00% 7529 -
#10 Short offline Completed without error 00% 7505 -
#11 Short offline Completed without error 00% 7483 -
#12 Short offline Completed without error 00% 7459 -
#13 Short offline Completed without error 00% 7435 -
#14 Short offline Completed without error 00% 7411 -
#15 Short offline Completed without error 00% 7387 -
#16 Short offline Completed without error 00% 7363 -
#17 Short offline Completed without error 00% 7363 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
nothing really scary here... just the device reset :S
mdadm not crying :
Code:
Personalities : [raid1] [raid10] md1 : active raid1 sde1[6](S) sdd1[4] sdg1[3] sda1[0] sdf1[5] sdc1[2] sdb1[1]
204736 blocks [6/6] [UUUUUU]
md10 : active raid10 sde2[6](S) sdd2[9] sdg2[8] sda2[7] sdf2[5] sdc2[2] sdb2[1]
2929276416 blocks super 1.2 512K chunks 2 near-copies [6/6] [UUUUUU]
unused devices: <none>
So what i have to do ?
can it be a sata cable problem ?
do i have to try to put back the drives on the HBA ?
maybe i can FAIL the drive manually in mdadm to restore io performance ?
do i just have to stop thinking and change this fu****g drive with a new one ?
one more strange thin : pve-root lvm seems to be loaded too : ( atop command )
Code:
ve-root | | busy 82% | read 6 | | write 34 | KiB/r 4 | | KiB/w 4 | MBr/s 0.00 | | MBw/s 0.01 | avq 2.22 | | avio 206 ms |
LVM | 100--disk--1 | | busy 77% | read 1 | | write 6 | KiB/r 4 | | KiB/w 6 | MBr/s 0.00 | | MBw/s 0.00 | avq 1.00 | | avio 1106 ms
please give advice
have a nice night or day
Hugo
EDIT : 5 minutes after writing this post :
[202440.493712] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen[202440.493792] ata2.00: failed command: FLUSH CACHE EXT
[202440.493853] ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 18
res 40/00:01:09:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[202440.493979] ata2.00: status: { DRDY }
[202440.494043] ata2: hard resetting link
[202440.981683] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[202440.982264] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20150619/psargs-359)
[202440.982340] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT1._GTF] (Node ffff88040e0bfe88), AE_NOT_FOUND (20150619/psparse-536)
[202440.983107] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20150619/psargs-359)
[202440.983200] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT1._GTF] (Node ffff88040e0bfe88), AE_NOT_FOUND (20150619/psparse-536)
[202440.983422] ata2.00: configured for UDMA/133
[202440.983541] ata2.00: retrying FLUSH 0xea Emask 0x4
[202440.983710] ata2: EH complete
[202894.779638] mpt2sas0: log_info(0x30030101): originator(IOP), code(0x03), sub_code(0x0101)
[202894.859945] mpt2sas0: log_info(0x30030101): originator(IOP), code(0x03), sub_code(0x0101)
[202894.939835] mpt2sas0: log_info(0x30030101): originator(IOP), code(0x03), sub_code(0x0101)
[202920.475286] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[202920.475381] ata2.00: failed command: FLUSH CACHE EXT
[202920.475474] ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 30
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[202920.475960] ata2.00: status: { DRDY }
[202920.476303] ata2: hard resetting link
[202920.963216] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[202920.963829] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20150619/psargs-359)
[202920.963939] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT1._GTF] (Node ffff88040e0bfe88), AE_NOT_FOUND (20150619/psparse-536)
[202920.964658] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20150619/psargs-359)
[202920.964786] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT1._GTF] (Node ffff88040e0bfe88), AE_NOT_FOUND (20150619/psparse-536)
[202920.965041] ata2.00: configured for UDMA/133
[202920.965341] ata2.00: retrying FLUSH 0xea Emask 0x4
[202920.965885] ata2.00: device reported invalid CHS sector 0
[202920.966282] ata2: EH complete
root@proxmox /home/admin #
EDIT 2 :
i tested the drive with hdparm :
Code:
root@proxmox /home/admin # hdparm -tT /dev/sdb
/dev/sdb:
Broadcast message from nut@proxmox (somewhere) (Tue Oct 13 13:51:49 2015):
UPS ups@localhost is unavailable
Timing cached reads: 27226 MB in 2.00 seconds = 13626.76 MB/sec
Timing buffered disk reads: 4 MB in 6.65 seconds = 615.65 kB/sec
root@proxmox /home/admin # hdparm -tT /dev/sdb
/dev/sdb:
Timing cached reads: 27052 MB in 2.00 seconds = 13539.38 MB/sec
Timing buffered disk reads: 148 MB in 4.60 seconds = 32.18 MB/sec
root@proxmox /home/admin # hdparm -tT /dev/sdb
/dev/sdb:
Timing cached reads: 2 MB in 51.76 seconds = 39.57 kB/sec
Timing buffered disk reads: 94 MB in 6.97 seconds = 13.49 MB/sec
seems really like a faulty cable on port on the motherboard, cache is fast then slow then fast, like a bad connection
I will fault the drive to restore performance...
Last edited: