Hey all,
I received an email notification regarding "SMART error (FailedOpenDevice) detected on host:" After looking this up a bit, I wasn't able to find anything relating to the issues I am experiencing, hence me posting this. The email is attached as "host3_sdd.txt" containing the email notification.
Currently I have 4/5 nodes in this cluster experiencing this issue. The effected hosts all have VMs running, the 5th node does NOT have any VMs - perhaps there's a correlation there?
The cluster is 5x nodes, each with x6 drives:
1x SAS holds OS
1x SSD for fast I/O requirements
x4 HDDs for larger storage requirements
An example scenario: host 3
- has /sdc & /sdd as down
- relevant time of issue: Thu Oct 31 03:59:19 2019 PDT
- GUI no longer shows drives listed
- ls -l /dev/sd* doesn't show the two drives listed
That said, I started with the SYSLOG as suggested and worked my way backwards to find the last time the drive was good and identify the beginning of the error messages. Below you'll find an excerpt from SYSLOG:
(drive was being detected beforehand and SMART was reporting back)
Oct 31 03:56:42 k1n3 kernel: [914232.144604] sd 0:0:3:0: [sdd] tag#1 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
...
Oct 31 03:56:42 k1n3 kernel: [914232.144609] print_req_error: I/O error, dev sdd, sector 231254656
...
Oct 31 03:56:42 k1n3 kernel: [914232.144679] sd 0:0:3:0: [sdd] killing request
...
Oct 31 03:59:19 k1n3 smartd[1087]: Device: /dev/sdd [SAT], open() failed: No such device
...
Please help!
I received an email notification regarding "SMART error (FailedOpenDevice) detected on host:" After looking this up a bit, I wasn't able to find anything relating to the issues I am experiencing, hence me posting this. The email is attached as "host3_sdd.txt" containing the email notification.
Currently I have 4/5 nodes in this cluster experiencing this issue. The effected hosts all have VMs running, the 5th node does NOT have any VMs - perhaps there's a correlation there?
The cluster is 5x nodes, each with x6 drives:
1x SAS holds OS
1x SSD for fast I/O requirements
x4 HDDs for larger storage requirements
An example scenario: host 3
- has /sdc & /sdd as down
- relevant time of issue: Thu Oct 31 03:59:19 2019 PDT
- GUI no longer shows drives listed
- ls -l /dev/sd* doesn't show the two drives listed
That said, I started with the SYSLOG as suggested and worked my way backwards to find the last time the drive was good and identify the beginning of the error messages. Below you'll find an excerpt from SYSLOG:
(drive was being detected beforehand and SMART was reporting back)
Oct 31 03:56:42 k1n3 kernel: [914232.144604] sd 0:0:3:0: [sdd] tag#1 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
...
Oct 31 03:56:42 k1n3 kernel: [914232.144609] print_req_error: I/O error, dev sdd, sector 231254656
...
Oct 31 03:56:42 k1n3 kernel: [914232.144679] sd 0:0:3:0: [sdd] killing request
...
Oct 31 03:59:19 k1n3 smartd[1087]: Device: /dev/sdd [SAT], open() failed: No such device
...
Please help!