Hello everybody,
this turns out to be an experience report rather than an request for help as I finally found a solution (scroll to the very end of the post if you dont care about the journey). Since I do not have an own blog or something I think this is a good place to make this information available to others.
I am creating a bit text, but I guess I need to explain the situation and I really hope someone can make use of it some time.
As this would probably be the first question. Using the following hardware:
- Supermicro H8SGL-F Mainboard (Bios 3.5b)
- Opteron 6276 16C
- 8x8GB Reg ECC DDR3
- LSI9211-8i in IT-Mode (FW 20.0.07, Bios 7.39.02)
- HP SAS Expander 468406-B21 487738-001 (FW 2.10)
- Disks are not server-grade:
I have migrated my home server to PV5 the last weeks.
This was done by adding a SAS-HBA + Expander as an PCI-passthrough to an existing ESXi-VM. I have added 8 drives (HDD-POOL) and 2 SSDs (SSD-POOL) using ZFS as a storage backend. Since I was using R10 configuration before migration, I decided to go with mirrored VDEVs as well.
During migration I moved one VM after another to the new hypervisor (which was running itself in a VM in the existing hypervisor). All went fine, a few minor, conceptual challenges but nothing I couldnt solve with the help of a popular search engine.
Two days ago the "switchover" took place. All the VMs where migrated and I have shut down everything. Moved the ESXi out of the way, got PVE installation in place and imported back the pools and all configs. All smooth.
Added another mirror to the HDD-POOL (since the origin storage is not needed anymore) and a bunch of SPARE drives.
While it was running good for about 2 days, I all the sudden experienced strange behavior of my backup server and mail server. They stopped randomly. After digging around some time and not finding any possible solution on the VM level (thought they were running out of memory due to missing RAID-Cache) I checked the host.
The just added "mirror-4" showed a failed drive, with "too many errors".
Ok, replaced it with another one, started resilvering (which was very slow) and finished finally, but showed uncorrectable errors like shown below:
After a resilver the mirror shows errors
That is bad!
Moving to a solution which should protect me from this kind of stuff, just seemed to create a corruption...
The affected zvol was luckily the backup-disk of the mail-server.
I deleted the zvol, issued a scrub and the pool as well as the mirror is healthy again.
I checked the /var/log/messages log and found a lot of ugly stuff in there. The log is basically flooded with these kind of messages, so I just provide and example:
Guess what: sdd and sdg are both part of the mirror-4. Already having replaced a drive, I did it with yet another one. Same result. So how likely is it having 4 defective drives when they have worked a day before all right on a RAID controller?
Additional I have a lot of the following type of messages, which I could not find anything on the web.
I downgraded the LSI 9211-8i to a P19 firmware. One of the early P20 versions gave me trouble but that was more a lucky shot (which missed). Nothing changed.
Since I have not head any issues with the 8 disks which were connected to the drive carriers using new cables I decided to purchase new cables for the remaining ones.
Other messages that appaered while doing a scrub (with the new purchased cables) and which I have not recognized before:
However when doing real IO to the pool the already known messages are back
this turns out to be an experience report rather than an request for help as I finally found a solution (scroll to the very end of the post if you dont care about the journey). Since I do not have an own blog or something I think this is a good place to make this information available to others.
I am creating a bit text, but I guess I need to explain the situation and I really hope someone can make use of it some time.
As this would probably be the first question. Using the following hardware:
- Supermicro H8SGL-F Mainboard (Bios 3.5b)
- Opteron 6276 16C
- 8x8GB Reg ECC DDR3
- LSI9211-8i in IT-Mode (FW 20.0.07, Bios 7.39.02)
- HP SAS Expander 468406-B21 487738-001 (FW 2.10)
- Disks are not server-grade:
- 12x WD Black 750GB 2,5" consisting of WD7500BPKT (3Gbit) and WD7500BPKX (6Gbit) + 2x SSD (Kingston + Samsung)
- Disks are housed in 5x 4x SATA Carriers (Hot-Swap).- Each SATA Carrier is connected to an individual SAS break-out port on the expander.
- 2 of the SATA Carriers were equipped with brand new cables while the other 3 were connected through existing ones.
I have migrated my home server to PV5 the last weeks.
This was done by adding a SAS-HBA + Expander as an PCI-passthrough to an existing ESXi-VM. I have added 8 drives (HDD-POOL) and 2 SSDs (SSD-POOL) using ZFS as a storage backend. Since I was using R10 configuration before migration, I decided to go with mirrored VDEVs as well.
During migration I moved one VM after another to the new hypervisor (which was running itself in a VM in the existing hypervisor). All went fine, a few minor, conceptual challenges but nothing I couldnt solve with the help of a popular search engine.
Two days ago the "switchover" took place. All the VMs where migrated and I have shut down everything. Moved the ESXi out of the way, got PVE installation in place and imported back the pools and all configs. All smooth.
Added another mirror to the HDD-POOL (since the origin storage is not needed anymore) and a bunch of SPARE drives.
While it was running good for about 2 days, I all the sudden experienced strange behavior of my backup server and mail server. They stopped randomly. After digging around some time and not finding any possible solution on the VM level (thought they were running out of memory due to missing RAID-Cache) I checked the host.
The just added "mirror-4" showed a failed drive, with "too many errors".
Ok, replaced it with another one, started resilvering (which was very slow) and finished finally, but showed uncorrectable errors like shown below:
After a resilver the mirror shows errors
Code:
# mirror-4 ONLINE 0 0 30
# C4-P1_SLOT-40 ONLINE 0 0 30
# C3-P1_SLOT-36 ONLINE 0 0 30
#...
#errors: Permanent errors have been detected in the following files:
# HDD-POOL/vm-100-disk-3:<0x1>
That is bad!
Moving to a solution which should protect me from this kind of stuff, just seemed to create a corruption...
The affected zvol was luckily the backup-disk of the mail-server.
I deleted the zvol, issued a scrub and the pool as well as the mirror is healthy again.
Code:
# mirror-4 ONLINE 0 0 0
# C4-P1_SLOT-40 ONLINE 0 0 0
# C3-P1_SLOT-36 ONLINE 0 0 0
I checked the /var/log/messages log and found a lot of ugly stuff in there. The log is basically flooded with these kind of messages, so I just provide and example:
Code:
#Oct 12 21:57:38 proxmox kernel: [ 7655.326518] sd 0:0:3:0: attempting task abort! scmd(ffff9cc2f6801e00)
#Oct 12 21:57:38 proxmox kernel: [ 7655.326523] sd 0:0:3:0: [sdd] tag#76 CDB: Write(10) 2a 00 03 a0 be 48 00 00 08 00
#Oct 12 21:57:38 proxmox kernel: [ 7655.326525] scsi target0:0:3: handle(0x000d), sas_address(0x5001438021185707), phy(7)
#Oct 12 21:57:38 proxmox kernel: [ 7655.326527] scsi target0:0:3: enclosure_logical_id(0x5001438021185725), slot(36)
#Oct 12 21:57:42 proxmox kernel: [ 7658.660569] sd 0:0:3:0: task abort: SUCCESS scmd(ffff9cc2f6801e00)
#Oct 12 21:57:42 proxmox kernel: [ 7658.660579] sd 0:0:3:0: [sdd] tag#76 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
#Oct 12 21:57:42 proxmox kernel: [ 7658.660582] sd 0:0:3:0: [sdd] tag#76 CDB: Write(10) 2a 00 03 a0 be 48 00 00 08 00
#Oct 12 21:57:42 proxmox kernel: [ 7658.661515] sd 0:0:6:0: attempting task abort! scmd(ffff9cc2f70fd980)
#Oct 12 21:57:42 proxmox kernel: [ 7658.661519] sd 0:0:6:0: [sdg] tag#103 CDB: Write(10) 2a 00 03 a0 c7 48 00 01 00 00
#Oct 12 21:57:42 proxmox kernel: [ 7658.661521] scsi target0:0:6: handle(0x0010), sas_address(0x500143802118570b), phy(11)
#Oct 12 21:57:42 proxmox kernel: [ 7658.661523] scsi target0:0:6: enclosure_logical_id(0x5001438021185725), slot(40)
#Oct 12 21:57:45 proxmox kernel: [ 7661.660931] sd 0:0:6:0: task abort: SUCCESS scmd(ffff9cc2f70fd980)
Guess what: sdd and sdg are both part of the mirror-4. Already having replaced a drive, I did it with yet another one. Same result. So how likely is it having 4 defective drives when they have worked a day before all right on a RAID controller?
Additional I have a lot of the following type of messages, which I could not find anything on the web.
Code:
#Oct 12 21:57:51 proxmox kernel: [ 7667.911504] mpt2sas_cm0: log_info(0x31120112): originator(PL), code(0x12), sub_code(0x0112)
I downgraded the LSI 9211-8i to a P19 firmware. One of the early P20 versions gave me trouble but that was more a lucky shot (which missed). Nothing changed.
Since I have not head any issues with the 8 disks which were connected to the drive carriers using new cables I decided to purchase new cables for the remaining ones.
Other messages that appaered while doing a scrub (with the new purchased cables) and which I have not recognized before:
Code:
#Oct 14 18:45:01 proxmox kernel: [19282.799482] mpt2sas_cm0: log_info(0x30030101): originator(IOP), code(0x03), sub_code(0x0101)
#Oct 14 18:55:02 proxmox kernel: [19883.449078] mpt2sas_cm0: log_info(0x30030101): originator(IOP), code(0x03), sub_code(0x0101)
However when doing real IO to the pool the already known messages are back
Code:
#Oct 14 19:31:26 proxmox kernel: [22067.843513] mpt2sas_cm0: log_info(0x31111000): originator(PL), code(0x11), sub_code(0x1000)
#Oct 14 19:31:29 proxmox kernel: [22070.843870] mpt2sas_cm0: log_info(0x31120112): originator(PL), code(0x12), sub_code(0x0112)
#Oct 14 19:33:19 proxmox kernel: [22181.112188] sd 0:0:3:0: [sdd] Read Capacity(16) failed: Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
#Oct 14 19:33:19 proxmox kernel: [22181.112191] sd 0:0:3:0: [sdd] Sense not available.
Last edited: