[SOLVED] PVE5 : ZFS Pool Corruption issues after adding 5th mirror to pool

apoc · Oct 15, 2017

Hello everybody,

this turns out to be an experience report rather than an request for help as I finally found a solution (scroll to the very end of the post if you dont care about the journey). Since I do not have an own blog or something I think this is a good place to make this information available to others.

I am creating a bit text, but I guess I need to explain the situation and I really hope someone can make use of it some time.

As this would probably be the first question. Using the following hardware:
- Supermicro H8SGL-F Mainboard (Bios 3.5b)
- Opteron 6276 16C
- 8x8GB Reg ECC DDR3
- LSI9211-8i in IT-Mode (FW 20.0.07, Bios 7.39.02)
- HP SAS Expander 468406-B21 487738-001 (FW 2.10)
- Disks are not server-grade:

- 12x WD Black 750GB 2,5" consisting of WD7500BPKT (3Gbit) and WD7500BPKX (6Gbit) + 2x SSD (Kingston + Samsung)

- Disks are housed in 5x 4x SATA Carriers (Hot-Swap).

- Each SATA Carrier is connected to an individual SAS break-out port on the expander.

- 2 of the SATA Carriers were equipped with brand new cables while the other 3 were connected through existing ones.

I have migrated my home server to PV5 the last weeks.
This was done by adding a SAS-HBA + Expander as an PCI-passthrough to an existing ESXi-VM. I have added 8 drives (HDD-POOL) and 2 SSDs (SSD-POOL) using ZFS as a storage backend. Since I was using R10 configuration before migration, I decided to go with mirrored VDEVs as well.

During migration I moved one VM after another to the new hypervisor (which was running itself in a VM in the existing hypervisor). All went fine, a few minor, conceptual challenges but nothing I couldnt solve with the help of a popular search engine.

Two days ago the "switchover" took place. All the VMs where migrated and I have shut down everything. Moved the ESXi out of the way, got PVE installation in place and imported back the pools and all configs. All smooth.
Added another mirror to the HDD-POOL (since the origin storage is not needed anymore) and a bunch of SPARE drives.

While it was running good for about 2 days, I all the sudden experienced strange behavior of my backup server and mail server. They stopped randomly. After digging around some time and not finding any possible solution on the VM level (thought they were running out of memory due to missing RAID-Cache) I checked the host.

The just added "mirror-4" showed a failed drive, with "too many errors".
Ok, replaced it with another one, started resilvering (which was very slow) and finished finally, but showed uncorrectable errors like shown below:

After a resilver the mirror shows errors

Code:

#      mirror-4         ONLINE       0     0    30
#        C4-P1_SLOT-40  ONLINE       0     0    30
#        C3-P1_SLOT-36  ONLINE       0     0    30
#...
#errors: Permanent errors have been detected in the following files:
#        HDD-POOL/vm-100-disk-3:<0x1>

That is bad!
Moving to a solution which should protect me from this kind of stuff, just seemed to create a corruption...
The affected zvol was luckily the backup-disk of the mail-server.

I deleted the zvol, issued a scrub and the pool as well as the mirror is healthy again.

Code:

#      mirror-4         ONLINE       0     0     0
#        C4-P1_SLOT-40  ONLINE       0     0     0
#        C3-P1_SLOT-36  ONLINE       0     0     0

I checked the /var/log/messages log and found a lot of ugly stuff in there. The log is basically flooded with these kind of messages, so I just provide and example:

Code:

#Oct 12 21:57:38 proxmox kernel: [ 7655.326518] sd 0:0:3:0: attempting task abort! scmd(ffff9cc2f6801e00)
#Oct 12 21:57:38 proxmox kernel: [ 7655.326523] sd 0:0:3:0: [sdd] tag#76 CDB: Write(10) 2a 00 03 a0 be 48 00 00 08 00
#Oct 12 21:57:38 proxmox kernel: [ 7655.326525] scsi target0:0:3: handle(0x000d), sas_address(0x5001438021185707), phy(7)
#Oct 12 21:57:38 proxmox kernel: [ 7655.326527] scsi target0:0:3: enclosure_logical_id(0x5001438021185725), slot(36)
#Oct 12 21:57:42 proxmox kernel: [ 7658.660569] sd 0:0:3:0: task abort: SUCCESS scmd(ffff9cc2f6801e00)
#Oct 12 21:57:42 proxmox kernel: [ 7658.660579] sd 0:0:3:0: [sdd] tag#76 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
#Oct 12 21:57:42 proxmox kernel: [ 7658.660582] sd 0:0:3:0: [sdd] tag#76 CDB: Write(10) 2a 00 03 a0 be 48 00 00 08 00
#Oct 12 21:57:42 proxmox kernel: [ 7658.661515] sd 0:0:6:0: attempting task abort! scmd(ffff9cc2f70fd980)
#Oct 12 21:57:42 proxmox kernel: [ 7658.661519] sd 0:0:6:0: [sdg] tag#103 CDB: Write(10) 2a 00 03 a0 c7 48 00 01 00 00
#Oct 12 21:57:42 proxmox kernel: [ 7658.661521] scsi target0:0:6: handle(0x0010), sas_address(0x500143802118570b), phy(11)
#Oct 12 21:57:42 proxmox kernel: [ 7658.661523] scsi target0:0:6: enclosure_logical_id(0x5001438021185725), slot(40)
#Oct 12 21:57:45 proxmox kernel: [ 7661.660931] sd 0:0:6:0: task abort: SUCCESS scmd(ffff9cc2f70fd980)

Guess what: sdd and sdg are both part of the mirror-4. Already having replaced a drive, I did it with yet another one. Same result. So how likely is it having 4 defective drives when they have worked a day before all right on a RAID controller?

Additional I have a lot of the following type of messages, which I could not find anything on the web.

Code:

#Oct 12 21:57:51 proxmox kernel: [ 7667.911504] mpt2sas_cm0: log_info(0x31120112): originator(PL), code(0x12), sub_code(0x0112)

I downgraded the LSI 9211-8i to a P19 firmware. One of the early P20 versions gave me trouble but that was more a lucky shot (which missed). Nothing changed.

Since I have not head any issues with the 8 disks which were connected to the drive carriers using new cables I decided to purchase new cables for the remaining ones.
Other messages that appaered while doing a scrub (with the new purchased cables) and which I have not recognized before:

Code:

#Oct 14 18:45:01 proxmox kernel: [19282.799482] mpt2sas_cm0: log_info(0x30030101): originator(IOP), code(0x03), sub_code(0x0101)
#Oct 14 18:55:02 proxmox kernel: [19883.449078] mpt2sas_cm0: log_info(0x30030101): originator(IOP), code(0x03), sub_code(0x0101)

However when doing real IO to the pool the already known messages are back

Code:

#Oct 14 19:31:26 proxmox kernel: [22067.843513] mpt2sas_cm0: log_info(0x31111000): originator(PL), code(0x11), sub_code(0x1000)
#Oct 14 19:31:29 proxmox kernel: [22070.843870] mpt2sas_cm0: log_info(0x31120112): originator(PL), code(0x12), sub_code(0x0112)
#Oct 14 19:33:19 proxmox kernel: [22181.112188] sd 0:0:3:0: [sdd] Read Capacity(16) failed: Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
#Oct 14 19:33:19 proxmox kernel: [22181.112191] sd 0:0:3:0: [sdd] Sense not available.

apoc · Mar 4, 2020

I also got messages which could be stack traces or similar:

Code:

#Oct 14 19:34:16 proxmox kernel: [22237.625957] txg_sync        D    0  3650      2 0x00000000
#Oct 14 19:34:16 proxmox kernel: [22237.625959] Call Trace:
#Oct 14 19:34:16 proxmox kernel: [22237.625965]  __schedule+0x233/0x6f0
#Oct 14 19:34:16 proxmox kernel: [22237.625967]  schedule+0x36/0x80
#Oct 14 19:34:16 proxmox kernel: [22237.625977]  cv_wait_common+0x128/0x140 [spl]
#Oct 14 19:34:16 proxmox kernel: [22237.625980]  ? wake_atomic_t_function+0x60/0x60
#Oct 14 19:34:16 proxmox kernel: [22237.625984]  __cv_wait+0x15/0x20 [spl]
#Oct 14 19:34:16 proxmox kernel: [22237.626045]  spa_config_enter+0x99/0x110 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.626075]  spa_sync+0x113/0xb40 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.626077]  ? default_wake_function+0x12/0x20
#Oct 14 19:34:16 proxmox kernel: [22237.626079]  ? autoremove_wake_function+0x12/0x40
#Oct 14 19:34:16 proxmox kernel: [22237.626080]  ? __wake_up_common+0x4d/0x80
#Oct 14 19:34:16 proxmox kernel: [22237.626112]  txg_sync_thread+0x3be/0x630 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.626143]  ? txg_quiesce_thread+0x3f0/0x3f0 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.626147]  thread_generic_wrapper+0x72/0x80 [spl]
#Oct 14 19:34:16 proxmox kernel: [22237.626149]  kthread+0x109/0x140
#Oct 14 19:34:16 proxmox kernel: [22237.626153]  ? __thread_exit+0x20/0x20 [spl]
#Oct 14 19:34:16 proxmox kernel: [22237.626155]  ? kthread_create_on_node+0x60/0x60
#Oct 14 19:34:16 proxmox kernel: [22237.626157]  ret_from_fork+0x2c/0x40
#Oct 14 19:34:16 proxmox kernel: [22237.626316] kvm             D    0  1461      1 0x00000004
#Oct 14 19:34:16 proxmox kernel: [22237.626317] Call Trace:
#Oct 14 19:34:16 proxmox kernel: [22237.626319]  __schedule+0x233/0x6f0
#Oct 14 19:34:16 proxmox kernel: [22237.626320]  schedule+0x36/0x80
#Oct 14 19:34:16 proxmox kernel: [22237.626325]  cv_wait_common+0x128/0x140 [spl]
#Oct 14 19:34:16 proxmox kernel: [22237.626327]  ? wake_atomic_t_function+0x60/0x60
#Oct 14 19:34:16 proxmox kernel: [22237.626331]  __cv_wait+0x15/0x20 [spl]
#Oct 14 19:34:16 proxmox kernel: [22237.626363]  spa_config_enter+0xf8/0x110 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.626393]  zio_vdev_io_start+0x1c3/0x2d0 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.626423]  zio_nowait+0x79/0xf0 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.626445]  arc_read+0x54e/0xa60 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.626467]  dbuf_read+0x2af/0x850 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.626492]  dmu_tx_check_ioerr+0x6c/0xf0 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.626517]  dmu_tx_count_write+0x3e5/0x660 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.626539]  ? dbuf_rele+0x36/0x40 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.626540]  ? __kmalloc_node+0x1c5/0x2a0
#Oct 14 19:34:16 proxmox kernel: [22237.626566]  ? dnode_hold_impl+0x1f7/0x4c0 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.626569]  ? spl_kmem_zalloc+0x91/0x160 [spl]
#Oct 14 19:34:16 proxmox kernel: [22237.626573]  ? spl_kmem_zalloc+0x91/0x160 [spl]
#Oct 14 19:34:16 proxmox kernel: [22237.626598]  dmu_tx_hold_write+0x36/0x50 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.626627]  zvol_request+0x26b/0x680 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.626630]  ? generic_make_request_checks+0x38a/0x5a0
#Oct 14 19:34:16 proxmox kernel: [22237.626631]  generic_make_request+0x110/0x2d0
#Oct 14 19:34:16 proxmox kernel: [22237.626633]  ? iov_iter_get_pages+0xbd/0x340
#Oct 14 19:34:16 proxmox kernel: [22237.626635]  ? bvec_alloc+0x90/0xf0
#Oct 14 19:34:16 proxmox kernel: [22237.626636]  submit_bio+0x73/0x150
#Oct 14 19:34:16 proxmox kernel: [22237.626638]  ? bio_iov_iter_get_pages+0xd7/0x120
#Oct 14 19:34:16 proxmox kernel: [22237.626640]  blkdev_direct_IO+0x1ef/0x3b0
#Oct 14 19:34:16 proxmox kernel: [22237.626642]  ? read_events+0x320/0x320
#Oct 14 19:34:16 proxmox kernel: [22237.626643]  ? try_to_wake_up+0x59/0x3e0
#Oct 14 19:34:16 proxmox kernel: [22237.626644]  ? blkdev_direct_IO+0x2f7/0x3b0
#Oct 14 19:34:16 proxmox kernel: [22237.626647]  generic_file_direct_write+0xb4/0x170
#Oct 14 19:34:16 proxmox kernel: [22237.626648]  __generic_file_write_iter+0xc0/0x1d0
#Oct 14 19:34:16 proxmox kernel: [22237.626650]  blkdev_write_iter+0x8b/0x100
#Oct 14 19:34:16 proxmox kernel: [22237.626652]  ? apparmor_file_permission+0x1a/0x20
#Oct 14 19:34:16 proxmox kernel: [22237.626654]  ? security_file_permission+0x3b/0xc0
#Oct 14 19:34:16 proxmox kernel: [22237.626655]  aio_write+0xf5/0x150
#Oct 14 19:34:16 proxmox kernel: [22237.626657]  ? __check_object_size+0x100/0x1d7
#Oct 14 19:34:16 proxmox kernel: [22237.626659]  ? __fget_light+0x25/0x60
#Oct 14 19:34:16 proxmox kernel: [22237.626660]  ? __fdget+0x13/0x20
#Oct 14 19:34:16 proxmox kernel: [22237.626662]  ? eventfd_ctx_fdget+0x43/0xa0
#
### continued next post

apoc · Mar 4, 2020

Code:

#Oct 14 19:34:16 proxmox kernel: [22237.626663]  do_io_submit+0x2b6/0x620
#Oct 14 19:34:16 proxmox kernel: [22237.626664]  SyS_io_submit+0x10/0x20
#Oct 14 19:34:16 proxmox kernel: [22237.626666]  entry_SYSCALL_64_fastpath+0x1e/0xad
#Oct 14 19:34:16 proxmox kernel: [22237.626667] RIP: 0033:0x7f828061b717
#Oct 14 19:34:16 proxmox kernel: [22237.626668] RSP: 002b:00007ffe53c85958 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
#Oct 14 19:34:16 proxmox kernel: [22237.626670] RAX: ffffffffffffffda RBX: 00007f81d5535c40 RCX: 00007f828061b717
#Oct 14 19:34:16 proxmox kernel: [22237.626670] RDX: 00007ffe53c85970 RSI: 0000000000000004 RDI: 00007f8280b53000
#Oct 14 19:34:16 proxmox kernel: [22237.626671] RBP: 00007f81d4fea060 R08: 0000000000000000 R09: 00007f8280b80e90
#Oct 14 19:34:16 proxmox kernel: [22237.626672] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000002000
#Oct 14 19:34:16 proxmox kernel: [22237.626673] R13: 0000000000000000 R14: 00007f81d51a9d60 R15: 000000020023b000
#Oct 14 19:34:16 proxmox kernel: [22237.626755] spa_async       D    0  1922      2 0x00000000
#Oct 14 19:34:16 proxmox kernel: [22237.626756] Call Trace:
#Oct 14 19:34:16 proxmox kernel: [22237.626758]  __schedule+0x233/0x6f0
#Oct 14 19:34:16 proxmox kernel: [22237.626790]  ? vdev_disk_dio_put+0x60/0x70 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.626791]  schedule+0x36/0x80
#Oct 14 19:34:16 proxmox kernel: [22237.626792]  schedule_timeout+0x22a/0x3f0
#Oct 14 19:34:16 proxmox kernel: [22237.626794]  ? ktime_get+0x41/0xb0
#Oct 14 19:34:16 proxmox kernel: [22237.626795]  io_schedule_timeout+0xa4/0x110
#Oct 14 19:34:16 proxmox kernel: [22237.626800]  cv_wait_common+0xbc/0x140 [spl]
#Oct 14 19:34:16 proxmox kernel: [22237.626802]  ? wake_atomic_t_function+0x60/0x60
#Oct 14 19:34:16 proxmox kernel: [22237.626806]  __cv_wait_io+0x18/0x20 [spl]
#Oct 14 19:34:16 proxmox kernel: [22237.626836]  zio_wait+0xc4/0x160 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.626868]  vdev_open+0x41f/0x4f0 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.626869]  ? mutex_lock+0x12/0x40
#Oct 14 19:34:16 proxmox kernel: [22237.626900]  vdev_reopen+0x33/0xd0 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.626931]  spa_async_probe+0x6a/0x70 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.626961]  spa_async_probe+0x44/0x70 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.626991]  spa_async_probe+0x44/0x70 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.627022]  spa_async_thread+0x145/0x2f0 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.627023]  ? kfree+0x14a/0x160
#Oct 14 19:34:16 proxmox kernel: [22237.627054]  ? spa_vdev_resilver_done+0x160/0x160 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.627058]  thread_generic_wrapper+0x72/0x80 [spl]
#Oct 14 19:34:16 proxmox kernel: [22237.627059]  kthread+0x109/0x140
#Oct 14 19:34:16 proxmox kernel: [22237.627063]  ? __thread_exit+0x20/0x20 [spl]
#Oct 14 19:34:16 proxmox kernel: [22237.627065]  ? kthread_create_on_node+0x60/0x60
#Oct 14 19:34:16 proxmox kernel: [22237.627066]  ret_from_fork+0x2c/0x40
#Oct 14 19:34:16 proxmox kernel: [22237.627146] zpool           D    0  1923   6297 0x00000004
#Oct 14 19:34:16 proxmox kernel: [22237.627147] Call Trace:
#Oct 14 19:34:16 proxmox kernel: [22237.627149]  __schedule+0x233/0x6f0
#Oct 14 19:34:16 proxmox kernel: [22237.627150]  schedule+0x36/0x80
#Oct 14 19:34:16 proxmox kernel: [22237.627154]  cv_wait_common+0x128/0x140 [spl]
#Oct 14 19:34:16 proxmox kernel: [22237.627156]  ? wake_atomic_t_function+0x60/0x60
#Oct 14 19:34:16 proxmox kernel: [22237.627160]  __cv_wait+0x15/0x20 [spl]
#Oct 14 19:34:16 proxmox kernel: [22237.627191]  spa_config_enter+0xf8/0x110 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.627222]  spa_config_generate+0x83e/0x8e0 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.627253]  spa_open_common+0xc8/0x490 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.627284]  spa_get_stats+0x5b/0x550 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.627285]  ? kmalloc_large_node+0x56/0x80
#Oct 14 19:34:16 proxmox kernel: [22237.627286]  ? __check_object_size+0x100/0x1d7
#Oct 14 19:34:16 proxmox kernel: [22237.627317]  zfs_ioc_pool_stats+0x39/0x90 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.627349]  zfsdev_ioctl+0x460/0x4f0 [zfs]
#Oct 14 19:34:16 proxmox kernel: [22237.627351]  do_vfs_ioctl+0xa3/0x610
#Oct 14 19:34:16 proxmox kernel: [22237.627353]  ? __do_page_fault+0x266/0x4e0
#Oct 14 19:34:16 proxmox kernel: [22237.627354]  SyS_ioctl+0x79/0x90
#Oct 14 19:34:16 proxmox kernel: [22237.627356]  entry_SYSCALL_64_fastpath+0x1e/0xad
#Oct 14 19:34:16 proxmox kernel: [22237.627357] RIP: 0033:0x7f78690e3e07
#Oct 14 19:34:16 proxmox kernel: [22237.627357] RSP: 002b:00007fff35201b28 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
#Oct 14 19:34:16 proxmox kernel: [22237.627359] RAX: ffffffffffffffda RBX: 00007f786939cb00 RCX: 00007f78690e3e07
#Oct 14 19:34:16 proxmox kernel: [22237.627360] RDX: 00007fff35201b60 RSI: 0000000000005a05 RDI: 0000000000000003
#Oct 14 19:34:16 proxmox kernel: [22237.627360] RBP: 0000563a7413afb0 R08: 0000000000000003 R09: 0000000000010010
#Oct 14 19:34:16 proxmox kernel: [22237.627361] R10: 00007f786912d460 R11: 0000000000000246 R12: 0000000000010000
#Oct 14 19:34:16 proxmox kernel: [22237.627362] R13: 0000000000020060 R14: 0000563a7413afa0 R15: 00007fff352051bc
#Oct 14 19:34:17 proxmox kernel: [22239.371949] mpt2sas_cm0: log_info(0x31120112): originator(PL), code(0x12), sub_code(0x0112)
#Oct 14 19:34:20 proxmox kernel: [22242.372441] mpt2sas_cm0: log_info(0x31120112): originator(PL), code(0x12), sub_code(0x0112)
#Oct 14 19:34:23 proxmox kernel: [22245.372949] mpt2sas_cm0: log_info(0x31120112): originator(PL), code(0x12), sub_code(0x0112)
#Oct 14 19:34:27 proxmox kernel: [22248.623412] mpt2sas_cm0: log_info(0x31120112): originator(PL), code(0x12), sub_code(0x0112)
#Oct 14 19:34:30 proxmox kernel: [22251.623877] mpt2sas_cm0: log_info(0x31111000): originator(PL), code(0x11), sub_code(0x1000)
#Oct 14 19:34:33 proxmox kernel: [22254.624386] mpt2sas_cm0: log_info(0x31120112): originator(PL), code(0x12), sub_code(0x0112)
#Oct 14 19:34:36 proxmox kernel: [22257.624961] mpt2sas_cm0: log_info(0x31120112): originator(PL), code(0x12), sub_code(0x0112)
#Oct 14 19:34:39 proxmox kernel: [22260.875494] mpt2sas_cm0: log_info(0x31120112): originator(PL), code(0x12), sub_code(0x0112)
#Oct 14 19:34:42 proxmox kernel: [22263.875995] mpt2sas_cm0: log_info(0x31120112): originator(PL), code(0x12), sub_code(0x0112)

I did some further research and found suggestions to disable the NCQ on the SATA drives (by reducing number of outstanding commands to 1 (default is 32)) but that did not help either.

Replacing the SAS HBA as well as the SAS Expander with a spare unit I have did also not help...

Another point of investigation was the SAS-Expander itself. There are a lot of discussions out there (partially from 2010 and 2012, so quite old). The summarization is:

People dont agree if that (the expander) is source of potential evil or not (in combination with SATA drives and ZFS).
There seems to be a lot of educated guessing going on, but no real explanations or even reports from users with detailed log messages.
No consistent information could be found as controller and / or disks do play their role in potential issues.

apoc · Mar 4, 2020

=====> THE SOLUTION -- REMOVE THE SAS EXPANDER <=====

I went out of other ideas. It was not really plausible why that would make a difference: The first 10 Disks (8 HDDs + 2 SSDs) worked like a charme on the duo of 9211-8i and the SAS Expander. Same duo runs on my NAS (Linux, MDADM) just fine as well.
Furthermore it is not my preferred setup but I finally removed the SAS-Expander, cabeling everything directly to SAS/SATA Ports. I did that by installing a second 9211-8i (the spare one) and uusing the onboard ports.

Fired everything up again. Pool got imported and I was able to write a zvol (100 GiB) from within a VM with 292 MB/s average...
No chance I was able to do that with the original setup.
I still can see the following messages in the /var/log/messages but they appear only once per 10 minutes and controller. So far I would say they dont seem to report issues:

Code:

#Oct 14 21:38:05 proxmox kernel: [ 6045.065655] mpt2sas_cm0: log_info(0x30030101): originator(IOP), code(0x03), sub_code(0x0101)
#Oct 14 21:38:06 proxmox kernel: [ 6045.432428] mpt2sas_cm1: log_info(0x30030101): originator(IOP), code(0x03), sub_code(0x0101)

My lesson: Dont use an Expander with ZFS and SATA drives unless you have tested the full setup (which I wasnt able due to migration).

If someone can provide an explanation that would be great. Otherwise it is for the records.

All the best!

apoc · Mar 4, 2020

oh man ... I just wanted to add the explanation of these messages and was totally messed up by the fact that only 10k chars are allowed in a post :/

It is an old thread, but since i have stumbled across these messages again and finally found an answer I wanted to update it (would have edited the first post, but the forum tells me a post must not contain more than 10k chars...)

Code:

#Oct 14 21:38:05 proxmox kernel: [ 6045.065655] mpt2sas_cm0: log_info(0x30030101): originator(IOP), code(0x03), sub_code(0x0101)
#Oct 14 21:38:06 proxmox kernel: [ 6045.432428] mpt2sas_cm1: log_info(0x30030101): originator(IOP), code(0x03), sub_code(0x0101)

These messages above come from "sas2ircu-status" package issueing requests to the HBA, which it doesn´t understand.

Uninstalling the package made the messages go away

Source: https://www.alexbrueckel.de/weird-log_info-messages-from-mpt2sas/

Search

Search

[SOLVED] PVE5 : ZFS Pool Corruption issues after adding 5th mirror to pool

apoc

Famous Member

apoc

Famous Member

apoc

Famous Member

apoc

Famous Member

apoc

Famous Member

We value your privacy