LVM lvcreate getting stuck in pve8 kernel on some JBOD controller/SSD (Intel DC S3500) combinations?

tconnors

Member
Jun 24, 2020
15
0
6
42
I was having a great deal of problems creating ceph OSDs on one of my machines - a dell poweredge r730xd, using a Perc H730P Mini, passing through Intel SSDSC2BB120G4 120GB DC S3500 (both D2010370 and D2012370 firmware, the latter of which claimed to deal with an issue where it stops responding in response to some invalid SCSI commands) in jbod mode. It was able to open previous created OSDs, but would fail to even create a trivial LV in a simple volume group on a whole disk partition - kicking out the entire drive after a ~1 minute timeout after the lvcreate command is initiated. Doing the drive initialisation in one of my other pve8 boxes, not using a raid controller, works just fine. So hardware problem, right? Wrong!

straceing the call on the 6.2.16-15 kernel shows the timeout happening in the first call to:

````1707870 ioctl(3, BLKZEROOUT, [0, 4096]) = -1 EIO (Input/output error)````

Reinstalling pve7's 5.15.116-1 kernel, and it goes straight past that call, which succeeds, and lvcreate succeeds!

It's well after midnight here, so I'll do further debugging tomorrow, but I just wanted to get this out there, just in case it's causing anyone else problems (I have been completely unable to find anything remotely related in upstream kernel, and no-one's mentioned it here or reddit or apparently in Debian, yet). And next steps? I'll run the lvcreate on both kernels to verify the trace up until then is indeed the same. I'll go back to my previous scenario to ensure I really can recreate the OSDs I had previously created, evidently when I was still on pve7 (apparently a month had progressed since I first started experimenting down this path, and I hadn't realised I had upgraded pve in that time). I might have a spare non-intel DC SSD to test this on, but otherwise can't afford to take my 800GB intel DCs or any other drives out of production. I was under the impression I had previously tested this on another non-120GB Intel DC SSD without the error, but can't be entirely sure of this.

But is this 6.2.16-15 a fault in the upstream kernel, or is this a proxmox specific issue? Where do I start debugging that?
 
I can verify this does not affect all SSDs - on a crappy scsi SanDisk SSD, lvcreate works just fine on 6.2.16-15 through that same HBA.

It affects all my DC S3500's (scsi, through that HBA, but not in other machines without the Perc H730P) when in the 6.2.16-15 kernel, and they're all fine on the 5.15.116-1 kernel.

It affects them with and without `issue_discards = 1` in lvm.conf. What is the 6.2 kernel doing differently with the BLKZEROOUT ioctl?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!