I was having a great deal of problems creating ceph OSDs on one of my machines - a dell poweredge r730xd, using a Perc H730P Mini, passing through Intel SSDSC2BB120G4 120GB DC S3500 (both D2010370 and D2012370 firmware, the latter of which claimed to deal with an issue where it stops responding in response to some invalid SCSI commands) in jbod mode. It was able to open previous created OSDs, but would fail to even create a trivial LV in a simple volume group on a whole disk partition - kicking out the entire drive after a ~1 minute timeout after the lvcreate command is initiated. Doing the drive initialisation in one of my other pve8 boxes, not using a raid controller, works just fine. So hardware problem, right? Wrong!
straceing the call on the 6.2.16-15 kernel shows the timeout happening in the first call to:
````1707870 ioctl(3, BLKZEROOUT, [0, 4096]) = -1 EIO (Input/output error)````
Reinstalling pve7's 5.15.116-1 kernel, and it goes straight past that call, which succeeds, and lvcreate succeeds!
It's well after midnight here, so I'll do further debugging tomorrow, but I just wanted to get this out there, just in case it's causing anyone else problems (I have been completely unable to find anything remotely related in upstream kernel, and no-one's mentioned it here or reddit or apparently in Debian, yet). And next steps? I'll run the lvcreate on both kernels to verify the trace up until then is indeed the same. I'll go back to my previous scenario to ensure I really can recreate the OSDs I had previously created, evidently when I was still on pve7 (apparently a month had progressed since I first started experimenting down this path, and I hadn't realised I had upgraded pve in that time). I might have a spare non-intel DC SSD to test this on, but otherwise can't afford to take my 800GB intel DCs or any other drives out of production. I was under the impression I had previously tested this on another non-120GB Intel DC SSD without the error, but can't be entirely sure of this.
But is this 6.2.16-15 a fault in the upstream kernel, or is this a proxmox specific issue? Where do I start debugging that?
straceing the call on the 6.2.16-15 kernel shows the timeout happening in the first call to:
````1707870 ioctl(3, BLKZEROOUT, [0, 4096]) = -1 EIO (Input/output error)````
Reinstalling pve7's 5.15.116-1 kernel, and it goes straight past that call, which succeeds, and lvcreate succeeds!
It's well after midnight here, so I'll do further debugging tomorrow, but I just wanted to get this out there, just in case it's causing anyone else problems (I have been completely unable to find anything remotely related in upstream kernel, and no-one's mentioned it here or reddit or apparently in Debian, yet). And next steps? I'll run the lvcreate on both kernels to verify the trace up until then is indeed the same. I'll go back to my previous scenario to ensure I really can recreate the OSDs I had previously created, evidently when I was still on pve7 (apparently a month had progressed since I first started experimenting down this path, and I hadn't realised I had upgraded pve in that time). I might have a spare non-intel DC SSD to test this on, but otherwise can't afford to take my 800GB intel DCs or any other drives out of production. I was under the impression I had previously tested this on another non-120GB Intel DC SSD without the error, but can't be entirely sure of this.
But is this 6.2.16-15 a fault in the upstream kernel, or is this a proxmox specific issue? Where do I start debugging that?