Adaptec Raid - broken requests after upgrade to current PVE 8 - revert to 7.4?

haukenx · Dec 28, 2023

Hi there,

I am in the middle of performing an upgrade of my 4 node cluster from latest 7.4 to current 8.1. While the update procedure itself worked like a charm, I am running into RAID trouble with the current kernel (same as here https://bugzilla.kernel.org/show_bug.cgi?id=217599#c30, really seems to be a kernel issue). I am getting aborted requests, making the system unavailabe from several seconds up to minutes.

Two nodes have been updated, and I of course fear loosing quorums, when two nodes fail at the same time, especially, since I also run a hyperconverged Ceph (current 17.2).

It seems that reverting to the latest previous kernel available through apt (6.2.16-5-pve), the problem is mitigated, but of course, I do not want to keep the cluster in this state.

Searching the forums, I only saw the possibility of reverting the upgrade by pinning all the changed packages to their previous versions, but that seems not realistic to me for a total of 658 upgraded packages.

So, I am thinking of taking the broken nodes out of the cluster and reinstalling them from the latest 7.4 image. Does that seem like a good approach to you?

Stoiko Ivanov · Dec 28, 2023

see https://bugzilla.proxmox.com/show_bug.cgi?id=5077
We have a kernel with the proposed revert for testing - feedback would be welcome!

haukenx · Dec 28, 2023

Wow, thanks a lot for that hint! I will immediately check and let you know!

haukenx · Dec 28, 2023

Just a quick reply while still testing: With your kernel, I can not reproduce the error using fio on a single node

Next steps: I will install the same kernel on the second node that is affected and will bulk-migrate vms between those two nodes (which always triggered the unwanted aborted requests, before).

Any other suggestions on how to do further tests?

haukenx · Dec 29, 2023

Hi there,

after using the patched kernel on both affected nodes, they both behave perfectly fine, again

Thanks a lot!

One question remains: What do I do with my funny cluster now?

Upgrade the remaining two nodes (still on 7.4) and use the patched kernel
Upgrade the remaining two nodes (still on 7.4) and use the older 6.2 kernel that seems not to be affected
Downgrade (i.e.: remove them from the cluster and re-install 7.4) the two affected nodes
Wait

Could you give me some directions on that, please?

Stoiko Ivanov · Jan 4, 2024

haukenx said:
after using the patched kernel on both affected nodes, they both behave perfectly fine, again Thanks a lot!

Thanks for the feedback!

haukenx said:
One question remains: What do I do with my funny cluster now?

Upgrade the remaining two nodes (still on 7.4) and use the patched kernel

Upgrade the remaining two nodes (still on 7.4) and use the older 6.2 kernel that seems not to be affected

Downgrade (i.e.: remove them from the cluster and re-install 7.4) the two affected nodes

Wait

Could you give me some directions on that, please?

While I cannot promise any time to fix - I think your feedback, that the kernel with the patches fixes your issue makes it quite likely that we will pull them in and they will be available in one of the next 6.5 proxmox-kernels. Additionally when checking the upstream bugreport at bugzilla.kernel.org it seems that the revert has been applied to the mainline kernel and that Debian (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1059624) has also pulled the changes in.

The best way forward depends on your preferences/needs:
* 7.4 is still supported for more than half a year: https://pve.proxmox.com/pve-docs/chapter-pve-faq.html - but won't get any new features.
* the 6.2 kernel will not get further updates from our side - so in case a security issue comes up your systems would be vulnerable as well
* running a mixed version cluster is nothing we test too extensively - and especially migrations from newer nodes to older ones might break (although such a mixed setup can work in many environment for a while).

If you can wait for another week or two I'd suggest to see if the next proxmox-kernel contains the fix.

I hope this helps!

mdo · Jan 4, 2024

I have found this thread only an hour ago after we realised to be affected by the issue ("aacraid: Outstanding commands on ...") at random times. It's Christmas time and here in NZ this is holiday season so we upgraded various servers with different Adaptec RAID controllers (models 8405, 8805, 81605Z) at a time that customers are on holiday - and it turns out that most of the upgraded servers are now showing that issue, mainly at resource (I/O) intense times like nightly vzdump snapshot creation.
Very little use of these servers so far during the days due to holidays but all will be back to normal use on Monday and we are very worried - hence we will have to try the patched kernel asap.
I will report back here.
Michael

Stoiko Ivanov · Jan 5, 2024

mdo said:
Very little use of these servers so far during the days due to holidays but all will be back to normal use on Monday and we are very worried - hence we will have to try the patched kernel asap.

It seems that the issue is not happening on 100% of systems running one of those controllers - a user, who was affected by this was kind enough to provide a 8405 model to us for further testing, and we did not manage to reproduce the issue (even with the same firmware version of the controller) - so maybe your systems might not be hit... - however in case they are - I'm rather optimistic that the patched kernel addresses the issue!

I hope this helps!

mdo · Jan 5, 2024

Stoiko Ivanov said:
It seems that the issue is not happening on 100% of systems running one of those controllers - a user, who was affected by this was kind enough to provide a 8405 model to us for further testing, and we did not manage to reproduce the issue (even with the same firmware version of the controller) - so maybe your systems might not be hit... - however in case they are - I'm rather optimistic that the patched kernel addresses the issue!

I hope this helps!

Our ratio is about 60% affected, across Adaptec controller models 8405, 8805 and 81605Z. All controllers are on their latest firmware version.
We have updated the affected systems with the patched kernel and so far (over night) all have been stable and not shown the former issue so that looks very promising.

Hoping there won't be any negative coming up with the patched kernel (i.e. performance degradation).

Why 40% of the systems do not show the issue, we don't know either and cannot find the common denominator for that. We have left these systems for now on the official PVE8 kernel and will monitor them tightly next week.

BTW, all systems have been in place upgrades from previous PVE 7.4, not fresh installs (as unlikely as that might be related I think).

Thanks.

haukenx · Jan 8, 2024

Stoiko Ivanov said:
If you can wait for another week or two I'd suggest to see if the next proxmox-kernel contains the fix.

Thanks a lot for your feedback. I'll keep the cluster in it's current mixed state (seems to be running smoothly) and wait for the patched kernel through the repositories.

Hauke

zash1958 · Jan 30, 2024

Any news in this issue? I have here and for 2 customers installed the patched kernel and all is runnig without any problems with the Microsemi 88x adapter. Therefore the question for a regular kernel for this in the future?

fabian · Jan 30, 2024

the next pve kernel release (6.5.11-8-pve) will contain the revert/fix, it's in internal testing atm.

Search

Search

Adaptec Raid - broken requests after upgrade to current PVE 8 - revert to 7.4?

haukenx

Active Member

Stoiko Ivanov

Proxmox Staff Member

haukenx

Active Member

haukenx

Active Member

haukenx

Active Member

Stoiko Ivanov

Proxmox Staff Member

mdo

Renowned Member

Stoiko Ivanov

Proxmox Staff Member

mdo

Renowned Member

haukenx

Active Member

zash1958

Active Member

fabian

Proxmox Staff Member