Adaptec Raid - broken requests after upgrade to current PVE 8 - revert to 7.4?

haukenx

Active Member
Dec 16, 2018
23
1
43
47
Hi there,

I am in the middle of performing an upgrade of my 4 node cluster from latest 7.4 to current 8.1. While the update procedure itself worked like a charm, I am running into RAID trouble with the current kernel (same as here https://bugzilla.kernel.org/show_bug.cgi?id=217599#c30, really seems to be a kernel issue). I am getting aborted requests, making the system unavailabe from several seconds up to minutes.

Two nodes have been updated, and I of course fear loosing quorums, when two nodes fail at the same time, especially, since I also run a hyperconverged Ceph (current 17.2).

It seems that reverting to the latest previous kernel available through apt (6.2.16-5-pve), the problem is mitigated, but of course, I do not want to keep the cluster in this state.

Searching the forums, I only saw the possibility of reverting the upgrade by pinning all the changed packages to their previous versions, but that seems not realistic to me for a total of 658 upgraded packages.

So, I am thinking of taking the broken nodes out of the cluster and reinstalling them from the latest 7.4 image. Does that seem like a good approach to you?
 
Just a quick reply while still testing: With your kernel, I can not reproduce the error using fio on a single node :)

Next steps: I will install the same kernel on the second node that is affected and will bulk-migrate vms between those two nodes (which always triggered the unwanted aborted requests, before).

Any other suggestions on how to do further tests?
 
  • Like
Reactions: Stoiko Ivanov
Hi there,

after using the patched kernel on both affected nodes, they both behave perfectly fine, again :) Thanks a lot!

One question remains: What do I do with my funny cluster now?
  • Upgrade the remaining two nodes (still on 7.4) and use the patched kernel
  • Upgrade the remaining two nodes (still on 7.4) and use the older 6.2 kernel that seems not to be affected
  • Downgrade (i.e.: remove them from the cluster and re-install 7.4) the two affected nodes
  • Wait
Could you give me some directions on that, please?
 
after using the patched kernel on both affected nodes, they both behave perfectly fine, again :) Thanks a lot!
Thanks for the feedback!

One question remains: What do I do with my funny cluster now?
  • Upgrade the remaining two nodes (still on 7.4) and use the patched kernel
  • Upgrade the remaining two nodes (still on 7.4) and use the older 6.2 kernel that seems not to be affected
  • Downgrade (i.e.: remove them from the cluster and re-install 7.4) the two affected nodes
  • Wait
Could you give me some directions on that, please?
While I cannot promise any time to fix - I think your feedback, that the kernel with the patches fixes your issue makes it quite likely that we will pull them in and they will be available in one of the next 6.5 proxmox-kernels. Additionally when checking the upstream bugreport at bugzilla.kernel.org it seems that the revert has been applied to the mainline kernel and that Debian (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1059624) has also pulled the changes in.

The best way forward depends on your preferences/needs:
* 7.4 is still supported for more than half a year: https://pve.proxmox.com/pve-docs/chapter-pve-faq.html - but won't get any new features.
* the 6.2 kernel will not get further updates from our side - so in case a security issue comes up your systems would be vulnerable as well
* running a mixed version cluster is nothing we test too extensively - and especially migrations from newer nodes to older ones might break (although such a mixed setup can work in many environment for a while).

If you can wait for another week or two I'd suggest to see if the next proxmox-kernel contains the fix.

I hope this helps!
 
  • Like
Reactions: haukenx
I have found this thread only an hour ago after we realised to be affected by the issue ("aacraid: Outstanding commands on ...") at random times. It's Christmas time and here in NZ this is holiday season so we upgraded various servers with different Adaptec RAID controllers (models 8405, 8805, 81605Z) at a time that customers are on holiday - and it turns out that most of the upgraded servers are now showing that issue, mainly at resource (I/O) intense times like nightly vzdump snapshot creation.
Very little use of these servers so far during the days due to holidays but all will be back to normal use on Monday and we are very worried - hence we will have to try the patched kernel asap.
I will report back here.
Michael
 
  • Like
Reactions: Stoiko Ivanov
Very little use of these servers so far during the days due to holidays but all will be back to normal use on Monday and we are very worried - hence we will have to try the patched kernel asap.
It seems that the issue is not happening on 100% of systems running one of those controllers - a user, who was affected by this was kind enough to provide a 8405 model to us for further testing, and we did not manage to reproduce the issue (even with the same firmware version of the controller) - so maybe your systems might not be hit... - however in case they are - I'm rather optimistic that the patched kernel addresses the issue!

I hope this helps!
 
It seems that the issue is not happening on 100% of systems running one of those controllers - a user, who was affected by this was kind enough to provide a 8405 model to us for further testing, and we did not manage to reproduce the issue (even with the same firmware version of the controller) - so maybe your systems might not be hit... - however in case they are - I'm rather optimistic that the patched kernel addresses the issue!

I hope this helps!
Our ratio is about 60% affected, across Adaptec controller models 8405, 8805 and 81605Z. All controllers are on their latest firmware version.
We have updated the affected systems with the patched kernel and so far (over night) all have been stable and not shown the former issue so that looks very promising.

Hoping there won't be any negative coming up with the patched kernel (i.e. performance degradation).

Why 40% of the systems do not show the issue, we don't know either and cannot find the common denominator for that. We have left these systems for now on the official PVE8 kernel and will monitor them tightly next week.

BTW, all systems have been in place upgrades from previous PVE 7.4, not fresh installs (as unlikely as that might be related I think).

Thanks.
 
If you can wait for another week or two I'd suggest to see if the next proxmox-kernel contains the fix.

Thanks a lot for your feedback. I'll keep the cluster in it's current mixed state (seems to be running smoothly) and wait for the patched kernel through the repositories.

Hauke
 
Any news in this issue? I have here and for 2 customers installed the patched kernel and all is runnig without any problems with the Microsemi 88x adapter. Therefore the question for a regular kernel for this in the future?
 
  • Like
Reactions: mdo
the next pve kernel release (6.5.11-8-pve) will contain the revert/fix, it's in internal testing atm.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!