Could be the cause of my failed to import pool "rpool." my system is not repairable?In general yes. You'd only need to take care when you're using ZFS to not upgrade any pool to the newer ZFS features, as then the older ZFS from older Kernels won't understand those pools anymore
Install here didn't go smoothly getting: Failed to import pool "rpool."
Booting up stuck...how to resolve?
Solved...after I did the install and rebooting, I entered the bios and turned on three drives...that confused the new kernel first time booting. All is well.Could be the cause of my failed to import pool "rpool." my system is not repairable?
We have a three node dual Intel(R) Xeon(R) Gold 6426Y cluster here since about a month, running various tests and preparing for our next benchmark paper, most of the latter was done on our 6.5 based kernel with great success for now. The same HW was also tested by our solution partner Thomas Krenn before being delivered to us, and they're selling those systems with Proxmox VE pre-installed and did not find any issue either (albeit, back then their test probably used 6.2).
FWIW, we also had access to 4th gen scalable before its official release, from a HW vendor's early access programs, but that was about a year ago and back then we tested with our (then quite fresh) 6.2 kernel, which worked fine there too.
But yes, we did not run tests with some proprietary DBs, maybe you got some more specific details about a use case that can be reproduced without some opaque software, then we could try to look if we can reproduce any of that. As it sounds like you do lots of tinkering, there might also some guest OS tunables involved, iow., the more details we got the more likely we can find something, if it's an underlying issue and not some misconfiguration.
What kernel is running within the VM?Running on the 4th Gen Intel with any of the 6.5.x kernels results in the following within a matter of an hour or so.
View attachment 57625
I can drop core counts, memory, enable numa etc, none of it matters and we still hit CPU lockups.
Move the VM back to a 5.15.x kernel and the VM is rock solid. Move the VM to a 2nd or 3rd gen Intel with a 6.x.x kernel and its solid.
Also worth mentioning that the VM throws NMI right on boot with any 6.x.x kernel on the 4th gen intel.
View attachment 57626
Its a Debian 12 VM.What kernel is running within the VM?
I haven't been able to reproduce the issue locally here even after a few hours, matching your VM configuration as close as I could (it's only dual socket on my end) and usingRunning on the 4th Gen Intel with any of the 6.5.x kernels results in the following within a matter of an hour or so.
stress-ng
to generate a lot of CPU load and some memory and IO load within the VM.How much did you drop the core count? I noticed the configuration has CPU hotplug enabled. Just asking to be sure: was it used during the test?I can drop core counts, memory, enable numa etc, none of it matters and we still hit CPU lockups.
I haven't been able to reproduce the issue locally here even after a few hours, matching your VM configuration as close as I could (it's only dual socket on my end) and usingstress-ng
to generate a lot of CPU load and some memory and IO load within the VM.
Could you describe the workload inside the VM in a bit more detail? Is there anything else running on the host system around the time the issue occurs?
How much did you drop the core count? I noticed the configuration has CPU hotplug enabled. Just asking to be sure: was it used during the test?
It's just a shot in the dark, but you could try turning off the numa_balancer as suggested here for a different issue that's also happening with 6.x kernels and not 5.15: https://forum.proxmox.com/threads/p...th-windows-server-2019-vms.130727/post-601617
Why am I getting these on kernel 6.5?
Nov 06 18:42:55 nolliprivatecloud login[1948]: ROOT LOGIN on '/dev/pts/0'
Nov 06 18:44:47 nolliprivatecloud kernel: evict_inodes inode 00000000dc7b1645, i_count = 1, was skipped!
Nov 06 18:44:47 nolliprivatecloud kernel: evict_inodes inode 0000000074b7fd0c, i_count = 1, was skipped!
Nov 06 18:45:03 nolliprivatecloud kernel: evict_inodes inode 00000000165fefb7, i_count = 1, was skipped!
Nov 06 18:45:03 nolliprivatecloud kernel: evict_inodes inode 00000000f7a2b3b6, i_count = 1, was skipped!
Nov 06 18:45:18 nolliprivatecloud kernel: evict_inodes inode 00000000e22e196e, i_count = 1, was skipped!
Nov 06 18:45:18 nolliprivatecloud kernel: evict_inodes inode 00000000997c6818, i_count = 1, was skipped!
Nov 06 18:45:33 nolliprivatecloud kernel: evict_inodes inode 000000006e9fe81d, i_count = 1, was skipped!
Nov 06 18:45:33 nolliprivatecloud kernel: evict_inodes inode 000000005281d993, i_count = 1, was skipped!
Nov 06 18:51:28 nolliprivatecloud pvedaemon[1447]: <root@pam> successful auth for user 'root@pam'
And:
Nov 06 18:39:27 nolliprivatecloud smartd[974]: Device: /dev/nvme1, number of Error Log entries increased from 110 to 124
While I didn't test with that this time, I did kernel compilations when the issues where first reported here for 6.2. But I couldn't trigger any soft lockups either. Do have latest BIOS update and microcode installed?Heavy compile load. Cc1plus is the vast majority of it.
I will do some testing shortly and report back.
Thanks for the report, this is due to an ubuntu specific backport, there was already a report at their bug tracker, but I posted the info there about which patch causes this: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2037214Why am I getting these on kernel 6.5?
Nov 06 18:42:55 nolliprivatecloud login[1948]: ROOT LOGIN on '/dev/pts/0'
Nov 06 18:44:47 nolliprivatecloud kernel: evict_inodes inode 00000000dc7b1645, i_count = 1, was skipped!
Nov 06 18:44:47 nolliprivatecloud kernel: evict_inodes inode 0000000074b7fd0c, i_count = 1, was skipped!
Nov 06 18:45:03 nolliprivatecloud kernel: evict_inodes inode 00000000165fefb7, i_count = 1, was skipped!
Nov 06 18:45:03 nolliprivatecloud kernel: evict_inodes inode 00000000f7a2b3b6, i_count = 1, was skipped!
Nov 06 18:45:18 nolliprivatecloud kernel: evict_inodes inode 00000000e22e196e, i_count = 1, was skipped!
Nov 06 18:45:18 nolliprivatecloud kernel: evict_inodes inode 00000000997c6818, i_count = 1, was skipped!
Nov 06 18:45:33 nolliprivatecloud kernel: evict_inodes inode 000000006e9fe81d, i_count = 1, was skipped!
Nov 06 18:45:33 nolliprivatecloud kernel: evict_inodes inode 000000005281d993, i_count = 1, was skipped!
Nov 06 18:51:28 nolliprivatecloud pvedaemon[1447]: <root@pam> successful auth for user 'root@pam'
And:
Nov 06 18:39:27 nolliprivatecloud smartd[974]: Device: /dev/nvme1, number of Error Log entries increased from 110 to 124
While I didn't test with that this time, I did kernel compilations when the issues where first reported here for 6.2. But I couldn't trigger any soft lockups either. Do have latest BIOS update and microcode installed?
EDIT: Questions still left from the last post:
How much did you drop the core count? I noticed the configuration has CPU hotplug enabled. Just asking to be sure: was it used during the test?
What about theJust checked Supermicro's site and the host is on the latest Bios. They don't seem to offer any microcode updates as of yet for this model.
intel-microcode
package?Did you run a benchmark (on a working kernel) to compare how much you would actually lose from limiting the core count to the amount of actual CPUs (and not just hyperthreads)? I know you said you didn't have issues with it in the last decade, but if we can't reproduce the issue, we'll have a really hard time tracking it down, so having a workaround would be at least something.Doing some testing without CPU hotplug enabled and I found the following.
CPU Hotplug Enabled
- VM Boots aok with all the cores from the host
CPU Hotplug Disabled
- VM freezes with core counts past 192
- Makes it through 50% of its boot process then locks up
What about theintel-microcode
package?
Did you run a benchmark (on a working kernel) to compare how much you would actually lose from limiting the core count to the amount of actual CPUs (and not just hyperthreads)? I know you said you didn't have issues with it in the last decade, but if we can't reproduce the issue, we'll have a really hard time tracking it down, so having a workaround would be at least something.
EDIT: Or even better, compare performance of 5.15 kernel with high CPU count to 6.5 kernel with reduced CPU count.
We hit soft lockups again with the VM set to 192 cores and our compiles set to use 128 of those cores (NUMA on and CPU hotplug disabled).
No benchmark's, these are production, we don't have time to do that kind of stuff. Hence the reason we pay for enterprise repo's.
Just updated the microcode.
root@ccsprogmiscrit1:~# journalctl -k --grep="microcode"
-- Journal begins at Tue 2023-07-25 08:42:43 EDT, ends at Tue 2023-11-07 13:18:20 EST. --
Nov 07 13:17:33 ccsprogmiscrit1 kernel: microcode: updated early: 0x2b0001b0 -> 0x2b0004b1, date = 2023-05-09
Nov 07 13:17:33 ccsprogmiscrit1 kernel: microcode: Microcode Update Driver: v2.2.
Re-running our compile now, last attempt resulted in cpu lockups. Maybe the microcode update will help.