Upgraded Nod has several problems after 11days uptime

Ivan Gersi · Oct 9, 2022

I have cluster form 5 nodes with pve-manager 6.0.9. Pve4 was freshly installed with 6.4.4 version and joined correctly.
Now, after 11 days uptime pve4 can`t see nfs backup storage and I can`t migrate any machine to another node because all machines value don`t match regex pattern.
I know there is no recommended to mix any version of pve-manager, but is there any way to fix it in this state?
Edit:Nod restart helped but this is temp solution.

Neobin · Oct 10, 2022

Ivan Gersi said:
I know there is no recommended to mix any version of pve-manager

You know it, but do it nonetheless. Recommendations in general have a reason.

Ivan Gersi said:
Edit:Nod restart helped but this is temp solution.

Since you have a (temporary) working cluster again, you obviously should now update all the nodes to the most recent PVE 6.4 version (and reboot them one by one to boot with the new kernel).
Make sure, that the appropriate PVE-repository [1] is setup correctly on all nodes. (Especially if you do not have subscriptions [2].)

After this, you should also strongly consider the upgrade to PVE 7 [3], because PVE 6 is EOL [4].

[1] https://pve.proxmox.com/wiki/Package_Repositories#_proxmox_ve_6_x_repositories
[2] https://proxmox.com/en/proxmox-ve/pricing
[3] https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0
[4] https://pve.proxmox.com/wiki/FAQ

Ivan Gersi · Oct 14, 2022

I`ll have to do it (upgrade all nodes) because node was grey after 4 days again.
Restarting services (cluset, daemen, proxy...) helped and node is online/green again but it can`t see nfs storage again.
And i have a strange indication in logs.

Oct 14 01:22:00 pve4 kernel: [363056.310337] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Oct 14 01:22:00 pve4 kernel: [363056.311608] rcu: 17-...!: (607 GPs behind) idle=734/0/0x0 softirq=23512440/23512440 fqs=1
Oct 14 01:22:00 pve4 kernel: [363056.312503] (detected by 8, t=15002 jiffies, g=47593777, q=8582)
Oct 14 01:22:00 pve4 kernel: [363056.312507] Sending NMI from CPU 8 to CPUs 17:
Oct 14 01:22:00 pve4 kernel: [363056.312560] NMI backtrace for cpu 17 skipped: idling at intel_idle+0x8b/0x130
Oct 14 01:22:00 pve4 kernel: [363056.313525] rcu: rcu_sched kthread starved for 15000 jiffies! g47593777 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=17
Oct 14 01:22:00 pve4 kernel: [363056.315187] rcu: RCU grace-period kthread stack dump:
Oct 14 01:22:00 pve4 kernel: [363056.316722] rcu_sched I 0 12 2 0x80004000
Oct 14 01:22:00 pve4 kernel: [363056.316728] Call Trace:
Oct 14 01:22:00 pve4 kernel: [363056.316741] __schedule+0x2e6/0x700
Oct 14 01:22:00 pve4 kernel: [363056.316745] schedule+0x33/0xa0
Oct 14 01:22:00 pve4 kernel: [363056.316749] schedule_timeout+0x152/0x330
Oct 14 01:22:00 pve4 kernel: [363056.316754] ? rcu_report_qs_rnp+0xb3/0x100
Oct 14 01:22:00 pve4 kernel: [363056.316760] ? __next_timer_interrupt+0xd0/0xd0
Oct 14 01:22:00 pve4 kernel: [363056.316764] rcu_gp_kthread+0x488/0x9a0
Oct 14 01:22:00 pve4 kernel: [363056.316769] kthread+0x120/0x140
Oct 14 01:22:00 pve4 kernel: [363056.316772] ? kfree_call_rcu+0x20/0x20
Oct 14 01:22:00 pve4 kernel: [363056.316775] ? kthread_park+0x90/0x90
Oct 14 01:22:00 pve4 kernel: [363056.316779] ret_from_fork+0x35/0x40

Maybe some relationship with unreachble NFS storage?
df -h is hang on, any operations with nfs hang on (e.g. lsof, cd /mnt/...)

Search

Search

Upgraded Nod has several problems after 11days uptime

Ivan Gersi

Renowned Member

Neobin

Distinguished Member

Ivan Gersi

Renowned Member

We value your privacy