Upgraded Nod has several problems after 11days uptime

Ivan Gersi

Renowned Member
May 29, 2016
83
7
73
54
I have cluster form 5 nodes with pve-manager 6.0.9. Pve4 was freshly installed with 6.4.4 version and joined correctly.
Now, after 11 days uptime pve4 can`t see nfs backup storage and I can`t migrate any machine to another node because all machines value don`t match regex pattern.
I know there is no recommended to mix any version of pve-manager, but is there any way to fix it in this state?
Edit:Nod restart helped but this is temp solution.
 
Last edited:
I know there is no recommended to mix any version of pve-manager

You know it, but do it nonetheless. Recommendations in general have a reason.

Edit:Nod restart helped but this is temp solution.

Since you have a (temporary) working cluster again, you obviously should now update all the nodes to the most recent PVE 6.4 version (and reboot them one by one to boot with the new kernel).
Make sure, that the appropriate PVE-repository [1] is setup correctly on all nodes. (Especially if you do not have subscriptions [2].)

After this, you should also strongly consider the upgrade to PVE 7 [3], because PVE 6 is EOL [4].

[1] https://pve.proxmox.com/wiki/Package_Repositories#_proxmox_ve_6_x_repositories
[2] https://proxmox.com/en/proxmox-ve/pricing
[3] https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0
[4] https://pve.proxmox.com/wiki/FAQ
 
I`ll have to do it (upgrade all nodes) because node was grey after 4 days again.
Restarting services (cluset, daemen, proxy...) helped and node is online/green again but it can`t see nfs storage again.
And i have a strange indication in logs.
Oct 14 01:22:00 pve4 kernel: [363056.310337] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Oct 14 01:22:00 pve4 kernel: [363056.311608] rcu: 17-...!: (607 GPs behind) idle=734/0/0x0 softirq=23512440/23512440 fqs=1
Oct 14 01:22:00 pve4 kernel: [363056.312503] (detected by 8, t=15002 jiffies, g=47593777, q=8582)
Oct 14 01:22:00 pve4 kernel: [363056.312507] Sending NMI from CPU 8 to CPUs 17:
Oct 14 01:22:00 pve4 kernel: [363056.312560] NMI backtrace for cpu 17 skipped: idling at intel_idle+0x8b/0x130
Oct 14 01:22:00 pve4 kernel: [363056.313525] rcu: rcu_sched kthread starved for 15000 jiffies! g47593777 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=17
Oct 14 01:22:00 pve4 kernel: [363056.315187] rcu: RCU grace-period kthread stack dump:
Oct 14 01:22:00 pve4 kernel: [363056.316722] rcu_sched I 0 12 2 0x80004000
Oct 14 01:22:00 pve4 kernel: [363056.316728] Call Trace:
Oct 14 01:22:00 pve4 kernel: [363056.316741] __schedule+0x2e6/0x700
Oct 14 01:22:00 pve4 kernel: [363056.316745] schedule+0x33/0xa0
Oct 14 01:22:00 pve4 kernel: [363056.316749] schedule_timeout+0x152/0x330
Oct 14 01:22:00 pve4 kernel: [363056.316754] ? rcu_report_qs_rnp+0xb3/0x100
Oct 14 01:22:00 pve4 kernel: [363056.316760] ? __next_timer_interrupt+0xd0/0xd0
Oct 14 01:22:00 pve4 kernel: [363056.316764] rcu_gp_kthread+0x488/0x9a0
Oct 14 01:22:00 pve4 kernel: [363056.316769] kthread+0x120/0x140
Oct 14 01:22:00 pve4 kernel: [363056.316772] ? kfree_call_rcu+0x20/0x20
Oct 14 01:22:00 pve4 kernel: [363056.316775] ? kthread_park+0x90/0x90
Oct 14 01:22:00 pve4 kernel: [363056.316779] ret_from_fork+0x35/0x40
Maybe some relationship with unreachble NFS storage?
df -h is hang on, any operations with nfs hang on (e.g. lsof, cd /mnt/...)
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!