pvesr issue

schinzelh

Active Member
Apr 30, 2018
16
3
43
Hi,

i am running a cluster of 5 nodes with Proxmox 5.1 since months and yesterday all nodes all of a sudden stopped "seeing" eachother and i find these error logs in `dmesg -T` repeated several times on all servers at around the same time.

Code:
[Sun Apr 29 15:22:56 2018] INFO: task pvesr:19470 blocked for more than 120 seconds.

[Sun Apr 29 15:22:56 2018]       Tainted: P           O    4.13.13-5-pve #1
[Sun Apr 29 15:22:56 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Apr 29 15:22:56 2018] pvesr           D    0 19470      1 0x00000000
[Sun Apr 29 15:22:56 2018] Call Trace:
[Sun Apr 29 15:22:56 2018]  __schedule+0x3cc/0x850
[Sun Apr 29 15:22:56 2018]  ? path_parentat+0x3e/0x80
[Sun Apr 29 15:22:56 2018]  schedule+0x36/0x80
[Sun Apr 29 15:22:56 2018]  rwsem_down_write_failed+0x230/0x3a0
[Sun Apr 29 15:22:56 2018]  call_rwsem_down_write_failed+0x17/0x30
[Sun Apr 29 15:22:56 2018]  ? call_rwsem_down_write_failed+0x17/0x30
[Sun Apr 29 15:22:56 2018]  down_write+0x2d/0x40
[Sun Apr 29 15:22:56 2018]  filename_create+0x7e/0x160
[Sun Apr 29 15:22:56 2018]  SyS_mkdir+0x51/0x100
[Sun Apr 29 15:22:56 2018]  ? exit_to_usermode_loop+0x9b/0xd0
[Sun Apr 29 15:22:56 2018]  entry_SYSCALL_64_fastpath+0x33/0xa3
[Sun Apr 29 15:22:56 2018] RIP: 0033:0x7fd6f89da477
[Sun Apr 29 15:22:56 2018] RSP: 002b:00007fff33652338 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[Sun Apr 29 15:22:56 2018] RAX: ffffffffffffffda RBX: 000055a6710ee010 RCX: 00007fd6f89da477
[Sun Apr 29 15:22:56 2018] RDX: 000055a66f403484 RSI: 00000000000001ff RDI: 000055a6745ca2d0
[Sun Apr 29 15:22:56 2018] RBP: 0000000000000000 R08: 0000000000000200 R09: 000055a6710ee028
[Sun Apr 29 15:22:56 2018] R10: 0000000000000000 R11: 0000000000000246 R12: 000055a6731e2158
[Sun Apr 29 15:22:56 2018] R13: 000055a6745659f0 R14: 000055a6745ca2d0 R15: 00000000000001ff

And idea what yould be causing this and how to get the cluster in sync again?
 
Hi Fabian,

that sounds exactly like what i am seeing: corosync causing high load, pve-ha-lrm being stuck. I will try the new packages on pvetest and see if they fix it for me. Any further gotchas/tips for the update?

Holger
 
restarting as described in the linked thread and then upgrading should be enough.