All functions became slooow (corosync problem?)

grin

Renowned Member
Dec 8, 2008
172
21
83
Hungary
grin.hu
On some clusters (4.2 and 4.3) every management function started to get slow; GUI refresh calls take 4+ seconds (literally anything, from options to summary), 'pct' on console (w/o params, too) take 4-8 sec to finish. pvecm took 48 sec just now for a 4 node testcluster with full quorum.

I have tried to trace the wait and reached RPCEnvironment->init_request--> PVE::Cluster::cfs_update();, or possibly anything calling ipcc_send_receive_* Every call takes 2-4 sec.

I am lost here, there's no obvious documentation who handles these requests, visibly not pvestatd or pveproxy. Someone's timing out on something I suspect but I just don't see who and where and why. Any idea what do check?
 
corosync-quorumtool -l takes like forever, 150 sec. There's plenty of traffic on 5404 and 5405.

[update] Somehow it seems to be related to corosync: on clusters with fast response quorumtool resonds immediately. I do not see anything in logs, however:

Nov 15 16:36:08 node01 corosync[2941]: [QB ] IPC credentials authenticated (2941-19005-3)
Nov 15 16:36:08 node01 corosync[2941]: [QB ] connecting to client [19005]
Nov 15 16:36:08 node01 corosync[2941]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168
Nov 15 16:36:08 node01 corosync[2941]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168
Nov 15 16:36:08 node01 corosync[2941]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168
Nov 15 16:36:08 node01 corosync[2941]: [MAIN ] connection created
Nov 15 16:36:08 node01 corosync[2941]: [CMAP ] lib_init_fn: conn=0x55fd73ae9fa0
Nov 15 16:36:08 node01 corosync[2941]: [QB ] IPC credentials authenticated (2941-19005-23)
Nov 15 16:36:08 node01 corosync[2941]: [QB ] connecting to client [19005]
Nov 15 16:36:08 node01 corosync[2941]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168
Nov 15 16:36:08 node01 corosync[2941]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168
Nov 15 16:36:08 node01 corosync[2941]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168
Nov 15 16:36:08 node01 corosync[2941]: [MAIN ] connection created
Nov 15 16:36:08 node01 corosync[2941]: [QUORUM] lib_init_fn: conn=0x55fd739e4e40
Nov 15 16:36:08 node01 corosync[2941]: [QUORUM] got quorum_type request on 0x55fd739e4e40
Nov 15 16:36:10 node01 corosync[2941]: [QB ] IPC credentials authenticated (2941-19005-24)
Nov 15 16:36:10 node01 corosync[2941]: [QB ] connecting to client [19005]
Nov 15 16:36:10 node01 corosync[2941]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168
Nov 15 16:36:10 node01 corosync[2941]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168
Nov 15 16:36:10 node01 corosync[2941]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168
Nov 15 16:36:10 node01 corosync[2941]: [MAIN ] connection created
Nov 15 16:36:14 node01 corosync[2941]: [QB ] IPC credentials authenticated (2941-19005-25)
Nov 15 16:36:14 node01 corosync[2941]: [QB ] connecting to client [19005]
Nov 15 16:36:14 node01 corosync[2941]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168
Nov 15 16:36:14 node01 corosync[2941]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168
Nov 15 16:36:14 node01 corosync[2941]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168
Nov 15 16:36:14 node01 corosync[2941]: [MAIN ] connection created
Nov 15 16:36:14 node01 pvestatd[41380]: status update time (16.174 seconds)
Nov 15 16:36:16 node01 corosync[2941]: [QUORUM] got trackstart request on 0x55fd739e4e40
Nov 15 16:36:16 node01 corosync[2941]: [QUORUM] sending initial status to 0x55fd739e4e40
Nov 15 16:36:16 node01 corosync[2941]: [QUORUM] sending quorum notification to 0x55fd739e4e40, length = 64
Nov 15 16:36:18 node01 corosync[2941]: [VOTEQ ] got getinfo request on 0x55fd73af0580 for node 1
Nov 15 16:36:18 node01 corosync[2941]: [VOTEQ ] getinfo response error: 1
Nov 15 16:36:20 node01 corosync[2941]: [VOTEQ ] got getinfo request on 0x55fd73af0580 for node 2
Nov 15 16:36:20 node01 corosync[2941]: [VOTEQ ] getinfo response error: 1
Nov 15 16:36:22 node01 corosync[2941]: [VOTEQ ] got getinfo request on 0x55fd73af0580 for node 3
Nov 15 16:36:22 node01 corosync[2941]: [VOTEQ ] getinfo response error: 1
Nov 15 16:36:24 node01 corosync[2941]: [VOTEQ ] got getinfo request on 0x55fd73af0580 for node 4
Nov 15 16:36:24 node01 corosync[2941]: [VOTEQ ] getinfo response error: 1

Nov 15 16:36:30 node01 pvestatd[41380]: status update time (16.181 seconds)

Nov 15 16:36:47 node01 pvestatd[41380]: status update time (16.192 seconds)

Nov 15 16:37:03 node01 pvestatd[41380]: status update time (16.188 seconds)

Nov 15 16:37:14 node01 corosync[2941]: [QB ] HUP conn (2941-19005-25)
Nov 15 16:37:14 node01 corosync[2941]: [QB ] qb_ipcs_disconnect(2941-19005-25) state:2
Nov 15 16:37:14 node01 corosync[2941]: [MAIN ] cs_ipcs_connection_closed()
Nov 15 16:37:14 node01 corosync[2941]: [MAIN ] cs_ipcs_connection_destroyed()
Nov 15 16:37:14 node01 corosync[2941]: [QB ] Free'ing ringbuffer: /dev/shm/qb-votequorum-response-2941-19005-25-header
Nov 15 16:37:14 node01 corosync[2941]: [QB ] Free'ing ringbuffer: /dev/shm/qb-votequorum-event-2941-19005-25-header
Nov 15 16:37:14 node01 corosync[2941]: [QB ] Free'ing ringbuffer: /dev/shm/qb-votequorum-request-2941-19005-25-header
Nov 15 16:37:14 node01 corosync[2941]: [QB ] HUP conn (2941-19005-24)
Nov 15 16:37:14 node01 corosync[2941]: [QB ] qb_ipcs_disconnect(2941-19005-24) state:2
Nov 15 16:37:14 node01 corosync[2941]: [MAIN ] cs_ipcs_connection_closed()
Nov 15 16:37:14 node01 corosync[2941]: [MAIN ] cs_ipcs_connection_destroyed()


 
Last edited:
More info (you won't be happy I guess): due to that *&^%$#@!ing 60 sec fencing (I will murder it someday while praying for a real fencing) one cluster in question have rebooted. The result: it's fast again. corosync-quorumtool returns in 13 ms.

Where to see what (on the still slow clusters)?
 
This keeps happening on various clusters, and only on some member nodes. Right now I have a 3 node cluster (a different one from the above) where node1 and node2 observes this extreme slowdown behaviour (CT/VM are okay but anything related to cluster management, from corosync-* command line tools through pvecm /pvect / qm to the GUI in general are mindboggingly slow; corosync-quorum returns in 65581 ms, which, incidentally, but possibly with no real meaning, quite close to 65536 ms), while node3 is okay (c-q returns in 24ms).

Interesting: corosync-cpgtool [on slow nodes] seem to be running forever (30 minutes now).

corosync-cfgtool -R won't fix it.
Restarting corosync fixes it (but due to a proxmox bug it reboots the node in 1 minute).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!