All functions became slooow (corosync problem?)

grin

Renowned Member
Dec 8, 2008
174
22
83
Hungary
grin.hu
On some clusters (4.2 and 4.3) every management function started to get slow; GUI refresh calls take 4+ seconds (literally anything, from options to summary), 'pct' on console (w/o params, too) take 4-8 sec to finish. pvecm took 48 sec just now for a 4 node testcluster with full quorum.

I have tried to trace the wait and reached RPCEnvironment->init_request--> PVE::Cluster::cfs_update();, or possibly anything calling ipcc_send_receive_* Every call takes 2-4 sec.

I am lost here, there's no obvious documentation who handles these requests, visibly not pvestatd or pveproxy. Someone's timing out on something I suspect but I just don't see who and where and why. Any idea what do check?
 
corosync-quorumtool -l takes like forever, 150 sec. There's plenty of traffic on 5404 and 5405.

[update] Somehow it seems to be related to corosync: on clusters with fast response quorumtool resonds immediately. I do not see anything in logs, however:

Nov 15 16:36:08 node01 corosync[2941]: [QB ] IPC credentials authenticated (2941-19005-3)
Nov 15 16:36:08 node01 corosync[2941]: [QB ] connecting to client [19005]
Nov 15 16:36:08 node01 corosync[2941]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168
Nov 15 16:36:08 node01 corosync[2941]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168
Nov 15 16:36:08 node01 corosync[2941]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168
Nov 15 16:36:08 node01 corosync[2941]: [MAIN ] connection created
Nov 15 16:36:08 node01 corosync[2941]: [CMAP ] lib_init_fn: conn=0x55fd73ae9fa0
Nov 15 16:36:08 node01 corosync[2941]: [QB ] IPC credentials authenticated (2941-19005-23)
Nov 15 16:36:08 node01 corosync[2941]: [QB ] connecting to client [19005]
Nov 15 16:36:08 node01 corosync[2941]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168
Nov 15 16:36:08 node01 corosync[2941]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168
Nov 15 16:36:08 node01 corosync[2941]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168
Nov 15 16:36:08 node01 corosync[2941]: [MAIN ] connection created
Nov 15 16:36:08 node01 corosync[2941]: [QUORUM] lib_init_fn: conn=0x55fd739e4e40
Nov 15 16:36:08 node01 corosync[2941]: [QUORUM] got quorum_type request on 0x55fd739e4e40
Nov 15 16:36:10 node01 corosync[2941]: [QB ] IPC credentials authenticated (2941-19005-24)
Nov 15 16:36:10 node01 corosync[2941]: [QB ] connecting to client [19005]
Nov 15 16:36:10 node01 corosync[2941]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168
Nov 15 16:36:10 node01 corosync[2941]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168
Nov 15 16:36:10 node01 corosync[2941]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168
Nov 15 16:36:10 node01 corosync[2941]: [MAIN ] connection created
Nov 15 16:36:14 node01 corosync[2941]: [QB ] IPC credentials authenticated (2941-19005-25)
Nov 15 16:36:14 node01 corosync[2941]: [QB ] connecting to client [19005]
Nov 15 16:36:14 node01 corosync[2941]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168
Nov 15 16:36:14 node01 corosync[2941]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168
Nov 15 16:36:14 node01 corosync[2941]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168
Nov 15 16:36:14 node01 corosync[2941]: [MAIN ] connection created
Nov 15 16:36:14 node01 pvestatd[41380]: status update time (16.174 seconds)
Nov 15 16:36:16 node01 corosync[2941]: [QUORUM] got trackstart request on 0x55fd739e4e40
Nov 15 16:36:16 node01 corosync[2941]: [QUORUM] sending initial status to 0x55fd739e4e40
Nov 15 16:36:16 node01 corosync[2941]: [QUORUM] sending quorum notification to 0x55fd739e4e40, length = 64
Nov 15 16:36:18 node01 corosync[2941]: [VOTEQ ] got getinfo request on 0x55fd73af0580 for node 1
Nov 15 16:36:18 node01 corosync[2941]: [VOTEQ ] getinfo response error: 1
Nov 15 16:36:20 node01 corosync[2941]: [VOTEQ ] got getinfo request on 0x55fd73af0580 for node 2
Nov 15 16:36:20 node01 corosync[2941]: [VOTEQ ] getinfo response error: 1
Nov 15 16:36:22 node01 corosync[2941]: [VOTEQ ] got getinfo request on 0x55fd73af0580 for node 3
Nov 15 16:36:22 node01 corosync[2941]: [VOTEQ ] getinfo response error: 1
Nov 15 16:36:24 node01 corosync[2941]: [VOTEQ ] got getinfo request on 0x55fd73af0580 for node 4
Nov 15 16:36:24 node01 corosync[2941]: [VOTEQ ] getinfo response error: 1

Nov 15 16:36:30 node01 pvestatd[41380]: status update time (16.181 seconds)

Nov 15 16:36:47 node01 pvestatd[41380]: status update time (16.192 seconds)

Nov 15 16:37:03 node01 pvestatd[41380]: status update time (16.188 seconds)

Nov 15 16:37:14 node01 corosync[2941]: [QB ] HUP conn (2941-19005-25)
Nov 15 16:37:14 node01 corosync[2941]: [QB ] qb_ipcs_disconnect(2941-19005-25) state:2
Nov 15 16:37:14 node01 corosync[2941]: [MAIN ] cs_ipcs_connection_closed()
Nov 15 16:37:14 node01 corosync[2941]: [MAIN ] cs_ipcs_connection_destroyed()
Nov 15 16:37:14 node01 corosync[2941]: [QB ] Free'ing ringbuffer: /dev/shm/qb-votequorum-response-2941-19005-25-header
Nov 15 16:37:14 node01 corosync[2941]: [QB ] Free'ing ringbuffer: /dev/shm/qb-votequorum-event-2941-19005-25-header
Nov 15 16:37:14 node01 corosync[2941]: [QB ] Free'ing ringbuffer: /dev/shm/qb-votequorum-request-2941-19005-25-header
Nov 15 16:37:14 node01 corosync[2941]: [QB ] HUP conn (2941-19005-24)
Nov 15 16:37:14 node01 corosync[2941]: [QB ] qb_ipcs_disconnect(2941-19005-24) state:2
Nov 15 16:37:14 node01 corosync[2941]: [MAIN ] cs_ipcs_connection_closed()
Nov 15 16:37:14 node01 corosync[2941]: [MAIN ] cs_ipcs_connection_destroyed()


 
Last edited:
More info (you won't be happy I guess): due to that *&^%$#@!ing 60 sec fencing (I will murder it someday while praying for a real fencing) one cluster in question have rebooted. The result: it's fast again. corosync-quorumtool returns in 13 ms.

Where to see what (on the still slow clusters)?
 
This keeps happening on various clusters, and only on some member nodes. Right now I have a 3 node cluster (a different one from the above) where node1 and node2 observes this extreme slowdown behaviour (CT/VM are okay but anything related to cluster management, from corosync-* command line tools through pvecm /pvect / qm to the GUI in general are mindboggingly slow; corosync-quorum returns in 65581 ms, which, incidentally, but possibly with no real meaning, quite close to 65536 ms), while node3 is okay (c-q returns in 24ms).

Interesting: corosync-cpgtool [on slow nodes] seem to be running forever (30 minutes now).

corosync-cfgtool -R won't fix it.
Restarting corosync fixes it (but due to a proxmox bug it reboots the node in 1 minute).