Corosync memory leak

pashadee

Active Member
Jan 11, 2014
34
0
26
Hi guys,

Experiencing the same issue on 2 separate clusters of different sizes in two different locations with completely different hardware.

Issue is that after a week or two of regular cluster operation memory usage by corosync grows to crazy levels for instance currently on a node it's at 27.4GB reserved. It also takes a CPU core and spikes it to either 100% or 200%.

To get teh cluster to become responsive again I always have to sto pve-cluster and then corosync and restart in reverse order. This happens on all the nodes in the cluster (11 nodes in one, and 3 in the other cluster).

Not sure if I am the only one that is experiencing this, but any help would be appreciated.

Take care!
 
I should add this perhaps:

Cluster 1 (11 nodes)
proxmox-ve: 4.3-70 (running kernel: 4.4.21-1-pve)
pve-manager: 4.3-7 (running version: 4.3-7/db02a4de)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.21-1-pve: 4.4.21-70
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-46
qemu-server: 4.0-92
pve-firmware: 1.1-10
libpve-common-perl: 4.0-76
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-67
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.3-12
pve-qemu-kvm: 2.7.0-4
pve-container: 1.0-78
pve-firewall: 2.0-31
pve-ha-manager: 1.0-35
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.5-1
lxcfs: 2.0.4-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve12~bpo80
ceph: 0.94.9-1~bpo80+1




Cluster 2 (3 nodes)
proxmox-ve: 4.4-92 (running kernel: 4.4.67-1-pve)
pve-manager: 4.4-15 (running version: 4.4-15/7599e35a)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.67-1-pve: 4.4.67-92
pve-kernel-4.4.49-1-pve: 4.4.49-86
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0.1-1
pve-cluster: 4.0-52
qemu-server: 4.0-110
pve-firmware: 1.1-11
libpve-common-perl: 4.0-95
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-101
pve-firewall: 2.0-33
pve-ha-manager: 1.0-41
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
ceph: 10.2.9-1~bpo80+1
 
I’ve run multiple large installs of proxmox on 4 and just upgraded my home cluster to 5. I’ve never seen an issue with crono sync. Once it’s working that is.

My guess would be that there’s something wrong on the network. Can you provide some more details? What switches, network interfaces etc.

Does the memory grow on all heads or just one or two?

Are you using local storage or shared storage? If shared which?

Cheers
 
Hi Guy, thanks for the response.

It happens on every single node. Setup is as follows: 3 vm hosts and 8 storage nodes.
The vm hosts are ceph monitors, while the storage nodes are ceph osds.

All the VMs use the ceph RBD pools. The nodes themselves run on raid 1 ssds with zfs.

Gigabit Network on the front end connected by Dell switches, and 10Gb on the backened provided by Mellanox switches.

The strange thing about this is that I am getting the exact same issue with with a much smaller 3 node cluster. Just 8 port TP-Link switches on this setup both on front and back network. Same setup with ceph storage.

I suspect if it's not corosync maybe some bug in pve-cluster system that causes corosync to use memory and lock up the CPU.
 
do you have check corosync log ? maybe multicast problem, retransmit ?

I think your corosync memory is high, because the cpu is high, so corosync is currently something weird.

I never seen this on my cluster (around 30mb memory for corosync process, almost no cpu usage.), on 16-20 nodes clusters.
 
Thanks for the pointer spirit! ... not sure how to check the corosync longs? /var/log/corosync has nothing in it. Is there a log redirection on proxmox nodes?

Thanks!
 
Thanks robhost,

I have 200-300 retransmits every second... so this seems excessive no? If that is pointing to the problem, any pointers on where I can get some guidance on fixing it.

Sample line:
Jul 27 14:05:48 px1-g5 corosync[11545]: [TOTEM ] Retransmit List: 775 776 777
 
Thanks spirit.

If I make the edit on host 1 and increment config version does corosync automatically replicate to the other 2 hosts or do I need to make same edit on all hosts?

Thanks
 
Thanks spirit.

If I make the edit on host 1 and increment config version does corosync automatically replicate to the other 2 hosts or do I need to make same edit on all hosts?

Thanks

as it's in /etc/pve, it'll be replicated and applied on all nodes. you just need to edit it once.
 
I'm using a separate network for ceph, shoudl corosync run on the same network, or does it matter if it runs on the front side (vm side)?
what's recommended?
 
I'm using a separate network for ceph, shoudl corosync run on the same network, or does it matter if it runs on the front side (vm side)?
what's recommended?
Hi,
cluster communication should be seperated from storage traffic!

But I had an 5-node cluster without trouble, where ceph-public and pve-cluster use the same 10GB-network... but like wrote - it's not recommended.

Udo
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!