We've been running proxmox non-clustered for awhile and clustering up a number of machines and clustering went fairly smoothly until we started noticing a number of OOM and oom-killer taking down host systems. After some digging we've isolated the issue to corosync running a system completely out of memory. The corosync lists don't turn up anything, so maybe others here might have some tips or suggestions.
We've stopped corosync on critical boxes and just isolated to running on 3 machines to maintain quorum.
Memory consumption (over 11 hours) on systems and corosync at 80% memory utilization:
pmox1:
USER PID %CPU %MEM VSZ RSS STAT ELAPSED COMMAND
root 218488 0.2 81.2 3412584 3273512 S<Lsl 11:34:40 corosync -f
pmox2:
USER PID %CPU %MEM VSZ RSS STAT ELAPSED COMMAND
root 204975 0.2 81.8 3437340 3298664 S<Lsl 11:39:36 corosync -f
pmox3:
USER PID %CPU %MEM VSZ RSS STAT ELAPSED COMMAND
root 358776 0.2 80.3 3435464 3296348 S<Lsl 11:38:54 corosync -f
Debian Wheezy
Proxmox versions:
pve-manager: 3.0-23 (pve-manager/3.0/957f0862)
running kernel: 2.6.32-20-pve
proxmox-ve-2.6.32: 3.0-100
pve-kernel-2.6.32-20-pve: 2.6.32-100
lvm2: 2.02.95-pve3
clvm: 2.02.95-pve3
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.0-1
pve-cluster: 3.0-4
qemu-server: 3.0-20
pve-firmware: 1.0-22
libpve-common-perl: 3.0-4
libpve-access-control: 3.0-4
libpve-storage-perl: 3.0-8
vncterm: 1.1-4
vzctl: 4.0-1pve3
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.4-13
ksm-control-daemon: 1.1-1
I should add when I turn off corosync I don't run into any OOM issues, but obviously I can't cluster and get the nice integrated management.
Cluster.conf:
<?xml version="1.0"?>
<cluster config_version="18" name="clrdev">
<cman keyfile="/var/lib/pve-cluster/corosync.authkey"/>
<clusternodes>
<clusternode name="int-proxmox2" nodeid="1" votes="1"/>
<clusternode name="int-proxmox1" nodeid="2" votes="1"/>
<clusternode name="proxmox4" nodeid="3" votes="1"/>
<clusternode name="proxmox3" nodeid="4" votes="1"/>
<clusternode name="proxmox7" nodeid="5" votes="1"/>
<clusternode name="proxmox6" nodeid="6" votes="1"/>
</clusternodes>
<rm/>
</cluster>
The corosync.log is fairly chatty and I've provided a link to make sure this is normal chatter/experience: http://pastebin.com/zf13srf5
If anything else is needed or helpful please let me know.
We've stopped corosync on critical boxes and just isolated to running on 3 machines to maintain quorum.
Memory consumption (over 11 hours) on systems and corosync at 80% memory utilization:
pmox1:
USER PID %CPU %MEM VSZ RSS STAT ELAPSED COMMAND
root 218488 0.2 81.2 3412584 3273512 S<Lsl 11:34:40 corosync -f
pmox2:
USER PID %CPU %MEM VSZ RSS STAT ELAPSED COMMAND
root 204975 0.2 81.8 3437340 3298664 S<Lsl 11:39:36 corosync -f
pmox3:
USER PID %CPU %MEM VSZ RSS STAT ELAPSED COMMAND
root 358776 0.2 80.3 3435464 3296348 S<Lsl 11:38:54 corosync -f
Debian Wheezy
Proxmox versions:
pve-manager: 3.0-23 (pve-manager/3.0/957f0862)
running kernel: 2.6.32-20-pve
proxmox-ve-2.6.32: 3.0-100
pve-kernel-2.6.32-20-pve: 2.6.32-100
lvm2: 2.02.95-pve3
clvm: 2.02.95-pve3
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.0-1
pve-cluster: 3.0-4
qemu-server: 3.0-20
pve-firmware: 1.0-22
libpve-common-perl: 3.0-4
libpve-access-control: 3.0-4
libpve-storage-perl: 3.0-8
vncterm: 1.1-4
vzctl: 4.0-1pve3
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.4-13
ksm-control-daemon: 1.1-1
I should add when I turn off corosync I don't run into any OOM issues, but obviously I can't cluster and get the nice integrated management.
Cluster.conf:
<?xml version="1.0"?>
<cluster config_version="18" name="clrdev">
<cman keyfile="/var/lib/pve-cluster/corosync.authkey"/>
<clusternodes>
<clusternode name="int-proxmox2" nodeid="1" votes="1"/>
<clusternode name="int-proxmox1" nodeid="2" votes="1"/>
<clusternode name="proxmox4" nodeid="3" votes="1"/>
<clusternode name="proxmox3" nodeid="4" votes="1"/>
<clusternode name="proxmox7" nodeid="5" votes="1"/>
<clusternode name="proxmox6" nodeid="6" votes="1"/>
</clusternodes>
<rm/>
</cluster>
The corosync.log is fairly chatty and I've provided a link to make sure this is normal chatter/experience: http://pastebin.com/zf13srf5
If anything else is needed or helpful please let me know.