i just had a node go red.
vm's were off network.
from logs
a little before that in log, issues started here
from inside a lxc:
this system has 64GB ecc ram.
had to reboot to fix.
normal memory used is 15GB
Now I'll make sure pve-zsync only runs one at a time from different systems.
It looks like there us a zfs / kernel bug ... I'll do more research , ran in to this so far: http://www.subly.me/articles/linux-zfs-oom.html . However that person had not much memory to start with.
Any clues on preventing this?
vm's were off network.
from logs
Code:
.
Dec 13 10:50:01 sys3 CRON[26588]: (root) CMD ( pve-zsync sync --limit 10000 --source 4444 --dest 10.2.2.181:tank/pve-zsync-bkup --name pro4 --maxsnap 200 --method ssh)
Dec 13 10:50:01 dell1 CRON[14260]: (root) CMD (pve-zsync sync --source 105 --dest tank/pve-zsync-bkup --name imap --maxsnap 101 --method local)
Dec 13 10:50:03 dell1 kernel: [126488.191024] SLUB: Unable to allocate memory on node -1 (gfp=0xd0)
Dec 13 10:50:03 dell1 kernel: [126488.191033] cache: kmalloc-4096(4:107), object size: 4096, buffer size: 4096, default order: 3, min order: 0
Dec 13 10:50:03 dell1 kernel: [126488.191038] node 0: slabs: 14063, objs: 108661, free: 0
Dec 13 10:50:03 dell1 kernel: [126488.294658] cache: kmalloc-4096(4:107), object size: 4096, buffer size: 4096, default order: 3, min order: 0
Dec 13 10:50:03 dell1 kernel: [126488.395777] SLUB: Unable to allocate memory on node -1 (gfp=0xd0)
Dec 13 10:50:03 dell1 kernel: [126488.395789] node 0: slabs: 14063, objs: 108661, free: 0
Dec 13 10:50:03 dell1 kernel: [126488.479726] cache: kmalloc-4096(4:107), object size: 4096, buffer size: 4096, default order: 3, min order: 0
Dec 13 10:50:04 dell1 kernel: [126489.916747] Possible memory allocation deadlock: size=80 lflags=0xc210
Dec 13 10:50:04 dell1 kernel: [126489.916772] 0000000000000000 000000000000c210 ffff8803ee33b3b8 ffffffffc0097bfb
Dec 13 10:50:04 dell1 kernel: [126489.916818] [<ffffffffc0097bfb>] spl_kmem_zalloc+0x17b/0x180 [spl]
Dec 13 10:50:04 dell1 kernel: [126489.917030] [<ffffffff8139787c>] ? generic_make_request_checks+0x1dc/0x3a0
Dec 13 10:50:04 dell1 kernel: [126489.917061] [<ffffffff81397be6>] submit_bio+0x76/0x180
Dec 13 10:50:04 dell1 kernel: [126489.917093] [<ffffffff81195802>] pageout.isra.40+0x182/0x270
Dec 13 10:50:04 dell1 kernel: [126489.917127] [<ffffffff81198ece>] shrink_lruvec+0x5fe/0x7f0
Dec 13 10:50:04 dell1 kernel: [126489.917162] [<ffffffff811999b4>] try_to_free_mem_cgroup_pages+0xb4/0x140
Dec 13 10:50:04 dell1 kernel: [126489.917194] [<ffffffff81182717>] add_to_page_cache_lru+0x37/0x90
Dec 13 10:50:04 dell1 kernel: [126489.917226] [<ffffffff810e93b2>] ? set_cpu_itimer+0x132/0x220
Dec 13 10:50:04 dell1 kernel: [126489.917266] Possible memory allocation deadlock: size=80 lflags=0xc210
a little before that in log, issues started here
Code:
Dec 13 10:41:26 dell1 kernel: [125970.997643] TCP: request_sock_TCP: Possible SYN flooding on port 7002. Sending cookies. Check SNMP counters.
Dec 13 10:42:24 dell1 kernel: [126028.917202] Possible memory allocation deadlock: size=216 lflags=0xc210
Dec 13 10:42:24 dell1 kernel: [126028.917233] 0000000000000000 000000000000c210 ffff8807c3007458 ffffffffc0097bfb
Dec 13 10:42:24 dell1 kernel: [126028.917276] [<ffffffffc0097bfb>] spl_kmem_zalloc+0x17b/0x180 [spl]
Dec 13 10:42:24 dell1 kernel: [126028.917487] [<ffffffffc02e78c6>] zvol_request+0x226/0x680 [zfs]
Dec 13 10:42:24 dell1 kernel: [126028.917520] [<ffffffff81397be6>] submit_bio+0x76/0x180
Dec 13 10:42:24 dell1 kernel: [126028.917551] [<ffffffff81195802>] pageout.isra.40+0x182/0x270
Dec 13 10:42:24 dell1 kernel: [126028.917582] [<ffffffff810a5e00>] ? try_to_wake_up+0x180/0x340
Dec 13 10:42:24 dell1 kernel: [126028.917613] [<ffffffff811f15ae>] try_charge+0x18e/0x720
Dec 13 10:42:24 dell1 kernel: [126028.917774] [<ffffffff81327f33>] ? security_file_permission+0xa3/0xc0
Dec 13 10:42:24 dell1 kernel: [126028.917790] [<ffffffff810675dd>] __do_page_fault+0x19d/0x410
Dec 13 10:42:24 dell1 kernel: [126028.917804] [<ffffffff81809f48>] page_fault+0x28/0x30
Dec 13 10:42:26 dell1 kernel: [126031.251186] Possible memory allocation deadlock: size=224 lflags=0x4210
Dec 13 10:42:26 dell1 kernel: [126031.251205] Hardware name: Dell Inc. PowerEdge R720/0C4Y3R, BIOS 2.5.2 01/28/2015
Dec 13 10:42:26 dell1 kernel: [126031.251224] 00011200ffffffff 0000000000000296 ffff8809d3a23398 ffff8803f6e20a00
Dec 13 10:42:26 dell1 kernel: [126031.251262] [<ffffffffc0097a74>] spl_kmem_alloc+0x184/0x190 [spl]
Dec 13 10:42:26 dell1 kernel: [126031.251441] [<ffffffffc02e786c>] zvol_request+0x1cc/0x680 [zfs]
Dec 13 10:42:26 dell1 kernel: [126031.251465] [<ffffffff81397b2e>] generic_make_request+0xee/0x130
Dec 13 10:42:26 dell1 kernel: [126031.251487] [<ffffffff811cb4a0>] ? __frontswap_store+0x90/0x120
Dec 13 10:42:26 dell1 kernel: [126031.251510] [<ffffffff81197748>] shrink_page_list+0x408/0x780
Dec 13 10:42:26 dell1 kernel: [126031.251535] [<ffffffff81198ece>] shrink_lruvec+0x5fe/0x7f0
Dec 13 10:42:26 dell1 kernel: [126031.251558] [<ffffffff811994e2>] do_try_to_free_pages+0x172/0x440
Dec 13 10:42:26 dell1 kernel: [126031.251580] [<ffffffff811f237e>] mem_cgroup_try_charge+0x8e/0xf0
Dec 13 10:42:26 dell1 kernel: [126031.251600] [<ffffffff81184186>] filemap_fault+0x1b6/0x3e0
Dec 13 10:42:26 dell1 kernel: [126031.251621] [<ffffffff811b4830>] handle_mm_fault+0xfc0/0x1840
Dec 13 10:42:26 dell1 kernel: [126031.251643] [<ffffffff81067872>] do_page_fault+0x22/0x30
Dec 13 10:42:26 dell1 kernel: [126031.251681] CPU: 3 PID: 32519 Comm: mysqld Tainted: P O 4.2.6-1-pve #1
Dec 13 10:42:26 dell1 kernel: [126031.251694] 0000000000000000 0000000000004210 ffff8809d3a233a8 ffffffffc0097a74
Dec 13 10:42:26 dell1 kernel: [126031.251713] [<ffffffff81801028>] dump_stack+0x45/0x57
Dec 13 10:42:26 dell1 kernel: [126031.251807] [<ffffffff813969ff>] ? part_round_stats+0x4f/0x60
Dec 13 10:42:26 dell1 kernel: [126031.252023] [<ffffffff81184779>] ? mempool_alloc+0x69/0x170
Dec 13 10:42:26 dell1 kernel: [126031.252097] [<ffffffff81397be6>] submit_bio+0x76/0x180
Dec 13 10:42:26 dell1 kernel: [126031.252104] [<ffffffff811c5f5e>] __swap_writepage+0x22e/0x270
Dec 13 10:42:26 dell1 kernel: [126031.252110] [<ffffffff811cb4a0>] ? __frontswap_store+0x90/0x120
Dec 13 10:42:26 dell1 kernel: [126031.252134] [<ffffffff81197748>] shrink_page_list+0x408/0x780
Dec 13 10:42:26 dell1 kernel: [126031.252161] [<ffffffff81198ece>] shrink_lruvec+0x5fe/0x7f0
Dec 13 10:42:26 dell1 kernel: [126031.252184] [<ffffffff811994e2>] do_try_to_free_pages+0x172/0x440
Dec 13 10:42:26 dell1 kernel: [126031.252247] [<ffffffff811f237e>] mem_cgroup_try_charge+0x8e/0xf0
..
Dec 13 10:43:56 dell1 kernel: [126121.651232] INFO: task monit:7349 blocked for more than 120 seconds.
Dec 13 10:43:56 dell1 kernel: [126121.651397] monit D ffff880feea56a00 0 7349 1 0x00000000
..
Dec 13 10:43:56 dell1 kernel: [126121.652403] INFO: task bc-server:14570 blocked for more than 120 seconds.
Dec 13 10:43:56 dell1 kernel: [126121.652676] [<ffffffff81806df2>] rwsem_down_read_failed+0xf2/0x140
Dec 13 10:43:56 dell1 kernel: [126121.652686] [<ffffffff813d6704>] call_rwsem_down_read_failed+0x14/0x30
Dec 13 10:43:56 dell1 kernel: [126121.652690] [<ffffffff81806324>] ? down_read+0x24/0x30
Dec 13 10:43:56 dell1 kernel: [126121.652699] [<ffffffff810677be>] __do_page_fault+0x37e/0x410
Dec 13 10:43:56 dell1 kernel: [126121.652706] [<ffffffff818038ae>] ? __schedule+0x37e/0x950
Dec 13 10:43:56 dell1 kernel: [126121.652711] [<ffffffff81067872>] do_page_fault+0x22/0x30
Dec 13 10:43:56 dell1 kernel: [126121.652715] [<ffffffff81809f48>] page_fault+0x28/0x30
Dec 13 10:43:56 dell1 kernel: [126121.652719] INFO: task bc-server:14593 blocked for more than 120 seconds.
Dec 13 10:43:56 dell1 kernel: [126121.652764] Tainted: P O 4.2.6-1-pve #1
Dec 13 10:43:56 dell1 kernel: [126121.652805] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Code:
# pveversion -v
proxmox-ve: 4.1-26 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-1 (running version: 4.1-1/2f9650d4)
pve-kernel-4.2.6-1-pve: 4.2.6-26
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-4.2.3-1-pve: 4.2.3-18
pve-kernel-4.2.3-2-pve: 4.2.3-22
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-29
qemu-server: 4.0-41
pve-firmware: 1.1-7
libpve-common-perl: 4.0-41
libpve-access-control: 4.0-10
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-17
pve-container: 1.0-32
pve-firewall: 2.0-14
pve-ha-manager: 1.0-14
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-5
lxcfs: 0.13-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve6~jessie
from inside a lxc:
Code:
Dec 13 10:44:51 imap kernel: [126176.155922] [<ffffffff8106735f>] mm_fault_error+0x7f/0x160
this system has 64GB ecc ram.
had to reboot to fix.
normal memory used is 15GB
Now I'll make sure pve-zsync only runs one at a time from different systems.
It looks like there us a zfs / kernel bug ... I'll do more research , ran in to this so far: http://www.subly.me/articles/linux-zfs-oom.html . However that person had not much memory to start with.
Any clues on preventing this?
Last edited: