[SOLVED] PVE suddunly stopped working, all CTs unrecheable

ianux

New Member
Oct 13, 2015
19
0
1
Hello,

Today a weird thing happened: all of a sudden Icinga started pushing dozens of alerts. All the CTs was unreachable. While still pingable, I couldn't ssh into them nor pct enter from the host (the command was hanging). The web UI show nothing special, the CTs was up ands running.

It's a production server so I freaked out and attempted a soft reboot which get stuck (systemd said it had no time limit to waiting for a process - pvedaemon I think - to exit) and then I did a hard reboot.

After reboot, everything is fine so far.

Here is the relevant parts of the kernel log when this happened:

kernel: [172455.308889] [<ffffffff81306128>] fuse_direct_io+0x3a8/0x5b0
kernel: [172455.323219] [<ffffffff812fba47>] fuse_request_send+0x27/0x30
kernel: [172455.326029] [<ffffffff811fd2c6>] __vfs_read+0x26/0x40
kernel: [172455.327518] [<ffffffff8180aaf2>] entry_SYSCALL_64_fastpath+0x16/0x75
kernel: [172455.329815] ffff880d01637be8 0000000000000082 ffff881038532940 ffff880b1cfce040
kernel: [172455.332307] [<ffffffff810bdd30>] ? wait_woken+0x90/0x90
kernel: [172455.333529] [<ffffffff81306128>] fuse_direct_io+0x3a8/0x5b0
kernel: [172455.335099] [<ffffffff811fd2c6>] __vfs_read+0x26/0x40
kernel: [172455.337775] cat D ffff88103f456a00 0 22418 6479 0x00000104
kernel: [172455.340176] [<ffffffff812fb863>] request_wait_answer+0x163/0x280
kernel: [172455.344816] INFO: task check_mem:22442 blocked for more than 120 seconds.
kernel: [172455.345213] Tainted: P O 4.2.8-1-pve #1
kernel: [172455.346393] ffff880121ec3be8 0000000000000086 ffffffff81e14580 ffff880f650c3700
kernel: [172455.347618] Call Trace:
kernel: [172455.348823] [<ffffffff810bdd30>] ? wait_woken+0x90/0x90
kernel: [172455.349238] [<ffffffff812fba10>] __fuse_request_send+0x90/0xa0
kernel: [172455.351169] [<ffffffff811fd264>] new_sync_read+0x94/0xd0
kernel: [172455.379977] 0000000000000246 ffff880f0e374000 ffff880f0e373c38 ffff880ff6e8fcf0
kernel: [172455.384370] [<ffffffff811fd264>] new_sync_read+0x94/0xd0

And then the hanged process when soft rebooting (those entries was repeated a lot of time before hard reboot):

pvedaemon[11052]: Use of uninitialized value in concatenation (.) or string at /usr/share/perl5/PVE/Tools.pm line 834.
pvedaemon[11052]: Use of uninitialized value in concatenation (.) or string at /usr/share/perl5/PVE/Tools.pm line 834.
pvedaemon[11052]: Use of uninitialized value in concatenation (.) or string at /usr/share/perl5/PVE/Tools.pm line 834.
pvedaemon[24466]: Argument "\n" isn't numeric in int at /usr/share/perl5/PVE/Tools.pm line 847, <GEN3306> line 1.
pvedaemon[24466]: Argument "\n" isn't numeric in int at /usr/share/perl5/PVE/Tools.pm line 848, <GEN3306> line 2.
pvedaemon[24466]: Argument "\n" isn't numeric in int at /usr/share/perl5/PVE/Tools.pm line 849, <GEN3306> line 3.
lxcfs[1510]: Timed out waiting for scm_cred: Success
 
There was the same problem.

Linux ns3100000 4.2.8-1-pve #1 SMP Fri Feb 26 16:37:36 CET 2016 x86_64 GNU/Linux
pve-manager/4.1-15/8cd55b52 (running kernel: 4.2.8-1-pve)
 
I confirm the same problem.

Mar 12 22:13:00 pve1 kernel: [202607.321591] INFO: task bash:30206 blocked for more than 120 seconds.
Mar 12 22:13:00 pve1 kernel: [202607.321698] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 12 22:13:00 pve1 kernel: [202607.321792] ffff880174797be8 0000000000000082 ffff88021e5f6040 ffff8800ad1c5280
Mar 12 22:13:00 pve1 kernel: [202607.321801] ffff88041be67840 fffffffffffffe00 ffff880174797c08 ffffffff81806967
Mar 12 22:13:00 pve1 kernel: [202607.321819] [<ffffffff81806967>] schedule+0x37/0x80
Mar 12 22:13:00 pve1 kernel: [202607.321833] [<ffffffff810bdd30>] ? wait_woken+0x90/0x90
Mar 12 22:13:00 pve1 kernel: [202607.321840] [<ffffffff812fba47>] fuse_request_send+0x27/0x30
Mar 12 22:13:00 pve1 kernel: [202607.321847] [<ffffffff81306374>] __fuse_direct_read+0x44/0x60
Mar 12 22:13:00 pve1 kernel: [202607.321854] [<ffffffff811fd264>] new_sync_read+0x94/0xd0
Mar 12 22:13:00 pve1 kernel: [202607.321861] [<ffffffff811fd91a>] vfs_read+0x8a/0x130



[bře12 22:15] INFO: task bash:30206 blocked for more than 120 seconds.
[ +0,000058] Tainted: P O 4.2.8-1-pve #1
[ +0,000049] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ +0,000084] bash D 0000000000000006 0 30206 28078 0x00000104
[ +0,000010] ffff880174797be8 0000000000000082 ffff88021e5f6040 ffff8800ad1c5280
[ +0,000005] 0000000000000246 ffff880174798000 ffff880174797c38 ffff8800ca80c580
[ +0,000004] ffff88041be67840 fffffffffffffe00 ffff880174797c08 ffffffff81806967
[ +0,000006] Call Trace:
[ +0,000012] [<ffffffff81806967>] schedule+0x37/0x80
[ +0,000008] [<ffffffff812fb863>] request_wait_answer+0x163/0x280
[ +0,000006] [<ffffffff810bdd30>] ? wait_woken+0x90/0x90
[ +0,000004] [<ffffffff812fba10>] __fuse_request_send+0x90/0xa0
[ +0,000003] [<ffffffff812fba47>] fuse_request_send+0x27/0x30
[ +0,000004] [<ffffffff81306128>] fuse_direct_io+0x3a8/0x5b0
[ +0,000003] [<ffffffff81306374>] __fuse_direct_read+0x44/0x60
[ +0,000003] [<ffffffff813063d0>] fuse_direct_read_iter+0x40/0x60
[ +0,000004] [<ffffffff811fd264>] new_sync_read+0x94/0xd0
[ +0,000004] [<ffffffff811fd2c6>] __vfs_read+0x26/0x40
[ +0,000003] [<ffffffff811fd91a>] vfs_read+0x8a/0x130
[ +0,000003] [<ffffffff811fe7e5>] SyS_read+0x55/0xc0
[ +0,000006] [<ffffffff8180aaf2>] entry_SYSCALL_64_fastpath+0x16/0x75

pveversion -v
proxmox-ve: 4.1-39 (running kernel: 4.2.8-1-pve)
pve-manager: 4.1-15 (running version: 4.1-15/8cd55b52)
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.2.8-1-pve: 4.2.8-39
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-33
qemu-server: 4.0-62
pve-firmware: 1.1-7
libpve-common-perl: 4.0-49
libpve-access-control: 4.0-11
libpve-storage-perl: 4.0-42
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-9
pve-container: 1.0-46
pve-firewall: 2.0-18
pve-ha-manager: 1.0-24
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie
 
Me too.
My system hang many time after update to the latest kernel


# uname -a
Linux proxmox2 4.2.8-1-pve #1 SMP Fri Feb 26 16:37:36 CET 2016 x86_64 GNU/Linux
 
It happened again this sunday :
kernel: [268571.791083] [<ffffffff81067d53>] __do_page_fault+0x3e3/0x410
kernel: [268571.953173] Killed process 16400 (/usr/sbin/apach) total-vm:401428kB, anon-rss:10028kB, file-rss:0kB
kernel: [268591.421783] [<ffffffff81186623>] pagefault_out_of_memory+0x13/0x80
kernel: [268594.819443] [<ffffffff811f1201>] mem_cgroup_oom_synchronize+0x531/0x600
kernel: [268594.822463] memory: usage 1048512kB, limit 1048576kB, failcnt 0
kernel: [268597.432210] [<ffffffff81186623>] pagefault_out_of_memory+0x13/0x80
kernel: [268603.937032] 0000000000000000 ffff880e0a7bee00 ffff8802600e7d38 ffffffff81802374
kernel: [268603.939302] [<ffffffff811ed400>] ? mem_cgroup_css_online+0x250/0x250
kernel: [268715.496564] memory+swap: usage 1048528kB, limit 1048576kB, failcnt 8417312
kernel: [268718.514570] CPU: 7 PID: 19794 Comm: doctor.pl Tainted: P O 4.2.8-1-pve #1
kernel: [268718.590747] Memory cgroup stats for /lxc/163: cache:7552KB rss:1040800KB rss_huge:0KB mapped_file:7096KB dirty:0KB writeback:0KB swap:224KB inactive_anon:523944KB active_anon:523656KB inactive_file:24KB active_file:208KB unevictable:0KB
kernel: [268726.904183] Hardware name: Supermicro Super Server/X10SRi-F, BIOS 1.0b 04/21/2015
kernel: [268726.905663] ffff880ecdc6d778 ffff8810328bc030 0000000000000015 0000000000000000
kernel: [268726.908085] [<ffffffff811f1201>] mem_cgroup_oom_synchronize+0x531/0x600
kernel: [268726.909665] [<ffffffff8106788f>] mm_fault_error+0x7f/0x160
kernel: [268743.137023] 0000000000000000 ffff880e0a7bee00 ffff880c8b3a3d38 ffffffff81802374
kernel: [268743.139262] [<ffffffff81802374>] dump_header+0xaf/0x238
kernel: [268743.140393] [<ffffffff811f1201>] mem_cgroup_oom_synchronize+0x531/0x600
kernel: [268743.143400] [<ffffffff8180ca48>] page_fault+0x28/0x30
kernel: [268743.143751] Task in /lxc/163 killed as a result of limit of /lxc/163
kernel: [268746.258383] [<ffffffff8106788f>] mm_fault_error+0x7f/0x160
kernel: [268746.259066] [<ffffffff8180635e>] ? __schedule+0x37e/0x950
kernel: [268746.747548] memory+swap: usage 1048576kB, limit 1048576kB, failcnt 8884441
kernel: [268748.353663] [<ffffffff8180ca48>] page_fault+0x28/0x30
kernel: [268749.751234] Call Trace:
kernel: [268749.836869] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
kernel: [268749.995042] Killed process 12575 (/usr/sbin/apach) total-vm:400956kB, anon-rss:9796kB, file-rss:0kB
kernel: [268750.155926] [<ffffffff81186623>] pagefault_out_of_memory+0x13/0x80
kernel: [268750.237436] memory+swap: usage 1048576kB, limit 1048576kB, failcnt 8933602
kernel: [268751.760978] [<ffffffff81802374>] dump_header+0xaf/0x238
kernel: [268751.763251] [<ffffffff81067d53>] __do_page_fault+0x3e3/0x410
kernel: [268753.044439] /usr/sbin/apach invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
kernel: [268753.046658] 0000000000000000 ffff880e0a7bee00 ffff880685133d38 ffffffff81802374
kernel: [268753.048006] [<ffffffff81802374>] dump_header+0xaf/0x238
kernel: [268753.048348] [<ffffffff81185e53>] oom_kill_process+0x1e3/0x3c0
kernel: [268754.243341] ps invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
kernel: [268754.248388] [<ffffffff81067d53>] __do_page_fault+0x3e3/0x410
kernel: [268754.412081] Killed process 23974 (/usr/sbin/apach) total-vm:400940kB, anon-rss:9788kB, file-rss:0kB
kernel: [268754.747136] Hardware name: Supermicro Super Server/X10SRi-F, BIOS 1.0b 04/21/2015
kernel: [268754.755216] memory: usage 1048328kB, limit 1048576kB, failcnt 0
kernel: [268755.057941] [<ffffffff811ed400>] ? mem_cgroup_css_online+0x250/0x250
kernel: [268755.560108] Call Trace:
kernel: [268755.561702] [<ffffffff811ed400>] ? mem_cgroup_css_online+0x250/0x250
kernel: [268756.763248] 0000000000000000 ffff880e0a7bee00 ffff880685157d38 ffffffff81802374
kernel: [268757.159017] ffff880ecdc6d778 ffff8810328bc030 0000000000000015 0000000000000000
kernel: [268757.159527] Call Trace:
kernel: [268757.162025] [<ffffffff81067d53>] __do_page_fault+0x3e3/0x410
kernel: [268757.162376] [<ffffffff81067da2>] do_page_fault+0x22/0x30
kernel: [268757.321250] Memory cgroup out of memory: Kill process 27967 (/usr/sbin/apach) score 15 or sacrifice child
kernel: [268757.459310] Hardware name: Supermicro Super Server/X10SRi-F, BIOS 1.0b 04/21/2015
kernel: [268757.461781] [<ffffffff811f1201>] mem_cgroup_oom_synchronize+0x531/0x600
kernel: [268758.159373] CPU: 6 PID: 27302 Comm: in:imklog Tainted: P O 4.2.8-1-pve #1
kernel: [268758.159868] Hardware name: Supermicro Super Server/X10SRi-F, BIOS 1.0b 04/21/2015
kernel: [268758.160566] 0000000000000000 ffff880e0a7bee00 ffff880e0a633d38 ffffffff81802374
kernel: [268758.162925] [<ffffffff81186623>] pagefault_out_of_memory+0x13/0x80

And so on...

All those messages showed up between output like these:

kernel: [268591.425604] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
kernel: [268591.429440] [27967] 0 27967 100231 3828 147 4 0 0 /usr/sbin/apach
kernel: [268591.429883] [28068] 104 28068 5866 132 16 3 0 0 nrpe
kernel: [268591.432855] [16397] 33 16397 46346 2196 81 4 0 0 /usr/sbin/apach
kernel: [268591.437350] [27158] 104 27158 2118 29 10 3 0 0 uptime
kernel: [268591.445494] [ 4426] 104 4426 2171 24 10 3 0 0 ps
kernel: [268591.449468] [17208] 104 17208 1098 19 8 3 0 0 cat
kernel: [268591.463996] [21865] 104 21865 2171 23 10 3 0 0 ps
kernel: [268591.468059] [ 3881] 104 3881 1098 20 8 3 0 0 cat
kernel: [268591.469193] [ 5934] 0 5934 1084 22 8 3 0 0 sh
kernel: [268591.470226] [ 7073] 104 7073 2171 23 10 3 0 0 ps
kernel: [268591.470924] [ 7743] 104 7743 2171 23 10 3 0 0 ps
kernel: [268591.472143] [12948] 0 12948 12109 103 29 3 0 0 cron
kernel: [268591.475752] [20569] 104 20569 14365 1260 34 3 0 0 apt-get
kernel: [268591.485121] [19910] 0 19910 12109 103 29 3 0 0 cron
kernel: [268591.497463] [25973] 104 25973 2118 29 10 3 0 0 uptime
kernel: [268591.497726] [27146] 104 27146 1098 19 8 3 0 0 cat
kernel: [268591.502252] [10824] 104 10824 2171 23 10 3 0 0 ps
kernel: [268591.503470] [16757] 104 16757 1098 20 7 3 0 0 cat
kernel: [268591.504678] [19897] 104 19897 2171 23 10 3 0 0 ps
kernel: [268591.508823] [ 835] 104 835 2171 24 10 3 0 0 ps
kernel: [268591.512698] [11502] 104 11502 2171 23 8 3 0 0 ps
kernel: [268591.525053] [30171] 104 30171 2171 23 10 3 0 0 ps
kernel: [268591.554927] [ 1720] 104 1720 2171 23 9 3 0 0 ps
kernel: [268591.556293] [12396] 104 12396 1098 19 6 3 0 0 cat
kernel: [268591.560862] [ 5689] 104 5689 2171 23 10 3 0 0 ps
kernel: [268591.561821] [13743] 104 13743 2118 29 10 3 0 0 uptime
kernel: [268591.562376] [20435] 0 20435 12109 103 29 3 0 0 cron
kernel: [268591.563329] [21148] 104 21148 2171 24 9 3 0 0 ps
kernel: [268591.564770] [ 339] 104 339 2118 28 10 3 0 0 uptime

This time, even the host was unreachable. I had to hard reboot it from the OVH web manager.

Is it a kernel problem? A PVE problem? Is this caused by a CT behaving weirdly? Is it possible that a DDOS on a CT caused that?
 
You seem to have run out of memory (possible only in your container? if you post the full log as attachment, it might be possible to find out), in which case the kernel will start to kill processes to free memory.
 
You seem to have run out of memory (possible only in your container? if you post the full log as attachment, it might be possible to find out), in which case the kernel will start to kill processes to free memory.

Ok then, what do you want? syslog? kern.log? The only CT appearing in the log is a DNS server, that's why I suspected a DNS amplification attack.
 
"pveversion -v", config of containers, output of "journalctl -b-1" if you have activated persistent journal logging (replace 1 with the number of reboots since the issue occured), otherwise traditional persistent log files (syslog, kern.log, messages, ..).

The process killed was an apache process, so the apache logs might also be relevant (but it is also possible that the memory limit for your container is set too low).
 
It seems more CPU bound than memory bound, looking at Icinga reports (which is a CT running on the crashed host, I know, bad move). The lack of data on the right side started when everything stopped responding.

icinga_cpu.png icinga_ram.png
 
pveversion -v
pve-manager: 4.1-15 (running version: 4.1-15/8cd55b52)
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.2.8-1-pve: 4.2.8-39
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-4.2.3-2-pve: 4.2.3-22
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-33
qemu-server: 4.0-62
pve-firmware: 1.1-7
libpve-common-perl: 4.0-49
libpve-access-control: 4.0-11
libpve-storage-perl: 4.0-42
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-9
pve-container: 1.0-46
pve-firewall: 2.0-18
pve-ha-manager: 1.0-24
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie

I can't upload gzip files or raw files > 1Mb so here are links to get them. I've greped lines from the day of the crash).

kern.log.gz http://dl.free.fr/m5cG7O2WQ
messages.gz http://dl.free.fr/q6j4ks3SO
syslog.gz http://dl.free.fr/kib9VztqO
 
Upload them as .zip (they are all smaller than 1MB if I understand that french page correctly ;))
 
Here they are.
 

Attachments

  • kern.log.zip
    128.1 KB · Views: 2
  • messages.zip
    126.4 KB · Views: 2
  • syslog.zip
    171 KB · Views: 1
Here they are.

I think your issue with out of memory errors is unrelated to what the other people in this thread reported, and I think the kernel messages are just a symptom of whatever problem you experienced regarding high load (whether that was caused by a DDoS, a lot of users for your services, or something else is impossible to tell from the logs, maybe your ISP can give you more insight here?). If the issue persists and is definitely not caused by real load on your system, please open a new thread and try to gather as much information as possible.
 
So, what about this?
Mar 13 08:54:18 xxxx kernel: [255738.570646] [<ffffffff812fba47>] fuse_request_send+0x27/0x30
Mar 13 08:54:18 xxxx kernel: [255738.613973] [<ffffffff811fe7e5>] SyS_read+0x55/0xc0
Mar 13 08:54:18 xxxx kernel: [255738.632098] uptime D ffff88103f496a00 0 24538 10938 0x00000104
Mar 13 08:54:18 xxxx kernel: [255738.633718] Call Trace:
Mar 13 08:54:18 xxxx kernel: [255738.635683] [<ffffffff812fba47>] fuse_request_send+0x27/0x30
Mar 13 08:54:18 xxxx kernel: [255738.652103] [<ffffffff81306128>] fuse_direct_io+0x3a8/0x5b0

It appeared before the first signs of something going wrong, around noon:
Mar 13 12:27:52 xxxx kernel: [268571.791083] [<ffffffff81067d53>] __do_page_fault+0x3e3/0x410
Mar 13 12:27:52 xxxx kernel: [268571.953173] Killed process 16400 (/usr/sbin/apach) total-vm:401428kB, anon-rss:10028kB, file-rss:0kB
Mar 13 12:28:11 xxxx kernel: [268591.421783] [<ffffffff81186623>] pagefault_out_of_memory+0x13/0x80
Mar 13 12:28:15 xxxx kernel: [268594.819443] [<ffffffff811f1201>] mem_cgroup_oom_synchronize+0x531/0x600

And later, everything stopped working:
Mar 13 14:18:07 xxxx pve-firewall[1771]: status update error: command '/sbin/iptables-save' failed: open3: fork failed: Ressource temporairement non disponible at /usr/share/perl5/PVE/Tools.pm line 396.
Mar 13 14:19:07 xxxx pve-firewall[1771]: status update error: command '/sbin/iptables-save' failed: open3: fork failed: Ressource temporairement non disponible at /usr/share/perl5/PVE/Tools.pm line 396.
Mar 13 14:20:01 xxxx cron[1751]: (CRON) error (can't fork)

In fact, I lost all the metrics in Icinga at exactly the time of the first error (08:54). More precisely, all the NRPE requests on all CTs stopped working at that time, but the other external check (SSH, FTP, HTTP status) was still receiving data, and then stopped at 14:18, the time everything fell apart.

Based on Icinga metrics, everything was fine on the CTs themselves, but on the host, as you can see on my previous screenshots, the CPU load went progressively crazy beginning at 9:00 (the time of the first error) and saturated completely around 14:00.

It seems it all begins with a problem with FUSE, as in my first post. So lxcfs must be involved.
 
Last edited:
i try to done this command:
Code:
sysctl -w vm.dirty_ratio=10
sysctl -w vm.dirty_background_ratio=5

sysctl -p
For now i not know if the problem is fixed because i done only this morning
 
I am also affected by the same problem, multiple times. There are some OOM errors, but inside the container I had free memory. I have also tasks "blocked for more than 120 seconds" and many processes stuck in state D.
As far as I managed to trace it, I am strongly convinced, that the problem is with lxcfs-2.0.0-pve1. I think this bug is related to: https://github.com/lxc/lxcfs/issues/73
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!