HELP! PVE 5.1 - NETWORK CRASH with 32 LXC containers

Arthur · Nov 10, 2017

This problem is driving me nuts.

We have 3 new servers where we installed PVE 5.1-36.

I tried to start 32 LXC containers at a time through API and it seems like there's a new parameter in the API for creating containers. Instead of passing --cpulimit X now I am required to pass --cores X. That's fine. we made the adjustments and ok. We're happy that PVE 5.1 is gonna work out! Nope.

Problem 1) After creating containers I encountered the following error:

Code:

 Nov  8 02:55:00 marshall pvesr[28315]: Unable to create new inotify object: Too many open files at /usr/share/perl5/PVE/INotify.pm line 390.

Googled the shit out of this error and figured out that we should increase the following variables in /etc/sysctl.conf. For other people encountering the same problem here's my two cents:

Code:

fs.inotify.max_user_watches=1048576
fs.inotify.max_user_instances=8192

Note: Proxmox Team, would it be possible to consider including these variables in /etc/sysctl.d/pve.conf by default?

Problem 2) Everytime the server CRASHED after a certain time (20min, 30min)...

The last log entries I saw in /var/log/syslog in the reported times of the crash were:

So then again, googled a lot and found someone in this forum talking about removing the SWAP partition from /etc/fstab. Did that and of course after a reboot no mounted swap partition anymore. Is this good or bad? No idea. One thing is for sure. Warning about timeout waiting on that device disappeared. Expected, since it's not mounted anymore.

I was almost about to get happy... If only...

Problem 3) Start log of one LXC Container. Notice that there’s an EXT4-fs warning (device loopXX): ext4_multi_mount_protect

Does this,

Code:

EXT4-fs warning: ext4_multi_mount_protect:324: MMP interval 42 higher than expected, please wait.

mean any problem? After waiting a bit the containers start correctly.

The real nightmare starts after about 30 minutes of the HOST operating...

It crashes! No network, no ssh, no IPMI (KVM) no access at all! Only after a full reboot I can recover the server.

I read a lot in this forum and I've seen people reporting their pveperf results. I decided to try that and here's what I got:

Code:

CPU BOGOMIPS:      38410.86
REGEX/SECOND:      2271408
HD SIZE:           1813.94 GB (/dev/root)
unable to open HD at /usr/bin/pveperf line 150.

I am worried that this "unable to open HD at /usr/bin/pveperf" could mean that something's really wrong with my disks or any other hardware related problem and might create problems in the future.

Could you guys share your comments whether this would be something to be worried about? Also I am still monitoring the application to check for CRASHES. Are there any known issues in the new PVE 5.1 kernel with multiple containers?

Alwin · Nov 13, 2017

Arthur said:
I am worried that this "unable to open HD at /usr/bin/pveperf" could mean that something's really wrong with my disks or any other hardware related problem and might create problems in the future.

How did you setup your root partition? You would have entries in kernel.log and syslog.

Arthur · Nov 13, 2017

Alwin said:
How did you setup your root partition? You would have entries in kernel.log and syslog.

Hi Alwin. Thanks for the note. After spending about a week on this I finally got to the bottom of it and fixed it. It turns out that it was a networking problem in our physical server.

Basically, the network cards eno3 and vmbr0 both had valid IP addresses and the datacenter monitoring was cutting network due to having multiple MAC Addresses assigned to 1 IP. They understood this as a security issue which in my point of view is absolutely correct.

My description in this thread was then misleading. Since I lost network connectivity we thought the server was crashing due to proxmox. But that was not the case. I will try to change the description on this thread to Network Crash.

For people encountering a similar issue in the future then I guess the lesson learned for me was:

- If you lose networking in proxmox after a while (in my case was about 20 / 30 minutes). Check if eno3 / eth0 and vmbr0 both have valid IP addresses. If that's the case please ensure that only one of them (vmbr0) is assigned an ip because it might mean that an anti-DDoS tool in your network is blocking network connectivity due to multiple MACs within the same IP.

Search

Search

HELP! PVE 5.1 - NETWORK CRASH with 32 LXC containers

Arthur

New Member

Alwin

Proxmox Retired Staff

Arthur

New Member