[SOLVED] OpenVZ containers going offline every night

nbrogi · Nov 28, 2016

Hi, everyone!

I can't figure this out.

I'm having some containers go offline every night at approximately the same time, for about 5 minutes at a time, I'd say 2-3 times over a couple of hours.

For sure this is related to backing up, but the weird thing is that I disabled backups for these particular containers—which are supposed to be available at all times—and they still go down while backing up happens.

Any idea why this might be, and how I can solve the problem?

I'm using OpenVZ containers on Proxmox 3.4, pveversion "pve-manager/3.4-6/102d4547 (running kernel: 2.6.32-39-pve)"

morph027 · Nov 28, 2016

Can you post your /etc/pve/vzdump.cron ?

nbrogi · Nov 28, 2016

Absolutely!

# cluster wide vzdump cron schedule
# Automatically generated file - do not edit

PATH="/usr/sbin:/usr/bin:/sbin:/bin"

0 0 * * * root vzdump --quiet 1 --mailnotification failure --mode snapshot --mailto email@example.com --all 1 --compress gzip --storage local --exclude 603,627

morph027 · Nov 28, 2016

Hm, looks good to me. Is there anything in the task log of these 2 vms?

nbrogi · Nov 28, 2016

Sorry, where would I find that?

I've never had to mess with Proxmox before, and things usually just work.

morph027 · Nov 28, 2016

Right now i do not have any Proxmox 3 instance running, but if you click on the vm itself, there are tabs in the right pane. There should be a tab called "Task History"...

nbrogi · Nov 28, 2016

Got it. I appreciate your help.

I don't see anything out of the ordinary, it's just me stopping/starting the machine, and restoring it 10 days ago.

morph027 · Nov 28, 2016

Probably a cron job inside the VZ?

grin · Nov 28, 2016

Maybe anything relevant in your syslog?

nbrogi · Nov 28, 2016

morph027 said:
Probably a cron job inside the VZ?

Unfortunately I don't have much. It's just Let's Encrypt, but only on Mondays:
30 2 * * 1 /usr/local/sbin/certbot-auto renew >> /var/log/le-renew.log

nbrogi · Nov 28, 2016

grin said:
Maybe anything relevant in your syslog?

I see this which might be relevant, but I don't see the machine that goes down, since backups are disabled:

2016-11-28T03:10:43+0100 vzctl : CT 631 : Setting up checkpoint...
2016-11-28T03:10:43+0100 vzctl : CT 631 : suspend...
2016-11-28T03:10:43+0100 vzctl : CT 631 : get context...
2016-11-28T03:10:43+0100 vzctl : CT 631 : Checkpointing completed successfully
2016-11-28T03:10:43+0100 vzctl : CT 631 : Resuming...

Somehow, it's still taken down.

grin · Nov 28, 2016

nbrogi said:
I see this which might be relevant, but I don't see the machine that goes down, since backups are disabled:

2016-11-28T03:10:43+0100 vzctl : CT 631 : Setting up checkpoint...
2016-11-28T03:10:43+0100 vzctl : CT 631 : suspend...
2016-11-28T03:10:43+0100 vzctl : CT 631 : get context...
2016-11-28T03:10:43+0100 vzctl : CT 631 : Checkpointing completed successfully
2016-11-28T03:10:43+0100 vzctl : CT 631 : Resuming...

Somehow, it's still taken down.

Could you tell us exactly when the machine goes down and when comes it up? (Since it's a CT clocks should be in sync.)

nbrogi · Nov 28, 2016

The problem seems to be here:

2016-11-28T01:20:31+0100 vzctl : CT 627 : Killing container ...
2016-11-28T01:20:31+0100 vzctl : CT 627 : Container was stopped
2016-11-28T01:20:32+0100 vzctl : CT 627 : Container is unmounted
2016-11-28T01:24:51+0100 vzctl : CT 627 : Starting container ...
2016-11-28T01:24:51+0100 vzctl : CT 627 : Container is mounted
2016-11-28T01:24:51+0100 vzctl : CT 627 : Adding IP address(es): xxx.xx.xx.xx
2016-11-28T01:24:52+0100 vzctl : CT 627 : Setting CPU units: 1000
2016-11-28T01:24:52+0100 vzctl : CT 627 : Setting CPUs: 8
2016-11-28T01:24:52+0100 vzctl : CT 627 : Setting devices
2016-11-28T01:24:52+0100 vzctl : CT 627 : Container start in progress...

I have no idea why it would stop it, though.

grin · Nov 28, 2016

Look for the reasons before that. Who's initiating it, HA manager, or something else? Is there some related stuff in dmesg (or in the case of this bloody damned systemd, the journalctl)? Maybe OOM, or any lxc or kernel errors?

morph027 · Nov 28, 2016

Anything in /var/log/pve/tasks/index around the same time?

nbrogi · Nov 28, 2016

morph027 said:
Anything in /var/log/pve/tasks/index around the same time?

I'm not sure, it might be this but I don't have timestamps:

UPID:FR-2:00010DBB:A9D2BD3D:583B784F:vzstop:627:root@pam: 583B7859 OK
UPID:FR-2:000118A4:A9D319D6:583B793C:vzstart:627:root@pam: 583B7957 OK
UPID:FR-2:000FFD96:A9CB5E9E:583B6571:vzdump::root@pam: 583B9256 OK

nbrogi · Nov 28, 2016

grin said:
Look for the reasons before that. Who's initiating it, HA manager, or something else? Is there some related stuff in dmesg (or in the case of this bloody damned systemd, the journalctl)? Maybe OOM, or any lxc or kernel errors?

I'm not sure, /var/log/dmesg doesn't have timestamps, but I also don't see the VMID.

I did "locate journalctl" and it didn't return anything, so it's possible that I'm not using it.

What component besides the backup daemon (or whatever is called) would be stopping containers? Maybe there's some setting that I can change...

grin · Nov 28, 2016

nbrogi said:
I'm not sure, /var/log/dmesg doesn't have timestamps, but I also don't see the VMID.
I did "locate journalctl" and it didn't return anything, so it's possible that I'm not using it.

Try journalctl -xn 500 or similar, I don't know the optimal command line since it's not really my favourite tool, to put it nicely. Should show all kinds of logs with timestamps, unless it decides it's not in the mood.

nbrogi said:
What component besides the backup daemon (or whatever is called) would be stopping containers? Maybe there's some setting that I can change...

ha-manager comes to my mind first, but others may come up with better ideas based on the data you have presented.

morph027 · Nov 28, 2016

Proxmox is based on Wheezy, isn't it? There is no systemd and no journalctl.

But there is good old /var/log/syslog ...

grin · Nov 28, 2016

Oh okay, you're using 3.4. My 4.xx is jessie. Sorry.
Then you should have the info in /var/log/kern.log with timestamps. But I believe without systemd even the syslog is able to contain all the relevant info, so if you didn't see anything relevant I'm out of [basic] ideas and leave others to chime in.

[SOLVED] OpenVZ containers going offline every night

New Member

Renowned Member

New Member

Renowned Member

New Member

Renowned Member

New Member

Renowned Member

Renowned Member

New Member

New Member

Renowned Member

New Member

Renowned Member

Renowned Member

New Member

New Member

Renowned Member

Renowned Member

Renowned Member