PVE lockup

cspiegel

New Member
May 17, 2009
24
0
1
Hi,

I'm testing Proxmox VE 1.2 for three weeks now. This morning, the machine locked up for the second time during the test period. The PVE host still answers pings, but I cannot login (neither via SSH, nor via the attached console - it does not show the login prompt). The virtual machines are no longer reachable in this situation.

The last pre-lockup messages I can see in

Code:
/var/log/messages
/var/log/syslog
do not show any special incidents. Neither do the virtual machine's logs (two Windows Server 2003 guests, three Linux guests).

After rebooting (I had to push the reset button), everything works fine again.

Any idea? Your help is highly appreciated.

Regards,
Christoph
 
Please can you send the syslog (/var/log/syslog) to dietmar@proxmox.com.

What hardware doe you run on?

Also run memtest (from boot prompt) - is that OK?

Whats the output of:

# pveversion -v

Code:
# pveversion -v
pve-manager: 1.2-1 (pve-manager/1.2/3982)
qemu-server: 1.0-12
pve-kernel: 2.6.24-6
pve-kvm: 85-1
pve-firmware: 1
vncterm: 0.9-2
vzctl: 3.0.23-1pve3
vzdump: 1.1-1
vzprocps: 2.0.11-1dso2
vzquota: 3.0.11-1

We're running on a dual socket Supermicro board (Intel 5000P chipset) with 2 Intel Xeon E5440 processors, making a total of eight cores. We're using SATA storage attached to a 3ware 9690SA controller.

I will run a memtest today.

Please note that the host machine still responds ping requests.
 
Looks like a problem with your IO system. You do not have a HW RAID?

Anyways, please do not run vzdump in snapshot mode, and test if the error still occur.

what is the output of:

# pveperf
 
Looks like a problem with your IO system. You do not have a HW RAID?

Anyways, please do not run vzdump in snapshot mode, and test if the error still occur.

what is the output of:

# pveperf

I'm running a memtest right now and will invoke pveperf later.

We have a RAID 10 running on the host.

I recognized that the machine's last log entry was during the nightly snapshot backup. But what can be wrong about snapshot backup?

However, I added the vzdump backup last week. The first hung of the PVE server occured without vzdump being ever executed.
 
I recognized that the machine's last log entry was during the nightly snapshot backup. But what can be wrong about snapshot backup?

I am just trying to find the problem. lvm snapshots puts a high IO load on the server, and that can trigger further problems. Such problems are usually due to bad hardware or driver incompatibilities.

- Dietmar
 
I am just trying to find the problem. lvm snapshots puts a high IO load on the server, and that can trigger further problems. Such problems are usually due to bad hardware or driver incompatibilities.

Okay, I see. I will try to figure out if I can reproduce the problem when doing snapshot backup. If so, I will try another backup mode.

Thank you.
 
what is the output of:
# pveperf

Code:
# pveperf
CPU BOGOMIPS:      45339.82
REGEX/SECOND:      709800
HD SIZE:           94.49 GB (/dev/pve/root)
BUFFERED READS:    264.60 MB/sec
AVERAGE SEEK TIME: 8.82 ms
FSYNCS/SECOND:     1801.64
DNS EXT:           171.57 ms
DNS INT:           0.61 ms
 
I will try to figure out if I can reproduce the problem when doing snapshot backup. If so, I will try another backup mode.

Hi,

I switched to from 'snapshot' to 'suspend' backup mode for testing. Everything worked fine for more than one week (I'm running backup each night) but then the same problem occured again. The system completely locks up during backup of the third VM. No login possible, neither via SSH, nor from the console. Ping works anyway. Had to push the reset button, again.

This is the last entries in /var/log/syslog before the lockup:
Code:
Jun  6 00:45:09 virt1 vzdump[26651]: Starting Backup of VM 102 (qemu)
Jun  6 00:46:01 virt1 vzdump[26651]: Finished Backup of VM 102 (00:00:52)
Jun  6 00:46:01 virt1 vzdump[26651]: Starting Backup of VM 103 (qemu)
Jun  6 00:50:01 virt1 /USR/SBIN/CRON[27305]: (root) CMD (/usr/share/vzctl/scripts/vpsreboot)
Jun  6 00:50:01 virt1 /USR/SBIN/CRON[27308]: (root) CMD (/usr/share/vzctl/scripts/vpsnetclean)
Jun  6 00:50:01 virt1 /USR/SBIN/CRON[27310]: (root) CMD (test -x /usr/lib/atsar/atsa1 && /usr/lib/atsar/atsa1)
Jun  6 00:55:01 virt1 /USR/SBIN/CRON[27378]: (root) CMD (/usr/share/vzctl/scripts/vpsreboot)
Jun  6 00:55:01 virt1 /USR/SBIN/CRON[27380]: (root) CMD (/usr/share/vzctl/scripts/vpsnetclean)
Jun  6 01:00:01 virt1 /USR/SBIN/CRON[27448]: (root) CMD (/usr/share/vzctl/scripts/vpsreboot)
Jun  6 01:00:01 virt1 /USR/SBIN/CRON[27450]: (root) CMD (/usr/share/vzctl/scripts/vpsnetclean)
Jun  6 01:00:01 virt1 /USR/SBIN/CRON[27452]: (root) CMD (test -x /usr/lib/atsar/atsa1 && /usr/lib/atsar/atsa1)
Jun  6 01:05:01 virt1 /USR/SBIN/CRON[27521]: (root) CMD (/usr/share/vzctl/scripts/vpsreboot)
Jun  6 01:05:01 virt1 /USR/SBIN/CRON[27523]: (root) CMD (/usr/share/vzctl/scripts/vpsnetclean)
Jun  6 01:10:01 virt1 /USR/SBIN/CRON[27587]: (root) CMD (/usr/share/vzctl/scripts/vpsreboot)
Jun  6 01:10:01 virt1 /USR/SBIN/CRON[27589]: (root) CMD (/usr/share/vzctl/scripts/vpsnetclean)
Jun  6 01:10:01 virt1 /USR/SBIN/CRON[27591]: (root) CMD (test -x /usr/lib/atsar/atsa1 && /usr/lib/atsar/atsa1)
Jun  6 01:15:01 virt1 /USR/SBIN/CRON[27667]: (root) CMD (/usr/share/vzctl/scripts/vpsreboot)
Jun  6 01:15:01 virt1 /USR/SBIN/CRON[27670]: (root) CMD (/usr/share/vzctl/scripts/vpsnetclean)
Jun  6 01:17:01 virt1 /USR/SBIN/CRON[27710]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jun  6 01:20:01 virt1 /USR/SBIN/CRON[27739]: (root) CMD (/usr/share/vzctl/scripts/vpsreboot)
Jun  6 01:20:01 virt1 /USR/SBIN/CRON[27743]: (root) CMD (/usr/share/vzctl/scripts/vpsnetclean)
Jun  6 01:20:01 virt1 /USR/SBIN/CRON[27744]: (root) CMD (test -x /usr/lib/atsar/atsa1 && /usr/lib/atsar/atsa1)
Jun  6 01:25:01 virt1 /USR/SBIN/CRON[27817]: (root) CMD (/usr/share/vzctl/scripts/vpsreboot)
Jun  6 01:25:01 virt1 /USR/SBIN/CRON[27819]: (root) CMD (/usr/share/vzctl/scripts/vpsnetclean)
Jun  6 01:25:50 virt1 qm[27843]: VM 103 suspend
Jun  6 01:30:02 virt1 /USR/SBIN/CRON[27891]: (root) CMD (test -x /usr/lib/atsar/atsa1 && /usr/lib/atsar/atsa1)
Jun  6 01:30:02 virt1 /USR/SBIN/CRON[27892]: (root) CMD (/usr/share/vzctl/scripts/vpsnetclean)
Jun  6 01:30:02 virt1 /USR/SBIN/CRON[27893]: (root) CMD (/usr/share/vzctl/scripts/vpsreboot)
Jun  6 01:34:29 virt1 qm[27949]: VM 103 resume
Jun  6 01:35:01 virt1 /USR/SBIN/CRON[27964]: (root) CMD (/usr/share/vzctl/scripts/vpsnetclean)
Jun  6 01:35:01 virt1 /USR/SBIN/CRON[27965]: (root) CMD (/usr/share/vzctl/scripts/vpsreboot)
Jun  6 01:40:01 virt1 /USR/SBIN/CRON[28036]: (root) CMD (/usr/share/vzctl/scripts/vpsreboot)
Jun  6 01:40:01 virt1 /USR/SBIN/CRON[28037]: (root) CMD (/usr/share/vzctl/scripts/vpsnetclean)
Jun  6 01:40:01 virt1 /USR/SBIN/CRON[28039]: (root) CMD (test -x /usr/lib/atsar/atsa1 && /usr/lib/atsar/atsa1)
The last line I can see is the call to atsa1. I don't know whether this happens by chance, but it was exactly the same thing when the machine locked up the last time. Can the counter collection (I guess it is used for the performance monitoring in the Proxmox VE web interface) or another Proxmox task be the reason for the lockup? However I'm pretty certain that the lockup only happens during backup, but maybe it is triggered by something else in this state.

Any comments appreciated.

Regards,
Christoph
 
The critical part of the backup seems already be done at that time

Code:
Jun  6 01:34:29 virt1 qm[27949]: VM 103 resume

Please try to disable the atsar cron job (vi /etc/cron.d/atsar) and test again.
 
The critical part of the backup seems already be done at that time

Code:
Jun  6 01:34:29 virt1 qm[27949]: VM 103 resume
Please try to disable the atsar cron job (vi /etc/cron.d/atsar) and test again.

Thanks for the quick response.

What does atsar exactly do? Will there be any features of Proxmox VE which won't work if it's disabled?
 
Hi all,

I did not have this problem for several weeks now. But this morning, the PVE hung again. I had no chance to log in from the console and thus had to reset the machine.

These are the last lines in /var/log/syslog:
Code:
Jul 28 04:25:01 virt1 /USR/SBIN/CRON[4450]: (root) CMD (/usr/share/vzctl/scripts/vpsreboot)
Jul 28 04:25:01 virt1 /USR/SBIN/CRON[4452]: (root) CMD (/usr/share/vzctl/scripts/vpsnetclean)
Jul 28 04:30:01 virt1 /USR/SBIN/CRON[4564]: (root) CMD (/usr/share/vzctl/scripts/vpsreboot)
Jul 28 04:30:01 virt1 /USR/SBIN/CRON[4565]: (root) CMD (/usr/share/vzctl/scripts/vpsnetclean)
pveversion -v
Code:
pveversion -v
pve-manager: 1.3-1 (pve-manager/1.3/4023)
qemu-server: 1.0-14
pve-kernel: 2.6.24-8
pve-kvm: 86-3
pve-firmware: 1
vncterm: 0.9-2
vzctl: 3.0.23-1pve3
vzdump: 1.1-2
vzprocps: 2.0.11-1dso2
vzquota: 3.0.11-1
Any ideas?

Best regards,
Christoph
 
Such problems are usually due to bad hardware or driver incompatibilities - maybe you can try another hardware.

- Dietmar
 
Such problems are usually due to bad hardware or driver incompatibilities - maybe you can try another hardware.

- Dietmar


Hi Dietmar,

unfortunately I can't change the hardware right now. We will have another machine available shortly, then I can cross-check with this one.

In the meanwhile, I think the problem is related to CIFS... I found thousands of lines (hundreds per second) in /var/log/kern.log like the following:

Code:
CIFS VFS: Write2 ret -28, wrote 0
Maybe this is caused by a Windows update on the machine where I store the backups. I think it is not related to the PVE core.

I found a discussion which is really old but which deals with the same problem I have:

http://lists.samba.org/archive/linux-cifs-client/2007-June/001998.html

The author says:
"It was impossible to kill the copying process, and in a few tens of secods the system became practically not responsible: I could not login as root on the other console, Ctl-Alt-Del was not effective, etc, so I had to hard reset."

This is exactly what I experience after the nightly backup in the morning.

Best regards,
Christoph