PVE lockup

cspiegel · May 25, 2009

Hi,

I'm testing Proxmox VE 1.2 for three weeks now. This morning, the machine locked up for the second time during the test period. The PVE host still answers pings, but I cannot login (neither via SSH, nor via the attached console - it does not show the login prompt). The virtual machines are no longer reachable in this situation.

The last pre-lockup messages I can see in

Code:

/var/log/messages
/var/log/syslog

do not show any special incidents. Neither do the virtual machine's logs (two Windows Server 2003 guests, three Linux guests).

After rebooting (I had to push the reset button), everything works fine again.

Any idea? Your help is highly appreciated.

Regards,
Christoph

dietmar · May 25, 2009

Please can you send the syslog (/var/log/syslog) to dietmar@proxmox.com.

What hardware doe you run on?

Also run memtest (from boot prompt) - is that OK?

Whats the output of:

# pveversion -v

cspiegel · May 25, 2009

dietmar said:
Please can you send the syslog (/var/log/syslog) to dietmar@proxmox.com.

What hardware doe you run on?

Also run memtest (from boot prompt) - is that OK?

Whats the output of:

# pveversion -v

Code:

# pveversion -v
pve-manager: 1.2-1 (pve-manager/1.2/3982)
qemu-server: 1.0-12
pve-kernel: 2.6.24-6
pve-kvm: 85-1
pve-firmware: 1
vncterm: 0.9-2
vzctl: 3.0.23-1pve3
vzdump: 1.1-1
vzprocps: 2.0.11-1dso2
vzquota: 3.0.11-1

We're running on a dual socket Supermicro board (Intel 5000P chipset) with 2 Intel Xeon E5440 processors, making a total of eight cores. We're using SATA storage attached to a 3ware 9690SA controller.

I will run a memtest today.

Please note that the host machine still responds ping requests.

dietmar · May 25, 2009

Looks like a problem with your IO system. You do not have a HW RAID?

Anyways, please do not run vzdump in snapshot mode, and test if the error still occur.

what is the output of:

# pveperf

cspiegel · May 25, 2009

dietmar said:
Looks like a problem with your IO system. You do not have a HW RAID?

Anyways, please do not run vzdump in snapshot mode, and test if the error still occur.

what is the output of:

# pveperf

I'm running a memtest right now and will invoke pveperf later.

We have a RAID 10 running on the host.

I recognized that the machine's last log entry was during the nightly snapshot backup. But what can be wrong about snapshot backup?

However, I added the vzdump backup last week. The first hung of the PVE server occured without vzdump being ever executed.

dietmar · May 25, 2009

cspiegel said:
I recognized that the machine's last log entry was during the nightly snapshot backup. But what can be wrong about snapshot backup?

I am just trying to find the problem. lvm snapshots puts a high IO load on the server, and that can trigger further problems. Such problems are usually due to bad hardware or driver incompatibilities.

- Dietmar

cspiegel · May 25, 2009

dietmar said:
I am just trying to find the problem. lvm snapshots puts a high IO load on the server, and that can trigger further problems. Such problems are usually due to bad hardware or driver incompatibilities.

Okay, I see. I will try to figure out if I can reproduce the problem when doing snapshot backup. If so, I will try another backup mode.

Thank you.

dietmar · May 25, 2009

Or try another hardware.

cspiegel · May 25, 2009

dietmar said:
what is the output of:
# pveperf

Code:

# pveperf
CPU BOGOMIPS:      45339.82
REGEX/SECOND:      709800
HD SIZE:           94.49 GB (/dev/pve/root)
BUFFERED READS:    264.60 MB/sec
AVERAGE SEEK TIME: 8.82 ms
FSYNCS/SECOND:     1801.64
DNS EXT:           171.57 ms
DNS INT:           0.61 ms

cspiegel · May 25, 2009

dietmar said:
Also run memtest (from boot prompt) - is that OK?

I had to stop the memtest after ~5 hours (70% of the overall tests passed), no errors up to there.

dcalvache · May 26, 2009

Can you post your type of ethernet card?

cspiegel · May 26, 2009

dcalvache said:
Can you post your type of ethernet card?

06:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01)

cspiegel · Jun 8, 2009

cspiegel said:
I will try to figure out if I can reproduce the problem when doing snapshot backup. If so, I will try another backup mode.

Hi,

I switched to from 'snapshot' to 'suspend' backup mode for testing. Everything worked fine for more than one week (I'm running backup each night) but then the same problem occured again. The system completely locks up during backup of the third VM. No login possible, neither via SSH, nor from the console. Ping works anyway. Had to push the reset button, again.

This is the last entries in /var/log/syslog before the lockup:

Code:

Jun  6 00:45:09 virt1 vzdump[26651]: Starting Backup of VM 102 (qemu)
Jun  6 00:46:01 virt1 vzdump[26651]: Finished Backup of VM 102 (00:00:52)
Jun  6 00:46:01 virt1 vzdump[26651]: Starting Backup of VM 103 (qemu)
Jun  6 00:50:01 virt1 /USR/SBIN/CRON[27305]: (root) CMD (/usr/share/vzctl/scripts/vpsreboot)
Jun  6 00:50:01 virt1 /USR/SBIN/CRON[27308]: (root) CMD (/usr/share/vzctl/scripts/vpsnetclean)
Jun  6 00:50:01 virt1 /USR/SBIN/CRON[27310]: (root) CMD (test -x /usr/lib/atsar/atsa1 && /usr/lib/atsar/atsa1)
Jun  6 00:55:01 virt1 /USR/SBIN/CRON[27378]: (root) CMD (/usr/share/vzctl/scripts/vpsreboot)
Jun  6 00:55:01 virt1 /USR/SBIN/CRON[27380]: (root) CMD (/usr/share/vzctl/scripts/vpsnetclean)
Jun  6 01:00:01 virt1 /USR/SBIN/CRON[27448]: (root) CMD (/usr/share/vzctl/scripts/vpsreboot)
Jun  6 01:00:01 virt1 /USR/SBIN/CRON[27450]: (root) CMD (/usr/share/vzctl/scripts/vpsnetclean)
Jun  6 01:00:01 virt1 /USR/SBIN/CRON[27452]: (root) CMD (test -x /usr/lib/atsar/atsa1 && /usr/lib/atsar/atsa1)
Jun  6 01:05:01 virt1 /USR/SBIN/CRON[27521]: (root) CMD (/usr/share/vzctl/scripts/vpsreboot)
Jun  6 01:05:01 virt1 /USR/SBIN/CRON[27523]: (root) CMD (/usr/share/vzctl/scripts/vpsnetclean)
Jun  6 01:10:01 virt1 /USR/SBIN/CRON[27587]: (root) CMD (/usr/share/vzctl/scripts/vpsreboot)
Jun  6 01:10:01 virt1 /USR/SBIN/CRON[27589]: (root) CMD (/usr/share/vzctl/scripts/vpsnetclean)
Jun  6 01:10:01 virt1 /USR/SBIN/CRON[27591]: (root) CMD (test -x /usr/lib/atsar/atsa1 && /usr/lib/atsar/atsa1)
Jun  6 01:15:01 virt1 /USR/SBIN/CRON[27667]: (root) CMD (/usr/share/vzctl/scripts/vpsreboot)
Jun  6 01:15:01 virt1 /USR/SBIN/CRON[27670]: (root) CMD (/usr/share/vzctl/scripts/vpsnetclean)
Jun  6 01:17:01 virt1 /USR/SBIN/CRON[27710]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jun  6 01:20:01 virt1 /USR/SBIN/CRON[27739]: (root) CMD (/usr/share/vzctl/scripts/vpsreboot)
Jun  6 01:20:01 virt1 /USR/SBIN/CRON[27743]: (root) CMD (/usr/share/vzctl/scripts/vpsnetclean)
Jun  6 01:20:01 virt1 /USR/SBIN/CRON[27744]: (root) CMD (test -x /usr/lib/atsar/atsa1 && /usr/lib/atsar/atsa1)
Jun  6 01:25:01 virt1 /USR/SBIN/CRON[27817]: (root) CMD (/usr/share/vzctl/scripts/vpsreboot)
Jun  6 01:25:01 virt1 /USR/SBIN/CRON[27819]: (root) CMD (/usr/share/vzctl/scripts/vpsnetclean)
Jun  6 01:25:50 virt1 qm[27843]: VM 103 suspend
Jun  6 01:30:02 virt1 /USR/SBIN/CRON[27891]: (root) CMD (test -x /usr/lib/atsar/atsa1 && /usr/lib/atsar/atsa1)
Jun  6 01:30:02 virt1 /USR/SBIN/CRON[27892]: (root) CMD (/usr/share/vzctl/scripts/vpsnetclean)
Jun  6 01:30:02 virt1 /USR/SBIN/CRON[27893]: (root) CMD (/usr/share/vzctl/scripts/vpsreboot)
Jun  6 01:34:29 virt1 qm[27949]: VM 103 resume
Jun  6 01:35:01 virt1 /USR/SBIN/CRON[27964]: (root) CMD (/usr/share/vzctl/scripts/vpsnetclean)
Jun  6 01:35:01 virt1 /USR/SBIN/CRON[27965]: (root) CMD (/usr/share/vzctl/scripts/vpsreboot)
Jun  6 01:40:01 virt1 /USR/SBIN/CRON[28036]: (root) CMD (/usr/share/vzctl/scripts/vpsreboot)
Jun  6 01:40:01 virt1 /USR/SBIN/CRON[28037]: (root) CMD (/usr/share/vzctl/scripts/vpsnetclean)
Jun  6 01:40:01 virt1 /USR/SBIN/CRON[28039]: (root) CMD (test -x /usr/lib/atsar/atsa1 && /usr/lib/atsar/atsa1)

The last line I can see is the call to atsa1. I don't know whether this happens by chance, but it was exactly the same thing when the machine locked up the last time. Can the counter collection (I guess it is used for the performance monitoring in the Proxmox VE web interface) or another Proxmox task be the reason for the lockup? However I'm pretty certain that the lockup only happens during backup, but maybe it is triggered by something else in this state.

Any comments appreciated.

Regards,
Christoph

dietmar · Jun 8, 2009

The critical part of the backup seems already be done at that time

Code:

Jun  6 01:34:29 virt1 qm[27949]: VM 103 resume

Please try to disable the atsar cron job (vi /etc/cron.d/atsar) and test again.

cspiegel · Jun 8, 2009

dietmar said:
The critical part of the backup seems already be done at that time

Code:

Jun 6 01:34:29 virt1 qm[27949]: VM 103 resume

Please try to disable the atsar cron job (vi /etc/cron.d/atsar) and test again.

Thanks for the quick response.

What does atsar exactly do? Will there be any features of Proxmox VE which won't work if it's disabled?

dietmar · Jun 9, 2009

cspiegel said:
What does atsar exactly do?

# man atsar

cspiegel said:
Will there be any features of Proxmox VE which won't work if it's disabled?

no, we do not use that data currently.

cspiegel · Jul 28, 2009

Hi all,

I did not have this problem for several weeks now. But this morning, the PVE hung again. I had no chance to log in from the console and thus had to reset the machine.

These are the last lines in /var/log/syslog:

Code:

Jul 28 04:25:01 virt1 /USR/SBIN/CRON[4450]: (root) CMD (/usr/share/vzctl/scripts/vpsreboot)
Jul 28 04:25:01 virt1 /USR/SBIN/CRON[4452]: (root) CMD (/usr/share/vzctl/scripts/vpsnetclean)
Jul 28 04:30:01 virt1 /USR/SBIN/CRON[4564]: (root) CMD (/usr/share/vzctl/scripts/vpsreboot)
Jul 28 04:30:01 virt1 /USR/SBIN/CRON[4565]: (root) CMD (/usr/share/vzctl/scripts/vpsnetclean)

pveversion -v

Code:

pveversion -v
pve-manager: 1.3-1 (pve-manager/1.3/4023)
qemu-server: 1.0-14
pve-kernel: 2.6.24-8
pve-kvm: 86-3
pve-firmware: 1
vncterm: 0.9-2
vzctl: 3.0.23-1pve3
vzdump: 1.1-2
vzprocps: 2.0.11-1dso2
vzquota: 3.0.11-1

Any ideas?

Best regards,
Christoph

dietmar · Jul 31, 2009

Such problems are usually due to bad hardware or driver incompatibilities - maybe you can try another hardware.

- Dietmar

cspiegel · Jul 31, 2009

dietmar said:
Such problems are usually due to bad hardware or driver incompatibilities - maybe you can try another hardware.

- Dietmar

Hi Dietmar,

unfortunately I can't change the hardware right now. We will have another machine available shortly, then I can cross-check with this one.

In the meanwhile, I think the problem is related to CIFS... I found thousands of lines (hundreds per second) in /var/log/kern.log like the following:

Code:

CIFS VFS: Write2 ret -28, wrote 0

Maybe this is caused by a Windows update on the machine where I store the backups. I think it is not related to the PVE core.

I found a discussion which is really old but which deals with the same problem I have:

http://lists.samba.org/archive/linux-cifs-client/2007-June/001998.html

The author says:
"It was impossible to kill the copying process, and in a few tens of secods the system became practically not responsible: I could not login as root on the other console, Ctl-Alt-Del was not effective, etc, so I had to hard reset."

This is exactly what I experience after the nightly backup in the morning.

Best regards,
Christoph

SamTzu · Aug 2, 2009

Might be related to the Intel / Snapshot problem that I'm having.
No problems with AMD nodes.

PVE lockup

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

New Member

Member

New Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Renowned Member

We value your privacy