strange things on vzdump

chipitsine · Aug 2, 2011

I seen similar messages regarding bug in kernel 2.6.18, but that is definetly not our case. we are running pvetest (2.6.32-5).
/var/log./messages contains regular errors about EXT3 things

Aug 1 00:42:57 KT0015 kernel: kjournald starting. Commit interval 5 seconds
Aug 1 00:42:57 KT0015 kernel: EXT3 FS on dm-3, internal journal
Aug 1 00:42:57 KT0015 kernel: EXT3-fs: dm-3: 17 orphan inodes deleted
Aug 1 00:42:57 KT0015 kernel: EXT3-fs: recovery complete.
Aug 1 00:42:57 KT0015 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Aug 1 01:04:03 KT0015 kernel: lost page write due to I/O error on dm-3

usually it is ok, but often system becomes unresponsive during vzdump (virtual machines do not respond, host refuses to reboot using "reboot" command").
it looks like old known issue, but it is not.

according to that forum, people are using ProxMox in production environment and it is not common issue to stop at nigthly vzdump ?
what could I miss ?

P.S. it is not about our RAID hardware. we are running different RAID arrays, problem with EXT3 occures from time to time on any of them.

tom · Aug 2, 2011

do you have the vzdump backup log for this job producing this error?

if vzdump locks the server there is most times a quite simple reason for that. if you provide all details about your backup configuration I can give some hints.

chipitsine · Aug 2, 2011

Jul 31 01:51:04 INFO: Starting Backup of VM 208 (openvz)
Jul 31 01:51:04 INFO: CTID 208 exist mounted running
Jul 31 01:51:04 INFO: status = CTID 208 exist mounted running
Jul 31 01:51:05 INFO: backup mode: snapshot
Jul 31 01:51:05 INFO: bandwidth limit: 20480 KB/s
Jul 31 01:51:05 INFO: ionice priority: 7
Jul 31 01:51:05 INFO: creating lvm snapshot of /dev/mapper/pve-data ('/dev/pve/vzsnap-KT0015-0')
Jul 31 01:51:15 INFO: Logical volume "vzsnap-KT0015-0" created
Jul 31 01:51:15 INFO: creating archive '/mnt/nas/pve/vzdump-openvz-208-2011_07_31-01_51_04.tgz'
Jul 31 02:02:55 INFO: tar: ./var/local/yandex/xxx/workindex/indexinv: Warning: Read error at byte 789561344, while reading 10240 bytes: Input/output er
ror
Jul 31 02:03:09 INFO: tar: ./var/local/yandex/xxx/newindex/0394plk: Warning: Read error at byte 0, while reading 2048 bytes: Input/output error
Jul 31 02:03:09 INFO: tar: ./var/local/yandex/xxx/newindex/0279pak: Warning: Read error at byte 0, while reading 9728 bytes: Input/output error
Jul 31 02:03:09 INFO: tar: ./var/local/yandex/xxx/newindex/0166pai: Warning: Read error at byte 0, while reading 5632 bytes: Input/output error
......
Jul 31 02:06:06 INFO: tar: ./usr/share/locale/eo/LC_MESSAGES/tasksel.mo: Warning: Read error at byte 0, while reading 1233 bytes: Input/output error
Jul 31 02:06:06 INFO: tar: ./usr/share/locale/hu/: Warning: Cannot savedir: Input/output error
Jul 31 02:06:06 INFO: tar: ./usr/share/locale/hu: Warning: Cannot close: Bad file descriptor
Jul 31 02:06:06 INFO: tar: ./usr/share/locale/hu/LC_MESSAGES/: Warning: Cannot savedir: Input/output error
Jul 31 02:06:06 INFO: tar: ./usr/share/locale/hu/LC_MESSAGES: Warning: Cannot close: Bad file descriptor
Jul 31 02:06:06 INFO: tar: ./usr/share/locale/hu/LC_MESSAGES/gnupg.mo: Warning: Read error at byte 0, while reading 1536 bytes: Input/output error

.......
Jul 31 02:06:23 INFO: tar: ./emul/ia32-linux/lib/libnss_files-2.7.so: Warning: Read error at byte 0, while reading 512 bytes: Input/output error
Jul 31 02:06:23 INFO: Total bytes written: 21355284480 (20GiB, 23MiB/s)
Jul 31 02:06:24 INFO: archive file size: 4.27GB
Jul 31 02:06:24 INFO: delete old backup '/mnt/nas/pve/vzdump-openvz-208-2011_07_28-03_10_00.tgz'
Jul 31 02:06:25 INFO: Logical volume "vzsnap-KT0015-0" successfully removed
Jul 31 02:06:25 INFO: Finished Backup of VM 208 (00:15:21)

also, during regular server operation (without vzdump) the following messages are seen:

KT0016:~#
Message from syslogd@KT0016 at Jul 30 01:00:09 ...
kernel:journal commit I/O error

Message from syslogd@KT0016 at Jul 31 01:01:13 ...
kernel:journal commit I/O error

Message from syslogd@KT0016 at Aug 2 01:17:36 ...
kernel:journal commit I/O error

KT0016:~#

dmesg shows:

EXT3-fs error (device dm-2): ext3_get_inode_loc: unable to read inode block - inode=35865317, block=143458352
EXT3-fs error (device dm-2): ext3_get_inode_loc: unable to read inode block - inode=35865317, block=143458352
EXT3-fs error (device dm-2): ext3_get_inode_loc: unable to read inode block - inode=35865318, block=143458352
EXT3-fs error (device dm-2): ext3_get_inode_loc: unable to read inode block - inode=35865318, block=143458352
EXT3-fs error (device dm-2): ext3_find_entry: reading directory #35741701 offset 0
ext3_abort called.
EXT3-fs error (device dm-2): ext3_put_super: Couldn't clean up the journal

tom · Aug 2, 2011

looks like you have files system errors, check your hardware and manually run a fsck.

what kind of storage do you got on /var/lib/vz?

chipitsine · Aug 2, 2011

tom said:
looks like you have files system errors, check your hardware and manually run a fsck.

what kind of storage do you got on /var/lib/vz?

looks like file system errors occur "on the fly". yes, when server boots up it says "checking filesystem ... blah-blah-blah .... recovering from ...." and it happens again and again.
not sure about hardware, regular hardware RAID (different kinds, this, for instance is MPT Fusion raid5).

/var/lib/vz is ext3 over LVM (just default install)

KT0015:~# mount
/dev/mapper/pve-root on / type ext3 (rw,errors=remount-ro)
tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
procbususb on /proc/bus/usb type usbfs (rw)
udev on /dev type tmpfs (rw,mode=0755)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
/dev/mapper/pve-data on /var/lib/vz type ext3 (rw)
/dev/sda1 on /boot type ext3 (rw)
KT0015:~#

chipitsine · Aug 3, 2011

problem definetely reproduces on vzdump (in snapshot mode):

KT0016:/mnt/nas/pve# vzdump --node 5 --snapshot --compress --storage nas --mailto xxx@xxx 113
INFO: starting new backup job: vzdump --node 5 --snapshot --compress --storage nas --mailto xxx@xxx 113
INFO: Starting Backup of VM 113 (openvz)
INFO: CTID 113 exist mounted running
INFO: status = CTID 113 exist mounted running
INFO: backup mode: snapshot
INFO: bandwidth limit: 20480 KB/s
INFO: ionice priority: 7
INFO: creating lvm snapshot of /dev/mapper/pve-data ('/dev/pve/vzsnap-KT0016-0')
INFO: Logical volume "vzsnap-KT0016-0" created
INFO: creating archive '/mnt/nas/pve/vzdump-openvz-113-2011_08_03-08_29_36.tgz'
INFO: tar: ./lib/plymouth/renderers/vga16fb.so: Warning: Read error at byte 0, while reading 3072 bytes: Input/output error
INFO: tar: ./lib/plymouth/renderers/frame-buffer.so: Warning: Read error at byte 0, while reading 4096 bytes: Input/output error
INFO: tar: ./lib/plymouth/script.so: Warning: Read error at byte 0, while reading 512 bytes: Input/output error

......

Message from syslogd@KT0016 at Aug 3 08:30:57 ...
kernel:journal commit I/O error

EXT3-fs: mounted filesystem with ordered data mode.
device-mapper: snapshots: Invalidating snapshot: Unable to allocate exception.
Aborting journal on device dm-2.
Buffer I/O error on device dm-2, logical block 75137538
lost page write due to I/O error on dm-2
journal commit I/O error
ext3_abort called.
EXT3-fs error (device dm-2): ext3_put_super: Couldn't clean up the journal
Remounting filesystem read-only

lvdisplay says its ok:

--- Logical volume ---
LV Name /dev/pve/data
VG Name pve
LV UUID 8FtdHj-tdYT-ZIe4-K7mp-FXmB-CZsG-5iiHMq
LV Write Access read/write
LV snapshot status source of
/dev/pve/vzsnap-KT0016-0 [active]
LV Status available
# open 1
LV Size 573.60 GB
Current LE 146841
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 254:3

--- Logical volume ---
LV Name /dev/pve/vzsnap-KT0016-0
VG Name pve
LV UUID QMW0nM-PWF2-SECg-exZe-4rwq-0ifx-SKHzH0
LV Write Access read/write
LV snapshot status active destination for /dev/pve/data
LV Status available
# open 1
LV Size 573.60 GB
Current LE 146841
COW-table size 1.00 GB
COW-table LE 256
Allocated to snapshot 0.24%
Snapshot chunk size 4.00 KB
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 254:2

Erwin123 · Aug 3, 2011

I have the exact same issues since ages.
I discovered that small things like a cronjob that makes a Mysqldump can screw the backup although the load is not visibly affected.
Also the problem seems to comes and go. At the moment its worse then ever. Some nodes are affected, some not, very weird.

Aug 03 01:09:14 INFO: Starting Backup of VM 103 (openvz)
Aug 03 01:09:14 INFO: CTID 103 exist mounted running
Aug 03 01:09:14 INFO: status = CTID 103 exist mounted running
Aug 03 01:09:14 INFO: backup mode: snapshot
Aug 03 01:09:14 INFO: ionice priority: 7
Aug 03 01:09:14 INFO: creating lvm snapshot of /dev/mapper/pve-data ('/dev/pve/vzsnap-node1-0')
Aug 03 01:09:14 INFO: Logical volume "vzsnap-node1-0" created
Aug 03 01:09:15 INFO: creating archive '/mnt/usbdrive/vzdump-openvz-103-2011_08_03-01_09_14.tar'
Aug 03 05:14:03 INFO: tar: ./home/httpd/vhosts/domain.nl/httpdocs/cache/cache_94afbfb2f291e0bf253fcf222e9d238e_5ca2e8e32653a10c9937f84f974fb846: Warning: Read error at byte 0, while reading 666 bytes: Input/output error
Aug 03 05:14:03 INFO: tar: ./home/httpd/vhosts/domain.nl/httpdocs/cache/cache_94afbfb2f291e0bf253fcf222e9d238e_721112e51ba5395c76bf22dd980cc0cb: Warning: Read error at byte 0, while reading 666 bytes: Input/output error
etc. this goes on for a 78MB logfile

In the morning, I check which containers failed and those backups are started again and practicly always succeed.
If it was a disk issue I guess they should fail also?

Messages log:

Aug 3 05:14:03 node1 kernel: lost page write due to I/O error on dm-3
Aug 3 05:14:03 node1 kernel: lost page write due to I/O error on dm-3
Aug 3 05:14:03 node1 kernel: lost page write due to I/O error on dm-3
Aug 3 05:14:03 node1 kernel: lost page write due to I/O error on dm-3
Aug 3 05:14:03 node1 kernel: lost page write due to I/O error on dm-3
Aug 3 05:14:03 node1 kernel: lost page write due to I/O error on dm-3
Aug 3 05:14:03 node1 kernel: lost page write due to I/O error on dm-3
Aug 3 05:14:03 node1 kernel: lost page write due to I/O error on dm-3
Aug 3 05:14:03 node1 kernel: lost page write due to I/O error on dm-3
Aug 3 05:14:03 node1 kernel: lost page write due to I/O error on dm-3
Aug 3 05:14:08 node1 kernel: printk: 7780 messages suppressed.
Aug 3 05:14:08 node1 kernel: lost page write due to I/O error on dm-3
Aug 3 05:14:13 node1 kernel: printk: 7778 messages suppressed.
Aug 3 05:14:13 node1 kernel: lost page write due to I/O error on dm-3
Aug 3 05:14:22 node1 kernel: inode=39211255, block=78413961
Aug 3 05:14:23 node1 kernel: printk: 13541 messages suppressed.
Aug 3 05:14:23 node1 kernel: lost page write due to I/O error on dm-3
Aug 3 05:14:28 node1 kernel: printk: 7842 messages suppressed.
Aug 3 05:14:28 node1 kernel: lost page write due to I/O error on dm-3
Aug 3 05:14:33 node1 kernel: printk: 7875 messages suppressed.
etc...

Backups are placed on USBdisk I tried NFS but this seems very 30 to 50% slower.
We had this problem with every kernel and PMVE version.

chipitsine · Aug 9, 2011

it happens again and again.
more suspection on vzdump/lvm issues.

however, such things are not known in regular OpenVZ world. maybe ProxMox just picked up bad OpenVZ patchset ?

dietmar · Aug 9, 2011

chipitsine said:
however, such things are not known in regular OpenVZ world. maybe ProxMox just picked up bad OpenVZ patchset ?

Our 2.6.18 kernel include the official/stable OpenVZ patches.

chipitsine · Aug 9, 2011

dietmar said:
Our 2.6.18 kernel include the official/stable OpenVZ patches.

ok, we'll downgrade and see what will happen.

Erwin123 · Aug 12, 2011

We've had this problem in all versions. Currently we're using 2.6.18 and its the same problem. We've also tried changing io scheduler but this makes no difference.

BiagioParuolo · Aug 13, 2011

Have you manually unmouted the dump mount point?

Erwin123 · Aug 13, 2011

We always mount and umount the usbdisks manually...howso?

BiagioParuolo · Aug 14, 2011

Dump mount is where ProxMox backup write tmp file.

Erwin123 · Aug 14, 2011

Ah sorry, no everything is standard configuration. Only change we made was altering the dump filesuze to 2048.
But as said before, the weird thing is that when you restart the failed backups again in the morning they run fine most of the time.
We also had a case where suddenly the backup failed every night at the same time. I stayed awake to see what was happening on the system that could cause the problem. I expected to see a extreme load but all I saw was mysqldump running which didn't even raised the load a bit but obviously was enough to cause the backup to fail.
We changedbthe time the mysqldump ran and that solved it.
Forbthe current issues we can't find any cronjobs that possibly causes this.

Ps this is written on my iphone using tapatalk, so sorry if its a bit hard to read...

BiagioParuolo · Aug 16, 2011

Solved?

chipitsine · Aug 16, 2011

BiagioParuolo said:
Solved?

we installed kernel 2.6.18, everything is ok.
will see in two weeks.

Erwin123 · Aug 22, 2011

Erwin123 said:
I have the exact same issues since ages.
I discovered that small things like a cronjob that makes a Mysqldump can screw the backup although the load is not visibly affected.
Also the problem seems to comes and go. At the moment its worse then ever. Some nodes are affected, some not, very weird.

In the morning, I check which containers failed and those backups are started again and practicly always succeed.
If it was a disk issue I guess they should fail also?

Messages log:

Backups are placed on USBdisk I tried NFS but this seems very 30 to 50% slower.
We had this problem with every kernel and PMVE version.

Does anyone please have any hints or tips how we can solve this?
Its driving me crazy, every morning I have to manually check and restart backups on three nodes that have failed with these errors.

node1:/mnt/usbdrive# pveversion -v
pve-manager: 1.8-18 (pve-manager/1.8/6070)
running kernel: 2.6.18-6-pve
proxmox-ve-2.6.18: 1.8-15
pve-kernel-2.6.24-7-pve: 2.6.24-8
pve-kernel-2.6.24-1-pve: 2.6.24-4
pve-kernel-2.6.18-6-pve: 2.6.18-15
pve-kernel-2.6.24-5-pve: 2.6.24-6
pve-kernel-2.6.24-2-pve: 2.6.24-5
qemu-server: 1.1-30
pve-firmware: 1.0-11
libpve-storage-perl: 1.0-17
vncterm: 0.9-2
vzctl: 3.0.28-1pve1
vzdump: 1.2-14
vzprocps: 2.0.11-2
vzquota: 3.0.11-1
pve-qemu-kvm-2.6.18: 0.9.1-15

dietmar · Aug 22, 2011

Erwin123 said:
Does anyone please have any hints or tips how we can solve this?

So everything runs fine if you do not use snapshot mode?

Erwin123 · Aug 22, 2011

We've only used snapshotmode. I understood this is the preferred way of backup up.
'Stop' is not an option since we would get a lot of complaints then (they are all hostingservers).

strange things on vzdump

chipitsine

Guest

Proxmox Staff Member

chipitsine

Guest

Proxmox Staff Member

chipitsine

Guest

chipitsine

Guest

Member

chipitsine

Guest

Proxmox Staff Member

chipitsine

Guest

Member

Active Member

Member

Active Member

Member

Active Member

chipitsine

Guest

Member

Proxmox Staff Member

Member