Problems, need advice urgent!

Erwin123

Member
May 14, 2008
207
1
16
I have a serious problem on one of my nodes.
The raid (3ware raid-10) gives errors on the backup of one of the containers.
Yesterday I needed to reboot the node to be able to get into the container again.
While rebooting linux stopped the diskcheck telling fsck could'nt continue because of problems it needed to be fixed by hand.
Since the node was already down for a while I just let it boot and everything runs fine.

The backups of that container gives a lot of errors about 'couldn't stat' and missing files.
It results in a backup log of GB's big.

How do I get this container save from this node?
I guess if I migrate it it will give the same errors and I will be left with a container with a lot of missing files (or loose it completly?).
If I stop the node again and do something manually (what?) with fsck will it corrupt the container or files in it?
How do I save this container?

This is a part from the 3w kernel errors:
Jan 3 08:14:22 node3 kernel: sd 0:0:0:0: [sda] Add. Sense: No additional sense information
Jan 3 08:14:25 node3 kernel: 3w-9xxx: scsi0: ERROR: (0x03:0x101A): Retry queued command:.
Jan 3 08:14:25 node3 kernel: sd 0:0:0:0: [sda] Sense Key : No Sense [deferred] [descriptor]
Jan 3 08:14:25 node3 kernel: Descriptor sense data with sense descriptors (in hex):
Jan 3 08:14:25 node3 kernel: 7f 00 00 00 00 00 00 28 00 00 00 00 00 00 00 00
Jan 3 08:14:25 node3 kernel: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jan 3 08:14:25 node3 kernel: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jan 3 08:14:25 node3 kernel: sd 0:0:0:0: [sda] Add. Sense: No additional sense information
Jan 3 08:14:29 node3 kernel: 3w-9xxx: scsi0: ERROR: (0x03:0x101A): Retry queued command:.
Jan 3 08:14:29 node3 kernel: sd 0:0:0:0: [sda] Sense Key : No Sense [deferred] [descriptor]
Jan 3 08:14:29 node3 kernel: Descriptor sense data with sense descriptors (in hex):
Jan 3 08:14:29 node3 kernel: 7f 00 00 00 00 00 00 28 00 00 00 00 00 00 00 00
Jan 3 08:14:29 node3 kernel: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jan 3 08:14:29 node3 kernel: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jan 3 08:14:29 node3 kernel: sd 0:0:0:0: [sda] Add. Sense: No additional sense information
Jan 3 08:22:52 node3 kernel: lost page write due to I/O error on dm-3
Jan 3 08:22:52 node3 kernel: lost page write due to I/O error on dm-3
Jan 3 08:22:52 node3 kernel: lost page write due to I/O error on dm-3
Jan 3 08:22:52 node3 kernel: lost page write due to I/O error on dm-3
Jan 3 08:22:52 node3 kernel: lost page write due to I/O error on dm-3
Jan 3 08:22:52 node3 kernel: lost page write due to I/O error on dm-3
Jan 3 08:22:52 node3 kernel: lost page write due to I/O error on dm-3
Jan 3 08:22:52 node3 kernel: lost page write due to I/O error on dm-3
Jan 3 08:22:52 node3 kernel: lost page write due to I/O error on dm-3
Jan 3 08:22:52 node3 kernel: lost page write due to I/O error on dm-3
Jan 3 08:22:53 node3 kernel: WARNING: at fs/buffer.c:1173 mark_buffer_dirty()
Jan 3 08:22:53 node3 kernel: Pid: 2960, comm: updatedb.mlocat Not tainted 2.6.24-7-pve #1
Jan 3 08:22:53 node3 kernel:
Jan 3 08:22:53 node3 kernel: Call Trace:
Jan 3 08:22:53 node3 kernel: [<ffffffff802fd117>] mark_buffer_dirty+0x87/0xa0
Jan 3 08:22:53 node3 kernel: [<ffffffff8033d7f7>] ext3_commit_super+0x57/0xa0
Jan 3 08:22:53 node3 kernel: [<ffffffff8033f512>] ext3_handle_error+0x52/0xd0
Jan 3 08:22:53 node3 kernel: [<ffffffff8033f696>] ext3_error+0x96/0xc0
Jan 3 08:22:53 node3 kernel: [<ffffffff802fca91>] __find_get_block+0xb1/0x1e0
Jan 3 08:22:53 node3 kernel: [<ffffffff802fc26c>] submit_bh+0xfc/0x130
Jan 3 08:22:53 node3 kernel: [<ffffffff80334a3a>] __ext3_get_inode_loc+0x31a/0x380
Jan 3 08:22:53 node3 kernel: [<ffffffff80334ad2>] ext3_read_inode+0x32/0x3c0
Jan 3 08:22:53 node3 kernel: [<ffffffff8033bc23>] ext3_lookup+0x143/0x170
Jan 3 08:22:53 node3 kernel: [<ffffffff802db675>] do_lookup+0x255/0x280
Jan 3 08:22:53 node3 kernel: [<ffffffff802dda50>] __link_path_walk+0x810/0x1380
Jan 3 08:22:53 node3 kernel: [<ffffffff802fc26c>] submit_bh+0xfc/0x130
Jan 3 08:22:53 node3 kernel: [<ffffffff802de665>] link_path_walk+0xa5/0x170
Jan 3 08:22:53 node3 kernel: [<ffffffff802decf5>] do_path_lookup+0xe5/0x380
Jan 3 08:22:53 node3 kernel: [<ffffffff802dd185>] getname+0xc5/0x180
Jan 3 08:22:53 node3 kernel: [<ffffffff802dfc6b>] __user_walk_fd+0x4b/0x80
Jan 3 08:22:53 node3 kernel: [<ffffffff802d64bc>] vfs_lstat_fd+0x2c/0x70
Jan 3 08:22:53 node3 kernel: [<ffffffff802d6527>] sys_newlstat+0x27/0x50
Jan 3 08:22:53 node3 kernel: [<ffffffff8020c69e>] system_call+0x7e/0x83
Jan 3 08:22:53 node3 kernel:
Jan 3 08:23:09 node3 kernel: printk: 93 messages suppressed.
Jan 3 08:23:09 node3 kernel: lost page write due to I/O error on dm-3
Jan 3 08:23:09 node3 kernel: lost page write due to I/O error on dm-3
Jan 3 08:23:09 node3 kernel: lost page write due to I/O error on dm-3
 
Last edited:
look like a serious hardware failure. replace it and hopefully you have a valid backup.
 
The poblem is that the backups of this one container all have errors because of this.
They have many missing files:

Jan 03 08:23:09 INFO: tar: ./usr/local/psa/PMM/var/2009-03-23-13.27.35.44335/: Warning: Cannot savedir: Input/output error
Jan 03 08:23:09 INFO: tar: ./usr/local/psa/PMM/var/2009-03-23-13.27.35.44335: Warning: Cannot close: Bad file descriptor
Jan 03 08:23:09 INFO: tar: ./usr/local/psa/PMM/var/2009-03-23-13.27.35.44335/migration.result: Warning: Cannot stat: No such file or directory
Jan 03 08:23:09 INFO: tar: ./usr/local/psa/PMM/var/2009-03-23-13.27.35.44335/migration.log: Warning: Cannot stat: No such file or directory
Jan 03 08:23:09 INFO: tar: ./usr/local/psa/PMM/var/2009-03-23-13.27.35.44335/migration.status: Warning: Cannot stat: No such file or directory
Jan 03 08:23:09 INFO: tar: ./usr/local/psa/PMM/var/2009-03-23-13.27.35.44335/dump-plesk.xml: Warning: Cannot stat: No such file or directory
Jan 03 08:23:09 INFO: tar: ./usr/local/psa/PMM/var/2009-03-23-13.27.35.44335/dump.xml: Warning: Cannot stat: No such file or directory
Jan 03 08:23:09 INFO: tar: ./usr/local/psa/PMM/var/2009-03-23-13.27.35.44335/scout-result.xml: Warning: Cannot stat: No such file or directory
Jan 03 08:23:09 INFO: tar: ./usr/local/psa/PMM/var/2009-03-23-13.27.35.44335/supervisor.log: Warning: Cannot stat: No such file or directory
Jan 03 08:23:09 INFO: tar: ./usr/local/psa/PMM/var/2009-03-23-13.27.35.44335/archives: Warning: Cannot stat: No such file or directory
etc...

Other containers are fine.

The container itself is running fine, nothing seems wrong with it.
The files it says are not there really are there.
What should I do?

1. try and migrate it (will I get the same errors as with the backups leaving me with a useless container?).

2. stop the node and try fsck (or will fsck possibly screw up the container?)

3.?

I have all other containers migrated without problems from the node.
I have a backup planned tonight with the 'stop' option. Will that possibly do any good?

Thanks!
 
The poblem is that the backups of this one container all have errors because of this.
They have many missing files:



Other containers are fine.

The container itself is running fine, nothing seems wrong with it.
The files it says are not there really are there.
What should I do?

1. try and migrate it (will I get the same errors as with the backups leaving me with a useless container?).

2. stop the node and try fsck (or will fsck possibly screw up the container?)

3.?

I have all other containers migrated without problems from the node.
I have a backup planned tonight with the 'stop' option. Will that possibly do any good?

Thanks!

stop the container and try to migrate offline. if this does not work, reboot the host and try again. but as it looks that the hardware could be faulty there is no guarantee.
 
Hi Tom,

We didn't dare to move the container so we moved everything inside into a new container on our backup server.
This morning a backup should have been make of this new container on the backupserver but it seems as if the raid problems travelled with it:

Jan 5 07:32:17 node2 kernel: printk: 7965 messages suppressed.
Jan 5 07:32:17 node2 kernel: lost page write due to I/O error on dm-3
Jan 5 07:32:22 node2 kernel: printk: 7959 messages suppressed.
Jan 5 07:32:22 node2 kernel: lost page write due to I/O error on dm-3
Jan 5 07:32:27 node2 kernel: printk: 7954 messages suppressed.
Jan 5 07:32:27 node2 kernel: lost page write due to I/O error on dm-3
Jan 5 07:32:32 node2 kernel: printk: 7948 messages suppressed.
Jan 5 07:32:32 node2 kernel: lost page write due to I/O error on dm-3
Jan 5 07:32:38 node2 kernel: printk: 7134 messages suppressed.
Jan 5 07:32:38 node2 kernel: lost page write due to I/O error on dm-3

And I have a log of 49MB telling: Warning: Cannot stat: No such file or directory.

All other backups of other containers have no problems although I also see the kernel messages.

The first server had a 3-ware controller, this one a Areca. Both servers are less then a half year old. How on earth is this possible :(
 
Maybe you are able to solve the problem with the suggestion that the operating system gave you:

Try to boot your server with a livecd like knoppix or something and then do a manual fsck.

I had a similar problem and after doing a manual check, the fsck found several errors and missing inodes that could be fixed.
after the run (it took about 15 Minutes and i had to press several times the Y key...) i rebooted and all was fine again.
 
this is a different server.
I dont dare to do a fsck if I haven't got a proper backup of whats on the disk.
 
it turns out all my nodes give kernel raid errors in the messages log.
The logs of the cards itself shows no errors.

The only thing the servers have in common is that they are Supermicro servers with PVE.
Two have 3-ware cards (9xxx series) two have Areca cards.
One have SAS disks the others SATA, all raid-10.

I updated the firmware of the 3ware card in the first server mentioned in this thread and fixed the disk error with fsck.
After I put pressure on the io it starts all over again:

Jan 5 20:33:11 node3 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completedhy=3, LBA=0x1115FD4B.
Jan 5 20:33:14 node3 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed hy=3, LBA=0x1115FD7F.
Jan 5 20:33:18 node3 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:phy=3, LBA=0x11160AED.
Jan 5 20:33:33 node3 kernel: 3w-9xxx: scsi0: ERROR: (0x03:0x101A): Retry queued command:.
Jan 5 20:33:33 node3 kernel: sd 0:0:0:0: [sda] Sense Key : No Sense [deferred] [descriptor]
Jan 5 20:33:33 node3 kernel: Descriptor sense data with sense descriptors (in hex):
Jan 5 20:33:33 node3 kernel: 7f 00 00 00 00 00 00 28 00 00 00 00 00 00 00 00
Jan 5 20:33:33 node3 kernel: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jan 5 20:33:33 node3 kernel: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jan 5 20:33:33 node3 kernel: sd 0:0:0:0: [sda] Add. Sense: No additional sense information
Jan 5 20:33:35 node3 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:phy=3, LBA=0x11160AFE.
Jan 5 20:33:38 node3 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:phy=3, LBA=0x11160B05.
Jan 5 20:33:41 node3 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:phy=3, LBA=0x11160B2E.
Jan 5 20:33:46 node3 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completedhy=3, LBA=0x1116115C.
Jan 5 20:33:50 node3 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completedhy=3, LBA=0x111611CE.
Jan 5 20:33:53 node3 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completedhy=3, LBA=0x111611FD.
Jan 5 20:33:56 node3 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completedhy=3, LBA=0x11161206.

is anyone else here using Supermicro servers with PVE.
Do you have errors?
Any ideas, suggestions what the **** is going on?