VM stuck on boot recovering journal

serged

New Member
Dec 16, 2011
7
0
1
New Zealand
Hi,

I have a CentOS 6 VM running under Proxmox/KVM which crashed for some reason. When I try to boot and diagnose why it had crashed, it starts trying to mount a 500GB LVM volume, but gets stuck with the message:

/dev/mapper/vg_data-lv_data: recovering journal

It's been like this for 5 hours and the console is not responding to the keyboard at all. Has anyone run into this problem before? This isn't the first time I've run into I/O problems with VM's running on Proxmox/KVM, are there any known I/O issues which could be causing these kinds of problems?
 
Hi,

I have a CentOS 6 VM running under Proxmox/KVM which crashed for some reason. When I try to boot and diagnose why it had crashed, it starts trying to mount a 500GB LVM volume, but gets stuck with the message:

/dev/mapper/vg_data-lv_data: recovering journal

It's been like this for 5 hours and the console is not responding to the keyboard at all. Has anyone run into this problem before? This isn't the first time I've run into I/O problems with VM's running on Proxmox/KVM, are there any known I/O issues which could be causing these kinds of problems?
Hi,
you can try to change the caching - i have one system where i must append ",cache=writethrough" at the disk-entry in the vm-config (/etc/qemu-server/VMID.conf).
You must stop and start the VM to activate the changes.

Udo
 
Tried adding the ",cache=writethrough" parameter to all of the disks the VM has. The full configuration file looks like:

name: akl-centos6
ide2: ISOs:iso/CentOS-6.0-x86_64-bin-DVD1.iso,media=cdrom
vlan0: rtl8139=CE:0E:AF:83:9B:D8
bootdisk: ide0
ostype: l26
ide0: local:102/vm-102-disk-1.raw,cache=writethrough
memory: 6144
onboot: 1
sockets: 4
ide1: local:102/vm-102-disk-2.raw,cache=writethrough
ide3: local:102/vm-102-disk-3.raw,cache=writethrough

However, still no luck :( Will try booting from a live CD and see if I can repair the filesystem journal from the live CD and then boot once the filesystem is repaired.
 
Hi,
is your disk an fast raid or an single sata-disk which also used by other VMs? In this case it's normal that an fsck take a long time (which filesystem?).

How is the read-performance with "pveperf /var/lib/vz"? (only if the vm-disk is on local storage)

Udo
 
Hi,

The disk is on a fast raid (RAID 10). There are other VM's using the disk, but this hasn't caused other IO issues. The filesystem on the host is ext3 and on the guest is ext4.

The output of pveperf is as follows:

# pveperf /var/lib/vz
CPU BOGOMIPS: 18619.85
REGEX/SECOND: 764141
HD SIZE: 556.26 GB (/dev/mapper/pve-data)
BUFFERED READS: 209.66 MB/sec
AVERAGE SEEK TIME: 8.07 ms
DNS EXT: 4020.14 ms
DNS INT: 2601.25 ms (company.internal)

I have left the VM running over the weekend, in order to see if it really is a case of fsck taking that long, but it still isn't finished after about three days :(

This definitely doesn't seem to me to be a case of fsck taking a long time, but rather something else going wrong and causing the VM to become completely unresponsive after launching fsck. After this happens, the keyboard doesn't work, sending a "Ctrl+Alt Delete" doesn't work. The only thing the VM responds to at that point is a "Stopf" command from the host.
 
Hi,

The disk is on a fast raid (RAID 10). There are other VM's using the disk, but this hasn't caused other IO issues. The filesystem on the host is ext3 and on the guest is ext4.

The output of pveperf is as follows:

# pveperf /var/lib/vz
CPU BOGOMIPS: 18619.85
REGEX/SECOND: 764141
HD SIZE: 556.26 GB (/dev/mapper/pve-data)
BUFFERED READS: 209.66 MB/sec
AVERAGE SEEK TIME: 8.07 ms
DNS EXT: 4020.14 ms
DNS INT: 2601.25 ms (company.internal)

I have left the VM running over the weekend, in order to see if it really is a case of fsck taking that long, but it still isn't finished after about three days :(

This definitely doesn't seem to me to be a case of fsck taking a long time, but rather something else going wrong and causing the VM to become completely unresponsive after launching fsck. After this happens, the keyboard doesn't work, sending a "Ctrl+Alt Delete" doesn't work. The only thing the VM responds to at that point is a "Stopf" command from the host.
Hi,
right - with this performance and ext4 should the fsck takes only a short time.

Strange. I have not really an idea, but what happens if you use only one cpu-core? I guess it's the same, but perhaps...

Udo
 
Hi,

I don't see in pveperf output the Fsyncs/second line, and the time to resolve DNS seems to me very bad. The Fsyncs/second figures are the key parameter to have a good idea of IO performance, so it could be that they are indeed very bad...

Alain
 
Why would the FSYNCS/SECOND line not be showing in the output? From looking at the man page for pveperf, I can't see that the command takes any options to enable/disable any of the parameters.

It seems counter-intuitive to me that the command wouldn't be showing this metric due to it being "too low", as this is the command I am using to try to identify these kinds of problems in the first place.
 
Why would the FSYNCS/SECOND line not be showing in the output? From looking at the man page for pveperf, I can't see that the command takes any options to enable/disable any of the parameters.

It seems counter-intuitive to me that the command wouldn't be showing this metric due to it being "too low", as this is the command I am using to try to identify these kinds of problems in the first place.
Hi,
normaly also very low values are displayed ;-)

BTW. which version of pve do you use (pveversion -v)? Is this an actual one (with which kernel?)?

Udo
 
Yes, I don't understand that this line does not show up in the output. AFAIK, there is no option for this command, so it is strange... In my opinion, it would be displayed even if the result was very low...

Here is a rather bad result for one of my test server with Proxmox 2.0 beta3, in Raid 10, but using ext4 (the cause of the bad result I guess...) :
# pveperf /var/lib/vz
CPU BOGOMIPS: 36265.42
REGEX/SECOND: 744706
HD SIZE: 5388.60 GB (/dev/mapper/pve-data)
BUFFERED READS: 609.26 MB/sec
AVERAGE SEEK TIME: 10.22 ms
FSYNCS/SECOND: 310.29
DNS EXT: 57.79 ms
DNS INT: 15.07 ms


Alain
 
Hi,
normaly also very low values are displayed ;-)

BTW. which version of pve do you use (pveversion -v)? Is this an actual one (with which kernel?)?

Udo

The output of the pveversion command is as follows:

Code:
# pveversion -v
pve-manager: 1.9-24 (pve-manager/1.9/6542)
running kernel: 2.6.32-6-pve
proxmox-ve-2.6.32: 1.9-43
pve-kernel-2.6.32-4-pve: 2.6.32-33
pve-kernel-2.6.18-2-pve: 2.6.18-5
pve-kernel-2.6.32-6-pve: 2.6.32-43
qemu-server: 1.1-32
pve-firmware: 1.0-13
libpve-storage-perl: 1.0-19
vncterm: 0.9-2
vzctl: 3.0.28-1pve5
vzdump: 1.2-15
vzprocps: 2.0.11-2
vzquota: 3.0.11-1
pve-qemu-kvm: 0.15.0-1
ksm-control-daemon: 1.0-6

I've tried setting the VM to run on only one core, but still no luck.
 
Hi,

You are not at the latest version. For example, my kernel is 2.6.32-50, pve-firmware is 1.0-14, abd so on..., but it is not a big deal for this problem.
Can you give more details on your hardware. What type of Raid controller do you use, how many disks in your raid 10, what kind of disks (SATA, SAS...), which file system you use (ext3, ext4...) ? How many memory do you have on your host ? Do you swap ?

It is very strange that the line FSYNC/SECOND does not display. It is not able to perform the test ?

Alain
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!