After Backup: boot failed: not a bootable disk

Oli Nasar

New Member
Jan 2, 2016
3
0
1
63
Hi,

I have a very basic setup, a proxmox 4.1-2 running with 4 VMs (some debian wheezy, some debian jessie, all have a dedicated IP) on local storage (software raid 1).

All VMs were running fine except this morning, when I discovered all VMs were offline.
There was no host-restart tonight, and even if, all VMs are set to "start on boot".
But as said all VMs were offline.

Now the fatal problem:
I start the VMs again, they start and run, then I look into the console and they only boot and boot again, stating "boot failed: not a bootable disk".
All of them!

Nothing happened tonight, except the proxmox backup job of all VMs (lzo, snapshot mode).

Restarting the host did not help.
Using a rescue CD with a VM only shows that there are no partitions at all.

Restoring tonights backup brings the same: no bootable disk.
Restoring yesterday nights backup works, the VM start (with loss of 1 day of course).

So I believe the backup job kills the VMs (and backups the damaged data).

No idea if that helps:
- Under Proxmox 3 the VMs ran with disk mode Default (no cache), this did not work with Proxmox 4 so I set them all to Write through.
- The disk type is sata, alto tried with virtio

So best thing to do at this moment is no not backup the VMs.. which is of course not an option!

Any advise is really welcome..
 
Last edited:
Hello,
What's in your host logs, eg /var/log/messages in between when it was working and when it crashed? Anything relating to underlying disk issues, access issues, extended wait times, i/o freeze?

Do you have anything like a weekly RAID check that could be scheduled to fire off the same time as backups, thus causing a load on the disks? One one of my RAID1 [not proxmox] installations it takes in excess of three hours to complete a full sync check. Any heavy i/o during this time will surely slow it.

What kind of I/O wait times do you experience when operating the vms normally?
 
Hi,

thanks for helping!

Proxmox Host:
I attached the uncut log (only added some comments for easier understanding) as it exceeds 10000 chars.
In short: There hasn't been anything interesting within the timeframe imho.


Guest (VM 112):
There is no entry in that timeframe, last entry was on december 31th and next was the first successful restart (the one after the restore).

Code:
Dec 31 19:47:45 s112 kernel: [ 1026.320122] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Dec 31 19:47:45 s112 kernel: [ 1026.320326] ata1.00: configured for UDMA/100
Dec 31 19:47:45 s112 kernel: [ 1026.320339] ata1.00: device reported invalid CHS sector 0
Dec 31 19:47:45 s112 kernel: [ 1026.320344] ata1: EH complete
Jan  2 14:25:35 s112 kernel: imklog 5.8.11, log source = /proc/kmsg started.
Jan  2 14:25:35 s112 rsyslogd: [origin software="rsyslogd" swVersion="5.8.11" x-pid="2045" x-info="http://www.rsyslog.com"] start
Jan  2 14:25:35 s112 kernel: [  0.000000] Initializing cgroup subsys cpuset
Jan  2 14:25:35 s112 kernel: [  0.000000] Initializing cgroup subsys cpu


I/O wait times... how do I measure them?
But wait a second, there is another server pushing files via rsync to this proxmox host at about 04:30 - that job is limited to 60MB/s - but that probably brings the hard drives (raid 1) to their maximum.
But even if - how could that affect the backup of the VMs? I mean it's slower okay but how does that ruin the VM guests bootloader?
 

Attachments

  • log.txt
    145.5 KB · Views: 6
Last edited:
Update: It just happened a few minutes ago, without a running backup but with files being from another server to this proxmox host.

Code:
Jan  5 21:18:00 root941 kernel: [284768.892072] kvm[3455]: segfault at 2 ip 0000559937ec765d sp 00007f5566205ea0 error 4 in kvm[559937c82000+4c6000]

One VM just appeared offline. I just noticed as I wanted to connecto to it.
So it has nothing to do with the backup but somehow with the server load... ?

Update2:
The other server (moving the files to the proxmox host) shows entries like:
Code:
Jan  5 21:07:01 server2 kernel: ct0 nfs: server 192.168.1.4 not responding, still trying
Jan  5 21:07:48 server2 kernel: ct0 nfs: server 192.168.1.4 not responding, still trying
Jan  5 21:07:48 server2 kernel: ct0 nfs: server 192.168.1.4 OK
Jan  5 21:07:57 server2 kernel: ct0 nfs: server 192.168.1.4 OK

So it really seems as if the proxmox host is under heavy load, the VMs get killed somehow, maybe a timeout. The question is: Why are their partitions killed?
 
Last edited:
I have experienced the same thing with pve 5.4, I had 3 vm's running and was restoring a large (29G) vm from backup, system load was around 30 as reported by top. When the restore was finished all 4 vm's disk's were now unbootable : not a bootable disk
so it seems that there is a bug when the i/o load is high
 
Its a joke ? Here same problem: VM is running ok. VM crash sudenly; when y stop and start again, the vm say "No bootable device" but QEMU hard disk shows on list for boot.
Ok, I going to restore yesterday backup: same poblem, backup restore ok, but no bootable disk ¿? WTF? Ok, a I going to restore 3 days ago backuo, and TA-CHAN same result!!!!

Its a nighmare!

But, on same proxmox host, another VM is running ok. what happen ? This issue it happened to me with a same proxmox version six months ago, but i reinstall without research.

Another data: I tried boot vm with rescatus and another tools. When I try recover data on disk, its say: no partition, no data ¿?

Some clue ? HELP!
 
It happen to me as well, server crash then i restart the server, almost vm now "not a bootable disk". Please help me somebody.
 
Is there a common element here? I see one post that indicates an NFS share in use. What about the other users? Also the original posts in this thread are from 2016. That's a long time ago. What happened back then is not necessarily an issue now. I think it would have better to creat a new post.

That being said, for me I find it terrifying that there are situations where a VM's disk can be destroyed somehow, just by a crash or high disk load. @Ricardo Bernao your situation is even stranger. In your situation it seems like there was file corruption in your VM that was not evident until something happened to require a reboot. That's the only reason I can think of for your backups to also be unbootable.

I'm sure there's more to this than meets the eye, and to get to the bottom of it I think we need more details.

Please can you all post you Proxmox version details, and a bit more about your systems? What type of virtual disk interface do you use?
 
Same error here, any solution?
Thanks!!

Code:
# pveversion --verbose
proxmox-ve: 6.1-2 (running kernel: 5.3.10-1-pve)
pve-manager: 6.1-3 (running version: 6.1-3/37248ce6)
pve-kernel-5.3: 6.0-12
pve-kernel-helper: 6.0-12
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-5
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-9
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.1-2
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-1
pve-cluster: 6.1-2
pve-container: 3.0-14
pve-docs: 6.1-3
pve-edk2-firmware: 2.20191002-1
pve-firewall: 4.0-9
pve-firmware: 3.0-4
pve-ha-manager: 3.0-8
pve-i18n: 2.0-3
pve-qemu-kvm: 4.1.1-2
pve-xtermjs: 3.13.2-1
qemu-server: 6.1-2
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve2
 
Four days ago I've installed all updates on my server. No backup errors was reported after that. Today I've cheched some old VM's and found, that two of them are broken: no bootable disk.
I've booted this VM with pmagic ISO and opened disk with GParted - it was unpartitioned like new empty VM disk.
Seems like the problem came with some last updates and still no argumented solution can be found.
 
Four days ago I've installed all updates on my server. No backup errors was reported after that. Today I've cheched some old VM's and found, that two of them are broken: no bootable disk.
I've booted this VM with pmagic ISO and opened disk with GParted - it was unpartitioned like new empty VM disk.
Seems like the problem came with some last updates and still no argumented solution can be found.

IIf you need recover this VM, contact me.
 
I experience the same problem today, several VM’s that I have is keep on rebooting and says no bootable drive.. The last thing that I did was to create a backup. I have tried to restore from back up and also end up with the same result. my backup is from a nas server that I mount to my proxmox.

Thus anyone found a solution on this?
 
I experience the same problem today, several VM’s that I have is keep on rebooting and says no bootable drive.. The last thing that I did was to create a backup. I have tried to restore from back up and also end up with the same result. my backup is from a nas server that I mount to my proxmox.

Thus anyone found a solution on this?
If you need support, I do this for you. Recover your VM data, you can pay me via Paypal. Send me a PM, bye!
 
So here is what I have learned about this issue.
I have observed that when performing restores to and SSD drive if the aggregate restore bandwith is higher than 10MB/s then the restore process will corrupt the disk images of other vms.

So I only every restore 1 VM at a maximum of 10MB/s, and have never eccountered this issue.

My VM filesystem is zfs
My backups are coming from and NFS mounted system

Interestingly when I perform the same restores on a spinning disk, it has never corrupted the other vm's and no issues are encountered.

So find a valid backup and restore the vm's one at a time at a max of 10MB/s

At this point it looks like proxmox is not dealing with the high I/O ability of the SSD well and gets confused at a low level, hence touching the other vm partitions is some unpleasant way.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!