VE 3.4 boot problem after disk replace

Seppo

New Member
May 23, 2017
5
0
1
58
Hi All,

First time poster.

I've been running a 3.4 install for several years now(24/7) - without any problems. It's a small NAS box running 4x2TB 3.5 Hard Drives in a Raid10 configuration(single zpool). I recently decided to upgrade drives 3 & 4 to 6tb drives. I had previously successfully upgraded both these drives from 1TB->2TB without problems.

After 16 hours of re-silvering drive 4(3.5TB of data) and with 43 minutes to go, my NAS just rebooted(?!)

A little worrying, but I thought no problem. If drive 4 is an issue I can simply delete the partition and try again.

My NAS will no longer boot in any configuration. I've tried it with all 4 drives(new 4), without drive 4 and with the old drive 4 (which is still as it was) - no luck.

I keep getting a "alloc magic is broken at xxxxxx" error.

I did not think this was possible. Raid 10 should run with 1 corrupt drive from each pair. It's only missing 1 non-root/non-grub drive in bay 4 and grub should still be intact.

Any ideas on 1)fixing this and also 2)how can this happen?

Thanks

Steve
 
Anybody?

How about I reduce it to question 2)

Why is this Raid10 system so fragile? Can anyone explain what happened? How does replacing one disk trash my entire system?

Steve
 
My 2 cents ....

I am not keen on ZFS, it has given me grief enough with a few tests on proxmox, that I simply do my own thing for unsupported Linux SW raid and it is 'cleaner, simpler, just works' and less drama. (plus bcache is easy and awesome, yay). (or deploy on hardware raid, for extremely low-drama config). (For me, life is too short for filesystem induced instability.)

Are you able to boot with a recovery CD, something that has ability to let you do basic ZFS admin tasks, such as examine attached ZFS storage/pools/etc. and try to gain insight into what is up?

is it possible that you didn't actually have all the pieces intact after a prior upgrade, and you had (for example) non-mirrored boot volume, and then your single copy boot slice is gone, and now you have a non-bootable setup? (I've seen people do a 'failed disk remove replace' where they forgot to fix up the boot config on the new disk, and are left with a system that is still bootable, but is no longer fully redundant in case the 'wrong' disk fails next time).

Worst case scenario? Can you access the data on the underlying ZFS storage / when booted from a live CD, and rescue your VM images? Then you can at least get VMs to other storage; clobber / clean install your Proxmox host (hey, good time to upgrade to latest! :) - and then copy your VM disk files back in / etc - manual migrate the VMs into the proxmox new install ?

not an easy solution, in terms of time; but fairly straightforward and you get your VMs back at the end of it. (?)

Tim
 
Hi Tim,

Thanks for the response.

"unsupported Linux SW raid" - yeah, I used software raid1 when I had just 2 disks, but when I moved to 4 - I went over to ZFS because I was led to believe it was fairly bullet-proof.

"is it possible that you didn't actually have all the pieces intact after a prior upgrade, and you had (for example) non-mirrored boot volume, and then your single copy boot slice is gone, and now you have a non-bootable setup? (I've seen people do a 'failed disk remove replace' where they forgot to fix up the boot config on the new disk, and are left with a system that is still bootable, but is no longer fully redundant in case the 'wrong' disk fails next time)."

I was paranoid about this scenario, so each time I've done a disk replace I've shutdown/ejected the disk and then rebooted in order to identify the disk in the ZFS pool prior to replacing. The previous upgrade was performed on the same mirror (3 & 4). I even went to the length of viewing the partitions for all disks in parted. Disks 1 & 2 had 3 partitions, 3 & 4 had 2. Also, my disks for each mirror are from different manufacturers - Mirror 1 is Toshiba and mirror 2 is Western Digital. I will not replace disks 1 & 2 without doing a complete reinstall.

"hey, good time to upgrade to latest!" -
I've wanted to for a long time, but had no luck with v4.x(maybe 5.x is better?). It would not autocreate my network interfaces and with zero experience of linux "containers" I was reluctant to switch and have to manually create NICs for each new VM each time. Also, the migrate path was a nightmare. I may re-examine one I've fixed this issue.

"Are you able to boot with a recovery CD, something that has ability to let you do basic ZFS admin tasks, such as examine attached ZFS storage/pools/etc. and try to gain insight into what is up?" - Yes, I was resigned to having to do this, but wanted to try and understand what had happened before any attempt.

"Worst case scenario? Can you access the data on the underlying ZFS storage / when booted from a live CD, and rescue your VM images? Then you can at least get VMs to other storage; clobber / clean install your Proxmox host (hey, good time to upgrade to latest! :) - and then copy your VM disk files back in / etc - manual migrate the VMs into the proxmox new install ?"

I will attempt a grub reinstall first and then as you suggest.

It would be helpful if the Proxmox boot had either 1) a grub repair/reinstall option or 2) an additional confirmation prior to partition create/filesystem format so you could abort following grub install - it this is the correct order for installation.

I will have a go over the weekend.

Thanks again Tim.
 
Hi Seppo,

glad to help / and thanks for your clarification followup posting. I quite understand it is frustrating when things go sideways like this. End of the day I think the Proxmox dev team have limited time to do 'customizations for edge-use-case config' such as grub repair. In my experience if you are using regular grub then any suitably current linux livecd can be used as a grub rescue assistant, ie, it does not have to be precisely the distro/installer for which you used to prep the system... as long as that distro has suitable support for (SW raid, ZFS, whatver are your core requirements to see accurately what is going on. And generally speaking I am not sure there is such a thing as an "easy and automated grub repair option" which does not also have the ability to make some situations worse, and as such I can see why TeamProxmox might be reluctant to ever put such a feature into the boot/installer CD. Rather, if you are sophisticated enough to be dong these things, then there are other tools which let you do it...

Other things:

I'm not quite sure what you mean about proxmox not 'autoconfigure' network interfaces (maybe relating to you having OpenVZ VMs and then latest Proxmox requiring you move over to LXC?).

Otherwise, I also realize for clarity when I say, "good time to upgrade" what I more accurately mean is, "good time to backup VMs to different disk; clobber old install, and do a clean install from scratch using latest installer version that you want to use - maybe try v5 release if you are keen and optimistic that it will indeed allow an in-place upgrade from "beta" to "final"; or if you want to be more conservative stick with V4.latest . Certainly I've deployed plenty of V4.x machines in the last ~year and it "just works nicely" in my experience, and all the extra features over the 3.x proxmox are really very nice.

And also for clarity, my general experience from ZFS testing is that - I am happier not using it. Maybe I'm just lazy, but end of the day for me having a 'simple config, easier to support and very smooth operation' outweighs the potential benefits of ZFS in some use-case scenarios. I realize that this may not be the case for everyone though.

:)

Anyhow. Hope it gets sorted sufficiently. With relatively not-too-much-pain.

Tim
 
Hi Steve, for sure this is outside of what I know - I only did a basic test install of Proxmox 2-disk ZFS raid ~>1 year ago; tested it a bit; was not happy with how things behaved (ie, had to tweak settings on some VM configs for things to work as expected? and found performance was not so good as vanilla software linux raid on comparable {arguably low end/garbage} hardware - so I just gave up my testing at that point.

Hopefully someone else with a bit of ZFS Proxmox experience will be able to comment further on possible next steps!

Tim
 
I've installed v3.4 on a different box and it comes up with a grub partition on the same boundary.

I've decided to copy all the data/VMs off and reinstall and copy it back - it will take days, but at least it will be sorted.

Thanks for your help.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!