[SOLVED] ZFS pool fails after power lost and kernel panic

JPVIANOR

New Member
Nov 6, 2022
9
0
1
EDIT : don't have the knowledge to find the real source of trouble. After many reboot, hour of tries and different way to get data back, server seems back stable with nothing done else than reboots.

Hi everyone
I'm using Proxmox to host all my personnal an professionnal data and after a power lost, my SMEserver won't boot.

And I'm just figuring out that I seriously messed up at backup and have no backup of the most valuable and critic data.

I'm a metal worker using Linux since many year but not a pro.

I tried to recover myself has much as I could but things are going worse and it's time to admit I need help.

I will, in separate posts to stay clear:
1 - explain my target asking you help
2 - give as much as I can data to help
3 - explain what I tried to fix it

Thank you for your help,
Kind regards,
Jean-Philippe.
 
Last edited:
Booting a ZFS capable Live Linux like Ubuntu might help to see what the problem. With that you could try to import the pool (zpool import -f rpool) and see what error message you get. And the PVE Iso also got a rescue mode in case there is just a problem with the bootloader partition.
 
first, it prompts "reading all physical volume. this may take awhile" then it boots quick after 20seconds. No error and display normal shell login.
I'm writing a detail post to give you the situation;
I did after you message et reboot to take a video and 30s after theand of boot, this prompts, it did not the previous time :
 

Attachments

  • prompts after correct boot.jpg
    prompts after correct boot.jpg
    303.1 KB · Views: 38
Last edited:
I'm asking you help to recover data.

Situation in brief : Proxmox runs now a Pfsense and the SMEserver. no critic VM. Only SMEserver files shared to be recovered.

Hardware seems OK, Zpool status seems OK ( status in next posts from GUI and shell).
Boot : one HDD
ZFS : 3x4To NAS Hdd to carry VM's disks only. 7To available, 4,5 used.
Kernel panic was probably caused by a faulty video card, figured it out when I plugged a display and find glitches and artefact. no kernel panic after, all hardware, CPU and ram seem fine in BIOS. Did not kept kernel panic prompts.
The best will be to get VM back, alowing me to backup correctly then upgrade to 7.1.

Simple is the best, any straight forward solution is welcome : I also can attaches the precious VMDisk to a new VM just to access data and copy it on external HDDs,
Booting a ZFS capable Live Linux like Ubuntu might help to see what the problem. With that you could try to import the pool (zpool import -f rpool) and see what error message you get. And the PVE Iso also got a rescue mode in case there is just a problem with the bootloader partition.
If I can access VM disks files throught live boot, I will.
More difficult, fresh install PVE7.2 on a new disk, keeping the actual boot HDD unpluged to keep config files, and remount ZFS ? just an idea, just to tell I can try.
it's time to detailes on situation :
 
Problem : all VMdisks looks not attached, all VMs with VMdisk in Zpool can't start

first Error : starting VM prompt pvedaemon[443]: timeout waiting on systemd
after reboot error : TASK ERROR: timeout: no zvol device link for 'vm-110-disk-0' found after 300 sec found

To get the error, I just "started" the VM and, it surprisely started but does nothing (console view was black, just the blinking cursor) I ask PVE to stop the VM, then restart give again "timeout waiting on systemd"
When "timeout waiting on systemd occurs, PVE shutdown is very long (10 minutes)
An here after reboot, I'm back to the probleme with no VM with data on Zpool can boot. (300seconds error)

EDIT : now, machines "start" is processing with no error but does nothing since 15 minutes.
Shutdown prompt this : 1667779120606.png

root@noeuda:~# zfs list NAME USED AVAIL REFER MOUNTPOINT RAID5_4To 5.36T 1.68T 128K /RAID5_4To RAID5_4To/vm-105-disk-0 132G 1.78T 21.9G - RAID5_4To/vm-106-disk-0 33.0G 1.68T 31.9G - RAID5_4To/vm-110-disk-0 5.07T 1.68T 5.07T - RAID5_4To/vm-200-disk-0 132G 1.73T 76.7G -

root@noeuda:~# zpool status pool: RAID5_4To state: ONLINE scan: scrub in progress since Sun Oct 9 00:24:02 2022 209G scanned at 462M/s, 71.5G issued at 158M/s, 7.81T total 0B repaired, 0.90% done, 0 days 14:16:16 to go config: NAME STATE READ WRITE CKSUM RAID5_4To ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 sda ONLINE 0 0 0 sdb ONLINE 0 0 0 sdc ONLINE 0 0 0 errors: No known data errors

root@noeuda:~# pveversion -v proxmox-ve: 6.1-2 (running kernel: 5.3.10-1-pve) pve-manager: 6.1-3 (running version: 6.1-3/37248ce6) pve-kernel-5.3: 6.0-12 pve-kernel-helper: 6.0-12 pve-kernel-5.3.10-1-pve: 5.3.10-1 ceph-fuse: 12.2.11+dfsg1-2.1+b1 corosync: 3.0.2-pve4 criu: 3.11-3 glusterfs-client: 5.5-3 ifupdown: 0.8.35+pve1 ksm-control-daemon: 1.3-1 libjs-extjs: 6.0.1-10 libknet1: 1.13-pve1 libpve-access-control: 6.0-5 libpve-apiclient-perl: 3.0-2 libpve-common-perl: 6.0-9 libpve-guest-common-perl: 3.0-3 libpve-http-server-perl: 3.0-3 libpve-storage-perl: 6.1-2 libqb0: 1.0.5-1 libspice-server1: 0.14.2-4~pve6+1 lvm2: 2.03.02-pve3 lxc-pve: 3.2.1-1 lxcfs: 3.0.3-pve60 novnc-pve: 1.1.0-1 proxmox-mini-journalreader: 1.1-1 proxmox-widget-toolkit: 2.1-1 pve-cluster: 6.1-2 pve-container: 3.0-14 pve-docs: 6.1-3 pve-edk2-firmware: 2.20191002-1 pve-firewall: 4.0-9 pve-firmware: 3.0-4 pve-ha-manager: 3.0-8 pve-i18n: 2.0-3 pve-qemu-kvm: 4.1.1-2 pve-xtermjs: 3.13.2-1 qemu-server: 6.1-2 smartmontools: 7.0-pve2 spiceterm: 3.1-1 vncterm: 1.6-1 zfsutils-linux: 0.8.2-pve2

see attached errors during poweroff and prompt during boot after thoe long time taking shutdown
 

Attachments

  • powerOff errors 01.jpg
    powerOff errors 01.jpg
    250.7 KB · Views: 10
  • powerOff errors 02.jpg
    powerOff errors 02.jpg
    350.3 KB · Views: 10
  • powerOff errors 03.jpg
    powerOff errors 03.jpg
    401.3 KB · Views: 8
  • powerOff errors 04 during reboot after long time taking poweroff.jpg
    powerOff errors 04 during reboot after long time taking poweroff.jpg
    102.6 KB · Views: 10
Last edited:
I tried a boot and shu down after few minutes, doing nothing between (no try to boot VM on Zpool) -> poweroff is fast an with NO error.
 
what I tried

many reboot
upgrade : cannot "apt-get dist-upgrade" (followed procedure to add repo pve-no-subscription but stuck on missing pgp key)
export / Import zfs : fail. Stuck doing nothing when trying (shell command stayed executing with nothing visible, ctrl+C after 10 minutes. Found error In GUI ZFS "status" for then tried zfs clear -> still executing but nothing visible, ctrl+C did nothing and freez GUI an console BUT pfsens VM still working (was downloading true firewall an ubuntu iso. GUI and console was freezed 40minutes during the download)

"udevadm trigger" commands - sorry, I dont find the page, I will search after.
 
Currently backing up : I succeed restart machines with this procedure -> poweroff stuck 30minutes waiting ZFS links, all stopped, hardware resset -> boot, does nithing, shutdown again, with no error. reboot, no error, wait 5 minutes, start VM, success. backup all data to prepare a full reset (and going for TrueNas insteed of SME, managing the RAID directly with TrueNas and not with ZFS in PVE.)
 
Or not : crash when trying to copy.
Do it again a last time for this night.
Hope this elements help you to point the problem,
Thank you for your help ; Jean-Philippe
 
VM's stay up many hours this night, backuped many GBs, then freezes during backup.
After hours of tries and carreful reboots, VMs can boot but PVE freezes on kernel panic after few minutes.
Figured out in SMEserver console the exact same type of errors.
I decided to change the way to fix it.
Booted le VMs on Kubuntu live, learned how to mount RAID/lvm2 partitions,
I tooks me 3 hours to complete it on the faulty VM, no error, fast pve reboots, -> so the SMEserver VM could ask something wrong to PVE and causes crash ? Will see when I will read data when I will copy data from mounted VMdisk to USB HDD.
If data will be recovered, I will close the thread.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!