Help! Proxmox not booting (zfs - proxmox a start job is running for import zfs pools by cache)

entilza

Well-Known Member
Jan 6, 2021
156
37
48
49
After doing some disk benchmarks, I started losing VM's... one by one then the entire Server rebooted. I can't believe this.

Now Proxmox won't boot.

"proxmox a start job is running for import zfs pools by cache"
(Timeout)

Then a "tainted" message.

I am about to boot with the install CD and see if I can get anywhere.

I am trying to find info on this and can't find much, Thanks.
 
I can't get anywhere.. I am trying to reinstall proxmox.

I was able to get to install CD I mounted my zfs into an alternate directory i backed up everything on a separate drive.

when IM import that second zfs it hangs again with the same errors, I am trying to see if I can tell it not to import that second zfs but at this point I am out of time.
 
How did make that benchmarks? I hope you didn't used dd to test write performance on a working ZFS pool. Seen that before.
 
I reinstalled proxmox and copied the pve-cluster (/var/lib/pve-cluster/config.db)

I was able to get my main VM's going ! Thankfully

I have 3 pools

1. rpool (proxmox)
2. tank0 (Fast ssd)
3. tank1 (large hhd)

tank1 is causing the hang on import. At least I am back up... gotta find out how to fix the tank that isnt importing... just a cache file? i will try.

I will try to fix this in the morning.
 
Last edited:
How did make that benchmarks? I hope you didn't used dd to test write performance on a working ZFS pool. Seen that before.

Thanks for responding, Re-tracing my steps

I had a VM in windows running in IDE-mode

I added a virtio driver for a 2nd drive to get the virtio-drivers recognized. I ran a crystal disk benchmark in the VM on the new drive and it caused the process to hang after a few runs.

At this moment I noticed a different VM went down. I saw a bunch of kernel errors. So I tried to restart it and I couldnt then the other vm went down, then proxmox server rebooted! I was stuck with one of my zfs pools not wanting to import.
 
logs are pointing to :

kvm invoked oom-killer

How can I be out of memory, 128G.. unless the hung benchmark test went crazy on the ZFS ?
 
Last edited:
I'm still terrified to attempt to import second tank1 pool. I read some command of zpool import -R or something...

root@pve-art:~# zpool import
pool: tank1
id: 12700964996125508439
state: ONLINE
status: The pool was last accessed by another system.
action: The pool can be imported using its name or numeric identifier and
the '-f' flag.
see: http://zfsonlinux.org/msg/ZFS-8000-EY
config:

tank1 ONLINE
mirror-0 ONLINE
ata-ST6000VN0033-2EE110_ZADB78ZR ONLINE
ata-ST6000VN0033-2EE110_ZADB6LHW ONLINE
 
So I've been doing some reading, I am certain the oom killer has caused this. For whatever reason ZFS went crazy on my crystal disk tests with virtio and crashed.

I am going to limit the ARC size in hopes to prevent this.

I still have not attempted to import my second mirror tank.
 
ok I was able to mount in read only with no problem but as soon as I mount with write, I got a few msgs in dmesg and all zpool commands freeze while this is happening.

drives are still spinning, maybe I can just wait?

[dmu_objset_find] going nuts in iotop... Maybe I can wait this out!


Jan 18 16:32:37 pve-art kernel: [59332.309350] Call Trace:
Jan 18 16:32:37 pve-art kernel: [59332.309351] __schedule+0x2e6/0x6f0
Jan 18 16:32:37 pve-art kernel: [59332.309352] ? _cond_resched+0x19/0x30
Jan 18 16:32:37 pve-art kernel: [59332.309353] schedule+0x33/0xa0
Jan 18 16:32:37 pve-art kernel: [59332.309353] schedule_preempt_disabled+0xe/0x10
Jan 18 16:32:37 pve-art kernel: [59332.309354] __mutex_lock.isra.10+0x2c9/0x4c0
Jan 18 16:32:37 pve-art kernel: [59332.309356] __mutex_lock_slowpath+0x13/0x20
Jan 18 16:32:37 pve-art kernel: [59332.309356] mutex_lock+0x2c/0x30
Jan 18 16:32:37 pve-art kernel: [59332.309387] spa_all_configs+0x3b/0x120 [zfs]
Jan 18 16:32:37 pve-art kernel: [59332.309418] zfs_ioc_pool_configs+0x1b/0x70 [zfs]
Jan 18 16:32:37 pve-art kernel: [59332.309449] zfsdev_ioctl+0x6db/0x8f0 [zfs]
Jan 18 16:32:37 pve-art kernel: [59332.309450] ? lru_cache_add_active_or_unevictable+0x39/0xb0
Jan 18 16:32:37 pve-art kernel: [59332.309451] do_vfs_ioctl+0xa9/0x640
Jan 18 16:32:37 pve-art kernel: [59332.309453] ? handle_mm_fault+0xc9/0x1f0
Jan 18 16:32:37 pve-art kernel: [59332.309454] ksys_ioctl+0x67/0x90
Jan 18 16:32:37 pve-art kernel: [59332.309455] __x64_sys_ioctl+0x1a/0x20
Jan 18 16:32:37 pve-art kernel: [59332.309457] do_syscall_64+0x57/0x190
Jan 18 16:32:37 pve-art kernel: [59332.309459] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jan 18 16:32:37 pve-art kernel: [59332.309460] RIP: 0033:0x7efda23da427
Jan 18 16:32:37 pve-art kernel: [59332.309462] Code: Bad RIP value.
Jan 18 16:32:37 pve-art kernel: [59332.309462] RSP: 002b:00007ffdeb3c9288 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jan 18 16:32:37 pve-art kernel: [59332.309464] RAX: ffffffffffffffda RBX: 000055fe3b3bc430 RCX: 00007efda23da427
Jan 18 16:32:37 pve-art kernel: [59332.309464] RDX: 00007ffdeb3c92c0 RSI: 0000000000005a04 RDI: 0000000000000003
Jan 18 16:32:37 pve-art kernel: [59332.309465] RBP: 00007ffdeb3cc8a0 R08: 00007efda1c29010 R09: 0000000000000000
Jan 18 16:32:37 pve-art kernel: [59332.309466] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000000000
Jan 18 16:32:37 pve-art kernel: [59332.309466] R13: 00007ffdeb3c92c0 R14: 000055fe39622740 R15: 000055fe3b3be3a0
 
Last edited:
omg its up!! So all I had to do was wait this out?! This is scary at boot time when its your first big issue and no warnings. Especially on a clean boot. A 'please wait' would have been awesome!

import_hung.png
 
Last edited:
root@pve-art:~# zpool status
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 0 days 00:00:04 with 0 errors on Mon Jan 18 00:31:37 2021
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-Samsung_SSD_860_PRO_256GB_S5GANE0N100537J-part3 ONLINE 0 0 0
ata-Samsung_SSD_860_PRO_256GB_S5GANA0MA17355T-part3 ONLINE 0 0 0

errors: No known data errors

pool: tank0
state: ONLINE
scan: scrub repaired 0B in 0 days 00:16:42 with 0 errors on Mon Jan 18 00:47:07 2021
config:

NAME STATE READ WRITE CKSUM
tank0 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-INTEL_SSDSC2KB960G8_PHYF014000CL960CGN ONLINE 0 0 0
ata-INTEL_SSDSC2KB960G8_PHYF0140005M960CGN ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
ata-INTEL_SSDSC2KB960G8_PHYF029201GM960CGN ONLINE 0 0 0
ata-INTEL_SSDSC2KB960G8_PHYF0354048D960CGN ONLINE 0 0 0

errors: No known data errors

pool: tank1
state: ONLINE
scan: scrub repaired 0B in 0 days 00:17:02 with 0 errors on Sun Jan 10 00:41:05 2021
config:

NAME STATE READ WRITE CKSUM
tank1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-ST6000VN0033-2EE110_ZADB78ZR ONLINE 0 0 0
ata-ST6000VN0033-2EE110_ZADB6LHW ONLINE 0 0 0

errors: No known data errors
 
Mine is doing the same thing... but I waited over 8 hours... this is after a recent upgrade... I can boot into previous version, but can import the zpool in READ ONLY
 

Attachments

  • pool no import.JPG
    pool no import.JPG
    213.7 KB · Views: 29
Can you keep it running? may take longer - this sort of thing makes me re-evaluate zfs honestly. Slow Hard disks? or does it just get slower with even more TB... When this happened to me it made me re-evaluate zfs in production but it only took about 20-30 minutes for me 2TB. But even after 10 mins I had no idea what was happening and kept rebooting.
 
Last edited: