Help! Proxmox not booting (zfs - proxmox a start job is running for import zfs pools by cache)

entilza

Active Member
Jan 6, 2021
152
35
33
49
After doing some disk benchmarks, I started losing VM's... one by one then the entire Server rebooted. I can't believe this.

Now Proxmox won't boot.

"proxmox a start job is running for import zfs pools by cache"
(Timeout)

Then a "tainted" message.

I am about to boot with the install CD and see if I can get anywhere.

I am trying to find info on this and can't find much, Thanks.
 
I can't get anywhere.. I am trying to reinstall proxmox.

I was able to get to install CD I mounted my zfs into an alternate directory i backed up everything on a separate drive.

when IM import that second zfs it hangs again with the same errors, I am trying to see if I can tell it not to import that second zfs but at this point I am out of time.
 
How did make that benchmarks? I hope you didn't used dd to test write performance on a working ZFS pool. Seen that before.
 
I reinstalled proxmox and copied the pve-cluster (/var/lib/pve-cluster/config.db)

I was able to get my main VM's going ! Thankfully

I have 3 pools

1. rpool (proxmox)
2. tank0 (Fast ssd)
3. tank1 (large hhd)

tank1 is causing the hang on import. At least I am back up... gotta find out how to fix the tank that isnt importing... just a cache file? i will try.

I will try to fix this in the morning.
 
Last edited:
How did make that benchmarks? I hope you didn't used dd to test write performance on a working ZFS pool. Seen that before.

Thanks for responding, Re-tracing my steps

I had a VM in windows running in IDE-mode

I added a virtio driver for a 2nd drive to get the virtio-drivers recognized. I ran a crystal disk benchmark in the VM on the new drive and it caused the process to hang after a few runs.

At this moment I noticed a different VM went down. I saw a bunch of kernel errors. So I tried to restart it and I couldnt then the other vm went down, then proxmox server rebooted! I was stuck with one of my zfs pools not wanting to import.
 
logs are pointing to :

kvm invoked oom-killer

How can I be out of memory, 128G.. unless the hung benchmark test went crazy on the ZFS ?
 
Last edited:
I'm still terrified to attempt to import second tank1 pool. I read some command of zpool import -R or something...

root@pve-art:~# zpool import
pool: tank1
id: 12700964996125508439
state: ONLINE
status: The pool was last accessed by another system.
action: The pool can be imported using its name or numeric identifier and
the '-f' flag.
see: http://zfsonlinux.org/msg/ZFS-8000-EY
config:

tank1 ONLINE
mirror-0 ONLINE
ata-ST6000VN0033-2EE110_ZADB78ZR ONLINE
ata-ST6000VN0033-2EE110_ZADB6LHW ONLINE
 
So I've been doing some reading, I am certain the oom killer has caused this. For whatever reason ZFS went crazy on my crystal disk tests with virtio and crashed.

I am going to limit the ARC size in hopes to prevent this.

I still have not attempted to import my second mirror tank.
 
ok I was able to mount in read only with no problem but as soon as I mount with write, I got a few msgs in dmesg and all zpool commands freeze while this is happening.

drives are still spinning, maybe I can just wait?

[dmu_objset_find] going nuts in iotop... Maybe I can wait this out!


Jan 18 16:32:37 pve-art kernel: [59332.309350] Call Trace:
Jan 18 16:32:37 pve-art kernel: [59332.309351] __schedule+0x2e6/0x6f0
Jan 18 16:32:37 pve-art kernel: [59332.309352] ? _cond_resched+0x19/0x30
Jan 18 16:32:37 pve-art kernel: [59332.309353] schedule+0x33/0xa0
Jan 18 16:32:37 pve-art kernel: [59332.309353] schedule_preempt_disabled+0xe/0x10
Jan 18 16:32:37 pve-art kernel: [59332.309354] __mutex_lock.isra.10+0x2c9/0x4c0
Jan 18 16:32:37 pve-art kernel: [59332.309356] __mutex_lock_slowpath+0x13/0x20
Jan 18 16:32:37 pve-art kernel: [59332.309356] mutex_lock+0x2c/0x30
Jan 18 16:32:37 pve-art kernel: [59332.309387] spa_all_configs+0x3b/0x120 [zfs]
Jan 18 16:32:37 pve-art kernel: [59332.309418] zfs_ioc_pool_configs+0x1b/0x70 [zfs]
Jan 18 16:32:37 pve-art kernel: [59332.309449] zfsdev_ioctl+0x6db/0x8f0 [zfs]
Jan 18 16:32:37 pve-art kernel: [59332.309450] ? lru_cache_add_active_or_unevictable+0x39/0xb0
Jan 18 16:32:37 pve-art kernel: [59332.309451] do_vfs_ioctl+0xa9/0x640
Jan 18 16:32:37 pve-art kernel: [59332.309453] ? handle_mm_fault+0xc9/0x1f0
Jan 18 16:32:37 pve-art kernel: [59332.309454] ksys_ioctl+0x67/0x90
Jan 18 16:32:37 pve-art kernel: [59332.309455] __x64_sys_ioctl+0x1a/0x20
Jan 18 16:32:37 pve-art kernel: [59332.309457] do_syscall_64+0x57/0x190
Jan 18 16:32:37 pve-art kernel: [59332.309459] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jan 18 16:32:37 pve-art kernel: [59332.309460] RIP: 0033:0x7efda23da427
Jan 18 16:32:37 pve-art kernel: [59332.309462] Code: Bad RIP value.
Jan 18 16:32:37 pve-art kernel: [59332.309462] RSP: 002b:00007ffdeb3c9288 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jan 18 16:32:37 pve-art kernel: [59332.309464] RAX: ffffffffffffffda RBX: 000055fe3b3bc430 RCX: 00007efda23da427
Jan 18 16:32:37 pve-art kernel: [59332.309464] RDX: 00007ffdeb3c92c0 RSI: 0000000000005a04 RDI: 0000000000000003
Jan 18 16:32:37 pve-art kernel: [59332.309465] RBP: 00007ffdeb3cc8a0 R08: 00007efda1c29010 R09: 0000000000000000
Jan 18 16:32:37 pve-art kernel: [59332.309466] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000000000
Jan 18 16:32:37 pve-art kernel: [59332.309466] R13: 00007ffdeb3c92c0 R14: 000055fe39622740 R15: 000055fe3b3be3a0
 
Last edited:
omg its up!! So all I had to do was wait this out?! This is scary at boot time when its your first big issue and no warnings. Especially on a clean boot. A 'please wait' would have been awesome!

import_hung.png
 
Last edited:
root@pve-art:~# zpool status
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 0 days 00:00:04 with 0 errors on Mon Jan 18 00:31:37 2021
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-Samsung_SSD_860_PRO_256GB_S5GANE0N100537J-part3 ONLINE 0 0 0
ata-Samsung_SSD_860_PRO_256GB_S5GANA0MA17355T-part3 ONLINE 0 0 0

errors: No known data errors

pool: tank0
state: ONLINE
scan: scrub repaired 0B in 0 days 00:16:42 with 0 errors on Mon Jan 18 00:47:07 2021
config:

NAME STATE READ WRITE CKSUM
tank0 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-INTEL_SSDSC2KB960G8_PHYF014000CL960CGN ONLINE 0 0 0
ata-INTEL_SSDSC2KB960G8_PHYF0140005M960CGN ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
ata-INTEL_SSDSC2KB960G8_PHYF029201GM960CGN ONLINE 0 0 0
ata-INTEL_SSDSC2KB960G8_PHYF0354048D960CGN ONLINE 0 0 0

errors: No known data errors

pool: tank1
state: ONLINE
scan: scrub repaired 0B in 0 days 00:17:02 with 0 errors on Sun Jan 10 00:41:05 2021
config:

NAME STATE READ WRITE CKSUM
tank1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-ST6000VN0033-2EE110_ZADB78ZR ONLINE 0 0 0
ata-ST6000VN0033-2EE110_ZADB6LHW ONLINE 0 0 0

errors: No known data errors
 
Mine is doing the same thing... but I waited over 8 hours... this is after a recent upgrade... I can boot into previous version, but can import the zpool in READ ONLY
 

Attachments

  • pool no import.JPG
    pool no import.JPG
    213.7 KB · Views: 29
Can you keep it running? may take longer - this sort of thing makes me re-evaluate zfs honestly. Slow Hard disks? or does it just get slower with even more TB... When this happened to me it made me re-evaluate zfs in production but it only took about 20-30 minutes for me 2TB. But even after 10 mins I had no idea what was happening and kept rebooting.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!