[SOLVED] PVE not starting (ZFS problem?)

pjuecat

Member
Aug 24, 2021
21
0
6
45
Hi,

[I also tried this post in german forum, but maybe there are some more guys here in english]

I had a strange situation with my PVE. All virtual machines were running but a login with browser was not possible. So I did a restart with POWER OFF and the i got some failures on shell. I don't remeber the exact error but it seems something like ZFS Pool.

I found this thread https://forum.proxmox.com/threads/second-zfs-pool-failed-to-import-on-boot.102409/ an tried the command:

Code:
systemctl status zfs-import.service zfs-import-cache.service
● zfs-import.service
     Loaded: masked (Reason: Unit zfs-import.service is masked.)
     Active: inactive (dead)

● zfs-import-cache.service - Import ZFS pools by cache file
     Loaded: loaded (/lib/systemd/system/zfs-import-cache.service; enabled; vendor preset: enabled)
     Active: active (exited) since Mon 2022-01-03 13:31:29 CET; 1h 26min ago
       Docs: man:zpool(8)
   Main PID: 1641 (code=exited, status=0/SUCCESS)
      Tasks: 0 (limit: 76969)
     Memory: 0B
        CPU: 0
     CGroup: /system.slice/zfs-import-cache.service

Jan 03 13:31:28 pve systemd[1]: Starting Import ZFS pools by cache file...
Jan 03 13:31:29 pve systemd[1]: Finished Import ZFS pools by cache file.

After trying to enable zfs-pool
Code:
systemctl enable zfs-import@POOLNAME
with POOLNAME "zpool" I think a new symlink was set.

After rebooting the server nothings works and I get a lot of FAIL.

Anybody an idea to get my system running again?
Did I make a mistake with setting POOLNAME = zpool? What's the correct POOLNAME?

Thanks and best regards
 

Attachments

  • Screenshot_20221026_193359.jpg
    Screenshot_20221026_193359.jpg
    604 KB · Views: 20
I found this thread https://forum.proxmox.com/threads/second-zfs-pool-failed-to-import-on-boot.102409/ an tried the command:

Code:
systemctl status zfs-import.service zfs-import-cache.service
● zfs-import.service
     Loaded: masked (Reason: Unit zfs-import.service is masked.)
     Active: inactive (dead)

● zfs-import-cache.service - Import ZFS pools by cache file
     Loaded: loaded (/lib/systemd/system/zfs-import-cache.service; enabled; vendor preset: enabled)
     Active: active (exited) since Mon 2022-01-03 13:31:29 CET; 1h 26min ago
       Docs: man:zpool(8)
   Main PID: 1641 (code=exited, status=0/SUCCESS)
      Tasks: 0 (limit: 76969)
     Memory: 0B
        CPU: 0
     CGroup: /system.slice/zfs-import-cache.service

Jan 03 13:31:28 pve systemd[1]: Starting Import ZFS pools by cache file...
Jan 03 13:31:29 pve systemd[1]: Finished Import ZFS pools by cache file.
Hi,
From what I can tell you copy-pasted the output from that thread and this is not your output. Could you show your output of
systemctl status zfs-import.service zfs-import-cache.service?

To find the correct poolname check the output of sudo cat /etc/pve/storage.cfg. Next to zfspool you will see the name.
 
Hello,

that's my output from systemctl status ...
Code:
root@s36-pve:~# systemctl status zfs-import.service zfs-import-cache.service
● zfs-import.service
     Loaded: masked (Reason: Unit zfs-import.service is masked.)
     Active: inactive (dead)

● zfs-import-cache.service - Import ZFS pools by cache file
     Loaded: loaded (/lib/systemd/system/zfs-import-cache.service; enabled; ven>
     Active: active (exited) since Thu 2022-10-27 10:44:08 CEST; 11min ago
       Docs: man:zpool(8)
    Process: 679 ExecStart=/sbin/zpool import -c /etc/zfs/zpool.cache -aN $ZPOO>
   Main PID: 679 (code=exited, status=0/SUCCESS)
        CPU: 46ms

Oct 27 10:44:05 s36-pve systemd[1]: Starting Import ZFS pools by cache file...
Oct 27 10:44:08 s36-pve systemd[1]: Finished Import ZFS pools by cache file.
lines 1-14/14 (END)...skipping...
● zfs-import.service
     Loaded: masked (Reason: Unit zfs-import.service is masked.)
     Active: inactive (dead)

● zfs-import-cache.service - Import ZFS pools by cache file
     Loaded: loaded (/lib/systemd/system/zfs-import-cache.service; enabled; vendor preset: enabled)
     Active: active (exited) since Thu 2022-10-27 10:44:08 CEST; 11min ago
       Docs: man:zpool(8)
    Process: 679 ExecStart=/sbin/zpool import -c /etc/zfs/zpool.cache -aN $ZPOOL_IMPORT_OPTS (code=exited, status=0/SUCCESS)
   Main PID: 679 (code=exited, status=0/SUCCESS)
        CPU: 46ms

Oct 27 10:44:05 s36-pve systemd[1]: Starting Import ZFS pools by cache file...
Oct 27 10:44:08 s36-pve systemd[1]: Finished Import ZFS pools by cache file.
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
lines 1-14/14 (END)

BUT: there's no file storage.cfg in the /etc/pve folder, it contains NO files ????

Thx
Jürgen
 
BUT: there's no file storage.cfg in the /etc/pve folder, it contains NO files ????
Proxmox stores the information in those files in a database. If that directory is empty, then Proxmox has indeed not started fully or encountered problems.
I don't think you have a ZFS pool named zpool, so systemctl enable zfs-import@zpool was a mistake but not problematic.
What is the output of systemctl --failed? Can you figure out why the pve services are not starting correctly?
 
I think PVE services started (if without problems I don't know, but some of them started) until I enabled the false?? ZFS pool.

Code:
root@s36-pve:~# systemctl --failed
  UNIT                     LOAD   ACTIVE SUB    DESCRIPTION
● corosync.service         loaded failed failed Corosync Cluster Engine
● logrotate.service        loaded failed failed Rotate log files
● man-db.service           loaded failed failed Daily man-db regeneration
● postfix@-.service        loaded failed failed Postfix Mail Transport Agent (instance -)
● pve-cluster.service      loaded failed failed The Proxmox VE cluster filesystem
● pve-firewall.service     loaded failed failed Proxmox VE firewall
● pve-guests.service       loaded failed failed PVE guests
● pve-ha-crm.service       loaded failed failed PVE Cluster HA Resource Manager Daemon
● pve-ha-lrm.service       loaded failed failed PVE Local HA Resource Manager Daemon
● pvescheduler.service     loaded failed failed Proxmox VE scheduler
● pvestatd.service         loaded failed failed PVE Status Daemon
● zfs-import@disk.service  loaded failed failed Import ZFS pool disk
● zfs-import@zpool.service loaded failed failed Import ZFS pool zpool

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
13 loaded units listed.
 
I think PVE services started (if without problems I don't know, but some of them started) until I enabled the false?? ZFS pool.
Try systemctl disable zfs-import@zpool to undo that and see if it helps.
Code:
root@s36-pve:~# systemctl --failed
  UNIT                     LOAD   ACTIVE SUB    DESCRIPTION
● corosync.service         loaded failed failed Corosync Cluster Engine
● logrotate.service        loaded failed failed Rotate log files
● man-db.service           loaded failed failed Daily man-db regeneration
● postfix@-.service        loaded failed failed Postfix Mail Transport Agent (instance -)
● pve-cluster.service      loaded failed failed The Proxmox VE cluster filesystem
● pve-firewall.service     loaded failed failed Proxmox VE firewall
● pve-guests.service       loaded failed failed PVE guests
● pve-ha-crm.service       loaded failed failed PVE Cluster HA Resource Manager Daemon
● pve-ha-lrm.service       loaded failed failed PVE Local HA Resource Manager Daemon
● pvescheduler.service     loaded failed failed Proxmox VE scheduler
● pvestatd.service         loaded failed failed PVE Status Daemon
● zfs-import@disk.service  loaded failed failed Import ZFS pool disk
● zfs-import@zpool.service loaded failed failed Import ZFS pool zpool
Looks like almost everything needed for Proxmox failed. You can lookup more details using systemctl status pve-cluster (for each service).
There has to be a clue about what is going wrong in journalctl -b 0 somewhere.
 
after UNDO and restart the systemctl --failed is 3 items smaller

Code:
root@s36-pve:~# systemctl --failed
  UNIT                    LOAD   ACTIVE SUB    DESCRIPTION
● corosync.service        loaded failed failed Corosync Cluster Engine
● postfix@-.service       loaded failed failed Postfix Mail Transport Agent (instance -)
● pve-cluster.service     loaded failed failed The Proxmox VE cluster filesystem
● pve-firewall.service    loaded failed failed Proxmox VE firewall
● pve-guests.service      loaded failed failed PVE guests
● pve-ha-crm.service      loaded failed failed PVE Cluster HA Resource Manager Daemon
● pve-ha-lrm.service      loaded failed failed PVE Local HA Resource Manager Daemon
● pvescheduler.service    loaded failed failed Proxmox VE scheduler
● pvestatd.service        loaded failed failed PVE Status Daemon
● zfs-import@disk.service loaded failed failed Import ZFS pool disk

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
10 loaded units listed.

Code:
root@s36-pve:~# systemctl status pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Thu 2022-10-27 13:35:44 CEST; 4min 16s ago
    Process: 2163 ExecStart=/usr/bin/pmxcfs (code=exited, status=255/EXCEPTION)
        CPU: 5ms

Oct 27 13:35:44 s36-pve systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5.
Oct 27 13:35:44 s36-pve systemd[1]: Stopped The Proxmox VE cluster filesystem.
Oct 27 13:35:44 s36-pve systemd[1]: pve-cluster.service: Start request repeated too quickly.
Oct 27 13:35:44 s36-pve systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Oct 27 13:35:44 s36-pve systemd[1]: Failed to start The Proxmox VE cluster filesystem.
root@s36-pve:~# systemctl status pvestatd
● pvestatd.service - PVE Status Daemon
     Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Thu 2022-10-27 13:35:44 CEST; 6min ago
    Process: 2137 ExecStart=/usr/bin/pvestatd start (code=exited, status=111)
        CPU: 400ms

Oct 27 13:35:44 s36-pve pvestatd[2137]: ipcc_send_rec[1] failed: Connection refused
Oct 27 13:35:44 s36-pve pvestatd[2137]: ipcc_send_rec[2] failed: Connection refused
Oct 27 13:35:44 s36-pve pvestatd[2137]: ipcc_send_rec[3] failed: Connection refused
Oct 27 13:35:44 s36-pve pvestatd[2137]: Unable to load access control list: Connection refused
Oct 27 13:35:44 s36-pve pvestatd[2137]: ipcc_send_rec[1] failed: Connection refused
Oct 27 13:35:44 s36-pve pvestatd[2137]: ipcc_send_rec[2] failed: Connection refused
Oct 27 13:35:44 s36-pve pvestatd[2137]: ipcc_send_rec[3] failed: Connection refused
Oct 27 13:35:44 s36-pve systemd[1]: pvestatd.service: Control process exited, code=exited, status=111/n/a
Oct 27 13:35:44 s36-pve systemd[1]: pvestatd.service: Failed with result 'exit-code'.
Oct 27 13:35:44 s36-pve systemd[1]: Failed to start PVE Status Daemon.
root@s36-pve:~# systemctl status zfs-import@disk
● zfs-import@disk.service - Import ZFS pool disk
     Loaded: loaded (/lib/systemd/system/zfs-import@.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Thu 2022-10-27 13:35:41 CEST; 7min ago
       Docs: man:zpool(8)
    Process: 704 ExecStart=/sbin/zpool import -N -d /dev/disk/by-id -o cachefile=none disk (code=exited, status=1/FAILURE)
   Main PID: 704 (code=exited, status=1/FAILURE)
        CPU: 17ms

Oct 27 13:35:41 s36-pve systemd[1]: Starting Import ZFS pool disk...
Oct 27 13:35:41 s36-pve zpool[704]: cannot import 'disk': no such pool available
Oct 27 13:35:41 s36-pve systemd[1]: zfs-import@disk.service: Main process exited, code=exited, status=1/FAILURE
Oct 27 13:35:41 s36-pve systemd[1]: zfs-import@disk.service: Failed with result 'exit-code'.
Oct 27 13:35:41 s36-pve systemd[1]: Failed to start Import ZFS pool disk.

Code:
root@s36-pve:~# journalctl -b 0
Journal file /var/log/journal/c8df15b121b046a590c4f77598e234b8/system.journal is truncated, ignoring file.
-- Journal begins at Thu 2021-10-07 16:58:14 CEST, ends at Thu 2022-10-27 13:43:39 CEST. --
Oct 27 13:35:41 s36-pve kernel: Linux version 5.15.60-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.>
Oct 27 13:35:41 s36-pve kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.60-1-pve root=/dev/mapper/pve-root ro quiet
Oct 27 13:35:41 s36-pve kernel: KERNEL supported cpus:
Oct 27 13:35:41 s36-pve kernel:   Intel GenuineIntel
Oct 27 13:35:41 s36-pve kernel:   AMD AuthenticAMD
Oct 27 13:35:41 s36-pve kernel:   Hygon HygonGenuine
Oct 27 13:35:41 s36-pve kernel:   Centaur CentaurHauls
Oct 27 13:35:41 s36-pve kernel:   zhaoxin   Shanghai
Oct 27 13:35:41 s36-pve kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Oct 27 13:35:41 s36-pve kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Oct 27 13:35:41 s36-pve kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Oct 27 13:35:41 s36-pve kernel: x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
Oct 27 13:35:41 s36-pve kernel: x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
Oct 27 13:35:41 s36-pve kernel: x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
Oct 27 13:35:41 s36-pve kernel: x86/fpu: xstate_offset[3]:  832, xstate_sizes[3]:   64
Oct 27 13:35:41 s36-pve kernel: x86/fpu: xstate_offset[4]:  896, xstate_sizes[4]:   64
Oct 27 13:35:41 s36-pve kernel: x86/fpu: Enabled xstate features 0x1f, context size is 960 bytes, using 'compacted' format.
Oct 27 13:35:41 s36-pve kernel: signal: max sigframe size: 2032
Oct 27 13:35:41 s36-pve kernel: BIOS-provided physical RAM map:
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009efff] usable
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x000000000009f000-0x00000000000fffff] reserved
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x0000000000100000-0x000000002cb91fff] usable
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x000000002cb92000-0x000000002ec95fff] reserved
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x000000002ec96000-0x000000002ed18fff] ACPI data
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x000000002ed19000-0x000000002f1b2fff] ACPI NVS
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x000000002f1b3000-0x000000002fba1fff] reserved
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x000000002fba2000-0x000000002fc4dfff] type 20
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x000000002fc4e000-0x000000002fc4efff] usable
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x000000002fc4f000-0x000000003cffffff] reserved
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x00000000e0000000-0x00000000efffffff] reserved
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x00000000fe000000-0x00000000fe010fff] reserved
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x00000000fed00000-0x00000000fed03fff] reserved
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x0000000100000000-0x00000010c0ffffff] usable
Oct 27 13:35:41 s36-pve kernel: NX (Execute Disable) protection: active
Oct 27 13:35:41 s36-pve kernel: efi: EFI v2.70 by American Megatrends
Oct 27 13:35:41 s36-pve kernel: efi: ACPI=0x2f119000 ACPI 2.0=0x2f119014 TPMFinalLog=0x2f121000 SMBIOS=0x2f9fe000 SMBIOS 3.0=0x2f9fd000 MEMATTR=0x2654b018 ESRT=0x28e5c918
Oct 27 13:35:41 s36-pve kernel: secureboot: Secure boot could not be determined (mode 0)
Oct 27 13:35:41 s36-pve kernel: SMBIOS 3.3.0 present.
Oct 27 13:35:41 s36-pve kernel: DMI: Intel(R) Client Systems NUC10i7FNH/NUC10i7FNB, BIOS FNCML357.0058.2022.0720.1011 07/20/2022
Oct 27 13:35:41 s36-pve kernel: tsc: Detected 1600.000 MHz processor
Oct 27 13:35:41 s36-pve kernel: tsc: Detected 1599.960 MHz TSC
Oct 27 13:35:41 s36-pve kernel: e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
Oct 27 13:35:41 s36-pve kernel: e820: remove [mem 0x000a0000-0x000fffff] usable
Oct 27 13:35:41 s36-pve kernel: last_pfn = 0x10c1000 max_arch_pfn = 0x400000000
Oct 27 13:35:41 s36-pve kernel: x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
Oct 27 13:35:41 s36-pve kernel: last_pfn = 0x2fc4f max_arch_pfn = 0x400000000
Oct 27 13:35:41 s36-pve kernel: esrt: Reserving ESRT space from 0x0000000028e5c918 to 0x0000000028e5c950.

o_O
 
This does not seem to be the full journal, could you run journalctl -b 0 > mylog.txt and attach that file?
 
Thanks for your help, system is running again. Reading the full journal helped to identify the problem - the disk ran out of free memory. Deleting some ISOs and old dump files solved all the problems.

:)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!