[SOLVED] PVE not starting (ZFS problem?)

pjuecat

Member
Aug 24, 2021
24
0
6
46
Hi,

[I also tried this post in german forum, but maybe there are some more guys here in english]

I had a strange situation with my PVE. All virtual machines were running but a login with browser was not possible. So I did a restart with POWER OFF and the i got some failures on shell. I don't remeber the exact error but it seems something like ZFS Pool.

I found this thread https://forum.proxmox.com/threads/second-zfs-pool-failed-to-import-on-boot.102409/ an tried the command:

Code:
systemctl status zfs-import.service zfs-import-cache.service
● zfs-import.service
     Loaded: masked (Reason: Unit zfs-import.service is masked.)
     Active: inactive (dead)

● zfs-import-cache.service - Import ZFS pools by cache file
     Loaded: loaded (/lib/systemd/system/zfs-import-cache.service; enabled; vendor preset: enabled)
     Active: active (exited) since Mon 2022-01-03 13:31:29 CET; 1h 26min ago
       Docs: man:zpool(8)
   Main PID: 1641 (code=exited, status=0/SUCCESS)
      Tasks: 0 (limit: 76969)
     Memory: 0B
        CPU: 0
     CGroup: /system.slice/zfs-import-cache.service

Jan 03 13:31:28 pve systemd[1]: Starting Import ZFS pools by cache file...
Jan 03 13:31:29 pve systemd[1]: Finished Import ZFS pools by cache file.

After trying to enable zfs-pool
Code:
systemctl enable zfs-import@POOLNAME
with POOLNAME "zpool" I think a new symlink was set.

After rebooting the server nothings works and I get a lot of FAIL.

Anybody an idea to get my system running again?
Did I make a mistake with setting POOLNAME = zpool? What's the correct POOLNAME?

Thanks and best regards
 

Attachments

  • Screenshot_20221026_193359.jpg
    Screenshot_20221026_193359.jpg
    604 KB · Views: 22
I found this thread https://forum.proxmox.com/threads/second-zfs-pool-failed-to-import-on-boot.102409/ an tried the command:

Code:
systemctl status zfs-import.service zfs-import-cache.service
● zfs-import.service
     Loaded: masked (Reason: Unit zfs-import.service is masked.)
     Active: inactive (dead)

● zfs-import-cache.service - Import ZFS pools by cache file
     Loaded: loaded (/lib/systemd/system/zfs-import-cache.service; enabled; vendor preset: enabled)
     Active: active (exited) since Mon 2022-01-03 13:31:29 CET; 1h 26min ago
       Docs: man:zpool(8)
   Main PID: 1641 (code=exited, status=0/SUCCESS)
      Tasks: 0 (limit: 76969)
     Memory: 0B
        CPU: 0
     CGroup: /system.slice/zfs-import-cache.service

Jan 03 13:31:28 pve systemd[1]: Starting Import ZFS pools by cache file...
Jan 03 13:31:29 pve systemd[1]: Finished Import ZFS pools by cache file.
Hi,
From what I can tell you copy-pasted the output from that thread and this is not your output. Could you show your output of
systemctl status zfs-import.service zfs-import-cache.service?

To find the correct poolname check the output of sudo cat /etc/pve/storage.cfg. Next to zfspool you will see the name.
 
Hello,

that's my output from systemctl status ...
Code:
root@s36-pve:~# systemctl status zfs-import.service zfs-import-cache.service
● zfs-import.service
     Loaded: masked (Reason: Unit zfs-import.service is masked.)
     Active: inactive (dead)

● zfs-import-cache.service - Import ZFS pools by cache file
     Loaded: loaded (/lib/systemd/system/zfs-import-cache.service; enabled; ven>
     Active: active (exited) since Thu 2022-10-27 10:44:08 CEST; 11min ago
       Docs: man:zpool(8)
    Process: 679 ExecStart=/sbin/zpool import -c /etc/zfs/zpool.cache -aN $ZPOO>
   Main PID: 679 (code=exited, status=0/SUCCESS)
        CPU: 46ms

Oct 27 10:44:05 s36-pve systemd[1]: Starting Import ZFS pools by cache file...
Oct 27 10:44:08 s36-pve systemd[1]: Finished Import ZFS pools by cache file.
lines 1-14/14 (END)...skipping...
● zfs-import.service
     Loaded: masked (Reason: Unit zfs-import.service is masked.)
     Active: inactive (dead)

● zfs-import-cache.service - Import ZFS pools by cache file
     Loaded: loaded (/lib/systemd/system/zfs-import-cache.service; enabled; vendor preset: enabled)
     Active: active (exited) since Thu 2022-10-27 10:44:08 CEST; 11min ago
       Docs: man:zpool(8)
    Process: 679 ExecStart=/sbin/zpool import -c /etc/zfs/zpool.cache -aN $ZPOOL_IMPORT_OPTS (code=exited, status=0/SUCCESS)
   Main PID: 679 (code=exited, status=0/SUCCESS)
        CPU: 46ms

Oct 27 10:44:05 s36-pve systemd[1]: Starting Import ZFS pools by cache file...
Oct 27 10:44:08 s36-pve systemd[1]: Finished Import ZFS pools by cache file.
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
lines 1-14/14 (END)

BUT: there's no file storage.cfg in the /etc/pve folder, it contains NO files ????

Thx
Jürgen
 
BUT: there's no file storage.cfg in the /etc/pve folder, it contains NO files ????
Proxmox stores the information in those files in a database. If that directory is empty, then Proxmox has indeed not started fully or encountered problems.
I don't think you have a ZFS pool named zpool, so systemctl enable zfs-import@zpool was a mistake but not problematic.
What is the output of systemctl --failed? Can you figure out why the pve services are not starting correctly?
 
I think PVE services started (if without problems I don't know, but some of them started) until I enabled the false?? ZFS pool.

Code:
root@s36-pve:~# systemctl --failed
  UNIT                     LOAD   ACTIVE SUB    DESCRIPTION
● corosync.service         loaded failed failed Corosync Cluster Engine
● logrotate.service        loaded failed failed Rotate log files
● man-db.service           loaded failed failed Daily man-db regeneration
● postfix@-.service        loaded failed failed Postfix Mail Transport Agent (instance -)
● pve-cluster.service      loaded failed failed The Proxmox VE cluster filesystem
● pve-firewall.service     loaded failed failed Proxmox VE firewall
● pve-guests.service       loaded failed failed PVE guests
● pve-ha-crm.service       loaded failed failed PVE Cluster HA Resource Manager Daemon
● pve-ha-lrm.service       loaded failed failed PVE Local HA Resource Manager Daemon
● pvescheduler.service     loaded failed failed Proxmox VE scheduler
● pvestatd.service         loaded failed failed PVE Status Daemon
● zfs-import@disk.service  loaded failed failed Import ZFS pool disk
● zfs-import@zpool.service loaded failed failed Import ZFS pool zpool

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
13 loaded units listed.
 
I think PVE services started (if without problems I don't know, but some of them started) until I enabled the false?? ZFS pool.
Try systemctl disable zfs-import@zpool to undo that and see if it helps.
Code:
root@s36-pve:~# systemctl --failed
  UNIT                     LOAD   ACTIVE SUB    DESCRIPTION
● corosync.service         loaded failed failed Corosync Cluster Engine
● logrotate.service        loaded failed failed Rotate log files
● man-db.service           loaded failed failed Daily man-db regeneration
● postfix@-.service        loaded failed failed Postfix Mail Transport Agent (instance -)
● pve-cluster.service      loaded failed failed The Proxmox VE cluster filesystem
● pve-firewall.service     loaded failed failed Proxmox VE firewall
● pve-guests.service       loaded failed failed PVE guests
● pve-ha-crm.service       loaded failed failed PVE Cluster HA Resource Manager Daemon
● pve-ha-lrm.service       loaded failed failed PVE Local HA Resource Manager Daemon
● pvescheduler.service     loaded failed failed Proxmox VE scheduler
● pvestatd.service         loaded failed failed PVE Status Daemon
● zfs-import@disk.service  loaded failed failed Import ZFS pool disk
● zfs-import@zpool.service loaded failed failed Import ZFS pool zpool
Looks like almost everything needed for Proxmox failed. You can lookup more details using systemctl status pve-cluster (for each service).
There has to be a clue about what is going wrong in journalctl -b 0 somewhere.
 
after UNDO and restart the systemctl --failed is 3 items smaller

Code:
root@s36-pve:~# systemctl --failed
  UNIT                    LOAD   ACTIVE SUB    DESCRIPTION
● corosync.service        loaded failed failed Corosync Cluster Engine
● postfix@-.service       loaded failed failed Postfix Mail Transport Agent (instance -)
● pve-cluster.service     loaded failed failed The Proxmox VE cluster filesystem
● pve-firewall.service    loaded failed failed Proxmox VE firewall
● pve-guests.service      loaded failed failed PVE guests
● pve-ha-crm.service      loaded failed failed PVE Cluster HA Resource Manager Daemon
● pve-ha-lrm.service      loaded failed failed PVE Local HA Resource Manager Daemon
● pvescheduler.service    loaded failed failed Proxmox VE scheduler
● pvestatd.service        loaded failed failed PVE Status Daemon
● zfs-import@disk.service loaded failed failed Import ZFS pool disk

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
10 loaded units listed.

Code:
root@s36-pve:~# systemctl status pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Thu 2022-10-27 13:35:44 CEST; 4min 16s ago
    Process: 2163 ExecStart=/usr/bin/pmxcfs (code=exited, status=255/EXCEPTION)
        CPU: 5ms

Oct 27 13:35:44 s36-pve systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5.
Oct 27 13:35:44 s36-pve systemd[1]: Stopped The Proxmox VE cluster filesystem.
Oct 27 13:35:44 s36-pve systemd[1]: pve-cluster.service: Start request repeated too quickly.
Oct 27 13:35:44 s36-pve systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Oct 27 13:35:44 s36-pve systemd[1]: Failed to start The Proxmox VE cluster filesystem.
root@s36-pve:~# systemctl status pvestatd
● pvestatd.service - PVE Status Daemon
     Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Thu 2022-10-27 13:35:44 CEST; 6min ago
    Process: 2137 ExecStart=/usr/bin/pvestatd start (code=exited, status=111)
        CPU: 400ms

Oct 27 13:35:44 s36-pve pvestatd[2137]: ipcc_send_rec[1] failed: Connection refused
Oct 27 13:35:44 s36-pve pvestatd[2137]: ipcc_send_rec[2] failed: Connection refused
Oct 27 13:35:44 s36-pve pvestatd[2137]: ipcc_send_rec[3] failed: Connection refused
Oct 27 13:35:44 s36-pve pvestatd[2137]: Unable to load access control list: Connection refused
Oct 27 13:35:44 s36-pve pvestatd[2137]: ipcc_send_rec[1] failed: Connection refused
Oct 27 13:35:44 s36-pve pvestatd[2137]: ipcc_send_rec[2] failed: Connection refused
Oct 27 13:35:44 s36-pve pvestatd[2137]: ipcc_send_rec[3] failed: Connection refused
Oct 27 13:35:44 s36-pve systemd[1]: pvestatd.service: Control process exited, code=exited, status=111/n/a
Oct 27 13:35:44 s36-pve systemd[1]: pvestatd.service: Failed with result 'exit-code'.
Oct 27 13:35:44 s36-pve systemd[1]: Failed to start PVE Status Daemon.
root@s36-pve:~# systemctl status zfs-import@disk
● zfs-import@disk.service - Import ZFS pool disk
     Loaded: loaded (/lib/systemd/system/zfs-import@.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Thu 2022-10-27 13:35:41 CEST; 7min ago
       Docs: man:zpool(8)
    Process: 704 ExecStart=/sbin/zpool import -N -d /dev/disk/by-id -o cachefile=none disk (code=exited, status=1/FAILURE)
   Main PID: 704 (code=exited, status=1/FAILURE)
        CPU: 17ms

Oct 27 13:35:41 s36-pve systemd[1]: Starting Import ZFS pool disk...
Oct 27 13:35:41 s36-pve zpool[704]: cannot import 'disk': no such pool available
Oct 27 13:35:41 s36-pve systemd[1]: zfs-import@disk.service: Main process exited, code=exited, status=1/FAILURE
Oct 27 13:35:41 s36-pve systemd[1]: zfs-import@disk.service: Failed with result 'exit-code'.
Oct 27 13:35:41 s36-pve systemd[1]: Failed to start Import ZFS pool disk.

Code:
root@s36-pve:~# journalctl -b 0
Journal file /var/log/journal/c8df15b121b046a590c4f77598e234b8/system.journal is truncated, ignoring file.
-- Journal begins at Thu 2021-10-07 16:58:14 CEST, ends at Thu 2022-10-27 13:43:39 CEST. --
Oct 27 13:35:41 s36-pve kernel: Linux version 5.15.60-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.>
Oct 27 13:35:41 s36-pve kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.60-1-pve root=/dev/mapper/pve-root ro quiet
Oct 27 13:35:41 s36-pve kernel: KERNEL supported cpus:
Oct 27 13:35:41 s36-pve kernel:   Intel GenuineIntel
Oct 27 13:35:41 s36-pve kernel:   AMD AuthenticAMD
Oct 27 13:35:41 s36-pve kernel:   Hygon HygonGenuine
Oct 27 13:35:41 s36-pve kernel:   Centaur CentaurHauls
Oct 27 13:35:41 s36-pve kernel:   zhaoxin   Shanghai
Oct 27 13:35:41 s36-pve kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Oct 27 13:35:41 s36-pve kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Oct 27 13:35:41 s36-pve kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Oct 27 13:35:41 s36-pve kernel: x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
Oct 27 13:35:41 s36-pve kernel: x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
Oct 27 13:35:41 s36-pve kernel: x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
Oct 27 13:35:41 s36-pve kernel: x86/fpu: xstate_offset[3]:  832, xstate_sizes[3]:   64
Oct 27 13:35:41 s36-pve kernel: x86/fpu: xstate_offset[4]:  896, xstate_sizes[4]:   64
Oct 27 13:35:41 s36-pve kernel: x86/fpu: Enabled xstate features 0x1f, context size is 960 bytes, using 'compacted' format.
Oct 27 13:35:41 s36-pve kernel: signal: max sigframe size: 2032
Oct 27 13:35:41 s36-pve kernel: BIOS-provided physical RAM map:
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009efff] usable
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x000000000009f000-0x00000000000fffff] reserved
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x0000000000100000-0x000000002cb91fff] usable
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x000000002cb92000-0x000000002ec95fff] reserved
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x000000002ec96000-0x000000002ed18fff] ACPI data
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x000000002ed19000-0x000000002f1b2fff] ACPI NVS
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x000000002f1b3000-0x000000002fba1fff] reserved
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x000000002fba2000-0x000000002fc4dfff] type 20
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x000000002fc4e000-0x000000002fc4efff] usable
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x000000002fc4f000-0x000000003cffffff] reserved
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x00000000e0000000-0x00000000efffffff] reserved
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x00000000fe000000-0x00000000fe010fff] reserved
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x00000000fed00000-0x00000000fed03fff] reserved
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
Oct 27 13:35:41 s36-pve kernel: BIOS-e820: [mem 0x0000000100000000-0x00000010c0ffffff] usable
Oct 27 13:35:41 s36-pve kernel: NX (Execute Disable) protection: active
Oct 27 13:35:41 s36-pve kernel: efi: EFI v2.70 by American Megatrends
Oct 27 13:35:41 s36-pve kernel: efi: ACPI=0x2f119000 ACPI 2.0=0x2f119014 TPMFinalLog=0x2f121000 SMBIOS=0x2f9fe000 SMBIOS 3.0=0x2f9fd000 MEMATTR=0x2654b018 ESRT=0x28e5c918
Oct 27 13:35:41 s36-pve kernel: secureboot: Secure boot could not be determined (mode 0)
Oct 27 13:35:41 s36-pve kernel: SMBIOS 3.3.0 present.
Oct 27 13:35:41 s36-pve kernel: DMI: Intel(R) Client Systems NUC10i7FNH/NUC10i7FNB, BIOS FNCML357.0058.2022.0720.1011 07/20/2022
Oct 27 13:35:41 s36-pve kernel: tsc: Detected 1600.000 MHz processor
Oct 27 13:35:41 s36-pve kernel: tsc: Detected 1599.960 MHz TSC
Oct 27 13:35:41 s36-pve kernel: e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
Oct 27 13:35:41 s36-pve kernel: e820: remove [mem 0x000a0000-0x000fffff] usable
Oct 27 13:35:41 s36-pve kernel: last_pfn = 0x10c1000 max_arch_pfn = 0x400000000
Oct 27 13:35:41 s36-pve kernel: x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
Oct 27 13:35:41 s36-pve kernel: last_pfn = 0x2fc4f max_arch_pfn = 0x400000000
Oct 27 13:35:41 s36-pve kernel: esrt: Reserving ESRT space from 0x0000000028e5c918 to 0x0000000028e5c950.

o_O
 
This does not seem to be the full journal, could you run journalctl -b 0 > mylog.txt and attach that file?
 
Thanks for your help, system is running again. Reading the full journal helped to identify the problem - the disk ran out of free memory. Deleting some ISOs and old dump files solved all the problems.

:)