Current fence status: FENCE, with wrong hd space usage

Adi M · Jan 12, 2023

I have a cluster with four notes and ZFS. Every notes has 3 drives for system and 5 drives for VMs/CTs.

Now today one note goes to fence state.
What I see: hd space is in red with 99.97% (26.54 GiB of 26.55 GiB).
But there must be some wrong because zfs cluster looks like this (rpool Size = 438 GB; Free = 13.6 GB):

It's still possible to access to this portal without any restriction or speed limitation.

I don't now, if its a good way do just to reboot server without have more problem.
View from ok notes:

View from nok notes:

Please give mir input for more specific info.

fabian · Jan 12, 2023

are these using ZFS as / ? please provide the output of "zfs list" and "df -h" on "pve-yps". thanks!

Adi M · Jan 12, 2023

Yes, i'm using ZFS as /
zfs list:

Code:

NAME                            USED  AVAIL     REFER  MOUNTPOINT
pbpool                          493G  44.6G      493G  /mnt/datastore/pbpool
rpool                           263G  9.41M      139K  /rpool
rpool/ROOT                     26.5G  9.41M      128K  /rpool/ROOT
rpool/ROOT/pve-1               26.5G  9.41M     26.5G  /
rpool/data                      237G  9.41M      128K  /rpool/data
rpool/data/vm-72250-disk-1     50.6G  9.41M     50.5G  -
rpool/data/vm-72250-disk-3     9.18G  9.41M     9.18G  -
rpool/data/vm-72252-disk-0     48.9G  9.41M     46.2G  -
rpool/data/vm-72252-disk-1      208K  9.41M      208K  -
rpool/data/vm-72252-disk-2      123K  9.41M      112K  -
rpool/data/vm-72252-state-tt1  8.71G  9.41M     8.71G  -
rpool/data/vm-72254-disk-0      119G  9.41M      119G  -

df -h:

Code:

Filesystem                Size  Used Avail Use% Mounted on
udev                       24G     0   24G   0% /dev
tmpfs                     4.8G   26M  4.7G   1% /run
rpool/ROOT/pve-1           27G   27G  9.4M 100% /
tmpfs                      24G   46M   24G   1% /dev/shm
tmpfs                     5.0M     0  5.0M   0% /run/lock
rpool                     9.7M  256K  9.4M   3% /rpool
pbpool                    538G  493G   45G  92% /mnt/datastore/pbpool
rpool/ROOT                9.5M  128K  9.4M   2% /rpool/ROOT
rpool/data                9.5M  128K  9.4M   2% /rpool/data
/dev/fuse                 128M   48K  128M   1% /etc/pve
192.168.0.2:/pve-storage  1.4T  872G  500G  64% /mnt/pve/nas-storage
192.168.0.2:/pve-lvm      1.4T  872G  500G  64% /mnt/pve/nas-lvm
tmpfs                     4.8G     0  4.8G   0% /run/user/0

Thank you

fabian · Jan 12, 2023

your rpool is only 263G, not 438G.. is it possible you underestimated the overhead of raidz? anyhow, you need to (re)move some data, e.g. by cleaning out old backups or similar things (snapshots?), and then re-evaluate your storage concept.

Adi M · Jan 12, 2023

Hmm, yes, interesting:

How can I give space back? All VMs is now on other note running and turnt of replication.
So can I remove all VM's on "pve-yps"?
When I remove a snapshot, I get "Permission denied".

fabian · Jan 13, 2023

yes, if you are sure you don't need the VM volumes on this node anymore, then removing them will give you a lot of space. I'd advise you to reconsider your pool layout, see https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_zfs_raid_considerations

Adi M · Jan 13, 2023

Thank you.

fabian said:
yes, if you are sure you don't need the VM volumes on this node anymore, then removing them will give you a lot of space. I'd advise you to reconsider your pool layout, see https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_zfs_raid_considerations

I get TASK ERROR: unable to open file '/etc/pve/nodes/pve-yps/qemu-server/72252.conf.tmp.2003377' - Permission denied
I don't understand what you mean with reconsider and how to do it.

fabian · Jan 13, 2023

I thought you moved all VMs? then no VM config can be on the "full" node, only volumes..

could you post the output of

- qm list
- pct list
- pvecm status

on pve-yps?

Adi M · Jan 13, 2023

fabian said:
I thought you moved all VMs? then no VM config can be on the "full" node, only volumes..

Sorry, it was moved by replication and HA.

fabian said:
could you post the output of

- qm list
- pct list
- pvecm status

on pve-yps?

qm list:

Code:

VMID NAME                 STATUS     MEM(MB)    BOOTDISK(GB) PID
     72250 Windows10pro         stopped    8192              50.00 0
     72252 Windows11Pro         stopped    16384             50.00 0

pct list: -> empty

pvecm status:

Code:

Cluster information
-------------------
Name:             pve-apnw
Config Version:   6
Transport:        knet
Secure auth:      on

Cannot initialize CMAP service

fabian · Jan 13, 2023

okay, so your node is not part of the cluster at the moment and hasn't realized yet that those VMs were stolen by HA..

what about systemctl status pve-cluster corosync and journalctl -b -u pve-cluster | head -n 100 on pve-yps?

Adi M · Jan 13, 2023

systemctl status pve-cluster corosync

Code:

pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2023-01-12 12:17:18 CET; 23h ago
    Process: 3695 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
   Main PID: 3786 (pmxcfs)
      Tasks: 6 (limit: 57840)
     Memory: 60.1M
        CPU: 52.025s
     CGroup: /system.slice/pve-cluster.service
             └─3786 /usr/bin/pmxcfs

Jan 13 11:49:21 pve-yps pmxcfs[3786]: [dcdb] crit: cpg_initialize failed: 2
Jan 13 11:49:21 pve-yps pmxcfs[3786]: [status] crit: cpg_initialize failed: 2
Jan 13 11:49:27 pve-yps pmxcfs[3786]: [quorum] crit: quorum_initialize failed: 2
Jan 13 11:49:27 pve-yps pmxcfs[3786]: [confdb] crit: cmap_initialize failed: 2
Jan 13 11:49:27 pve-yps pmxcfs[3786]: [dcdb] crit: cpg_initialize failed: 2
Jan 13 11:49:27 pve-yps pmxcfs[3786]: [status] crit: cpg_initialize failed: 2
Jan 13 11:49:33 pve-yps pmxcfs[3786]: [quorum] crit: quorum_initialize failed: 2
Jan 13 11:49:33 pve-yps pmxcfs[3786]: [confdb] crit: cmap_initialize failed: 2
Jan 13 11:49:33 pve-yps pmxcfs[3786]: [dcdb] crit: cpg_initialize failed: 2
Jan 13 11:49:33 pve-yps pmxcfs[3786]: [status] crit: cpg_initialize failed: 2
Jan 13 11:49:39 pve-yps pmxcfs[3786]: [quorum] crit: quorum_initialize failed: 2
Jan 13 11:49:39 pve-yps pmxcfs[3786]: [confdb] crit: cmap_initialize failed: 2
Jan 13 11:49:39 pve-yps pmxcfs[3786]: [dcdb] crit: cpg_initialize failed: 2
Jan 13 11:49:39 pve-yps pmxcfs[3786]: [status] crit: cpg_initialize failed: 2

● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Thu 2023-01-12 12:17:19 CET; 23h ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
    Process: 3874 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=21)
   Main PID: 3874 (code=exited, status=21)
        CPU: 125ms

Jan 12 12:17:19 pve-yps corosync[3874]:   [KNET  ] host: host: 2 has no active links
Jan 12 12:17:19 pve-yps corosync[3874]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jan 12 12:17:19 pve-yps corosync[3874]:   [KNET  ] host: host: 2 has no active links
Jan 12 12:17:19 pve-yps corosync[3874]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jan 12 12:17:19 pve-yps corosync[3874]:   [KNET  ] host: host: 2 has no active links
Jan 12 12:17:19 pve-yps corosync[3874]:   [MAIN  ] Couldn't store new ring id 455 to stable storage: No space left on device (28)
Jan 12 12:17:19 pve-yps corosync[3874]:   [MAIN  ] Corosync Cluster Engine exiting with status 21 at main.c:707.
Jan 12 12:17:19 pve-yps systemd[1]: corosync.service: Main process exited, code=exited, status=21/n/a
Jan 12 12:17:19 pve-yps systemd[1]: corosync.service: Failed with result 'exit-code'.
Jan 12 12:17:19 pve-yps systemd[1]: Failed to start Corosync Cluster Engine.

journalctl -b -u pve-cluster | head -n 100

Code:

-- Journal begins at Wed 2022-05-11 19:54:01 CEST, ends at Fri 2023-01-13 11:51:43 CET. --
Jan 12 12:17:16 pve-yps systemd[1]: Starting The Proxmox VE cluster filesystem...
Jan 12 12:17:17 pve-yps pmxcfs[3786]: [quorum] crit: quorum_initialize failed: 2
Jan 12 12:17:17 pve-yps pmxcfs[3786]: [quorum] crit: can't initialize service
Jan 12 12:17:17 pve-yps pmxcfs[3786]: [confdb] crit: cmap_initialize failed: 2
Jan 12 12:17:17 pve-yps pmxcfs[3786]: [confdb] crit: can't initialize service
Jan 12 12:17:17 pve-yps pmxcfs[3786]: [dcdb] crit: cpg_initialize failed: 2
Jan 12 12:17:17 pve-yps pmxcfs[3786]: [dcdb] crit: can't initialize service
Jan 12 12:17:17 pve-yps pmxcfs[3786]: [status] crit: cpg_initialize failed: 2
Jan 12 12:17:17 pve-yps pmxcfs[3786]: [status] crit: can't initialize service
Jan 12 12:17:18 pve-yps systemd[1]: Started The Proxmox VE cluster filesystem.
Jan 12 12:17:23 pve-yps pmxcfs[3786]: [quorum] crit: quorum_initialize failed: 2
Jan 12 12:17:23 pve-yps pmxcfs[3786]: [confdb] crit: cmap_initialize failed: 2
Jan 12 12:17:23 pve-yps pmxcfs[3786]: [dcdb] crit: cpg_initialize failed: 2
Jan 12 12:17:23 pve-yps pmxcfs[3786]: [status] crit: cpg_initialize failed: 2
Jan 12 12:17:29 pve-yps pmxcfs[3786]: [quorum] crit: quorum_initialize failed: 2
Jan 12 12:17:29 pve-yps pmxcfs[3786]: [confdb] crit: cmap_initialize failed: 2
Jan 12 12:17:29 pve-yps pmxcfs[3786]: [dcdb] crit: cpg_initialize failed: 2
Jan 12 12:17:29 pve-yps pmxcfs[3786]: [status] crit: cpg_initialize failed: 2
Jan 12 12:17:35 pve-yps pmxcfs[3786]: [quorum] crit: quorum_initialize failed: 2
Jan 12 12:17:35 pve-yps pmxcfs[3786]: [confdb] crit: cmap_initialize failed: 2
Jan 12 12:17:35 pve-yps pmxcfs[3786]: [dcdb] crit: cpg_initialize failed: 2
Jan 12 12:17:35 pve-yps pmxcfs[3786]: [status] crit: cpg_initialize failed: 2
Jan 12 12:17:41 pve-yps pmxcfs[3786]: [quorum] crit: quorum_initialize failed: 2
Jan 12 12:17:41 pve-yps pmxcfs[3786]: [confdb] crit: cmap_initialize failed: 2
Jan 12 12:17:41 pve-yps pmxcfs[3786]: [dcdb] crit: cpg_initialize failed: 2
...

fabian · Jan 13, 2023

okay. so one possible way forward (again, if you are sure you don't need those volumes anymore on pve-yps!) would be to delete the volumes using 'zfs destroy' and then reboot the node. hopefully it will be able to rejoin the cluster and re-sync the configuration.

what I meant with "reconsider" is explained in the link I gave you - basically, raidz with VMs has a high overhead, you might be better of using mirrors insteads (or tuning your volblocksize and recreating the volumes, but that has other downsides)..

Adi M · Jan 13, 2023

fabian said:
would be to delete the volumes using 'zfs destroy' and then reboot the node.

But when I use zfs destroy rpool, I also will wipe ALL data inclusive /, right?

fabian · Jan 13, 2023

yes, you should only delete the volumes, and only if you don't need them!

Adi M · Jan 13, 2023

OK, thank you for your Help. On Somehow the space comes back and after reboot, pve-yps is back in cluster

Search

Search

Current fence status: FENCE, with wrong hd space usage

Adi M

Active Member

fabian

Proxmox Staff Member

Adi M

Active Member

fabian

Proxmox Staff Member

Adi M

Active Member

fabian

Proxmox Staff Member

Adi M

Active Member

fabian

Proxmox Staff Member

Adi M

Active Member

fabian

Proxmox Staff Member

Adi M

Active Member

fabian

Proxmox Staff Member

Adi M

Active Member

fabian

Proxmox Staff Member

Adi M

Active Member

We value your privacy