PVE7: VMs stopping, because Host ZFS is full (rpool/ROOT/pve-1)

Jan 21, 2016
96
8
73
43
Germany
www.pug.org
Hello,

since a few days, we have on a PVE7 - standalone - host the strange issue, that VMs stuck and not responding (host unreachable) anymore. The root cause seems to be, that the host "/" is getting full:

Some envs:
  • Debian Bullseye pve-manager/7.4-17/513c62be

Code:
Filesystem           Size  Used Avail Use% Mounted on
udev                  32G     0   32G   0% /dev
tmpfs                6.3G  1.4M  6.3G   1% /run
rpool/ROOT/pve-1     6.4G  3.3G  3.1G  52% /
tmpfs                 32G   46M   32G   1% /dev/shm
tmpfs                5.0M     0  5.0M   0% /run/lock
rpool                1.4G  256K  1.4G   1% /rpool
rpool/data           1.4G  256K  1.4G   1% /rpool/data
rpool/ROOT           1.4G  256K  1.4G   1% /rpool/ROOT
rpool/pve-container  1.4G  256K  1.4G   1% /rpool/pve-container
/dev/fuse            128M   24K  128M   1% /etc/pve
tmpfs                6.3G     0  6.3G   0% /run/user/1003
tmpfs                6.3G     0  6.3G   0% /run/user/1024

Code:
# zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 01:24:24 with 0 errors on Sun Dec 10 01:48:25 2023
config:

    NAME                              STATE     READ WRITE CKSUM
    rpool                             ONLINE       0     0     0
      raidz1-0                        ONLINE       0     0     0
        wwn-0x5002538f4121a940-part2  ONLINE       0     0     0
        wwn-0x5002538f4121a93d-part2  ONLINE       0     0     0
        wwn-0x5002538f4121a947-part2  ONLINE       0     0     0
        wwn-0x5002538f31332a6f-part2  ONLINE       0     0     0
      raidz1-1                        ONLINE       0     0     0
        wwn-0x5002538f4121a945-part2  ONLINE       0     0     0
        wwn-0x5002538f4121a94b-part2  ONLINE       0     0     0
        wwn-0x5002538f41203130-part2  ONLINE       0     0     0
        wwn-0x5002538f43464634-part2  ONLINE       0     0     0

errors: No known data errors


# zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool  7.25T  7.02T   239G        -         -    82%    96%  1.00x    ONLINE  -

zfs list
NAME                                USED  AVAIL     REFER  MOUNTPOINT
rpool                              5.14T  1.39G      140K  /rpool
rpool/ROOT                         5.00G  1.39G      140K  /rpool/ROOT
rpool/ROOT/pve-1                   3.29G  3.10G     3.29G  /
rpool/data                          140K  1.39G      140K  /rpool/data
rpool/pve-container                5.13T  1.39G      140K  /rpool/pve-container
rpool/pve-container/vm-100-disk-1  10.3G  2.49G     9.21G  -
rpool/pve-container/vm-100-disk-2  1.07T  1.39G     1.07T  -
rpool/pve-container/vm-101-disk-0  11.4G  1.39G     11.4G  -
rpool/pve-container/vm-101-disk-1  10.3G  7.39G     4.32G  -
rpool/pve-container/vm-103-disk-0  48.9G  21.5G     22.8G  -
rpool/pve-container/vm-103-disk-1  52.0G  5.75G     32.1G  -
rpool/pve-container/vm-104-disk-0  25.8G  5.52G     21.6G  -
rpool/pve-container/vm-105-disk-0  25.8G  8.28G     18.9G  -
rpool/pve-container/vm-105-disk-1  3.88T  1.39G     3.88T  -
rpool/swap                         9.45G  1.39G     9.45G  -

I have realy no idea .. why it happens:

I've deleted a non used VM, which gave us 12GB back .. which I could seen yesterday.

I've checked the Monitoring graphs .. and I can see, that the free space jumps from ~400GB on 1.1.2024 to less than 10GB. I think .. maybe our Syslog VMs did something .. or our Debian repository .. but the real question for me is: How can that be ? I mean .. the VM disks, has a limited space assigned .. how can it bring the rootfs from the host itself to full ?

I've tried to give a reservation for the rpool/ROOT/pve-1:

Code:
root@fc-r02-pmox-06:[/etc/cron.daily]: zfs get reservation rpool/ROOT/pve-1
NAME              PROPERTY     VALUE   SOURCE
rpool/ROOT/pve-1  reservation  5G      local

but without success. Again were the VMs in the morning (7:35) down. Small note: Not all, but all VMs, which are bigger and doing stuff, like Debian Repo / Syslog / Puppet Master host. Our Jumphos was always ok.


So .. I'm a bit out of ideas.

any suggestions?
 
hi @aaron

I've checked our syslog host .. in the timerange .. a log cleaner started, which cleaned up 300GB. Exactly the same 300GB .. which are "lost" on the rpool and for rpool/ROOT/pve-1.

For the syslog host:

Code:
zfs get volsize,refreservation,used rpool/pve-container/vm-105-disk-1
NAME                               PROPERTY        VALUE      SOURCE
rpool/pve-container/vm-105-disk-1  volsize         2.66T      local
rpool/pve-container/vm-105-disk-1  refreservation  2.24T      local
rpool/pve-container/vm-105-disk-1  used            3.88T      -

For PVE host

Code:
 zfs get volsize,refreservation,used rpool/ROOT/pve-1
NAME              PROPERTY        VALUE      SOURCE
rpool/ROOT/pve-1  volsize         -          -
rpool/ROOT/pve-1  refreservation  none       default
rpool/ROOT/pve-1  used            3.29G      -

The overhead you are talking about .. yeah .. understand .. but its started at 1.1.2024. The system was unchanged for years (besides system upgrades). So .. its pretty confusing, that a VM can bring down the system rootfs.

Update

We started our log cleaner on the VM, which compresses logs and removes old. The VM stops responding, but listet still as running.

Also .. I saw .. that:

Code:
rpool                541M  256K  541M   1% /rpool
rpool/data           541M  256K  541M   1% /rpool/data
rpool/ROOT           541M  256K  541M   1% /rpool/ROOT
rpool/pve-container  541M  256K  541M   1% /rpool/pve-container

this was decreasing in free space, while the cleaner was running.
 
Last edited:
We had issues in the past .. that the syslog host killed all VMs, with I/O .. so we limited:

Code:
ostype: l26
scsi0: pve-container:vm-105-disk-0,size=25G,ssd=1
scsi1: pve-container:vm-105-disk-1,discard=on,mbps_wr=40,size=2789377M,ssd=1

we had no issues anymore .. since years .. but the OS was upgraded in November from Buster to Bullseye...

Update
We found something: We executed fstrim -v /data/syslog on the syslog VM:

The syslog VM was killed again (not responding: means we need to VM qm stop / qm start). The command was just a few seconds running, before stuck .. but on the PVE host, we see now **much** more space free !

Code:
rpool/ROOT/pve-1      15G  3.4G   12G  23% /
tmpfs                 32G   46M   32G   1% /dev/shm
tmpfs                5.0M     0  5.0M   0% /run/lock
rpool                115G  256K  115G   1% /rpool
rpool/data           115G  256K  115G   1% /rpool/data
rpool/ROOT           115G  256K  115G   1% /rpool/ROOT
rpool/pve-container  115G  256K  115G   1% /rpool/pve-container
/dev/fuse            128M   24K  128M   1% /etc/pve
 
Last edited:
Hmm. So besides the parity overhead due to RAIDz, which you can check if you get the properties of a disk dataset:
zfs get all rpool/pve-container/vm-105-disk-1 for example, are there maybe snapshots that still take up space? But looking through the zfs list that you posted earlier, I don't think so.

What are the two disk images that take up all the space?
Code:
rpool/pve-container/vm-100-disk-2  1.07T  1.39G     1.07T  -
rpool/pve-container/vm-105-disk-1  3.88T  1.39G     3.88T  -
Can you run the zfs get all command against them and post the output?
 
Hi,

we don't have snapshots. We delete them, after dangerous work was done:

Code:
root@fc-r02-pmox-06:[~]: zfs get all rpool/pve-container/vm-100-disk-2
NAME                               PROPERTY              VALUE                  SOURCE
rpool/pve-container/vm-100-disk-2  type                  volume                 -
rpool/pve-container/vm-100-disk-2  creation              Wed Nov  7 10:04 2018  -
rpool/pve-container/vm-100-disk-2  used                  1.07T                  -
rpool/pve-container/vm-100-disk-2  available             115G                   -
rpool/pve-container/vm-100-disk-2  referenced            1.07T                  -
rpool/pve-container/vm-100-disk-2  compressratio         1.01x                  -
rpool/pve-container/vm-100-disk-2  reservation           none                   default
rpool/pve-container/vm-100-disk-2  volsize               820G                   local
rpool/pve-container/vm-100-disk-2  volblocksize          8K                     default
rpool/pve-container/vm-100-disk-2  checksum              on                     default
rpool/pve-container/vm-100-disk-2  compression           on                     inherited from rpool
rpool/pve-container/vm-100-disk-2  readonly              off                    default
rpool/pve-container/vm-100-disk-2  createtxg             2985                   -
rpool/pve-container/vm-100-disk-2  copies                1                      default
rpool/pve-container/vm-100-disk-2  refreservation        227G                   local
rpool/pve-container/vm-100-disk-2  guid                  11084895740725821716   -
rpool/pve-container/vm-100-disk-2  primarycache          all                    default
rpool/pve-container/vm-100-disk-2  secondarycache        all                    default
rpool/pve-container/vm-100-disk-2  usedbysnapshots       0B                     -
rpool/pve-container/vm-100-disk-2  usedbydataset         1.07T                  -
rpool/pve-container/vm-100-disk-2  usedbychildren        0B                     -
rpool/pve-container/vm-100-disk-2  usedbyrefreservation  0B                     -
rpool/pve-container/vm-100-disk-2  logbias               latency                default
rpool/pve-container/vm-100-disk-2  objsetid              89                     -
rpool/pve-container/vm-100-disk-2  dedup                 off                    default
rpool/pve-container/vm-100-disk-2  mlslabel              none                   default
rpool/pve-container/vm-100-disk-2  sync                  standard               inherited from rpool
rpool/pve-container/vm-100-disk-2  refcompressratio      1.01x                  -
rpool/pve-container/vm-100-disk-2  written               1.07T                  -
rpool/pve-container/vm-100-disk-2  logicalused           758G                   -
rpool/pve-container/vm-100-disk-2  logicalreferenced     758G                   -
rpool/pve-container/vm-100-disk-2  volmode               default                default
rpool/pve-container/vm-100-disk-2  snapshot_limit        none                   default
rpool/pve-container/vm-100-disk-2  snapshot_count        none                   default
rpool/pve-container/vm-100-disk-2  snapdev               hidden                 default
rpool/pve-container/vm-100-disk-2  context               none                   default
rpool/pve-container/vm-100-disk-2  fscontext             none                   default
rpool/pve-container/vm-100-disk-2  defcontext            none                   default
rpool/pve-container/vm-100-disk-2  rootcontext           none                   default
rpool/pve-container/vm-100-disk-2  redundant_metadata    all                    default
rpool/pve-container/vm-100-disk-2  encryption            off                    default
rpool/pve-container/vm-100-disk-2  keylocation           none                   default
rpool/pve-container/vm-100-disk-2  keyformat             none                   default
rpool/pve-container/vm-100-disk-2  pbkdf2iters           0                      default
root@fc-r02-pmox-06:[~]:

root@fc-r02-pmox-06:[~]:
root@fc-r02-pmox-06:[~]: zfs get all rpool/pve-container/vm-105-disk-1
NAME                               PROPERTY              VALUE                  SOURCE
rpool/pve-container/vm-105-disk-1  type                  volume                 -
rpool/pve-container/vm-105-disk-1  creation              Fri Nov  9  8:52 2018  -
rpool/pve-container/vm-105-disk-1  used                  3.76T                  -
rpool/pve-container/vm-105-disk-1  available             115G                   -
rpool/pve-container/vm-105-disk-1  referenced            3.76T                  -
rpool/pve-container/vm-105-disk-1  compressratio         1.05x                  -
rpool/pve-container/vm-105-disk-1  reservation           none                   default
rpool/pve-container/vm-105-disk-1  volsize               2.66T                  local
rpool/pve-container/vm-105-disk-1  volblocksize          8K                     default
rpool/pve-container/vm-105-disk-1  checksum              on                     default
rpool/pve-container/vm-105-disk-1  compression           on                     inherited from rpool
rpool/pve-container/vm-105-disk-1  readonly              off                    default
rpool/pve-container/vm-105-disk-1  createtxg             41610                  -
rpool/pve-container/vm-105-disk-1  copies                1                      default
rpool/pve-container/vm-105-disk-1  refreservation        2.24T                  local
rpool/pve-container/vm-105-disk-1  guid                  17497535538946831200   -
rpool/pve-container/vm-105-disk-1  primarycache          all                    default
rpool/pve-container/vm-105-disk-1  secondarycache        all                    default
rpool/pve-container/vm-105-disk-1  usedbysnapshots       0B                     -
rpool/pve-container/vm-105-disk-1  usedbydataset         3.76T                  -
rpool/pve-container/vm-105-disk-1  usedbychildren        0B                     -
rpool/pve-container/vm-105-disk-1  usedbyrefreservation  0B                     -
rpool/pve-container/vm-105-disk-1  logbias               latency                default
rpool/pve-container/vm-105-disk-1  objsetid              791                    -
rpool/pve-container/vm-105-disk-1  dedup                 off                    default
rpool/pve-container/vm-105-disk-1  mlslabel              none                   default
rpool/pve-container/vm-105-disk-1  sync                  standard               inherited from rpool
rpool/pve-container/vm-105-disk-1  refcompressratio      1.05x                  -
rpool/pve-container/vm-105-disk-1  written               3.76T                  -
rpool/pve-container/vm-105-disk-1  logicalused           2.55T                  -
rpool/pve-container/vm-105-disk-1  logicalreferenced     2.55T                  -
rpool/pve-container/vm-105-disk-1  volmode               default                default
rpool/pve-container/vm-105-disk-1  snapshot_limit        none                   default
rpool/pve-container/vm-105-disk-1  snapshot_count        none                   default
rpool/pve-container/vm-105-disk-1  snapdev               hidden                 default
rpool/pve-container/vm-105-disk-1  context               none                   default
rpool/pve-container/vm-105-disk-1  fscontext             none                   default
rpool/pve-container/vm-105-disk-1  defcontext            none                   default
rpool/pve-container/vm-105-disk-1  rootcontext           none                   default
rpool/pve-container/vm-105-disk-1  redundant_metadata    all                    default
rpool/pve-container/vm-105-disk-1  encryption            off                    default
rpool/pve-container/vm-105-disk-1  keylocation           none                   default
rpool/pve-container/vm-105-disk-1  keyformat             none                   default
rpool/pve-container/vm-105-disk-1  pbkdf2iters           0                      default

We try to use the 6.2 Kernel and check, if it happens again, while using the log-cleaner script and fstrim inside the vm.

* 105 is the syslog
* 100 is the Debian repo mirror

and maybe not bad to know:

  • MB: Supermicro X11SSH-TF
  • Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)
  • 7x Samsung Samsung SSD 870 EVO 1TB
  • 1 x Samsung SSD 870 EVO 2TB (was a replacement from a broken 1TB)
 
Last edited:
If you take a look at the "referenced" and "volsize" lines for both disks, you will see that the "referenced" is always more than the size of the disk image. Because RAIDz does come with the additional parity overhead for volume datasets (as explained in the earlier linked docs, or if you search the forums).
Therefore, the used space of the pool will grow larger than initially expected.

The pool has an ashift of 12 right? The smallest possible allocation it therefore 4k. So for each 8k data block in the disk image, another 4k are used to store the parity. 2.66 TiB (volsize) * 1.5 = 3.99, and we see "referenced" of 3.76 TiB. I assume that even the disk images for VM itself were full or almost full? In real life, the referenced part might be a bit lower due to compression and such stuff.

If you run fstrim in the VMs and you see some space being freed, the "discard" option must already be enabled for the VM disk images, right?

Removing more data is probably the only quick workaround. Maybe limit the components for the Debian repo mirror to only the ones you need?
 
Here comes the fancy stuff:

Syslog VM:

Code:
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg2-data  2.7T  927G  1.7T  37% /data/syslog

Debian VM

Code:
 df -h /opt/
Filesystem                  Size  Used Avail Use% Mounted on
/dev/mapper/vg--data-optfs  777G  625G  121G  84% /opt


We installed Kernel 6.2 and where able to execute the fstrim command, without killing VMs

I will see tomorrow, if the VMs where killed again, or not.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!