VM ZFS dataset consumes all available space

slonick81

Member
Sep 24, 2019
4
0
21
43
Hi!

I'm running a simple 1-node Proxmox 6 with 3 4TB drives in RAIDZ for system and VMs. Actually, there is only 1 VM now, an FTP server built on Debian 9 + ProFTPD + Webmin. It has 2 virtual disks - one small 32G disk for system and 6.4TB disk for FTP content. This is current layout:

Code:
root@ftp:~# zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 0 days 09:02:28 with 0 errors on Sun Sep  8 09:26:29 2019
config:

        NAME                                                  STATE     READ WRITE CKSUM
        rpool                                                 ONLINE       0     0     0
          raidz1-0                                            ONLINE       0     0     0
            ata-HGST_HDN724040ALE640_PK1334PBGZUA9S-part3     ONLINE       0     0     0
            ata-WDC_WD4000FYYZ-01UL1B1_WD-WCC131934012-part3  ONLINE       0     0     0
            ata-HGST_HDN724040ALE640_PK1334PBGYJV3S-part3     ONLINE       0     0     0

errors: No known data errors
root@ftp:~# zfs list
NAME                       USED  AVAIL     REFER  MOUNTPOINT
rpool                     7.04T   605M      139K  /rpool
rpool/ROOT                1.60G   605M      128K  /rpool/ROOT
rpool/ROOT/pve-1          1.60G   605M     1.60G  /
rpool/data                7.03T   605M      128K  /rpool/data
rpool/data/vm-100-disk-0  22.0G   605M     22.0G  -
rpool/data/vm-100-disk-1  7.01T   605M     7.01T  -
root@ftp:~# df -h
Filesystem        Size  Used Avail Use% Mounted on
udev               16G     0   16G   0% /dev
tmpfs             3.2G  8.9M  3.2G   1% /run
rpool/ROOT/pve-1  2.2G  1.6G  605M  74% /
tmpfs              16G   43M   16G   1% /dev/shm
tmpfs             5.0M     0  5.0M   0% /run/lock
tmpfs              16G     0   16G   0% /sys/fs/cgroup
rpool             605M  256K  605M   1% /rpool
rpool/data        605M  128K  605M   1% /rpool/data
rpool/ROOT        605M  128K  605M   1% /rpool/ROOT
/dev/fuse          30M   16K   30M   1% /etc/pve
tmpfs             3.2G     0  3.2G   0% /run/user/0

As you can see, vm-100-disk-1 has grown far beyond expected 6.4TB and now actively consuming all the space. Yesterday the whole system was stuck because there was no space on rpool/ROOT/pve-1 (the size was 2.5G, used 2.5G, available 0), it was not accessible nor by web interface, nor by SSH. The only way to get it running was to log in with IPMI-KVM, free some space (I've cleaned apt-get cache) and reboot. There were some unused VM disks, I deleted them, got about 120GB of free space available on rpool and switched to other tasks. This night there were some up- and downloads from my colleagues and the system stuck again. rpool/ROOT/pve-1 has shrunk to 2.2G. I had to delete some ISO's to free some space (605M) and stop FTP service after reboot, so now the situation is stable.

The funniest part is that FTP itself is not full. It was never filled up to less than 700GB of free space.

Code:
root@debftp:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            3.9G     0  3.9G   0% /dev
tmpfs           798M   11M  788M   2% /run
/dev/sda1        28G  4.5G   22G  18% /
tmpfs           3.9G     0  3.9G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           3.9G     0  3.9G   0% /sys/fs/cgroup
/dev/sdb1       6.3T  5.0T  1.1T  83% /ftp
tmpfs           798M  4.0K  798M   1% /run/user/113
tmpfs           798M     0  798M   0% /run/user/0

root@debftp:~# fdisk -l
Disk /dev/sdb: 6.4 TiB, 6979321856000 bytes, 13631488000 sectors
Disk model: QEMU HARDDISK
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 1F5B5652-F825-7442-92D5-3378EED2CCE4

Device     Start         End     Sectors  Size Type
/dev/sdb1   2048 13631487966 13631485919  6.4T Linux filesystem

Disk /dev/sda: 32 GiB, 34359738368 bytes, 67108864 sectors
Disk model: QEMU HARDDISK
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x6e25444e

Device     Boot    Start      End  Sectors Size Id Type
/dev/sda1  *        2048 58720255 58718208  28G 83 Linux
/dev/sda2       58722302 67106815  8384514   4G  5 Extended
/dev/sda5       58722304 67106815  8384512   4G 82 Linux swap / Solaris

FS is ext4 for both sda and sdb linux partitions.

Now I have to understand what's going on, how I can shrink this FTP volume down and how to prevent it from overgrowing in future. Any help is greatly appreciated!
 
On RAIDZ, depending on the volblocksize, you can have more used disk space than logical space, du to padding. See https://www.mail-archive.com/freebsd-virtualization@freebsd.org/msg05685.html for example

Check with zfs list -o name,used,lused,refer,ratio to have a better picture of the space consumed, and the logical space.

I'd recommand adding a fourth drive and reinstall as RAID10 (2 mirror vdevs), as mirrors do not have this padding effect that you can have with RAIDZ (plus, you'll have better performance)
 
On RAIDZ, depending on the volblocksize, you can have more used disk space than logical space, du to padding. See https://www.mail-archive.com/freebsd-virtualization@freebsd.org/msg05685.html for example

Check with zfs list -o name,used,lused,refer,ratio to have a better picture of the space consumed, and the logical space.

I'd recommand adding a fourth drive and reinstall as RAID10 (2 mirror vdevs), as mirrors do not have this padding effect that you can have with RAIDZ (plus, you'll have better performance)

Thank you! Well, that's the case, I guess.

Code:
root@ftp:~# zfs list -o name,used,lused,refer,ratio
NAME                       USED  LUSED     REFER  RATIO
rpool                     7.04T  5.30T      139K  1.00x
rpool/ROOT                1.60G  2.59G      128K  1.87x
rpool/ROOT/pve-1          1.60G  2.59G     1.60G  1.87x
rpool/data                7.03T  5.29T      128K  1.00x
rpool/data/vm-100-disk-0  22.0G  28.3G     22.0G  1.72x
rpool/data/vm-100-disk-1  7.01T  5.27T     7.01T  1.00x

USED vs. LUSED is actually allocated size vs. logic size, right?

The next question is to find the easiest way to solve this. Speed isn't an issue in my case, we use this FTP for media files exchange, so IO is limited with 1Gb internal link anyway. Real RAID capacity isn't a problem, too, because it's more than enough for any single project we have but totally insufficient for all of them. And the server itself is 1U with 3 drive bays. I'd like to avoid moving all the data and reassembling the RAID. My wish is to reduce the disk-1 size somehow to get 100-200GB free space for updates and maintenance, then set up quotas of some sort to prevent it future overexpansion. Is it possible at all? I deleted around 30GB of old files and it took no effect on the dataset size.
 
USED vs. LUSED is actually allocated size vs. logic size, right?
Exactly

My wish is to reduce the disk-1 size somehow to get 100-200GB free space for updates and maintenance, then set up quotas of some sort to prevent it future overexpansion. Is it possible at all? I deleted around 30GB of old files and it took no effect on the dataset size.
Reducing a volume is always tricky, and risky. You first need to reduce all the layers inside the guest. And there's no simple how-to because it can be very different. First, you need to reduce the FS (if it's possible. Some can't be shrunked at all, like XFS or ZFS). Then the partition. If using LVM, you need to resize the LV, and the PV. Of course, be careful not to shrink lower layers more than the upper one (the partition must not be shrunk to a smaller size than the FS, or you'll loose your data) Only then, you can shut down the guest and reduce its volume. The GUI won't let you do it for obvious safety reasons, but you can do it with
Code:
zfs set volsize=<new size> rpool/data/vm-100-disk-1
Here again, be careful not to shrink the volume more than the container (PV or partition in the guest), or you'll loose your data too. In any case, you shouldn't set a quota on zvol. You fix a volsize only.
 
Well, tried to shrink ext4 inside VM to begin with. It simply ate the rest of free space in process. So anyone who's in the same situation and has some important data in VM - backup it first! We keep only temporary non-unique data on FTP, so I simply destroyed ZFS dataset of VM's disk. I decided to test another variant of keeping this drive: created folder storage on Proxmox node and then created simple raw image file for VM disk. So far so good, copy speed over network is the same and this image doesn't show such excessive space overhead but I'll keep tracking the situation:

Code:
root@ftp:~# zfs list -o name,used,lused,refer,ratio
NAME                       USED  LUSED     REFER  RATIO
rpool                      923G   937G      139K  1.02x
rpool/ROOT                1.60G  2.59G      128K  1.87x
rpool/ROOT/pve-1          1.60G  2.59G     1.60G  1.87x
rpool/data                22.0G  28.3G      128K  1.72x
rpool/data/vm-100-disk-0  22.0G  28.3G     22.0G  1.72x
rpool/ftp                  899G   906G      899G  1.00x

Anyway, should I start some kind of ticket for this case?
I mean it's too much: (7.01-5.27)/5.27~=0.33, 33% overhead over actually used space! I don't think my setup is very exotic, it's attractive to use ZFS RAIDs with parity for "warm" tasks when you don't care about IOPs or throughput, just need them to be running around. And the files were not small or fragmented at all: less than 500 file of <512KB size, most were 8-64MB (dpx and audio files). Besides, the whole behavior is strange and non-intuitive: when you allocate 1TB for disk image you expect that it will take, well, 1TB. We use ZFS RAIDZ and RAIDZ2 a lot for media storage and it always worked out well in terms of disk space planning and usage predictability. So in this case I was hoping for the same, at least I was expecting to see noticeably less usable space inside VM but not an outwards overextended dataset that makes the node unfunctional.
 
If you use raidz you need to calculate the best volblocksize for your pool and set it up instead of the default volblocksize of 8K (might be 16K, 32K, 64K or 128K depending on your ashift and number of drives) before createing your first zvol. Also make sure ZFS never uses more than 80-90% of your pool or it will be slow and run into problems. Good thing would be to limit that pool to 80% by ZFSs quota option so it can exceed that.
 
Last edited:
Had the same problem on 5x8TB disk raidz2. 25% of space was lost. It would have been the same as having RAID10 instead (cost of one added drive not considered). It was default 8k volblocksize and 128k recordsize.

If I used 64k volblocksize I got away with 10% of wasted space and bumping volblocksize to 128k would reduce wasted space to 0%.
It's file storage with files from 200k to 10M so this seems "correct" for my application. I did move postgres DB off the volume to be sure, but since postgres workload is quite linear on writes, it might have still been good (would be relaying on ram cache mostly anyways).
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!