PVE 5.2: ZFS "No space left on device" Avail 4,55T

Denny Fuchs · May 22, 2018

hi,

we have very strange behavior on one of our Proxmox 5.x . We had a very bad I/O performance with PVE 5.1 with 8 x 1TB WD Red in a Raidz-2 installation. We did an upgrade to the latest 5.2 and the bad performance has gone, but we had the very strange thing, that one of our VMs stopped. This VM is a Galera MariaDB 10.2 node and in the moment the (r)sync starts, the VM is stopped.
After a while we noticed, that we have on the hypervisor in the daemon.log a bunch of "No space left on device" for any programs, like Icinga2, Telegraf and ... qm.

Our pool:

Code:

zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 6h2m with 0 errors on Tue May 22 18:26:34 2018
config:

        NAME                              STATE     READ WRITE CKSUM
        rpool                             ONLINE       0     0     0
          raidz2-0                        ONLINE       0     0     0
            wwn-0x50014ee6b2018ccf-part2  ONLINE       0     0     0
            wwn-0x50014ee6b20172e8-part2  ONLINE       0     0     0
            wwn-0x50014ee60756e8c2-part2  ONLINE       0     0     0
            wwn-0x50014ee059d11375-part2  ONLINE       0     0     0
            wwn-0x50014ee607571855-part2  ONLINE       0     0     0
            wwn-0x50014ee6b2017ac3-part2  ONLINE       0     0     0
            wwn-0x50014ee6080042d1-part2  ONLINE       0     0     0
            wwn-0x50014ee0047bfc8d-part2  ONLINE       0     0     0

errors: No known data errors

I decided to drop the rpool/pve-container after I did an export from all VMs and tried to restore them again. The first VM I've choosen (104) was restored successful (the Galera Node), but the next restore (103) fails, because: "No such space left on device".

After a few minutes, I tried again, without ANY changes, just fired up the same command ... and it was O.K.

Code:

 # zfs list -o space
NAME                               AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
rpool                              4.53T   475G        0B    205K             0B       475G
rpool/ROOT                         4.53T   113G        0B    205K             0B       113G
rpool/ROOT/pve-1                   4.53T   113G        0B    113G             0B         0B
rpool/pve-container                4.53T   352G        0B    205K             0B       352G
rpool/pve-container/vm-103-disk-1  4.53T  11.9G        0B   11.9G             0B         0B
rpool/pve-container/vm-103-disk-2  4.53T  10.3G        0B   5.80G          4.52G         0B
rpool/pve-container/vm-104-disk-1  4.54T  20.6G        0B   12.0G          8.64G         0B
rpool/pve-container/vm-104-disk-2  4.56T   309G        0B    274G          35.1G         0B
rpool/swap                         4.54T  8.50G        0B    257M          8.25G         0B

In the beginning, we had a few snapshots, but deleted them:

Code:

# zfs list -t snapshot
no datasets available

I was able to restore all VMs, but the first fail indicates, that the problem still exists. I have no clue, what is happen to ZFS or something else.

Any suggestions ?

Denny Fuchs · May 22, 2018

hi,

now it happens again .. my MariaDB nodes was killed and on the Proxmox I have:

Code:

icinga2[4025]: [2018-05-22 23:16:04 +0200] warning/PluginCheckTask: Check command for object 'qh-a07-pmox-06.inatec.com' (PID: -1, arguments: '/usr/lib/nagios/plugins/check_load' '-c' '8,6.4,4.8' '-w' '7.2,5.6,4') terminated with exit code 128, output: Fork failed with error code 28 (No space left on device)

very strange.

Maybe I hit: https://github.com/zfsonlinux/zfs/issues/7401

dcsapak · May 24, 2018

what is the output of

Code:

zfs list
df -h

?

edit: also pveversion -v

Nemesiz · May 24, 2018

As bad I/O your Raidz2 is bad. Look at the picture for allocation overhead.

You are running out of space insive VM (KVM?) or in host ?

Denny Fuchs · May 26, 2018

Hello,

I have to say sorry: The VMs had only in sum less than 1TB used for space. df -h had plenty of (4)TB free for usage, also RAM /tmp /run ... and some more I had a closer look ... nothing, also all snapshots (only two snapshots) where deleted .... nothing.
But, this host was/is so important that I had to choose, what to do the next: I've choosen to export the VMs and reinstall the host with PVE 5.2 via IPMI. Also I've choosen a different Raidz layout: 2 x Raidz-1 in one rpool. I think, the system is a bit faster ...
The important key in this setup is, that we have the exact same host / setup in our second datacenter. The only difference could be an older PVE version (have to take look later) maybe with an older ZFS kernel modul.

@Nemesiz Running out on the host, not VM, but the VM was because of that killed, like other processes.

cu denny

Nemesiz · May 26, 2018

The error you think you hitted is more related to Linux kernel 3.x and may be older versions.

Do this test

Code:

# rm -rf SRC; mkdir SRC; for i in $(seq 1 10000); do echo $i > SRC/$i ; done; find SRC | wc -l
#for i in $(seq 1 10); do cp -r SRC DST$i; find DST$i | wc -l; done

Search

Search

PVE 5.2: ZFS "No space left on device" Avail 4,55T

Denny Fuchs

Renowned Member

Denny Fuchs

Renowned Member

dcsapak

Proxmox Staff Member

Nemesiz

Renowned Member

Attachments

Denny Fuchs

Renowned Member

Nemesiz

Renowned Member