PVE 5.2: ZFS "No space left on device" Avail 4,55T

Jan 21, 2016
96
8
73
43
Germany
www.pug.org
hi,

we have very strange behavior on one of our Proxmox 5.x . We had a very bad I/O performance with PVE 5.1 with 8 x 1TB WD Red in a Raidz-2 installation. We did an upgrade to the latest 5.2 and the bad performance has gone, but we had the very strange thing, that one of our VMs stopped. This VM is a Galera MariaDB 10.2 node and in the moment the (r)sync starts, the VM is stopped.
After a while we noticed, that we have on the hypervisor in the daemon.log a bunch of "No space left on device" for any programs, like Icinga2, Telegraf and ... qm.

Our pool:

Code:
zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 6h2m with 0 errors on Tue May 22 18:26:34 2018
config:

        NAME                              STATE     READ WRITE CKSUM
        rpool                             ONLINE       0     0     0
          raidz2-0                        ONLINE       0     0     0
            wwn-0x50014ee6b2018ccf-part2  ONLINE       0     0     0
            wwn-0x50014ee6b20172e8-part2  ONLINE       0     0     0
            wwn-0x50014ee60756e8c2-part2  ONLINE       0     0     0
            wwn-0x50014ee059d11375-part2  ONLINE       0     0     0
            wwn-0x50014ee607571855-part2  ONLINE       0     0     0
            wwn-0x50014ee6b2017ac3-part2  ONLINE       0     0     0
            wwn-0x50014ee6080042d1-part2  ONLINE       0     0     0
            wwn-0x50014ee0047bfc8d-part2  ONLINE       0     0     0

errors: No known data errors

I decided to drop the rpool/pve-container after I did an export from all VMs and tried to restore them again. The first VM I've choosen (104) was restored successful (the Galera Node), but the next restore (103) fails, because: "No such space left on device".

After a few minutes, I tried again, without ANY changes, just fired up the same command ... and it was O.K.

Code:
 # zfs list -o space
NAME                               AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
rpool                              4.53T   475G        0B    205K             0B       475G
rpool/ROOT                         4.53T   113G        0B    205K             0B       113G
rpool/ROOT/pve-1                   4.53T   113G        0B    113G             0B         0B
rpool/pve-container                4.53T   352G        0B    205K             0B       352G
rpool/pve-container/vm-103-disk-1  4.53T  11.9G        0B   11.9G             0B         0B
rpool/pve-container/vm-103-disk-2  4.53T  10.3G        0B   5.80G          4.52G         0B
rpool/pve-container/vm-104-disk-1  4.54T  20.6G        0B   12.0G          8.64G         0B
rpool/pve-container/vm-104-disk-2  4.56T   309G        0B    274G          35.1G         0B
rpool/swap                         4.54T  8.50G        0B    257M          8.25G         0B

In the beginning, we had a few snapshots, but deleted them:

Code:
# zfs list -t snapshot
no datasets available

I was able to restore all VMs, but the first fail indicates, that the problem still exists. I have no clue, what is happen to ZFS or something else.

Any suggestions ?
 
hi,

now it happens again .. my MariaDB nodes was killed and on the Proxmox I have:

Code:
icinga2[4025]: [2018-05-22 23:16:04 +0200] warning/PluginCheckTask: Check command for object 'qh-a07-pmox-06.inatec.com' (PID: -1, arguments: '/usr/lib/nagios/plugins/check_load' '-c' '8,6.4,4.8' '-w' '7.2,5.6,4') terminated with exit code 128, output: Fork failed with error code 28 (No space left on device)

very strange.

Maybe I hit: https://github.com/zfsonlinux/zfs/issues/7401
 
Last edited:
what is the output of
Code:
zfs list
df -h
?

edit: also pveversion -v
 
As bad I/O your Raidz2 is bad. Look at the picture for allocation overhead.

You are running out of space insive VM (KVM?) or in host ?
 

Attachments

  • raidz2.png
    raidz2.png
    17.3 KB · Views: 25
Hello,

I have to say sorry: The VMs had only in sum less than 1TB used for space. df -h had plenty of (4)TB free for usage, also RAM /tmp /run ... and some more I had a closer look ... nothing, also all snapshots (only two snapshots) where deleted .... nothing.
But, this host was/is so important that I had to choose, what to do the next: I've choosen to export the VMs and reinstall the host with PVE 5.2 via IPMI. Also I've choosen a different Raidz layout: 2 x Raidz-1 in one rpool. I think, the system is a bit faster ...
The important key in this setup is, that we have the exact same host / setup in our second datacenter. The only difference could be an older PVE version (have to take look later) maybe with an older ZFS kernel modul.

@Nemesiz Running out on the host, not VM, but the VM was because of that killed, like other processes.

cu denny
 
The error you think you hitted is more related to Linux kernel 3.x and may be older versions.

Do this test

Code:
# rm -rf SRC; mkdir SRC; for i in $(seq 1 10000); do echo $i > SRC/$i ; done; find SRC | wc -l
#for i in $(seq 1 10); do cp -r SRC DST$i; find DST$i | wc -l; done
 
  • Like
Reactions: Denny Fuchs

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!