hi,
we have very strange behavior on one of our Proxmox 5.x . We had a very bad I/O performance with PVE 5.1 with 8 x 1TB WD Red in a Raidz-2 installation. We did an upgrade to the latest 5.2 and the bad performance has gone, but we had the very strange thing, that one of our VMs stopped. This VM is a Galera MariaDB 10.2 node and in the moment the (r)sync starts, the VM is stopped.
After a while we noticed, that we have on the hypervisor in the daemon.log a bunch of "No space left on device" for any programs, like Icinga2, Telegraf and ... qm.
Our pool:
I decided to drop the rpool/pve-container after I did an export from all VMs and tried to restore them again. The first VM I've choosen (104) was restored successful (the Galera Node), but the next restore (103) fails, because: "No such space left on device".
After a few minutes, I tried again, without ANY changes, just fired up the same command ... and it was O.K.
In the beginning, we had a few snapshots, but deleted them:
I was able to restore all VMs, but the first fail indicates, that the problem still exists. I have no clue, what is happen to ZFS or something else.
Any suggestions ?
we have very strange behavior on one of our Proxmox 5.x . We had a very bad I/O performance with PVE 5.1 with 8 x 1TB WD Red in a Raidz-2 installation. We did an upgrade to the latest 5.2 and the bad performance has gone, but we had the very strange thing, that one of our VMs stopped. This VM is a Galera MariaDB 10.2 node and in the moment the (r)sync starts, the VM is stopped.
After a while we noticed, that we have on the hypervisor in the daemon.log a bunch of "No space left on device" for any programs, like Icinga2, Telegraf and ... qm.
Our pool:
Code:
zpool status
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 6h2m with 0 errors on Tue May 22 18:26:34 2018
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
wwn-0x50014ee6b2018ccf-part2 ONLINE 0 0 0
wwn-0x50014ee6b20172e8-part2 ONLINE 0 0 0
wwn-0x50014ee60756e8c2-part2 ONLINE 0 0 0
wwn-0x50014ee059d11375-part2 ONLINE 0 0 0
wwn-0x50014ee607571855-part2 ONLINE 0 0 0
wwn-0x50014ee6b2017ac3-part2 ONLINE 0 0 0
wwn-0x50014ee6080042d1-part2 ONLINE 0 0 0
wwn-0x50014ee0047bfc8d-part2 ONLINE 0 0 0
errors: No known data errors
I decided to drop the rpool/pve-container after I did an export from all VMs and tried to restore them again. The first VM I've choosen (104) was restored successful (the Galera Node), but the next restore (103) fails, because: "No such space left on device".
After a few minutes, I tried again, without ANY changes, just fired up the same command ... and it was O.K.
Code:
# zfs list -o space
NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD
rpool 4.53T 475G 0B 205K 0B 475G
rpool/ROOT 4.53T 113G 0B 205K 0B 113G
rpool/ROOT/pve-1 4.53T 113G 0B 113G 0B 0B
rpool/pve-container 4.53T 352G 0B 205K 0B 352G
rpool/pve-container/vm-103-disk-1 4.53T 11.9G 0B 11.9G 0B 0B
rpool/pve-container/vm-103-disk-2 4.53T 10.3G 0B 5.80G 4.52G 0B
rpool/pve-container/vm-104-disk-1 4.54T 20.6G 0B 12.0G 8.64G 0B
rpool/pve-container/vm-104-disk-2 4.56T 309G 0B 274G 35.1G 0B
rpool/swap 4.54T 8.50G 0B 257M 8.25G 0B
In the beginning, we had a few snapshots, but deleted them:
Code:
# zfs list -t snapshot
no datasets available
I was able to restore all VMs, but the first fail indicates, that the problem still exists. I have no clue, what is happen to ZFS or something else.
Any suggestions ?