Issue with large ZFS volume, CIFS share and VM storage

cinematheque · Feb 16, 2021

We have a working 4 hosts PVE 6.3-3 cluster installed with ZFS. All those machine are enterprise gear on working condition. With those machine we archive, digitize and produce very large video files to different format. These are often intensive processes that require a lot of resources and generate high disk usage.

3 host has a 2To zpool made of 2 mirror (2 times 2 mirrored disk), without separate log disk. The 4th host is special because it contains 3 zpool :

zpool10, a 54TiB zpool made of 9 mirrors + 1 NVME mirror for log
zpool11, a 130TiB zpool made of 7 raidz1 + 1 NVME mirror for log
zpool12, a 29TiB zpool made of 3 mirror with no log disk

The 4th host is the central piece of this setup because it serve different purposes. zpool10 contains 2 dataset for VM and LXC storage, and 4 dataset which are network shared via net usershare (samba), accessible from our office and mounted on different VMs. zpool11 contains 4 other dataset which are also shared via net usershare. zpool12 contains a dedicated dataset for borg backup but isn’t used at the moment. The shared folder are used by Linux CentOS 7 and MacOS High Sierra machines.

For a while now, we observe different problem. First thing we noticed is when a machine does intensive task, for example transcoding a large video file or even deleting a large ZFS snapshot, the network shares disconnect and became inaccessible. More recently we notice that some VM and container hangs for a while during intensive task. In those cases we can stop the VM, but when we try to start it again we got a systemd timeout. The « solution » if often to wait for the operation to finish and try and start the VM again / restart smbd.service. We try and separate the LXC disk to a separate LVM volume, and are considering doing the same for our VMs.

When this happens, all datasets and zpool are affected regardless of which dataset is used for intensive operation.

Those problem are very annoying because it means when we perform a demanding task we must stop using the network shares and potentially some of the VMs. For now, we just wanted to know if those problem are caused by a known bug or limitation on Proxmox itself (which I doubt), or if we setup some things wrong and have to rethink our infrastructure. The zpool aren’t full, we do not reach 100% io delay and we do not run out of RAM. It just look like everything is freezing even if we do not reach the potential limit of our machines. Most importantely, i can’t find any relevant log when those problem happen.

Do any of you have similar concerns with similar setup ? What would you do differently from us ?
Thanks !

guletz · Feb 16, 2021

Hi @cinematheque

Some ideas:

Is not the best to have many different zpools on the same host/server.
If a big IO load will impact one zpool, then this will impact on all datasets on this pool.
If you use samba-share for hosting your VM/CT the performance is not so good.

cinematheque said:
First thing we noticed is when a machine does intensive task, for example transcoding a large video file or even deleting a large ZFS snapshot, the network shares disconnect and became inaccessible

Absolutly normal. A big IO will affect zfs pool and then any CT/VM that use it(even if it samba or else)

Without any other details(zpool status -v /arc_summary/your_VM_or_CT_config/ and so on) it is hard to say what you can do better. And your enviroment seems to me to be very complicated .... at least(I like to use simple setups = kiss)! With complicated setup is very difficult to optimise ANYTHING. Using only my intuition I would say that your storage/cluster have a big problem regarding IOPS(for yor own load).

Good luck / Bafta !

cinematheque · Feb 17, 2021

Hi @guletz , thanks for your feedback !

To clarify things : VM / CT are on local storage, on separate dataset. There are also samba shared dataset on the same zpool. I'll post my zpool status.

Basically what you're saying is that we better put everything on the same zpool instead of segregate things on multiple zpool ?

zpool status -v : https://pb.cfav.fr/?16da4bf321b14ce8#GSphj4LFu54QHvQt437Nm8cQDNuXhYkgN7YQCuzYeMhL

arc_summary : https://pb.cfav.fr/?729d42004576a451#7WRPxb8WFLXbprjShhRL3jH6BtVfRxQBmC5NzWsddGKZ

guletz · Feb 17, 2021

cinematheque said:
for example transcoding a large video file

Hi,

Can you show the output for this dataset:

Code:

zfs get all large_video_dataset

Good luck / Bafta !

cinematheque · Feb 17, 2021

guletz said:
Hi,

Can you show the output for this dataset

No problem !

zfs get all zpool10/AV :

Code:

NAME        PROPERTY              VALUE                                                                               SOURCE
zpool11/AV  type                  filesystem                                                                          -
zpool11/AV  creation              dim. juil. 12 13:18 2020                                                            -
zpool11/AV  used                  10,2T                                                                               -
zpool11/AV  available             9,76T                                                                               -
zpool11/AV  referenced            9,02T                                                                               -
zpool11/AV  compressratio         1.05x                                                                               -
zpool11/AV  mounted               yes                                                                                 -
zpool11/AV  quota                 20T                                                                                 local
zpool11/AV  reservation           none                                                                                default
zpool11/AV  recordsize            128K                                                                                default
zpool11/AV  mountpoint            /srv/AV                                                                             local
zpool11/AV  sharenfs              rw=XX.XX.XX.XX/24:XX.XX.XX.XX/16,insecure,all_squash,anonuid=100001,anongid=100001  local
zpool11/AV  checksum              on                                                                                  default
zpool11/AV  compression           lz4                                                                                 inherited from zpool11
zpool11/AV  atime                 off                                                                                 inherited from zpool11
zpool11/AV  devices               on                                                                                  default
zpool11/AV  exec                  on                                                                                  default
zpool11/AV  setuid                on                                                                                  default
zpool11/AV  readonly              off                                                                                 default
zpool11/AV  zoned                 off                                                                                 default
zpool11/AV  snapdir               hidden                                                                              default
zpool11/AV  aclinherit            restricted                                                                          default
zpool11/AV  createtxg             58068                                                                               -
zpool11/AV  canmount              on                                                                                  default
zpool11/AV  xattr                 sa                                                                                  local
zpool11/AV  copies                1                                                                                   default
zpool11/AV  version               5                                                                                   -
zpool11/AV  utf8only              off                                                                                 -
zpool11/AV  normalization         none                                                                                -
zpool11/AV  casesensitivity       sensitive                                                                           -
zpool11/AV  vscan                 off                                                                                 default
zpool11/AV  nbmand                off                                                                                 default
zpool11/AV  sharesmb              off                                                                                 local
zpool11/AV  refquota              none                                                                                default
zpool11/AV  refreservation        none                                                                                default
zpool11/AV  guid                  15673761124630707355                                                                -
zpool11/AV  primarycache          all                                                                                 default
zpool11/AV  secondarycache        all                                                                                 default
zpool11/AV  usedbysnapshots       1,22T                                                                               -
zpool11/AV  usedbydataset         9,02T                                                                               -
zpool11/AV  usedbychildren        0B                                                                                  -
zpool11/AV  usedbyrefreservation  0B                                                                                  -
zpool11/AV  logbias               latency                                                                             default
zpool11/AV  objsetid              33923                                                                               -
zpool11/AV  dedup                 on                                                                                  inherited from zpool11
zpool11/AV  mlslabel              none                                                                                default
zpool11/AV  sync                  standard                                                                            inherited from zpool11
zpool11/AV  dnodesize             legacy                                                                              default
zpool11/AV  refcompressratio      1.05x                                                                               -
zpool11/AV  written               0                                                                                   -
zpool11/AV  logicalused           10,8T                                                                               -
zpool11/AV  logicalreferenced     9,52T                                                                               -
zpool11/AV  volmode               default                                                                             default
zpool11/AV  filesystem_limit      none                                                                                default
zpool11/AV  snapshot_limit        none                                                                                default
zpool11/AV  filesystem_count      none                                                                                default
zpool11/AV  snapshot_count        none                                                                                default
zpool11/AV  snapdev               hidden                                                                              default
zpool11/AV  acltype               off                                                                                 default
zpool11/AV  context               none                                                                                default
zpool11/AV  fscontext             none                                                                                default
zpool11/AV  defcontext            none                                                                                default
zpool11/AV  rootcontext           none                                                                                default
zpool11/AV  relatime              off                                                                                 default
zpool11/AV  redundant_metadata    all                                                                                 default
zpool11/AV  overlay               off                                                                                 default
zpool11/AV  encryption            off                                                                                 default
zpool11/AV  keylocation           none                                                                                default
zpool11/AV  keyformat             none                                                                                default
zpool11/AV  pbkdf2iters           0                                                                                   default
zpool11/AV  special_small_blocks  0                                                                                   default

guletz · Feb 17, 2021

....

cinematheque said:
zpool11/AV recordsize 128K

I think this is not so better for this usecase. Because you have huge video files on it(my guess), I would try to set recordsize=[4-16] MB. It make no sense to use a very small size(=> huge number of metadata blocks)

guletz · Feb 17, 2021

Now, I see this data from your pool:

Cache hits by data type:
Demand data: 2.1 % 390.9M
Demand prefetch data: < 0.1 % 4.7M
Demand metadata: 97.8 % 18.1G
Demand prefetch metadata: < 0.1 % 7.8M

Cache misses by data type:
Demand data: 38.6 % 648.8M
Demand prefetch data: 3.7 % 62.3M
Demand metadata: 56.9 % 956.6M
Demand prefetch metadata: 0.8 % 13.9M

Cache hits by cache type:
Most frequently used (MFU): 89.0 % 16.4G
Most recently used (MRU): 11.0 % 2.0G
Most frequently used (MFU) ghost: 2.0 % 360.7M
Most recently used (MRU) ghost: 0.9 % 169.9M

So, mostly your arc is used for metadata !!!! In this case, why do not setup your pool to cache ONLY metadata. The problem you see with your pool is because your pool use o lot of time to read/write metadata(partial from arc, the rest from the disk), so on high IO tasks, pool seems to be hanging(when you remove snapshots, a lot of metadata need to be read/write).

MFU/MRU ghost show what data size could be cached on a zfs cache(you do not have any zfs cache disk as I can see)!

nvme-Samsung_SSD_970_EVO_Plus_2TB_S4J4NG0M902301K-part1
- this is huge(2TB) for a dedicated slog device
- maybe you could try to use also a small part for zpool cache

Special device for metadata only would be a better setup for your pool, see details here:

https://forum.level1techs.com/t/zfs-metadata-special-device-z/159954

Good luck / Bafta !

Good luck / Bafta !

cinematheque · Feb 17, 2021

guletz said:
MFU/MRU ghost show what data size could be cached on a zfs cache(you do not have any zfs cache disk as I can see)!

nvme-Samsung_SSD_970_EVO_Plus_2TB_S4J4NG0M902301K-part1
- this is huge(2TB) for a dedicated slog device
- maybe you could try to use also a small part for zpool cache

Special device for metadata only would be a better setup for your pool, see details here:

https://forum.level1techs.com/t/zfs-metadata-special-device-z/159954

Thanks for your input, that's interesting i think that can be part of the solution. I've never heard of special device, i'm gonna read about that.

Search

Search

Issue with large ZFS volume, CIFS share and VM storage

cinematheque

New Member

guletz

Famous Member

cinematheque

New Member

guletz

Famous Member

cinematheque

New Member

guletz

Famous Member

guletz

Famous Member

cinematheque

New Member