Hi,I'll bring you a strange case.
We have 3 clusters (production, contingency and develop). All of them running PVE 7.4 and Ceph 16.2.15.
We use contingency cluster to share ISO images for VM installations mounting Cephfs storage object from PVE GUI.
Recently when creating VM at develop's cluster, the wizard started to hang with any ISO image. Only on that cluster.
We also try to copy files from mounted cephfs on /mnt/pve/Instaladores/template/iso and behavior is this:
Error code when stopping VM creation after several minutes "working":
Hangs when copying from console....
Copying files smaller than 998Kb works just fine
Copying files equal or greater than 999Kb fails
On other console.... process status is "uninterruptible sleep"
We have tried upgrading CEPH to the last version, restarting MDS services, restarting each node, removing and creating Storage objet, everything... No error were logged...
/etc/pve/storage.cfg
Versions:
On server side...
We have 3 clusters (production, contingency and develop). All of them running PVE 7.4 and Ceph 16.2.15.
We use contingency cluster to share ISO images for VM installations mounting Cephfs storage object from PVE GUI.
Recently when creating VM at develop's cluster, the wizard started to hang with any ISO image. Only on that cluster.
We also try to copy files from mounted cephfs on /mnt/pve/Instaladores/template/iso and behavior is this:
Error code when stopping VM creation after several minutes "working":
Code:
command '/usr/bin/qemu-img info '--output=json' /mnt/pve/Instaladores/template/iso/debian-12.11.0-amd64-netinst.iso' failed: received interrupt
could not parse qemu-img info command output for '/mnt/pve/Instaladores/template/iso/debian-12.11.0-amd64-netinst.iso' - malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/share/perl5/PVE/Storage/Plugin.pm line 946.
TASK ERROR: unable to create VM 3048 - volume Instaladores:iso/debian-12.11.0-amd64-netinst.iso does not exist
Hangs when copying from console....
Code:
[root@PVE-DEV /]# cp /mnt/pve/Instaladores/template/iso/debian-12.11.0-amd64-netinst.iso ~/
Copying files smaller than 998Kb works just fine
Code:
[root@PVE-DEV /]# dd if=/dev/zero of=998K.txt bs=998K count=1
1+0 records in
1+0 records out
1021952 bytes (1.0 MB, 998 KiB) copied, 0.00361033 s, 283 MB/s
[root@PVE-DEV /]# cp 998K.txt /mnt/pve/Instaladores/template/iso/
Copying files equal or greater than 999Kb fails
Code:
[root@PVE-DEV /]# dd if=/dev/zero of=999K.txt bs=999K count=1
1+0 records in
1+0 records out
1022976 bytes (1.0 MB, 999 KiB) copied, 0.0154679 s, 66.1 MB/s
[root@PVE-DEV /]# cp 999K.txt /mnt/pve/Instaladores/template/iso/
On other console.... process status is "uninterruptible sleep"
Code:
[root@PVE-DEV /]# ps ax | grep 999K
4130433 pts/0 D+ 0:00 cp 999K.txt /mnt/pve/Instaladores/template/iso/
We have tried upgrading CEPH to the last version, restarting MDS services, restarting each node, removing and creating Storage objet, everything... No error were logged...
/etc/pve/storage.cfg
Code:
cephfs: Instaladores
path /mnt/pve/Instaladores
content iso
fs-name cephfs
monhost 10.x.x.x 10.x.x.x 10.x.x.x
prune-backups keep-all=1
username admin
Versions:
Bash:
proxmox-ve: 7.4-1 (running kernel: 5.15.108-1-pve)
pve-manager: 7.4-16 (running version: 7.4-16/0f39f621)
pve-kernel-5.15: 7.4-4
pve-kernel-5.4: 6.4-20
pve-kernel-5.15.108-1-pve: 5.15.108-1
pve-kernel-5.4.203-1-pve: 5.4.203-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph: 16.2.15-pve1
ceph-fuse: 16.2.15-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.2
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-4
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
On server side...
Code:
[root@node1 ~]# ceph status
cluster:
id: cdab5d3f-42a0-4cba-8e91-7a79c10404ed
health: HEALTH_OK
services:
mon: 4 daemons, quorum node1,node2,node3,node4 (age 4d)
mgr: node3(active, since 4d), standbys: node2, node1
mds: 1/1 daemons up, 2 standby
osd: 24 osds: 24 up (since 4d), 24 in (since 3M)
data:
volumes: 1/1 healthy
pools: 4 pools, 169 pgs
objects: 491.38k objects, 1.7 TiB
usage: 5.0 TiB used, 15 TiB / 20 TiB avail
pgs: 169 active+clean
io:
client: 37 KiB/s rd, 2.9 MiB/s wr, 9 op/s rd, 379 op/s wr
Code:
[root@PVE-CONT /]# ceph fs get cephfs
Filesystem 'cephfs' (1)
fs_name cephfs
epoch 264
flags 12
created 2020-12-02T11:21:08.352970-0300
modified 2025-05-29T14:16:11.172011-0300
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
required_client_features {}
last_failure 0
last_failure_osd_epoch 46897
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in 0
up {0=277564840}
failed
damaged
stopped
data_pools [7]
metadata_pool 8
inline_data disabled
balancer
standby_count_wanted 1
[mds.node1{0:277564840} state up:active seq 7 addr [v2:10.6.25.4:6800/1338729611,v1:10.6.25.4:6801/1338729611] compat {c=[1],r=[1],i=[7ff]}]