Input / output error

Thanx for the hint - I'll check that as soon as I can access the GUI again. Needs a restart ...
 
Hi, guys,

so, back again after a while - it is a server in production, so we can´t restart it whenever ... yesterday in then evening we rebooted the server. After that everything looked fine for a while - both VMs running, the Web-GUI was accessible.

Jan 13 19:23:38 pve systemd[1]: Started pve-cluster.service - The Proxmox VE cluster filesystem.

root@pve:/etc/pve/nodes/pve# qm list
VMID NAME STATUS MEM(MB) BOOTDISK(GB) PID
100 server running 16384 782.00 1756
101 server running 16384 315.00 1891

But short after midnight:

Jan 14 00:35:48 pve pmxcfs[1603]: [database] crit: commit transaction failed: database or disk is full#010
Jan 14 00:35:48 pve pmxcfs[1603]: [database] crit: rollback transaction failed: cannot rollback - no transaction is active#010

And, of course, the second VM (101) is not reachable longer via RDP, and those error messages appears in the logs:

2025-01-14T06:31:47.888316+01:00 pve pve-ha-lrm[1737]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve/lrm_status.tmp.1737' - Input/output error
2025-01-14T06:31:52.889111+01:00 pve pve-ha-lrm[1737]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve/lrm_status.tmp.1737' - Input/output error
client_loop: send disconnect: Broken pipe

Maybe the hint with the backup job that is temporary fills up space is good - but without a Web-GUI I can´t check this, or?! I already stumbled over the fact (?!) that it is not possible to back up VMs on an external USB-drive; at least, I found no solution that works without the Web-GUI. IIRC that was far better under Xen or raw KVM, even under Hyper-V .... OK, another discussion.

/tmp is not full at the moment:

root@pve:/# cd /tmp/
root@pve:/tmp# df -h .
Filesystem Size Used Avail Use% Mounted on
rpool/ROOT/pve-1 1.2T 869G 302G 75% /

Another check:

root@pve:/# cd /etc/pve/nodes/pve/
root@pve:/etc/pve/nodes/pve# df -h .
Filesystem Size Used Avail Use% Mounted on
/dev/fuse 128M 16K 128M 1% /etc/pve

root@pve:/etc/pve/nodes/pve# ls -la
-rw-r----- 1 root www-data 83 Jan 14 00:35 lrm_status
drwxr-xr-x 2 root www-data 0 Sep 20 13:30 lxc
drwxr-xr-x 2 root www-data 0 Sep 20 13:30 openvz
drwx------ 2 root www-data 0 Sep 20 13:30 priv
-rw-r----- 1 root www-data 1704 Sep 20 13:30 pve-ssl.key
-rw-r----- 1 root www-data 1793 Sep 20 13:30 pve-ssl.pem
drwxr-xr-x 2 root www-data 0 Sep 20 13:30 qemu-server
-rw-r----- 1 root www-data 556 Jan 13 19:23 ssh_known_hosts

root@pve:/etc/pve/nodes/pve# touch testdatei
touch: cannot touch 'testdatei': Input/output error

What the hell is going on here? One VM (100) is up and running and reachable, the other (101) not accessible, as the Web-GUI ... My first thought was "overprovisoning of a VM / virtual disk to big", but as far I can see that should be OK:

root@pve:/# zfs list
NAME USED AVAIL REFER MOUNTPOINT
rpool 3.75T 302G 166K /rpool
rpool/ROOT 868G 302G 153K /rpool/ROOT
rpool/ROOT/pve-1 868G 302G 868G /
rpool/data 2.90T 302G 153K /rpool/data
rpool/data/vm-100-disk-0 313G 302G 313G -
rpool/data/vm-100-disk-1 985G 302G 985G -
rpool/data/vm-101-disk-0 303G 302G 303G -
rpool/data/vm-101-disk-1 1.34T 302G 1.34T -
rpool/var-lib-vz 204K 302G 204K /var/lib/vz

I ask the customer if we can start over the ProxMox later today, and then I will have a look for Backup-Jobs in the Web-GUI. Furthermore, I will detach vm-101-disk-1 (the VM that is not reachable), just to find out if that changes anything.

Best regards, Lars
 
P.S.: with the help of this forum, I found "pvesh" :)

So:

root@pve:/# pvesh ls /cluster/backup
-rw-d backup-42d365f8-fa32

root@pve:/# pvesh get /cluster/backup/backup-42d365f8-fa32
┌────────────────┬──────────────────────┐
│ key │ value │
╞════════════════╪══════════════════════╡
│ compress │ zstd │
├────────────────┼──────────────────────┤
│ enabled │ 1 │
├────────────────┼──────────────────────┤
│ fleecing │ {"enabled":"0"} │
├────────────────┼──────────────────────┤
│ id │ backup-42d365f8-fa32 │
├────────────────┼──────────────────────┤
│ mode │ snapshot │
├────────────────┼──────────────────────┤
│ node │ pve │
├────────────────┼──────────────────────┤
│ notes-template │ {{guestname}} │
├────────────────┼──────────────────────┤
│ repeat-missed │ 0 │
├────────────────┼──────────────────────┤
│ schedule │ mon..fri 00:00 │
├────────────────┼──────────────────────┤
│ storage │ usb │
├────────────────┼──────────────────────┤
│ type │ vzdump │
├────────────────┼──────────────────────┤
│ vmid │ 100,101 │
└────────────────┴──────────────────────┘

But (I almost expected it ;)):

root@pve:/# pvesh delete /cluster/backup/backup-42d365f8-fa32
trying to acquire cfs lock 'file-vzdump_cron' ...
trying to acquire cfs lock 'file-vzdump_cron' ...
trying to acquire cfs lock 'file-vzdump_cron' ...
trying to acquire cfs lock 'file-vzdump_cron' ...
trying to acquire cfs lock 'file-vzdump_cron' ...
trying to acquire cfs lock 'file-vzdump_cron' ...
trying to acquire cfs lock 'file-vzdump_cron' ...
trying to acquire cfs lock 'file-vzdump_cron' ...
trying to acquire cfs lock 'file-vzdump_cron' ...
cfs-lock 'file-vzdump_cron' error: got lock request timeout

So, it is not possible to delete the backup job. Maybe because of the "disk full"-error, though the disks are far from being full:

root@pve:/# df -h

Filesystem Size Used Avail Use% Mounted on
udev 32G 0 32G 0% /dev
tmpfs 6.3G 2.9M 6.3G 1% /run
rpool/ROOT/pve-1 1.2T 869G 302G 75% /
tmpfs 32G 46M 32G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
efivarfs 512K 84K 424K 17% /sys/firmware/efi/efivars
rpool/var-lib-vz 302G 256K 302G 1% /var/lib/vz
rpool 302G 256K 302G 1% /rpool
rpool/ROOT 302G 256K 302G 1% /rpool/ROOT
rpool/data 302G 256K 302G 1% /rpool/data
/dev/fuse 128M 16K 128M 1% /etc/pve
tmpfs 6.3G 0 6.3G 0% /run/user/1000

Is it safe to edit /etc/pve/jobs.cfg and/or only remove the VM-ID with the "big" disk?

nano /etc/pve/jobs.cfg

vzdump: backup-42d365f8-fa32
schedule mon..fri 00:00
compress zstd
enabled 1
fleecing 0
mode snapshot
node pve
notes-template {{guestname}}
repeat-missed 0
storage usb
vmid 100,101

Thanx in advance!
 
you'll need to restart pve-cluster first to get /etc/pve RW again.. then editing the jobs over the API/UI/pvesh should work again.. but you should really find out why your disks keeps getting full, that sounds like you have a wrong storage.cfg entry and correlating backup job..
 
Hi,

and thanx to you all for your input and hints. The server is up and running smoothly now, after I deleted the backup-job in the GUI. Problem was as following:

- the backup job had "usb" as destination
- actually, there should always be an external USB-drive attached and the storage "usb" assigned to that disk
- apparently, the disk vanished now and then, and than the backup job just fills the / (root) with the backup-data until the point, where it realizes, that there is not enough space at all ...

That is similar to the behaviour of f.ex. a unreachable mount point or so in "normal" Linux. Than the free space will be filled up, and the system shows "100 % usage - 0 % free space". But we almost never experienced, that a system more or less stucked on that :\

However: we have to find out a reliable way to assign the backup storage to an external USB-drive and prevent Proxmox from filling up its own root-space in case of absence of the external disk, I think.
 
So - everything on Zero :((((((

- The backup job is deleted
- yesterday, everything seems fine
- but today in the morning:

root@pve:/# df -h

Filesystem Size Used Avail Use% Mounted on
udev 32G 0 32G 0% /dev
tmpfs 6.3G 1.5M 6.3G 1% /run
rpool/ROOT/pve-1 869G 869G 0 100% /
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
efivarfs 512K 84K 424K 17% /sys/firmware/efi/efivars
rpool/var-lib-vz 256K 256K 0 100% /var/lib/vz
rpool 256K 256K 0 100% /rpool
rpool/ROOT 256K 256K 0 100% /rpool/ROOT
rpool/data 256K 256K 0 100% /rpool/data
tmpfs 6.3G 0 6.3G 0% /run/user/1000

- so the system is not usable at the moment. The VMs are not running (showing "i/o-error"), the GUI was reachable, but it is not longer after a reboot.
Last time we had this is was an overprovisioning of one VM (more space assigned to the virtual disks as available in the storages). But this time, that could not be the error because we sat the VMs smaller.

HELP!!!!
 
Here are the error messages from the cluster log:

Jan 16 09:10:49 pve systemd[1]: Starting pve-cluster.service - The Proxmox VE cluster filesystem...
Jan 16 09:10:49 pve pmxcfs[1542]: [database] crit: chmod failed: No space left on device
Jan 16 09:10:49 pve pmxcfs[1542]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Jan 16 09:10:49 pve pmxcfs[1542]: [main] notice: exit proxmox configuration filesystem (-1)
Jan 16 09:10:49 pve systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Jan 16 09:10:49 pve systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Jan 16 09:10:49 pve systemd[1]: Failed to start pve-cluster.service - The Proxmox VE cluster filesystem.
 
qm list

ipcc_send_rec[1] failed: Connection refused
ipcc_send_rec[2] failed: Connection refused
ipcc_send_rec[3] failed: Connection refused
Unable to load access control list: Connection refused
 
please post
- storage.cfg
- jobs.cfg
- /etc/vzdump.conf

something must still be misconfigured...
 
please post
- storage.cfg
- jobs.cfg
- /etc/vzdump.conf

something must still be misconfigured...
Thank you very much for your fast reply :)

1) cat /etc/vzdump.conf

# vzdump default settings

#tmpdir: DIR
#dumpdir: DIR
#storage: STORAGE_ID
#mode: snapshot|suspend|stop
#bwlimit: KBPS
#performance: [max-workers=N][,pbs-entries-max=N]
#ionice: PRI
#lockwait: MINUTES
#stopwait: MINUTES
#stdexcludes: BOOLEAN
#mailto: ADDRESSLIST
#prune-backups: keep-INTERVAL=N[,...]
#script: FILENAME
#exclude-path: PATHLIST
#pigz: N
#notes-template: {{guestname}}
#pbs-change-detection-mode: legacy|data|metadata
#fleecing: enabled=BOOLEAN,storage=STORAGE_ID

But nothing under /etc/pve/:

root@pve:/# ls -ltra /etc/pve/
total 14
drwxr-xr-x 2 root root 2 Sep 20 13:28 .
drwxr-xr-x 91 root root 186 Dec 12 12:09 ..

I assume that the problem is here:

root@pve:/# df -h

=>

rpool/ROOT/pve-1 869G 869G 0 100% /

tmpfs 32G 0 32G 0% /dev/shm

=>

rpool 256K 256K 0 100% /rpool
rpool/ROOT 256K 256K 0 100% /rpool/ROOT
rpool/data 256K 256K 0 100% /rpool/data
tmpfs 6.3G 0 6.3G 0% /run/user/1000

The storage seems to be full, and there is nos "special" volume / device / partition reserved for /.
 
well.. you need to make some space, and then restart pve-cluster for /etc/pve to reappear ;)

anyway, maybe if you look at what is taking up the space you will already find the culprit:

Code:
du -sm /* | sort -n

and then rinse repeat for the dirs with high usage, e.g.

Code:
du -sm /mnt/* | sort -n

until you found one that contains a lot of data in an unexpected place..
 
Thanx again!

du -sm /* | sort -n

du: cannot access '/proc/1632/task/1632/fd/4': No such file or directory
du: cannot access '/proc/1632/task/1632/fdinfo/4': No such file or directory
du: cannot access '/proc/1632/fd/3': No such file or directory
du: cannot access '/proc/1632/fdinfo/3': No such file or directory

but with the help of your hint, I found the solution :)

0 /dev
(...) bla bla
6 /etc
225 /boot
1040 /var
1971 /usr
=>
885959 /usb

= why was there something under /usb/ at all??

root@pve:/# ll /usb/dump/

-rw-r--r-- 1 root root 10825 Nov 22 01:34 vzdump-qemu-100-2024_11_22-00_00_02.log
-rw-r--r-- 1 root root 930820206030 Nov 22 01:34 vzdump-qemu-100-2024_11_22-00_00_02.vma.zst
-rw-r--r-- 1 root root 6 Nov 22 01:34 vzdump-qemu-100-2024_11_22-00_00_02.vma.zst.notes
-rw-r--r-- 1 root root 3771 Nov 22 02:04 vzdump-qemu-101-2024_11_22-01_34_01.log

so, there were plenty of old backup files (november 2023) that filled upo /usb/dump, and that filled up /. After a

root@pve:/# rm -rf /usb/dump/

the system runs again:

root@pve:/# df -h
Filesystem Size Used Avail Use% Mounted on

udev 32G 0 32G 0% /dev
tmpfs 6.3G 1.5M 6.3G 1% /run
rpool/ROOT/pve-1 869G 298G 571G 35% /
rpool/var-lib-vz 571G 256K 571G 1% /var/lib/vz
rpool 571G 256K 571G 1% /rpool
rpool/ROOT 571G 256K 571G 1% /rpool/ROOT
rpool/data 571G 256K 571G 1% /rpool/data

But now the questions:

1) There were *no* backup jobs running!! Why did Proxmox filled up the / out of the blue?

2) Another idea: we had an external USB-drive, mounted under /mnt/usb. But there was no backup job using that disk. It seems to us that Proxmox filled up / (root) with the contents of /usb/dump only after connection of an external USB-disk?! We had no problems with the server for the last two days (without disk). Yesterday, we attached the external USB, and last night it stopped sudenly. There seems to be a relation between the two incidents.

Best regards, Lars
 
please post the config files I asked for!
 
please post
- storage.cfg
- jobs.cfg
- /etc/vzdump.conf

something must still be misconfigured...
Sorry, I was out order the last days .... at least, the Proxmox is running now without problems (I hope so):

1) egrep -v '#|^ *$' /etc/pve/storage.cfg

dir: local
path /var/lib/vz
content iso,vztmpl,backup
zfspool: local-zfs
pool rpool/data
content images,rootdir
sparse 1
dir: backup
path /var/lib/vz/dump
content backup,images
prune-backups keep-all=1
shared 0
dir: usb1
path /pladde/usb1
content backup
is_mountpoint /pladde/usb1
prune-backups keep-last=2
shared 0

2) egrep -v '#|^ *$' /etc/pve/jobs.cfg
vzdump: backup-380e378e-5bed
schedule 21:00
compress gzip
enabled 1
fleecing 0
mailnotification always
mailto support@somewhere
mode snapshot
node pve
notes-template {{guestname}}
notification-mode legacy-sendmail
repeat-missed 0
storage usb1
vmid 100

3)
egrep -v '#|^ *$' /etc/vzdump.conf

= nothing, because:

root@pve:/# cat /etc/vzdump.conf

# vzdump default settings
#tmpdir: DIR
#dumpdir: DIR
#storage: STORAGE_ID
#mode: snapshot|suspend|stop
#bwlimit: KBPS
#performance: [max-workers=N][,pbs-entries-max=N]
#ionice: PRI
#lockwait: MINUTES
#stopwait: MINUTES
#stdexcludes: BOOLEAN
#mailto: ADDRESSLIST
#prune-backups: keep-INTERVAL=N[,...]
#script: FILENAME
#exclude-path: PATHLIST
#pigz: N
#notes-template: {{guestname}}
#pbs-change-detection-mode: legacy|data|metadata
#fleecing: enabled=BOOLEAN,storage=STORAGE_ID

I am sure that the former problems came from:

dir: usb
path /usb
content backup
is_mountpoint /pladde/usb1

= because there was a 4-TB-USB-disk mounted for the backups. But that disk vanished, ando so Proxmox started its backups, filled up his own /-space and "forgot" to give it free afterwards. At least this is what I guess. Since we deleted the storage in Proxmox-GUI, mounted it with

pvesm set /usb1 --is_mountpoint /pladde/usb1

I hope that Proxmox will only start its backups when there really is a volume mounted (and not only the folder under /). When we are using rsync for backups on remote storages /iSCSI or other), we usually have something like this in the scripts:

# Pruefe auf freien Plattenplatz
GETPERCENTAGE='s/.* \([0-9]\{1,3\}\)%.*/\1/'
if $CHECK_HDMINFREE ; then
KBISFREE=`df /$DATA_PATH | tail -n1 | sed -e "$GETPERCENTAGE"`
INODEISFREE=`df -i /$DATA_PATH | tail -n1 | sed -e "$GETPERCENTAGE"`
if [ $KBISFREE -ge $HDMINFREE -o $INODEISFREE -ge $HDMINFREE ] ; then
logger "Fatal: Not enough space left for rotating backups!"
exit
fi
fi

(= stolen from Peer heinleins rsync-pages ;)).

Best regards
 
Code:
dir: usb
    path /usb
    content backup
    is_mountpoint /pladde/usb1

yeah that probably didn't do what you expected it to do, because is_mountpoint and path had a different path, so as long as /pladde/usb1 was mounted, it would happily write to /usb even if that was actually filling up the / partition if nothing was mounted there. if path and the mountpoint are identical, you can just set is_mountpoint to 1.
 
  • Like
Reactions: waltar