Input / output error

larsb · Dec 12, 2024

Hi, folx,

we have some starnge issues with a stand alone Proxmox-server.
- VM 1 is running without problems
- VM 2 is "running", but not reachable via RDP
- Proxmox-WebGui is not reachable, too. When logging in with valid credentials, there is an error "wrong password". The same credentials are valid when logging in via SSH.

The error logs are saying:

"unable to write lrm status file - unable to open file '/etc/pve/nodes/pve/lrm_status.tmp.1774' - Input/output error"

- so basically the reason is/are that error(s):

unable to write lrm status file
unable to open file '/etc/pve/nodes/pve/ Input/output error

We can´t even do a

touch /etc/pve/testfile

- same error (input / output error). We had the same problem a while ago when we overprovisoned a VM, i.e. we assigned 2 TB to a virtual disk, but there was only 1,8 TB left in then ZFS. Interesting experience, by the way - the whole Proxmox stopped with all VMs, we could not use the server for a couple of days because we had to restore the backups on a second machine and then install thenwhole system from scratch :\ The fact that it is possible to overprovision a VM is very poor IMHO, that was never possible with VMWare, Xen or Hyper-V! But that´s another discussion ....

Now, back to our problem. We could restart the whole Proxmox, then we are able for a short time to reach the GUI and then second VM. After some hours / half a day, the same problem will occur.

# zfs list

NAME USED AVAIL REFER MOUNTPOINT
rpool 3.73T 325G 166K /rpool
rpool/ROOT 868G 325G 153K /rpool/ROOT
rpool/ROOT/pve-1 868G 325G 868G /
rpool/data 2.88T 325G 153K /rpool/data
rpool/data/vm-100-disk-0 310G 325G 310G -
rpool/data/vm-100-disk-1 985G 325G 985G -
rpool/data/vm-101-disk-0 303G 325G 303G -
rpool/data/vm-101-disk-1 1.32T 325G 1.32T -
rpool/var-lib-vz 204K 325G 204K /var/lib/vz

# zpool list

NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
rpool 5.22T 4.67T 567G - - 19% 89% 1.00x ONLINE -

# zpool status -v

pool: rpool
state: ONLINE
scan: scrub repaired 0B in 00:42:56 with 0 errors on Sun Dec 8 01:06:57 2024
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-SAMSUNG_MZ7L3960HCJR-00A07_S662NN0W717623-part3 ONLINE 0 0 0
ata-SAMSUNG_MZ7L3960HCJR-00A07_S662NN0W717626-part3 ONLINE 0 0 0
ata-SAMSUNG_MZ7L3960HCJR-00A07_S662NN0W717625-part3 ONLINE 0 0 0
ata-SAMSUNG_MZ7L3960HCJR-00A07_S662NN0W717622-part3 ONLINE 0 0 0
ata-SAMSUNG_MZ7L3960HCJR-00A07_S662NN0W717631-part3 ONLINE 0 0 0
ata-SAMSUNG_MZ7L3960HCJR-00A07_S662NN0W717630-part3 ONLINE 0 0 0

errors: No known data errors

When we are trying "df -h" or "df -i", there is space enough, I guess. When changing to /etc/pve, I am in the Fuse-environment as usual, but I cannot do anything there:

root@pve:/etc# df -h .

Filesystem Size Used Avail Use% Mounted on
rpool/ROOT/pve-1 1.2T 869G 325G 73% /

root@pve:/etc# cd /etc/pve

root@pve:/etc/pve# df -h .

Filesystem Size Used Avail Use% Mounted on
/dev/fuse 128M 16K 128M 1% /etc/pve

root@pve:/etc/pve# touch small-file
touch: cannot touch 'small-file': Input/output error

Any ideas how this could happen?

fabian · Dec 12, 2024

check the logs of the pve-cluster systemd unit responsible for /etc/pve

larsb · Dec 12, 2024

I would do this as soon as you tell me where to find them

tail -f /var/log/pve/tasks/ ?

journalctl ?

fabian · Dec 12, 2024

"journalctl -b -u pve-cluster"

larsb · Dec 12, 2024

fabian said:
"journalctl -b -u pve-cluster"

We got a cluster?

root@pve:/var/log# journalctl -b -u pve-cluster
Nov 24 21:02:10 pve systemd[1]: Starting pve-cluster.service - The Proxmox VE cluster filesystem...

Nov 24 21:02:10 pve pmxcfs[1620]: [main] notice: resolved node name 'pve' to '10.195.195.220' for default node IP address

Nov 24 21:02:10 pve pmxcfs[1620]: [main] notice: resolved node name 'pve' to '10.195.195.220' for default node IP address

Nov 24 21:02:11 pve systemd[1]: Started pve-cluster.service - The Proxmox VE cluster filesystem.
Nov 25 00:35:57 pve pmxcfs[1640]: [database] crit: commit transaction failed: database or disk is full#010

Nov 25 00:35:57 pve pmxcfs[1640]: [database] crit: rollback transaction failed: cannot rollback - no transaction is active#010

But those lines are more than two weeks old ...

fabian · Dec 12, 2024

yeah, so your rootfs got full and the DB that is stored there which is backing /etc/pve didn't like that

you can try restarting the pve-cluster service, hopefully the DB itself is not corrupt..

larsb · Dec 12, 2024

Thanx for the fast reply and the hint. But I do not understand how this could happen. Should we assign more space to the rootfs?

larsb · Dec 12, 2024

fabian said:
you can try restarting the pve-cluster service, hopefully the DB itself is not corrupt..

I actually found some posts stating problems with the DB. Solution was to export and import the DB or so ...
Is it generally spoken better to use one or two disks exclusively for the Proxmox / rootfs?

fabian · Dec 12, 2024

I assume 2 weeks back *something* happened that made the rootfs fill up. often this is a storage that is not annotated properly as being a mountpoint, if the storage is missing for whatever reason a backup might accidentally end up on the rootfs for example..

of course redundant storage is usually better, but it doesn't prevent filling it up either

larsb · Dec 12, 2024

fabian said:
I assume 2 weeks back *something* happened that made the rootfs fill up. often this is a storage that is not annotated properly as being a mountpoint, if the storage is missing for whatever reason a backup might accidentally end up on the rootfs for example..

of course redundant storage is usually better, but it doesn't prevent filling it up either

Yeah, I agree with the latter - but we have had those problems now for a while, more than two weeks ago. But because the VM 100 is running and 101 is not that impoortant and because we can reach the Proxmox via ssh, we left the system for a while.

I will reboot the server later this evening and see, what happens

larsb · Dec 12, 2024

I found this post and will follow the steops adviced there:

look here

LnxBil · Dec 12, 2024

larsb said:
Should we assign more space to the rootfs?

You need to set a refreservation to the rootfs so that another dataset cannot write your disk full until the pool halts.

waltar · Dec 12, 2024

LnxBil said:
You need to set a refreservation to the rootfs so that another dataset cannot write your disk full until the pool halts.

That would not solve the problem ... as next dataset is the zvol data where if vm/lxc cannot write anymore get corrupted by itself, if running replication the snapshots and replications maybe in 10min interval would destroy next host too otherwise just useless vm snapshots until then. In /var/lib/vz backups fail and last but not least defect hardware or any hacker attacks could fill up your system logs which in the end still end with same error of cannot update the pve database anymore as logs are in the / which is just protected to not fill up from other dataset.
So checking disk usage from time to time is always good or spend lots of space as small space with regulations still any day run into a problem.

LnxBil · Dec 12, 2024

waltar said:
That would not solve the problem ...

it would solve the problem of the corrupted pve internal database, yet not the others, sure. It wasn't meant as a solution for everything. Thin provisioning needs constant monitoring and is a recipe for distaster if not monitored properly.

larsb · Friday at 08:16

LnxBil said:
it would solve the problem of the corrupted pve internal database, yet not the others, sure. It wasn't meant as a solution for everything. Thin provisioning needs constant monitoring and is a recipe for distaster if not monitored properly.

Thanx to both of you for your input, and sorry for the late reply :\
I would like to begin with the reservation of space for the rootfs. At the moment, we got:

root@pve:/# df -h

udev 32G 0 32G 0% /dev
tmpfs 6.3G 2.5M 6.3G 1% /run
rpool/ROOT/pve-1 1.2T 869G 325G 73% /
tmpfs 32G 37M 32G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
efivarfs 512K 84K 424K 17% /sys/firmware/efi/efivars
rpool/var-lib-vz 325G 256K 325G 1% /var/lib/vz
rpool 325G 256K 325G 1% /rpool
rpool/ROOT 325G 256K 325G 1% /rpool/ROOT
rpool/data 325G 256K 325G 1% /rpool/data
/dev/fuse 128M 16K 128M 1% /etc/pve
/dev/sdg1 3.6T 28K 3.4T 1% /mnt/usb
tmpfs 6.3G 0 6.3G 0% /run/user/1000

So, far from 100 % usage IMHO. But we still have the strange problem that we can not login via the GUI ("Login failed", though the credentials are correct) and that we can not run the second VM :\

As said before, if we reboot the Proxmoxserver it is for a short time possible to login via the GUI as well as reach the second VM. But after a day or so it is the same again, only one VM up and the Proxmox only accessible via SSH.

LnxBil · Friday at 08:23

First, this is NOT normal but without proper error messages, we cannot even guess.

larsb · Friday at 08:28

Of course you´re right

But we had them already, I think:

root@pve:/# journalctl -b -u pve-cluster

Nov 24 21:02:10 pve systemd[1]: Starting pve-cluster.service - The Proxmox VE cluster filesystem...
Nov 24 21:02:10 pve pmxcfs[1620]: [main] notice: resolved node name 'pve' to '10.195.195.220' for default node IP address
Nov 24 21:02:10 pve pmxcfs[1620]: [main] notice: resolved node name 'pve' to '10.195.195.220' for default node IP address
Nov 24 21:02:11 pve systemd[1]: Started pve-cluster.service - The Proxmox VE cluster filesystem.
Nov 25 00:35:57 pve pmxcfs[1640]: [database] crit: commit transaction failed: database or disk is full#010
Nov 25 00:35:57 pve pmxcfs[1640]: [database] crit: rollback transaction failed: cannot rollback - no transaction is active#010

- so since november 25th the disk or databse is announced as "full". But whiuch disk? Which database? I know such problems only from SQL- and/or Exchange-databases where it is necessary to clear transaction logs from time to time.

waltar · Friday at 08:46

This is the sqlite pve db : /var/lib/pve-cluster/config.db and "/" filesystem was full.

larsb · Friday at 18:14

Hi, waltar,

I may be a fool, but I am used to at least *trying* to understand the system and infrastructure of a server - but at this point, I am lost :\

root@pve:/# du -sch /var/lib/pve-cluster/config.db
20K /var/lib/pve-cluster/config.db

root@pve:/var/lib/pve-cluster# df -h .

Filesystem Size Used Avail Use% Mounted on
rpool/ROOT/pve-1 1.2T 869G 325G 73% /

root@pve:/var/lib/pve-cluster# df -h /

Filesystem Size Used Avail Use% Mounted on
rpool/ROOT/pve-1 1.2T 869G 325G 73% /

= why does it state "disk full" with a 20k database on a /-mountpoint of 1.2 Terabyte with roughly 27 percent unused??
And the pve-cluster-log did not change since the 25th of november ... is this typically for Proxmox?!

waltar · Saturday at 09:05

Maybe there was a (pve/pbs) backup job who filled /tmp temporäry which results in that error messages to db full ?!

Input / output error

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

New Member

Proxmox Staff Member

New Member

New Member

Distinguished Member

Renowned Member

Distinguished Member

New Member

Distinguished Member

New Member

Renowned Member

New Member

Renowned Member