[SOLVED] PVE Standalone - Loosing WebUI & IO problems on rpool

Pifouney

Member
Oct 17, 2021
179
17
23
35
Hello there!


I am coming to you following a concern that I have been encountering for several months already ...
In a little blocking environment, until now, I have postponed the treatment by restarting the machine.

I manage a standalone server installed natively in ZFS.
The server has 64G of RAM and an i7 7700K core.

When the server has been running for four or five days, I end up:
- lose access to the web interface,
- My host OS seems to accept some Inputs, but not all, and overall refuses to write to the / etc directory

At this point, and from that point on, my syslog file starts to look like the attached file "bug_IO_srv.png".

I have already used the pve update-tools, and configured internal clock on UTC...

I have now really no idea how i can solve them ... Any idea is welcome :)
 

Attachments

  • bug_IO_srv.png
    bug_IO_srv.png
    180.8 KB · Views: 10
Hi,
Hello there!


I am coming to you following a concern that I have been encountering for several months already ...
In a little blocking environment, until now, I have postponed the treatment by restarting the machine.

I manage a standalone server installed natively in ZFS.
The server has 64G of RAM and an i7 7700K core.

When the server has been running for four or five days, I end up:
- lose access to the web interface,
- My host OS seems to accept some Inputs, but not all, and overall refuses to write to the / etc directory

At this point, and from that point on, my syslog file starts to look like the attached file "bug_IO_srv.png".

I have already used the pve update-tools, and configured internal clock on UTC...

I have now really no idea how i can solve them ... Any idea is welcome :)
at a first glance, seems like an issue with the cluster file system. Please check the logs with journalctl -u pve-cluster.service. (The service is also relevant when the node is standalone). Please also post the output of pveversion -v.

Did you configure some HA services? The HA stack in PVE is designed for clusters. Does writing to all of /etc/ fail or just to /etc/pve?
 
Hello :)

Thanks for your time :)

lof of pveversion:
Code:
pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.140-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-6
pve-kernel-helper: 6.4-6
pve-kernel-5.4.140-1-pve: 5.4.140-1
pve-kernel-5.4.128-1-pve: 5.4.128-2
pve-kernel-5.4.124-1-pve: 5.4.124-2
pve-kernel-4.15: 5.4-19
pve-kernel-4.15.18-30-pve: 4.15.18-58
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve4~bpo10
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-1
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.5-pve1~bpo10+1

Did you configure some HA services? The HA stack in PVE is designed for clusters

I really don't believe I made any HA-related modifications on this node ...
But is there a command to verify this?

Does writing to all of /etc/ fail or just to /etc/pve?

I'm sure the IOs fail when trying to access / etc / pve.
I cannot check immediately (I restarted the server 48 hours ago for a community event).
But I remember that the problem also exists in / etc

Bests regards,
 
Please check the logs with journalctl -u pve-cluster.service.
And also /var/log/syslog from around the time the issue occurred.

I really don't believe I made any HA-related modifications on this node ...
But is there a command to verify this?
I'm was wondering because pve-ha-lrm is in the screenshot, but apparently it's always running, and just idling around if no HA is configured. You can check with ha-manager status. If it just says quorum OK and nothing else, you don't have HA configured.

I'm sure the IOs fail when trying to access / etc / pve.
I cannot check immediately (I restarted the server 48 hours ago for a community event).
But I remember that the problem also exists in / etc
Is /etc/ it's own disk/partition/file system by any chance or is it just part of the root file system?
 
  • Like
Reactions: Pifouney
Hey :)

My system is bugging again ^^

Return of
Code:
journalctl -u pve-cluster.service
:
Code:
-- Logs begin at Mon 2021-10-18 16:18:37 CEST, end at Thu 2021-10-21 14:40:10 CEST. --
oct. 18 16:18:39 ns3855022 systemd[1]: Starting The Proxmox VE cluster filesystem...
oct. 18 16:18:40 ns3855022 systemd[1]: Started The Proxmox VE cluster filesystem.
oct. 20 14:47:17 ns3855022 pmxcfs[1756]: [database] crit: commit transaction failed: database or disk is full#010
oct. 20 14:47:17 ns3855022 pmxcfs[1756]: [database] crit: rollback transaction failed: cannot rollback - no transaction is active#010

My syslog in bugging situation:
Code:
ct 21 14:41:00 ns3855022 pvestatd[1890]: status update time (9.079 seconds)
Oct 21 14:41:00 ns3855022 systemd[1]: Starting Proxmox VE replication runner...
Oct 21 14:41:01 ns3855022 pvesr[12709]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:41:02 ns3855022 pvesr[12709]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:41:02 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:03 ns3855022 pvesr[12709]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:41:04 ns3855022 pvesr[12709]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:41:05 ns3855022 pvesr[12709]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:41:06 ns3855022 pvesr[12709]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:41:07 ns3855022 pvesr[12709]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:41:07 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:08 ns3855022 pvesr[12709]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:41:09 ns3855022 pvesr[12709]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:41:10 ns3855022 pvesr[12709]: cfs-lock 'file-replication_cfg' error: got lock request timeout
Oct 21 14:41:10 ns3855022 systemd[1]: pvesr.service: Main process exited, code=exited, status=5/NOTINSTALLED
Oct 21 14:41:10 ns3855022 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Oct 21 14:41:10 ns3855022 systemd[1]: Failed to start Proxmox VE replication runner.
Oct 21 14:41:10 ns3855022 pvestatd[1890]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Oct 21 14:41:10 ns3855022 pvestatd[1890]: status update time (9.081 seconds)
Oct 21 14:41:12 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:17 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:20 ns3855022 pvestatd[1890]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Oct 21 14:41:20 ns3855022 pvestatd[1890]: status update time (9.081 seconds)
Oct 21 14:41:22 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:27 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:30 ns3855022 pvestatd[1890]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Oct 21 14:41:30 ns3855022 pvestatd[1890]: status update time (9.080 seconds)
Oct 21 14:41:32 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:37 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:40 ns3855022 pvestatd[1890]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Oct 21 14:41:40 ns3855022 pvestatd[1890]: status update time (9.077 seconds)
Oct 21 14:41:42 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:47 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:50 ns3855022 pvestatd[1890]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Oct 21 14:41:50 ns3855022 pvestatd[1890]: status update time (9.080 seconds)
Oct 21 14:41:52 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:57 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:42:00 ns3855022 systemd[1]: Starting Proxmox VE replication runner...
Oct 21 14:42:00 ns3855022 pvestatd[1890]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Oct 21 14:42:00 ns3855022 pvestatd[1890]: status update time (9.073 seconds)
Oct 21 14:42:01 ns3855022 pvesr[13899]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:42:02 ns3855022 pvesr[13899]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:42:02 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:42:03 ns3855022 pvesr[13899]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:42:04 ns3855022 pvesr[13899]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:42:05 ns3855022 pvesr[13899]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:42:06 ns3855022 pvesr[13899]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:42:07 ns3855022 pvesr[13899]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:42:07 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:42:08 ns3855022 pvesr[13899]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:42:09 ns3855022 pvesr[13899]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:42:10 ns3855022 pvesr[13899]: cfs-lock 'file-replication_cfg' error: got lock request timeout
Oct 21 14:42:10 ns3855022 systemd[1]: pvesr.service: Main process exited, code=exited, status=5/NOTINSTALLED
Oct 21 14:42:10 ns3855022 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Oct 21 14:42:10 ns3855022 systemd[1]: Failed to start Proxmox VE replication runner.
Oct 21 14:42:10 ns3855022 pvestatd[1890]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Oct 21 14:42:10 ns3855022 pvestatd[1890]: status update time (9.080 seconds)
Oct 21 14:42:12 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:42:17 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:42:21 ns3855022 pvestatd[1890]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Oct 21 14:42:21 ns3855022 pvestatd[1890]: status update time (9.080 seconds)

Answer of ha-manager status:
Code:
quorum OK

After looking at the result of journalctl -u pve-cluster.service, I see that there is a disk space concern. I went into more detail on my syslog file, and ended up finding this:
Code:
Oct 20 14:47:01 ns3855022 systemd[1]: Started Proxmox VE replication runner.
Oct 20 14:47:17 ns3855022 pmxcfs[1756]: [database] crit: commit transaction failed: database or disk is full#010

This goes with the backup schedules of the machine, and it is the backup of a VM that puts me in this situation... However, the machine still has disk space.
Return of df -h:
Code:
Sys. de fichiers             Taille Utilisé Dispo Uti% Monté sur
udev                            32G       0   32G   0% /dev
tmpfs                          6,3G     58M  6,3G   1% /run
rpool/ROOT/pve-1               164G     43G  121G  27% /
tmpfs                           32G     46M   32G   1% /dev/shm
tmpfs                          5,0M       0  5,0M   0% /run/lock
tmpfs                           32G       0   32G   0% /sys/fs/cgroup
/dev/sda1                      3,7T    1,1T  2,6T  30% /srv/HDD_DATAS
rpool                          121G    128K  121G   1% /rpool
rpool/ROOT                     121G    128K  121G   1% /rpool/ROOT
rpool/data                     121G    128K  121G   1% /rpool/data
rpool/GamingPool               121G    128K  121G   1% /rpool/GamingPool
rpool/datas-ct                 121G    128K  121G   1% /rpool/datas-ct
rpool/data/subvol-101-disk-0    25G    681M   25G   3% /rpool/data/subvol-101-disk-0
rpool/data/subvol-103-disk-0    25G    486M   25G   2% /rpool/data/subvol-103-disk-0
/dev/fuse                       30M     20K   30M   1% /etc/pve

Is a temp_save directory existing when running a backup task ? if not, i really don't enderstant this situation :s

My backup save a VM disk on my 4To HDD disk. The VM use 1To of disk usage, and his save bug everytime last i've see that....

Any idea please ? :D

Thanks for your time :)
 
Last edited:
Code:
-- Logs begin at Mon 2021-10-18 16:18:37 CEST, end at Thu 2021-10-21 14:40:10 CEST. --
oct. 18 16:18:39 ns3855022 systemd[1]: Starting The Proxmox VE cluster filesystem...
oct. 18 16:18:40 ns3855022 systemd[1]: Started The Proxmox VE cluster filesystem.
oct. 20 14:47:17 ns3855022 pmxcfs[1756]: [database] crit: commit transaction failed: database or disk is full#010
oct. 20 14:47:17 ns3855022 pmxcfs[1756]: [database] crit: rollback transaction failed: cannot rollback - no transaction is active#010


After looking at the result of journalctl -u pve-cluster.service, I see that there is a disk space concern. I went into more detail on my syslog file, and ended up finding this:
Code:
Oct 20 14:47:01 ns3855022 systemd[1]: Started Proxmox VE replication runner.
Oct 20 14:47:17 ns3855022 pmxcfs[1756]: [database] crit: commit transaction failed: database or disk is full#010
Since you say this coincides with scheduled backups, I wouldn't rule out that the commit actually fails for a different reason (maybe too much load?).

This goes with the backup schedules of the machine, and it is the backup of a VM that puts me in this situation... However, the machine still has disk space.
Return of df -h:
Code:
Sys. de fichiers             Taille Utilisé Dispo Uti% Monté sur
udev                            32G       0   32G   0% /dev
tmpfs                          6,3G     58M  6,3G   1% /run
rpool/ROOT/pve-1               164G     43G  121G  27% /
tmpfs                           32G     46M   32G   1% /dev/shm
tmpfs                          5,0M       0  5,0M   0% /run/lock
tmpfs                           32G       0   32G   0% /sys/fs/cgroup
/dev/sda1                      3,7T    1,1T  2,6T  30% /srv/HDD_DATAS
rpool                          121G    128K  121G   1% /rpool
rpool/ROOT                     121G    128K  121G   1% /rpool/ROOT
rpool/data                     121G    128K  121G   1% /rpool/data
rpool/GamingPool               121G    128K  121G   1% /rpool/GamingPool
rpool/datas-ct                 121G    128K  121G   1% /rpool/datas-ct
rpool/data/subvol-101-disk-0    25G    681M   25G   3% /rpool/data/subvol-101-disk-0
rpool/data/subvol-103-disk-0    25G    486M   25G   2% /rpool/data/subvol-103-disk-0
/dev/fuse                       30M     20K   30M   1% /etc/pve
Please also share the output of df -ih and the log of the (presumably failed) backup task. Which disk is the VM on and which disk is the backup target?

Is a temp_save directory existing when running a backup task ? if not, i really don't enderstant this situation :s
IIRC a temporary directory is only used for non-"stop mode" container backups if the storage doesn't support snapshots. And the default is not in /etc, but you can look at the tmpdir setting of your /etc/vzdump.conf to be sure.

My backup save a VM disk on my 4To HDD disk. The VM use 1To of disk usage, and his save bug everytime last i've see that....

Any idea please ? :D

Thanks for your time :)
 
Hello :)
Since you say this coincides with scheduled backups, I wouldn't rule out that the commit actually fails for a different reason (maybe too much load?).
Backups are scheduled at a ultra light load time :)

Return of
Diff:
df -ih
:
Code:
root@ns3855022:~# df -ih
Sys. de fichiers             Inœuds IUtil. ILibre IUti% Monté sur
udev                           7,9M    541   7,9M    1% /dev
tmpfs                          7,9M    869   7,9M    1% /run
rpool/ROOT/pve-1               243M    96K   242M    1% /
tmpfs                          7,9M    113   7,9M    1% /dev/shm
tmpfs                          7,9M     15   7,9M    1% /run/lock
tmpfs                          7,9M     18   7,9M    1% /sys/fs/cgroup
/dev/sda1                      373M     50   373M    1% /srv/HDD_DATAS
rpool                          242M     10   242M    1% /rpool
rpool/data                     242M      8   242M    1% /rpool/data
rpool/ROOT                     242M      7   242M    1% /rpool/ROOT
rpool/GamingPool               242M      6   242M    1% /rpool/GamingPool
rpool/datas-ct                 242M      6   242M    1% /rpool/datas-ct
rpool/data/subvol-103-disk-0    50M    27K    50M    1% /rpool/data/subvol-103-disk-0
rpool/data/subvol-101-disk-0    49M    46K    49M    1% /rpool/data/subvol-101-disk-0
/dev/fuse                      9,8K     34   9,8K    1% /etc/pve

Log of failed backup in attached files :D

Incase of this container, the virtual disk and sare_repository are on the same disk (4TB disk). It's a nextcloud container.

Code:
cat /etc/vzdump.conf
# vzdump default settings

tmpdir: /var/lib/vz/tmp_backup

Can i move this directory ? Or mount a large filesystem on ? :D

Thanks for your answers :)
 

Attachments

  • failed_save.png
    failed_save.png
    110.5 KB · Views: 5
Code:
cat /etc/vzdump.conf
# vzdump default settings

tmpdir: /var/lib/vz/tmp_backup

Can i move this directory ? Or mount a large filesystem on ? :D
Yes. You can also just comment out the line. By default vzdump should use the backup storage itself (except for PBS).
 
  • Like
Reactions: Pifouney

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!