[SOLVED] PVE Standalone - Loosing WebUI & IO problems on rpool

Pifouney

Active Member
Oct 17, 2021
320
31
33
35
Hello there!


I am coming to you following a concern that I have been encountering for several months already ...
In a little blocking environment, until now, I have postponed the treatment by restarting the machine.

I manage a standalone server installed natively in ZFS.
The server has 64G of RAM and an i7 7700K core.

When the server has been running for four or five days, I end up:
- lose access to the web interface,
- My host OS seems to accept some Inputs, but not all, and overall refuses to write to the / etc directory

At this point, and from that point on, my syslog file starts to look like the attached file "bug_IO_srv.png".

I have already used the pve update-tools, and configured internal clock on UTC...

I have now really no idea how i can solve them ... Any idea is welcome :)
 

Attachments

  • bug_IO_srv.png
    bug_IO_srv.png
    180.8 KB · Views: 10
Hi,
Hello there!


I am coming to you following a concern that I have been encountering for several months already ...
In a little blocking environment, until now, I have postponed the treatment by restarting the machine.

I manage a standalone server installed natively in ZFS.
The server has 64G of RAM and an i7 7700K core.

When the server has been running for four or five days, I end up:
- lose access to the web interface,
- My host OS seems to accept some Inputs, but not all, and overall refuses to write to the / etc directory

At this point, and from that point on, my syslog file starts to look like the attached file "bug_IO_srv.png".

I have already used the pve update-tools, and configured internal clock on UTC...

I have now really no idea how i can solve them ... Any idea is welcome :)
at a first glance, seems like an issue with the cluster file system. Please check the logs with journalctl -u pve-cluster.service. (The service is also relevant when the node is standalone). Please also post the output of pveversion -v.

Did you configure some HA services? The HA stack in PVE is designed for clusters. Does writing to all of /etc/ fail or just to /etc/pve?
 
Hello :)

Thanks for your time :)

lof of pveversion:
Code:
pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.140-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-6
pve-kernel-helper: 6.4-6
pve-kernel-5.4.140-1-pve: 5.4.140-1
pve-kernel-5.4.128-1-pve: 5.4.128-2
pve-kernel-5.4.124-1-pve: 5.4.124-2
pve-kernel-4.15: 5.4-19
pve-kernel-4.15.18-30-pve: 4.15.18-58
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve4~bpo10
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-1
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.5-pve1~bpo10+1

Did you configure some HA services? The HA stack in PVE is designed for clusters

I really don't believe I made any HA-related modifications on this node ...
But is there a command to verify this?

Does writing to all of /etc/ fail or just to /etc/pve?

I'm sure the IOs fail when trying to access / etc / pve.
I cannot check immediately (I restarted the server 48 hours ago for a community event).
But I remember that the problem also exists in / etc

Bests regards,
 
Please check the logs with journalctl -u pve-cluster.service.
And also /var/log/syslog from around the time the issue occurred.

I really don't believe I made any HA-related modifications on this node ...
But is there a command to verify this?
I'm was wondering because pve-ha-lrm is in the screenshot, but apparently it's always running, and just idling around if no HA is configured. You can check with ha-manager status. If it just says quorum OK and nothing else, you don't have HA configured.

I'm sure the IOs fail when trying to access / etc / pve.
I cannot check immediately (I restarted the server 48 hours ago for a community event).
But I remember that the problem also exists in / etc
Is /etc/ it's own disk/partition/file system by any chance or is it just part of the root file system?
 
  • Like
Reactions: Pifouney
Hey :)

My system is bugging again ^^

Return of
Code:
journalctl -u pve-cluster.service
:
Code:
-- Logs begin at Mon 2021-10-18 16:18:37 CEST, end at Thu 2021-10-21 14:40:10 CEST. --
oct. 18 16:18:39 ns3855022 systemd[1]: Starting The Proxmox VE cluster filesystem...
oct. 18 16:18:40 ns3855022 systemd[1]: Started The Proxmox VE cluster filesystem.
oct. 20 14:47:17 ns3855022 pmxcfs[1756]: [database] crit: commit transaction failed: database or disk is full#010
oct. 20 14:47:17 ns3855022 pmxcfs[1756]: [database] crit: rollback transaction failed: cannot rollback - no transaction is active#010

My syslog in bugging situation:
Code:
ct 21 14:41:00 ns3855022 pvestatd[1890]: status update time (9.079 seconds)
Oct 21 14:41:00 ns3855022 systemd[1]: Starting Proxmox VE replication runner...
Oct 21 14:41:01 ns3855022 pvesr[12709]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:41:02 ns3855022 pvesr[12709]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:41:02 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:03 ns3855022 pvesr[12709]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:41:04 ns3855022 pvesr[12709]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:41:05 ns3855022 pvesr[12709]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:41:06 ns3855022 pvesr[12709]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:41:07 ns3855022 pvesr[12709]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:41:07 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:08 ns3855022 pvesr[12709]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:41:09 ns3855022 pvesr[12709]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:41:10 ns3855022 pvesr[12709]: cfs-lock 'file-replication_cfg' error: got lock request timeout
Oct 21 14:41:10 ns3855022 systemd[1]: pvesr.service: Main process exited, code=exited, status=5/NOTINSTALLED
Oct 21 14:41:10 ns3855022 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Oct 21 14:41:10 ns3855022 systemd[1]: Failed to start Proxmox VE replication runner.
Oct 21 14:41:10 ns3855022 pvestatd[1890]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Oct 21 14:41:10 ns3855022 pvestatd[1890]: status update time (9.081 seconds)
Oct 21 14:41:12 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:17 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:20 ns3855022 pvestatd[1890]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Oct 21 14:41:20 ns3855022 pvestatd[1890]: status update time (9.081 seconds)
Oct 21 14:41:22 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:27 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:30 ns3855022 pvestatd[1890]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Oct 21 14:41:30 ns3855022 pvestatd[1890]: status update time (9.080 seconds)
Oct 21 14:41:32 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:37 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:40 ns3855022 pvestatd[1890]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Oct 21 14:41:40 ns3855022 pvestatd[1890]: status update time (9.077 seconds)
Oct 21 14:41:42 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:47 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:50 ns3855022 pvestatd[1890]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Oct 21 14:41:50 ns3855022 pvestatd[1890]: status update time (9.080 seconds)
Oct 21 14:41:52 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:57 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:42:00 ns3855022 systemd[1]: Starting Proxmox VE replication runner...
Oct 21 14:42:00 ns3855022 pvestatd[1890]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Oct 21 14:42:00 ns3855022 pvestatd[1890]: status update time (9.073 seconds)
Oct 21 14:42:01 ns3855022 pvesr[13899]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:42:02 ns3855022 pvesr[13899]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:42:02 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:42:03 ns3855022 pvesr[13899]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:42:04 ns3855022 pvesr[13899]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:42:05 ns3855022 pvesr[13899]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:42:06 ns3855022 pvesr[13899]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:42:07 ns3855022 pvesr[13899]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:42:07 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:42:08 ns3855022 pvesr[13899]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:42:09 ns3855022 pvesr[13899]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:42:10 ns3855022 pvesr[13899]: cfs-lock 'file-replication_cfg' error: got lock request timeout
Oct 21 14:42:10 ns3855022 systemd[1]: pvesr.service: Main process exited, code=exited, status=5/NOTINSTALLED
Oct 21 14:42:10 ns3855022 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Oct 21 14:42:10 ns3855022 systemd[1]: Failed to start Proxmox VE replication runner.
Oct 21 14:42:10 ns3855022 pvestatd[1890]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Oct 21 14:42:10 ns3855022 pvestatd[1890]: status update time (9.080 seconds)
Oct 21 14:42:12 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:42:17 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:42:21 ns3855022 pvestatd[1890]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Oct 21 14:42:21 ns3855022 pvestatd[1890]: status update time (9.080 seconds)

Answer of ha-manager status:
Code:
quorum OK

After looking at the result of journalctl -u pve-cluster.service, I see that there is a disk space concern. I went into more detail on my syslog file, and ended up finding this:
Code:
Oct 20 14:47:01 ns3855022 systemd[1]: Started Proxmox VE replication runner.
Oct 20 14:47:17 ns3855022 pmxcfs[1756]: [database] crit: commit transaction failed: database or disk is full#010

This goes with the backup schedules of the machine, and it is the backup of a VM that puts me in this situation... However, the machine still has disk space.
Return of df -h:
Code:
Sys. de fichiers             Taille Utilisé Dispo Uti% Monté sur
udev                            32G       0   32G   0% /dev
tmpfs                          6,3G     58M  6,3G   1% /run
rpool/ROOT/pve-1               164G     43G  121G  27% /
tmpfs                           32G     46M   32G   1% /dev/shm
tmpfs                          5,0M       0  5,0M   0% /run/lock
tmpfs                           32G       0   32G   0% /sys/fs/cgroup
/dev/sda1                      3,7T    1,1T  2,6T  30% /srv/HDD_DATAS
rpool                          121G    128K  121G   1% /rpool
rpool/ROOT                     121G    128K  121G   1% /rpool/ROOT
rpool/data                     121G    128K  121G   1% /rpool/data
rpool/GamingPool               121G    128K  121G   1% /rpool/GamingPool
rpool/datas-ct                 121G    128K  121G   1% /rpool/datas-ct
rpool/data/subvol-101-disk-0    25G    681M   25G   3% /rpool/data/subvol-101-disk-0
rpool/data/subvol-103-disk-0    25G    486M   25G   2% /rpool/data/subvol-103-disk-0
/dev/fuse                       30M     20K   30M   1% /etc/pve

Is a temp_save directory existing when running a backup task ? if not, i really don't enderstant this situation :s

My backup save a VM disk on my 4To HDD disk. The VM use 1To of disk usage, and his save bug everytime last i've see that....

Any idea please ? :D

Thanks for your time :)
 
Last edited:
Code:
-- Logs begin at Mon 2021-10-18 16:18:37 CEST, end at Thu 2021-10-21 14:40:10 CEST. --
oct. 18 16:18:39 ns3855022 systemd[1]: Starting The Proxmox VE cluster filesystem...
oct. 18 16:18:40 ns3855022 systemd[1]: Started The Proxmox VE cluster filesystem.
oct. 20 14:47:17 ns3855022 pmxcfs[1756]: [database] crit: commit transaction failed: database or disk is full#010
oct. 20 14:47:17 ns3855022 pmxcfs[1756]: [database] crit: rollback transaction failed: cannot rollback - no transaction is active#010


After looking at the result of journalctl -u pve-cluster.service, I see that there is a disk space concern. I went into more detail on my syslog file, and ended up finding this:
Code:
Oct 20 14:47:01 ns3855022 systemd[1]: Started Proxmox VE replication runner.
Oct 20 14:47:17 ns3855022 pmxcfs[1756]: [database] crit: commit transaction failed: database or disk is full#010
Since you say this coincides with scheduled backups, I wouldn't rule out that the commit actually fails for a different reason (maybe too much load?).

This goes with the backup schedules of the machine, and it is the backup of a VM that puts me in this situation... However, the machine still has disk space.
Return of df -h:
Code:
Sys. de fichiers             Taille Utilisé Dispo Uti% Monté sur
udev                            32G       0   32G   0% /dev
tmpfs                          6,3G     58M  6,3G   1% /run
rpool/ROOT/pve-1               164G     43G  121G  27% /
tmpfs                           32G     46M   32G   1% /dev/shm
tmpfs                          5,0M       0  5,0M   0% /run/lock
tmpfs                           32G       0   32G   0% /sys/fs/cgroup
/dev/sda1                      3,7T    1,1T  2,6T  30% /srv/HDD_DATAS
rpool                          121G    128K  121G   1% /rpool
rpool/ROOT                     121G    128K  121G   1% /rpool/ROOT
rpool/data                     121G    128K  121G   1% /rpool/data
rpool/GamingPool               121G    128K  121G   1% /rpool/GamingPool
rpool/datas-ct                 121G    128K  121G   1% /rpool/datas-ct
rpool/data/subvol-101-disk-0    25G    681M   25G   3% /rpool/data/subvol-101-disk-0
rpool/data/subvol-103-disk-0    25G    486M   25G   2% /rpool/data/subvol-103-disk-0
/dev/fuse                       30M     20K   30M   1% /etc/pve
Please also share the output of df -ih and the log of the (presumably failed) backup task. Which disk is the VM on and which disk is the backup target?

Is a temp_save directory existing when running a backup task ? if not, i really don't enderstant this situation :s
IIRC a temporary directory is only used for non-"stop mode" container backups if the storage doesn't support snapshots. And the default is not in /etc, but you can look at the tmpdir setting of your /etc/vzdump.conf to be sure.

My backup save a VM disk on my 4To HDD disk. The VM use 1To of disk usage, and his save bug everytime last i've see that....

Any idea please ? :D

Thanks for your time :)
 
Hello :)
Since you say this coincides with scheduled backups, I wouldn't rule out that the commit actually fails for a different reason (maybe too much load?).
Backups are scheduled at a ultra light load time :)

Return of
Diff:
df -ih
:
Code:
root@ns3855022:~# df -ih
Sys. de fichiers             Inœuds IUtil. ILibre IUti% Monté sur
udev                           7,9M    541   7,9M    1% /dev
tmpfs                          7,9M    869   7,9M    1% /run
rpool/ROOT/pve-1               243M    96K   242M    1% /
tmpfs                          7,9M    113   7,9M    1% /dev/shm
tmpfs                          7,9M     15   7,9M    1% /run/lock
tmpfs                          7,9M     18   7,9M    1% /sys/fs/cgroup
/dev/sda1                      373M     50   373M    1% /srv/HDD_DATAS
rpool                          242M     10   242M    1% /rpool
rpool/data                     242M      8   242M    1% /rpool/data
rpool/ROOT                     242M      7   242M    1% /rpool/ROOT
rpool/GamingPool               242M      6   242M    1% /rpool/GamingPool
rpool/datas-ct                 242M      6   242M    1% /rpool/datas-ct
rpool/data/subvol-103-disk-0    50M    27K    50M    1% /rpool/data/subvol-103-disk-0
rpool/data/subvol-101-disk-0    49M    46K    49M    1% /rpool/data/subvol-101-disk-0
/dev/fuse                      9,8K     34   9,8K    1% /etc/pve

Log of failed backup in attached files :D

Incase of this container, the virtual disk and sare_repository are on the same disk (4TB disk). It's a nextcloud container.

Code:
cat /etc/vzdump.conf
# vzdump default settings

tmpdir: /var/lib/vz/tmp_backup

Can i move this directory ? Or mount a large filesystem on ? :D

Thanks for your answers :)
 

Attachments

  • failed_save.png
    failed_save.png
    110.5 KB · Views: 5
Code:
cat /etc/vzdump.conf
# vzdump default settings

tmpdir: /var/lib/vz/tmp_backup

Can i move this directory ? Or mount a large filesystem on ? :D
Yes. You can also just comment out the line. By default vzdump should use the backup storage itself (except for PBS).
 
  • Like
Reactions: Pifouney