[SOLVED] PVE Standalone - Loosing WebUI & IO problems on rpool

Pifouney · Oct 18, 2021

Hello there!

I am coming to you following a concern that I have been encountering for several months already ...
In a little blocking environment, until now, I have postponed the treatment by restarting the machine.

I manage a standalone server installed natively in ZFS.
The server has 64G of RAM and an i7 7700K core.

When the server has been running for four or five days, I end up:
- lose access to the web interface,
- My host OS seems to accept some Inputs, but not all, and overall refuses to write to the / etc directory

At this point, and from that point on, my syslog file starts to look like the attached file "bug_IO_srv.png".

I have already used the pve update-tools, and configured internal clock on UTC...

I have now really no idea how i can solve them ... Any idea is welcome

fiona · Oct 19, 2021

Hi,

Pifouney said:
Hello there!

I am coming to you following a concern that I have been encountering for several months already ...
In a little blocking environment, until now, I have postponed the treatment by restarting the machine.

I manage a standalone server installed natively in ZFS.
The server has 64G of RAM and an i7 7700K core.

When the server has been running for four or five days, I end up:
- lose access to the web interface,
- My host OS seems to accept some Inputs, but not all, and overall refuses to write to the / etc directory

At this point, and from that point on, my syslog file starts to look like the attached file "bug_IO_srv.png".

I have already used the pve update-tools, and configured internal clock on UTC...

I have now really no idea how i can solve them ... Any idea is welcome

at a first glance, seems like an issue with the cluster file system. Please check the logs with journalctl -u pve-cluster.service. (The service is also relevant when the node is standalone). Please also post the output of pveversion -v.

Did you configure some HA services? The HA stack in PVE is designed for clusters. Does writing to all of /etc/ fail or just to /etc/pve?

Pifouney · Oct 19, 2021

Hello

Thanks for your time

lof of pveversion:

Code:

pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.140-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-6
pve-kernel-helper: 6.4-6
pve-kernel-5.4.140-1-pve: 5.4.140-1
pve-kernel-5.4.128-1-pve: 5.4.128-2
pve-kernel-5.4.124-1-pve: 5.4.124-2
pve-kernel-4.15: 5.4-19
pve-kernel-4.15.18-30-pve: 4.15.18-58
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve4~bpo10
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-1
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.5-pve1~bpo10+1

Did you configure some HA services? The HA stack in PVE is designed for clusters

I really don't believe I made any HA-related modifications on this node ...
But is there a command to verify this?

Does writing to all of /etc/ fail or just to /etc/pve?

I'm sure the IOs fail when trying to access / etc / pve.
I cannot check immediately (I restarted the server 48 hours ago for a community event).
But I remember that the problem also exists in / etc

Bests regards,

fiona · Oct 20, 2021

Fabian_E said:
Please check the logs with journalctl -u pve-cluster.service.

And also /var/log/syslog from around the time the issue occurred.

Pifouney said:
I really don't believe I made any HA-related modifications on this node ...
But is there a command to verify this?

I'm was wondering because pve-ha-lrm is in the screenshot, but apparently it's always running, and just idling around if no HA is configured. You can check with ha-manager status. If it just says quorum OK and nothing else, you don't have HA configured.

Pifouney said:
I'm sure the IOs fail when trying to access / etc / pve.
I cannot check immediately (I restarted the server 48 hours ago for a community event).
But I remember that the problem also exists in / etc

Is /etc/ it's own disk/partition/file system by any chance or is it just part of the root file system?

Pifouney · Oct 21, 2021

Hey

My system is bugging again ^^

Return of

Code:

journalctl -u pve-cluster.service

:

Code:

-- Logs begin at Mon 2021-10-18 16:18:37 CEST, end at Thu 2021-10-21 14:40:10 CEST. --
oct. 18 16:18:39 ns3855022 systemd[1]: Starting The Proxmox VE cluster filesystem...
oct. 18 16:18:40 ns3855022 systemd[1]: Started The Proxmox VE cluster filesystem.
oct. 20 14:47:17 ns3855022 pmxcfs[1756]: [database] crit: commit transaction failed: database or disk is full#010
oct. 20 14:47:17 ns3855022 pmxcfs[1756]: [database] crit: rollback transaction failed: cannot rollback - no transaction is active#010

My syslog in bugging situation:

Code:

ct 21 14:41:00 ns3855022 pvestatd[1890]: status update time (9.079 seconds)
Oct 21 14:41:00 ns3855022 systemd[1]: Starting Proxmox VE replication runner...
Oct 21 14:41:01 ns3855022 pvesr[12709]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:41:02 ns3855022 pvesr[12709]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:41:02 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:03 ns3855022 pvesr[12709]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:41:04 ns3855022 pvesr[12709]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:41:05 ns3855022 pvesr[12709]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:41:06 ns3855022 pvesr[12709]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:41:07 ns3855022 pvesr[12709]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:41:07 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:08 ns3855022 pvesr[12709]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:41:09 ns3855022 pvesr[12709]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:41:10 ns3855022 pvesr[12709]: cfs-lock 'file-replication_cfg' error: got lock request timeout
Oct 21 14:41:10 ns3855022 systemd[1]: pvesr.service: Main process exited, code=exited, status=5/NOTINSTALLED
Oct 21 14:41:10 ns3855022 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Oct 21 14:41:10 ns3855022 systemd[1]: Failed to start Proxmox VE replication runner.
Oct 21 14:41:10 ns3855022 pvestatd[1890]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Oct 21 14:41:10 ns3855022 pvestatd[1890]: status update time (9.081 seconds)
Oct 21 14:41:12 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:17 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:20 ns3855022 pvestatd[1890]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Oct 21 14:41:20 ns3855022 pvestatd[1890]: status update time (9.081 seconds)
Oct 21 14:41:22 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:27 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:30 ns3855022 pvestatd[1890]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Oct 21 14:41:30 ns3855022 pvestatd[1890]: status update time (9.080 seconds)
Oct 21 14:41:32 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:37 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:40 ns3855022 pvestatd[1890]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Oct 21 14:41:40 ns3855022 pvestatd[1890]: status update time (9.077 seconds)
Oct 21 14:41:42 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:47 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:50 ns3855022 pvestatd[1890]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Oct 21 14:41:50 ns3855022 pvestatd[1890]: status update time (9.080 seconds)
Oct 21 14:41:52 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:41:57 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:42:00 ns3855022 systemd[1]: Starting Proxmox VE replication runner...
Oct 21 14:42:00 ns3855022 pvestatd[1890]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Oct 21 14:42:00 ns3855022 pvestatd[1890]: status update time (9.073 seconds)
Oct 21 14:42:01 ns3855022 pvesr[13899]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:42:02 ns3855022 pvesr[13899]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:42:02 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:42:03 ns3855022 pvesr[13899]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:42:04 ns3855022 pvesr[13899]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:42:05 ns3855022 pvesr[13899]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:42:06 ns3855022 pvesr[13899]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:42:07 ns3855022 pvesr[13899]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:42:07 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:42:08 ns3855022 pvesr[13899]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:42:09 ns3855022 pvesr[13899]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 21 14:42:10 ns3855022 pvesr[13899]: cfs-lock 'file-replication_cfg' error: got lock request timeout
Oct 21 14:42:10 ns3855022 systemd[1]: pvesr.service: Main process exited, code=exited, status=5/NOTINSTALLED
Oct 21 14:42:10 ns3855022 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Oct 21 14:42:10 ns3855022 systemd[1]: Failed to start Proxmox VE replication runner.
Oct 21 14:42:10 ns3855022 pvestatd[1890]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Oct 21 14:42:10 ns3855022 pvestatd[1890]: status update time (9.080 seconds)
Oct 21 14:42:12 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:42:17 ns3855022 pve-ha-lrm[2083]: unable to write lrm status file - unable to delete old temp file: Input/output error
Oct 21 14:42:21 ns3855022 pvestatd[1890]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Oct 21 14:42:21 ns3855022 pvestatd[1890]: status update time (9.080 seconds)

Answer of ha-manager status:

Code:

quorum OK

After looking at the result of journalctl -u pve-cluster.service, I see that there is a disk space concern. I went into more detail on my syslog file, and ended up finding this:

Code:

Oct 20 14:47:01 ns3855022 systemd[1]: Started Proxmox VE replication runner.
Oct 20 14:47:17 ns3855022 pmxcfs[1756]: [database] crit: commit transaction failed: database or disk is full#010

This goes with the backup schedules of the machine, and it is the backup of a VM that puts me in this situation... However, the machine still has disk space.
Return of df -h:

Code:

Sys. de fichiers             Taille Utilisé Dispo Uti% Monté sur
udev                            32G       0   32G   0% /dev
tmpfs                          6,3G     58M  6,3G   1% /run
rpool/ROOT/pve-1               164G     43G  121G  27% /
tmpfs                           32G     46M   32G   1% /dev/shm
tmpfs                          5,0M       0  5,0M   0% /run/lock
tmpfs                           32G       0   32G   0% /sys/fs/cgroup
/dev/sda1                      3,7T    1,1T  2,6T  30% /srv/HDD_DATAS
rpool                          121G    128K  121G   1% /rpool
rpool/ROOT                     121G    128K  121G   1% /rpool/ROOT
rpool/data                     121G    128K  121G   1% /rpool/data
rpool/GamingPool               121G    128K  121G   1% /rpool/GamingPool
rpool/datas-ct                 121G    128K  121G   1% /rpool/datas-ct
rpool/data/subvol-101-disk-0    25G    681M   25G   3% /rpool/data/subvol-101-disk-0
rpool/data/subvol-103-disk-0    25G    486M   25G   2% /rpool/data/subvol-103-disk-0
/dev/fuse                       30M     20K   30M   1% /etc/pve

Is a temp_save directory existing when running a backup task ? if not, i really don't enderstant this situation :s

My backup save a VM disk on my 4To HDD disk. The VM use 1To of disk usage, and his save bug everytime last i've see that....

Any idea please ?

Thanks for your time

fiona · Oct 22, 2021

Pifouney said:

Code:

-- Logs begin at Mon 2021-10-18 16:18:37 CEST, end at Thu 2021-10-21 14:40:10 CEST. --
oct. 18 16:18:39 ns3855022 systemd[1]: Starting The Proxmox VE cluster filesystem...
oct. 18 16:18:40 ns3855022 systemd[1]: Started The Proxmox VE cluster filesystem.
oct. 20 14:47:17 ns3855022 pmxcfs[1756]: [database] crit: commit transaction failed: database or disk is full#010
oct. 20 14:47:17 ns3855022 pmxcfs[1756]: [database] crit: rollback transaction failed: cannot rollback - no transaction is active#010

After looking at the result of journalctl -u pve-cluster.service, I see that there is a disk space concern. I went into more detail on my syslog file, and ended up finding this:

Code:

Oct 20 14:47:01 ns3855022 systemd[1]: Started Proxmox VE replication runner.
Oct 20 14:47:17 ns3855022 pmxcfs[1756]: [database] crit: commit transaction failed: database or disk is full#010

Since you say this coincides with scheduled backups, I wouldn't rule out that the commit actually fails for a different reason (maybe too much load?).

Pifouney said:

This goes with the backup schedules of the machine, and it is the backup of a VM that puts me in this situation... However, the machine still has disk space.
Return of df -h:

Code:

Sys. de fichiers             Taille Utilisé Dispo Uti% Monté sur
udev                            32G       0   32G   0% /dev
tmpfs                          6,3G     58M  6,3G   1% /run
rpool/ROOT/pve-1               164G     43G  121G  27% /
tmpfs                           32G     46M   32G   1% /dev/shm
tmpfs                          5,0M       0  5,0M   0% /run/lock
tmpfs                           32G       0   32G   0% /sys/fs/cgroup
/dev/sda1                      3,7T    1,1T  2,6T  30% /srv/HDD_DATAS
rpool                          121G    128K  121G   1% /rpool
rpool/ROOT                     121G    128K  121G   1% /rpool/ROOT
rpool/data                     121G    128K  121G   1% /rpool/data
rpool/GamingPool               121G    128K  121G   1% /rpool/GamingPool
rpool/datas-ct                 121G    128K  121G   1% /rpool/datas-ct
rpool/data/subvol-101-disk-0    25G    681M   25G   3% /rpool/data/subvol-101-disk-0
rpool/data/subvol-103-disk-0    25G    486M   25G   2% /rpool/data/subvol-103-disk-0
/dev/fuse                       30M     20K   30M   1% /etc/pve

Please also share the output of df -ih and the log of the (presumably failed) backup task. Which disk is the VM on and which disk is the backup target?

Pifouney said:
Is a temp_save directory existing when running a backup task ? if not, i really don't enderstant this situation :s

IIRC a temporary directory is only used for non-"stop mode" container backups if the storage doesn't support snapshots. And the default is not in /etc, but you can look at the tmpdir setting of your /etc/vzdump.conf to be sure.

Pifouney said:
My backup save a VM disk on my 4To HDD disk. The VM use 1To of disk usage, and his save bug everytime last i've see that....

Any idea please ?

Thanks for your time

Pifouney · Oct 22, 2021

Hello

Since you say this coincides with scheduled backups, I wouldn't rule out that the commit actually fails for a different reason (maybe too much load?).

Backups are scheduled at a ultra light load time

Return of

Diff:

df -ih

:

Code:

root@ns3855022:~# df -ih
Sys. de fichiers             Inœuds IUtil. ILibre IUti% Monté sur
udev                           7,9M    541   7,9M    1% /dev
tmpfs                          7,9M    869   7,9M    1% /run
rpool/ROOT/pve-1               243M    96K   242M    1% /
tmpfs                          7,9M    113   7,9M    1% /dev/shm
tmpfs                          7,9M     15   7,9M    1% /run/lock
tmpfs                          7,9M     18   7,9M    1% /sys/fs/cgroup
/dev/sda1                      373M     50   373M    1% /srv/HDD_DATAS
rpool                          242M     10   242M    1% /rpool
rpool/data                     242M      8   242M    1% /rpool/data
rpool/ROOT                     242M      7   242M    1% /rpool/ROOT
rpool/GamingPool               242M      6   242M    1% /rpool/GamingPool
rpool/datas-ct                 242M      6   242M    1% /rpool/datas-ct
rpool/data/subvol-103-disk-0    50M    27K    50M    1% /rpool/data/subvol-103-disk-0
rpool/data/subvol-101-disk-0    49M    46K    49M    1% /rpool/data/subvol-101-disk-0
/dev/fuse                      9,8K     34   9,8K    1% /etc/pve

Log of failed backup in attached files

Incase of this container, the virtual disk and sare_repository are on the same disk (4TB disk). It's a nextcloud container.

Code:

cat /etc/vzdump.conf
# vzdump default settings

tmpdir: /var/lib/vz/tmp_backup

Can i move this directory ? Or mount a large filesystem on ?

Thanks for your answers

fiona · Oct 25, 2021

Pifouney said:
Code:

cat /etc/vzdump.conf # vzdump default settings tmpdir: /var/lib/vz/tmp_backup

Can i move this directory ? Or mount a large filesystem on ?

Yes. You can also just comment out the line. By default vzdump should use the backup storage itself (except for PBS).

Pifouney · Oct 27, 2021

Hey,

Big thanks to @Fabian_E, problem solved with your help and your time

Search

Search

[SOLVED] PVE Standalone - Loosing WebUI & IO problems on rpool

Pifouney

Active Member

Attachments

fiona

Proxmox Staff Member

Pifouney

Active Member

fiona

Proxmox Staff Member

Pifouney

Active Member

fiona

Proxmox Staff Member

Pifouney

Active Member

Attachments

fiona

Proxmox Staff Member

Pifouney

Active Member

We value your privacy