Proxmox reporting empty disk reads/writes for LXC containers

freak12techno · Feb 29, 2024

For my case, I can add that basically the configs for the old and the new LXCs I've created are basically identical, but somehow the older one doesn't log any activity produced by fio, while the new one does.

Not sure where to go from this further, let me know if there's something else I can check.

glrobichaud · Mar 1, 2024

freak12techno said:
For my case, I can add that basically the configs for the old and the new LXCs I've created are basically identical, but somehow the older one doesn't log any activity produced by fio, while the new one does.

Not sure where to go from this further, let me know if there's something else I can check.

Sorry for hijacking your thread @freak12techno, since our issue was very similar, I was hoping that by adding my experience it would spark an idea and for them to find where a bug could be lying in the system. If you followed my experience, it stopped all of the sudden after physically upgrading the RAM, which has puzzled me since nothing change at the software level. Hopefully, they will be able to find the bug and push an update.

Cheers!

freak12techno · Mar 1, 2024

glrobichaud said:
Sorry for hijacking your thread @freak12techno, since our issue was very similar, I was hoping that by adding my experience it would spark an idea and for them to find where a bug could be lying in the system. If you followed my experience, it stopped all of the sudden after physically upgrading the RAM, which has puzzled me since nothing change at the software level. Hopefully, they will be able to find the bug and push an update.

Cheers!

Actually we have somewhat similar issues, so I am totally ok with you providing more context.

One thing that came to mind: I also had a RAM upgrade at some point (I have 3 servers, all of which have the same problem; on some of servers I added regular DDR4 RAM, while on others I did not). Not sure if it might be the problem as I do not remember exactly when I upgraded the RAM and if it matches the disk IO metrics disappearning, @Chris do you think it might be possible, and if yes, how can I debug it further?

glrobichaud · Mar 1, 2024

freak12techno said:
Actually we have somewhat similar issues, so I am totally ok with you providing more context.

One thing that came to mind: I also had a RAM upgrade at some point (I have 3 servers, all of which have the same problem; on some of servers I added regular DDR4 RAM, while on others I did not). Not sure if it might be the problem as I do not remember exactly when I upgraded the RAM and if it matches the disk IO metrics disappearning, @Chris do you think it might be possible, and if yes, how can I debug it further?

Thanks @freak12techno, appreciate that. To add a little bit of context on my end about the RAM for this thread, I've replaced mine as follow:

Old
16GB Registered ECC RDIMM, DDR4-2400Mhz
New
128GB Registered ECC RDIMM, DDR4-2400MHz

Edit: typo

freak12techno · Mar 1, 2024

On my context: I did not replace RAM, I added more.
Before: 2x32GB DDR4
After: 4x32GB DDR4

Also I installed some NVMEs into these servers, not sure it it may cause issues as well.

I am yet not sure if that is the reason for this problem though and not sure yet on how to verify it.

lucius_the · Mar 7, 2024

Hi all,
Jumping in here in order to try to resolve a similar issue on my setup. I recently upgraded my server from PVE7 to PVE8 and my Disk IO graphs are not showing real numbers any more...

What is special with my CT is that I am using a CT mount. Not sure if that matters or if others are using CT mounts as well.
When I change the time range from "Week" To either "Hour" or "Day" I do get some very low stats - these may be my root disk stats. But why the CT mount stats are no longer present after upgrade to from PVE 7 to 8 - I have no idea... I do miss the stats though, they were useful.

lucius_the · Mar 7, 2024

Also, the same has happened on all of my CT-s. Some are debian 12, some are debian 11 (they were like that before upgrading to PVE 8 and everything was shown properly until the upgrade).

lucius_the · Mar 7, 2024

Digging a bit deeper - I just check in my office, a completely different server, that I also upgraded from PVE 7 to PVE 8... This server I use to store backups.

On them too, wherever I have a CT with a mountpoint -> Disk IO doesn't include the stats from the mountpoints.
I have 2 LXC CT-s there. Both are for backups. I should be seeing spikes because I'm receiving backups at a rate of 100 MB/s many times during the day, but the Disk IO is not showing that IO at all.

lucius_the · Mar 7, 2024

Here you can see more detail. This is one of the CT-s in my office (both screenshots). The first shows network receiving a lot of data with rates of 100 MB/s. This data is all written to a disk mounted inside this same CT. And look at the second screenshot - some reads are shown (these could perhaps be backup verify jobs), but no writes at all ?

Chris · Mar 7, 2024

lucius_the said:
Hi all,
Jumping in here in order to try to resolve a similar issue on my setup. I recently upgraded my server from PVE7 to PVE8 and my Disk IO graphs are not showing real numbers any more...

What is special with my CT is that I am using a CT mount. Not sure if that matters or if others are using CT mounts as well.
When I change the time range from "Week" To either "Hour" or "Day" I do get some very low stats - these may be my root disk stats. But why the CT mount stats are no longer present after upgrade to from PVE 7 to 8 - I have no idea... I do miss the stats though, they were useful.

Hi,
please also share your pct config <VMID> for one of the affected LXCs and your storage config cat /etc/pve/storage.cfg, this might help in finding a common pattern and in trying to reproduce the issue.

Also, please try to check if the issue is related to the kernel you are running on, by e.g. booting an older one and see if the issue persists.

Edit: Please also have a look at https://bugzilla.proxmox.com/show_bug.cgi?id=2135 in case you are using ZFS.

glrobichaud · Mar 7, 2024

UPDATE: I came back from a short vacation last night, and this morning, when I sat down at my desk to look at the dashboard, I noticed that all of them were reporting again. I checked to see if anything happened while I was away, but there was no report in the log. I can't explain it...

lucius_the · Mar 7, 2024

please also share your pct config <VMID> for one of the affected LXCs and your storage config cat /etc/pve/storage.cfg, this might help in finding a common pattern and in trying to reproduce the issue.

Here you go

CT config:

Code:

arch: amd64
cores: 4
features: nesting=1
hostname: veeam-repo
memory: 4096
mp0: pool16:subvol-102-disk-0,mp=/mnt/veeam-repo1,mountoptions=lazytime,size=8T
net0: name=eth0,bridge=vmbr0,firewall=1,hwaddr=<redacted>,ip=dhcp,type=veth
onboot: 1
ostype: debian
rootfs: local-lvm:vm-102-disk-0,mountoptions=noatime,size=16G
startup: order=2
swap: 512
tags: Linux;debian12
unprivileged: 1

storage.cfg

Code:

dir: local
        path /var/lib/vz
        content iso,backup,vztmpl

lvmthin: local-lvm
        thinpool data
        vgname pve
        content images,rootdir

pbs: pbs-ibm-local
        datastore backups-local
        server 127.0.0.1
        content backup
        fingerprint <redacted>
        namespace PVE-IBM
        nodes pve-ibm
        prune-backups keep-all=1
        username <redacted>

zfspool: pool16
        pool pool16
        content images,rootdir
        mountpoint /pool16
        nodes pve-ibm

dir: local-stuff
        path /pool16/stuff
        content vztmpl,snippets,iso,backup
        prune-backups keep-all=1
        shared 0

lucius_the · Mar 7, 2024

Also, please try to check if the issue is related to the kernel you are running on, by e.g. booting an older one and see if the issue persists.

I'm afraid I removed all previous kernel versions. My upgrade was from PVE7 (latest) to PBE8 (latest).

Please also have a look at https://bugzilla.proxmox.com/show_bug.cgi?id=2135 in case you are using ZFS.

I am using ZFS here. Not for rootfs, but for storage (CT mountpoint). And it's a separate ZFS recordset, so not a block device (it's not a VM).
Reading though the link I found this:

The reason for this is that we do get the diskwrite/diskread (PVE::LXC::vmstatus) values for containers from the blkio cgroup controller - and a container on ZFS does not have its own block-device (it's a ZFS-dataset not a zvol).

But this is a normal use case: an LXC CT with a mounted ZFS recordset. The ZFS recordset was even created through PVE GUI, so PVE actually does that for when you go and attach new disk from ZFS pool to a CT. So, a perfectly normal configuration. Can this not be fixed ?

...
The above info is for my office bacukp serever. I upgraded that one to PVE 8 a long time ago, so I can't say with certainty that Disk IO graph was ok before the upgrade to PVE 8 or not, because I don't remember.

But, in my home environment I did the upgrade just recently, so in this case I have a screenshot with graphs showing proper data before the upgrade to PVE 8. I don't have ZFS there (well actually I do have ZFS on that machine also, but the CT where I forst noticed the issue doesn't use ZFS). This is what I have (on my home "server"):

storage:

Code:

dir: local
        path /var/lib/vz
        content iso,vztmpl
        shared 0

lvmthin: local-lvm
        thinpool data
        vgname pve
        content rootdir,images

zfspool: local-zfs
        pool zfs_mirror
        content rootdir,images
        mountpoint /zfs_mirror
        sparse 1

pbs: local-pbs
        datastore opric-datastore
        server 127.0.0.1
        content backup
        fingerprint <redacted>
        namespace local-pve
        nodes pve
        prune-backups keep-all=1
        username local-bu@pbs

dir: wd6000-1
        path /mnt/pve/wd6000-1
        content rootdir,backup,snippets,vztmpl,iso,images
        is_mountpoint 1
        nodes pve

The CT that has frigate NVR in it and writes to disk at a rate od 1,67 MB/s all the time, 24/7.

Code:

arch: amd64
cores: 2
features: nesting=1
hostname: frigate
memory: 2048
mp1: /mnt/pve/wd6000-1/ct-mounts/nvr-space,mp=/mnt/nvr-space,mountoptions=lazytime
net0: name=eth0,bridge=vmbr0,firewall=1,gw=10.10.62.254,hwaddr=96:A6:30:F6:E7:43,ip=10.10.62.8/24,type=veth
net1: name=eth1,bridge=vmbr0,firewall=1,hwaddr=3A:45:37:F6:77:2C,ip=192.168.40.8/24,tag=40,type=veth
onboot: 1
ostype: debian
rootfs: local-lvm:vm-103-disk-0,mountoptions=noatime;lazytime,size=8G
swap: 512
tags: debian12
unprivileged: 1
lxc.cgroup2.devices.allow: c 226:0 rwm
lxc.cgroup2.devices.allow: c 226:128 rwm
lxc.cgroup2.devices.allow: c 189:* rwm # usb (coral)
lxc.mount.entry: /dev/dri/card0 dev/dri/card0 none bind,optional,create=file
lxc.mount.entry: /dev/dri/renderD128 dev/dri/renderD128 none bind,optional,create=file
lxc.mount.entry: /dev/bus/usb/002 dev/bus/usb/002 none bind,optional,create=dir
lxc.mount.entry: /dev/bus/usb/003 dev/bus/usb/003 none bind,optional,create=dir
lxc.mount.entry: /dev/bus/usb/004 dev/bus/usb/004 none bind,optional,create=dir
lxc.idmap: g 0 100000 44
lxc.idmap: u 0 100000 44
lxc.idmap: g 44 44 1
lxc.idmap: u 44 44 1
lxc.idmap: g 45 100045 60
lxc.idmap: u 45 100045 60
lxc.idmap: g 105 103 1
lxc.idmap: u 105 103 1
lxc.idmap: g 106 100106 65430
lxc.idmap: u 106 100106 65430

A bit more complex config because I have some devices passed through and the CT is unpriviledged.

This CT does have docker in it, because that's how frigate (the software NVR) is supposed to be run. I know PVE doesn't support or advise to run docker in LXC, but I'm not running it on ZFS and haven't really had any issues in... almost 2 years I think.

in any case, before upgrade from PVE7 to PVE8 I was seeing writes in the graph and now they are not showing any more.
Storage backing the NVR, as you can see, is a simple directory (ext4 on a single 6 TB HDD). No ZFS.

(edited typos)

keeka · Mar 7, 2024

lucius_the said:
This CT does have docker in it, because that's how frigate (the software NVR) is supposed to be run. I know PVE doesn't support or advise to run docker in LXC, but I'm not running it on ZFS and haven't really had any issues in... almost 2 years I think.

in any case, before upgrade from PVE7 to PVE8 I was seeing writes in the graph and now they are not showing any more.
Storage backing the NVR, as you can see, is a simple directory (ext4 on a single 6 TB HDD). No ZFS.

I've experienced lack of (or rather lower than expected) disk IO stats for any CT that was runnng docker. Be it newly created on PVE8 or carried over from a previous install. I thought perhaps pvestat was seeing the read/write activity for the CT root mount but not in docker overlay fs.

glrobichaud · Mar 7, 2024

UPDATE2: One of the LXC stop reporting again... I'm down to 3 out of 4 now.

Chris · Mar 8, 2024

How many of the containers (not having the rootfs or mountpoints on ZFS) in this thread are running a docker runtime within them? Can it be excluded that the issue is related to that?
As stated in the thread already linked in the first post of this thread that might cause such issues (include the link here again for reference https://forum.proxmox.com/threads/disk-i-o-graph-is-empty.134728/)

glrobichaud · Mar 8, 2024

Chris said:
How many of the containers (not having the rootfs or mountpoints on ZFS) in this thread are running a docker runtime within them? Can it be excluded that the issue is related to that?
As stated in the thread already linked in the first post of this thread that might cause such issues (include the link here again for reference https://forum.proxmox.com/threads/disk-i-o-graph-is-empty.134728/)

None of mine are running docker inside.

Chris · Mar 8, 2024

There seem to be mixed issues in this thread, leading to missing graphs.

I am able to reproduce the issue with a docker runtime running within the LXC. I suspect that the docker runtime interfering with the container is the root cause of most of the reported IO graphing issues here.

glrobichaud said:
None of mine are running docker inside.

Your particular case seems to be different as you stated you see part of the IO statistics, please verify the status of systemctl status pvestad and check your systemd journal on the Proxmox VE host for errors, journalctl -b -r.

glrobichaud · Mar 8, 2024

Chris said:
There seem to be mixed issues in this thread, leading to missing graphs.

I am able to reproduce the issue with a docker runtime running within the LXC. I suspect that the docker runtime interfering with the container is the root cause of most of the reported IO graphing issues here.

Your particular case seems to be different as you stated you see part of the IO statistics, please verify the status of systemctl status pvestad and check your systemd journal on the Proxmox VE host for errors, journalctl -b -r.

Hi Chris,

I'm assuming you meant pvestatd? Here's the output

systemctl status pvestatd

Code:

pvestatd.service - PVE Status Daemon
     Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; preset: enabled)
     Active: active (running) since Wed 2024-02-28 13:46:40 AST; 1 week 1 day ago
    Process: 3934184 ExecReload=/usr/bin/pvestatd restart (code=exited, status=0/SUCCESS)
   Main PID: 6764 (pvestatd)
      Tasks: 1 (limit: 154358)
     Memory: 92.8M
        CPU: 4h 32min 38.045s
     CGroup: /system.slice/pvestatd.service
             └─6764 pvestatd

Mar 04 08:49:04 GLRPVESRV pvestatd[6764]: auth key pair too old, rotating..
Mar 05 08:49:06 GLRPVESRV pvestatd[6764]: auth key pair too old, rotating..
Mar 05 11:48:55 GLRPVESRV pvestatd[6764]: modified cpu set for lxc/101: 1
Mar 06 12:26:16 GLRPVESRV pvestatd[6764]: modified cpu set for lxc/103: 4,7
Mar 07 10:21:08 GLRPVESRV systemd[1]: Reloading pvestatd.service - PVE Status Daemon...
Mar 07 10:21:09 GLRPVESRV pvestatd[3934184]: send HUP to 6764
Mar 07 10:21:09 GLRPVESRV pvestatd[6764]: received signal HUP
Mar 07 10:21:09 GLRPVESRV pvestatd[6764]: server shutdown (restart)
Mar 07 10:21:09 GLRPVESRV systemd[1]: Reloaded pvestatd.service - PVE Status Daemon.
Mar 07 10:21:10 GLRPVESRV pvestatd[6764]: restarting server

The following is the only error I could find going back 24hrs... which has been covered by some threads in this forum (i.e. https://forum.proxmox.com/threads/lxc-apparmor-denied-operation-mount-error-13.98441/ & https://forum.proxmox.com/threads/apparmor-denied-operation-mount.34340/) but none of them reported this particular error from what I've read.

journalctl -b -r

Code:

Mar 08 00:00:18 GLRPVESRV kernel: audit: type=1400 audit(1709870418.481:127): apparmor="DENIED" operation="mount" class="mount" info="failed perms check" error=-13 profile="lxc-102_</var/lib/lxc>" name=">
Mar 08 00:00:18 GLRPVESRV audit[2723]: AVC apparmor="DENIED" operation="mount" class="mount" info="failed perms check" error=-13 profile="lxc-102_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=2723 c>

lucius_the · Mar 8, 2024

Chris said:
There seem to be mixed issues in this thread, leading to missing graphs.

I am able to reproduce the issue with a docker runtime running within the LXC. I suspect that the docker runtime interfering with the container is the root cause of most of the reported IO graphing issues here.

Docker might be interfering but only on PVE8 - it wasn't interfering on PVE7.

Also on my backup server in office there is no docker. I posted my config above. Root is on LVM-thin and there is a ZFS mountpoint, that's where most writes go to.

Proxmox reporting empty disk reads/writes for LXC containers

New Member

New Member

New Member

New Member

New Member

Member

Attachments

Member

Member

Member

Attachments

Proxmox Staff Member

New Member

Member

Member

Active Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Member