[SOLVED] "Short read on command socket" error when running backup of LXC containers

avggeek · Jan 25, 2021

Hi,

After some recent updates to my Proxmox nodes, I'm finding that all backup jobs for LXC containers are failing with the error "short read on command socket (16 != 0)". Here is an example output:

Code:

INFO: starting new backup job: vzdump 102 --mode snapshot --storage shared-storage --compress zstd
INFO: Starting Backup of VM 102 (lxc)
INFO: Backup started at 2021-01-25 10:20:03
INFO: status = running
INFO: CT Name: server.lab
INFO: including mount point rootfs ('/') in backup
INFO: mode failure - some volumes do not support snapshots
INFO: trying 'suspend' mode instead
INFO: backup mode: suspend
INFO: ionice priority: 7
INFO: CT Name: server.lab
INFO: including mount point rootfs ('/') in backup
INFO: starting first sync /proc/1411/root/ to /var/tmp/vzdumptmp12173_102/
INFO: first sync finished - transferred 3.17G bytes in 93s
INFO: suspending guest
ERROR: Backup of VM 102 failed - short read on command socket (16 != 0)
INFO: Failed at 2021-01-25 10:21:37
INFO: Backup job finished with errors

TASK ERROR: job errors

The LXC drives are on a NFS volume mounted to the host and the NFS volume is the backup target.

I have tried restarting the Host as well as the NFS Host but the error is still persisting.

Output of pveversion:

Code:

proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 14.2.16-1~bpo10+1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 8.0-2~bpo10+1
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-2~bpo10+1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-4
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-4
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-3
smartmontools: 7.2-1~bpo10+1
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

Any help on this would be appreciated as all my LXC backups are failing :-(

Moayad · Jan 25, 2021

Hello,

Could you try to restore one LXC on different storage?

avggeek · Jan 25, 2021

Moayad said:
Hello,

Could you try to restore one LXC on different storage?

Hi Moayad,

I did the following steps:

- Shut down the running LXC.
- Restored the last backup available to me (from Jan 23rd) on to different storage. In this case, it was a partition on the NVMe boot drive on the same host.
- Restarted the LXC and verified it is working correctly.
- Reran the same "vzdump" command from before. It again failed with the same "short read on command socket (16 != 0)" error.

Do you have any suggestions on how to debug this further?

wbumiller · Jan 25, 2021

The step it fails at is the freeze step, which happens via cgroups, where we first connect to the container's monitor to query the exact cgroup paths.

-) Have you done any cgroup specific changes to your host (eg. switch to cgroup v2?)

-) Can you post the output of the following comands (assuming `102` is a running container you're trying to backup)?

Check container's cgroup info:

Code:

cat /proc/$(lxc-info -p 102 | awk '{print $2}')/cgroup

Check if lxc can successfully query the cgroup directories:

Code:

lxc-cgroup 102 freezer.state

Check if our query succeeds:

Code:

perl -e 'use PVE::LXC::Command; print PVE::LXC::Command::get_cgroup_path(102, 'freezer', 1),"\n";'

Edit:
If all but the last one work, run the last one with `strace -f ` prepended to the command.

avggeek · Jan 25, 2021

wbumiller said:
The step it fails at is the freeze step, which happens via cgroups, where we first connect to the container's monitor to query the exact cgroup paths.

-) Have you done any cgroup specific changes to your host (eg. switch to cgroup v2?)

I have not made any cgroup specific changes.

wbumiller said:
-) Can you post the output of the following comands (assuming `102` is a running container you're trying to backup)?

Check container's cgroup info:

Code:

cat /proc/$(lxc-info -p 102 | awk '{print $2}')/cgroup

Output is:

Code:

0::/lxc/102/ns/init.scope

wbumiller said:
Check if lxc can successfully query the cgroup directories:

Code:

lxc-cgroup 102 freezer.state

This command failed :-(

Output:

Code:

lxc-cgroup: 102: tools/lxc_cgroup.c: main: 125 Failed to retrieve value of 'freezer.state' for '/var/lib/lxc:102'

wbumiller said:
Check if our query succeeds:

Code:

perl -e 'use PVE::LXC::Command; print PVE::LXC::Command::get_cgroup_path(102, 'freezer', 1),"\n";'

This failed with the same "short read on command socket" error:

Code:

short read on command socket (16 != 0)

avggeek · Jan 25, 2021

wbumiller said:
The step it fails at is the freeze step, which happens via cgroups, where we first connect to the container's monitor to query the exact cgroup paths.

-) Have you done any cgroup specific changes to your host (eg. switch to cgroup v2?)

I decided to check this again and so I ran the following command:

Code:

# grep cgroup /proc/filesystems
nodev   cgroup
nodev   cgroup2

From what I've read if there is no cgroups2 on the host that command should return only one entry. In addition, when I ran "mount", I could see this entry in the output:

Code:

cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)

After some searching, I came across this Debian Wiki page which mentions a systemd parameter. That helped me narrow down the problem. I use Debian Backports and last week, I got a apt-listchanges alert which included the following message:

Code:

systemd (247.2-2) unstable; urgency=medium

  systemd now defaults to the "unified" cgroup hierarchy (i.e. cgroupv2).
  This change reflects the fact that cgroupsv2 support has matured
  substantially in both systemd and in the kernel.

There's another paragraph which states:

Code:

If you run into problems with cgroupv2, you can switch back to the previous,
  hybrid setup by adding "systemd.unified_cgroup_hierarchy=false" to the
  kernel command line.

I immediately added that to the Proxmox host I was having a problem on, ran update-grub and rebooted. Once I did that, I got the following output:

Code:

# lxc-cgroup 102 freezer.state
THAWED

Code:

# perl -e 'use PVE::LXC::Command; print PVE::LXC::Command::get_cgroup_path(102, 'freezer', 1),"\n";'
//lxc/102

The vzdump command also runs successfully. Time to go make the change on my other Proxmox nodes!

wbumiller · Jan 25, 2021

There's already another code for the v2 freezer, but it's currently not being used and apparently lxc doesn't provide any path at all when already on a pure cgroup v2 setup and querying it explicitly for the "unified" cgroup. This will be fixed with the next pve-container update.

avggeek · Jan 26, 2021

wbumiller said:
There's already another code for the v2 freezer, but it's currently not being used and apparently lxc doesn't provide any path at all when already on a pure cgroup v2 setup and querying it explicitly for the "unified" cgroup. This will be fixed with the next pve-container update.

Thanks for the info Wolfgang! I will keep an eye out for the update to pve-container and test it without the kernel parameter.

avggeek · Feb 12, 2021

wbumiller said:
There's already another code for the v2 freezer, but it's currently not being used and apparently lxc doesn't provide any path at all when already on a pure cgroup v2 setup and querying it explicitly for the "unified" cgroup. This will be fixed with the next pve-container update.

I can confirm that as of "pve-container: 3.3-3" I could remove the additional boot parameter and still successfully run backups.

Search

Search

[SOLVED] "Short read on command socket" error when running backup of LXC containers

avggeek

Member

Moayad

Proxmox Staff Member

avggeek

Member

wbumiller

Proxmox Staff Member

avggeek

Member

avggeek

Member

wbumiller

Proxmox Staff Member

avggeek

Member

avggeek

Member