[SOLVED] "Short read on command socket" error when running backup of LXC containers

avggeek

Member
Jul 12, 2020
7
0
6
Hi,

After some recent updates to my Proxmox nodes, I'm finding that all backup jobs for LXC containers are failing with the error "short read on command socket (16 != 0)". Here is an example output:

Code:
INFO: starting new backup job: vzdump 102 --mode snapshot --storage shared-storage --compress zstd
INFO: Starting Backup of VM 102 (lxc)
INFO: Backup started at 2021-01-25 10:20:03
INFO: status = running
INFO: CT Name: server.lab
INFO: including mount point rootfs ('/') in backup
INFO: mode failure - some volumes do not support snapshots
INFO: trying 'suspend' mode instead
INFO: backup mode: suspend
INFO: ionice priority: 7
INFO: CT Name: server.lab
INFO: including mount point rootfs ('/') in backup
INFO: starting first sync /proc/1411/root/ to /var/tmp/vzdumptmp12173_102/
INFO: first sync finished - transferred 3.17G bytes in 93s
INFO: suspending guest
ERROR: Backup of VM 102 failed - short read on command socket (16 != 0)
INFO: Failed at 2021-01-25 10:21:37
INFO: Backup job finished with errors

TASK ERROR: job errors

The LXC drives are on a NFS volume mounted to the host and the NFS volume is the backup target.

I have tried restarting the Host as well as the NFS Host but the error is still persisting.

Output of pveversion:

Code:
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 14.2.16-1~bpo10+1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 8.0-2~bpo10+1
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-2~bpo10+1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-4
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-4
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-3
smartmontools: 7.2-1~bpo10+1
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

Any help on this would be appreciated as all my LXC backups are failing :-(
 

avggeek

Member
Jul 12, 2020
7
0
6
Hello,

Could you try to restore one LXC on different storage?

Hi Moayad,

I did the following steps:

- Shut down the running LXC.
- Restored the last backup available to me (from Jan 23rd) on to different storage. In this case, it was a partition on the NVMe boot drive on the same host.
- Restarted the LXC and verified it is working correctly.
- Reran the same "vzdump" command from before. It again failed with the same "short read on command socket (16 != 0)" error.

Do you have any suggestions on how to debug this further?
 

wbumiller

Proxmox Staff Member
Staff member
Jun 23, 2015
714
122
63
The step it fails at is the freeze step, which happens via cgroups, where we first connect to the container's monitor to query the exact cgroup paths.

-) Have you done any cgroup specific changes to your host (eg. switch to cgroup v2?)

-) Can you post the output of the following comands (assuming `102` is a running container you're trying to backup)?

Check container's cgroup info:
Code:
cat /proc/$(lxc-info -p 102 | awk '{print $2}')/cgroup

Check if lxc can successfully query the cgroup directories:
Code:
lxc-cgroup 102 freezer.state

Check if our query succeeds:
Code:
perl -e 'use PVE::LXC::Command; print PVE::LXC::Command::get_cgroup_path(102, 'freezer', 1),"\n";'

Edit:
If all but the last one work, run the last one with `strace -f ` prepended to the command.
 
  • Like
Reactions: Moayad

avggeek

Member
Jul 12, 2020
7
0
6
The step it fails at is the freeze step, which happens via cgroups, where we first connect to the container's monitor to query the exact cgroup paths.

-) Have you done any cgroup specific changes to your host (eg. switch to cgroup v2?)

I have not made any cgroup specific changes.

-) Can you post the output of the following comands (assuming `102` is a running container you're trying to backup)?

Check container's cgroup info:
Code:
cat /proc/$(lxc-info -p 102 | awk '{print $2}')/cgroup
Output is:

Code:
0::/lxc/102/ns/init.scope

Check if lxc can successfully query the cgroup directories:
Code:
lxc-cgroup 102 freezer.state

This command failed :-(

Output:
Code:
lxc-cgroup: 102: tools/lxc_cgroup.c: main: 125 Failed to retrieve value of 'freezer.state' for '/var/lib/lxc:102'

Check if our query succeeds:
Code:
perl -e 'use PVE::LXC::Command; print PVE::LXC::Command::get_cgroup_path(102, 'freezer', 1),"\n";'

This failed with the same "short read on command socket" error:

Code:
short read on command socket (16 != 0)
 

avggeek

Member
Jul 12, 2020
7
0
6
The step it fails at is the freeze step, which happens via cgroups, where we first connect to the container's monitor to query the exact cgroup paths.

-) Have you done any cgroup specific changes to your host (eg. switch to cgroup v2?)

I decided to check this again and so I ran the following command:

Code:
# grep cgroup /proc/filesystems
nodev   cgroup
nodev   cgroup2

From what I've read if there is no cgroups2 on the host that command should return only one entry. In addition, when I ran "mount", I could see this entry in the output:

Code:
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)

After some searching, I came across this Debian Wiki page which mentions a systemd parameter. That helped me narrow down the problem. I use Debian Backports and last week, I got a apt-listchanges alert which included the following message:

Code:
systemd (247.2-2) unstable; urgency=medium

  systemd now defaults to the "unified" cgroup hierarchy (i.e. cgroupv2).
  This change reflects the fact that cgroupsv2 support has matured
  substantially in both systemd and in the kernel.

There's another paragraph which states:

Code:
If you run into problems with cgroupv2, you can switch back to the previous,
  hybrid setup by adding "systemd.unified_cgroup_hierarchy=false" to the
  kernel command line.

I immediately added that to the Proxmox host I was having a problem on, ran update-grub and rebooted. Once I did that, I got the following output:

Code:
# lxc-cgroup 102 freezer.state
THAWED

Code:
# perl -e 'use PVE::LXC::Command; print PVE::LXC::Command::get_cgroup_path(102, 'freezer', 1),"\n";'
//lxc/102

The vzdump command also runs successfully. Time to go make the change on my other Proxmox nodes!
 
Last edited:

wbumiller

Proxmox Staff Member
Staff member
Jun 23, 2015
714
122
63
There's already another code for the v2 freezer, but it's currently not being used and apparently lxc doesn't provide any path at all when already on a pure cgroup v2 setup and querying it explicitly for the "unified" cgroup. This will be fixed with the next pve-container update.
 
  • Like
Reactions: avggeek

avggeek

Member
Jul 12, 2020
7
0
6
There's already another code for the v2 freezer, but it's currently not being used and apparently lxc doesn't provide any path at all when already on a pure cgroup v2 setup and querying it explicitly for the "unified" cgroup. This will be fixed with the next pve-container update.
Thanks for the info Wolfgang! I will keep an eye out for the update to pve-container and test it without the kernel parameter.
 

avggeek

Member
Jul 12, 2020
7
0
6
There's already another code for the v2 freezer, but it's currently not being used and apparently lxc doesn't provide any path at all when already on a pure cgroup v2 setup and querying it explicitly for the "unified" cgroup. This will be fixed with the next pve-container update.

I can confirm that as of "pve-container: 3.3-3" I could remove the additional boot parameter and still successfully run backups.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!