[SOLVED] New Server works but replication is not - .bashrc was the culprit

liszca

Well-Known Member
May 8, 2020
75
1
48
23
I tried to replicate one of my big LXC container to new Proxmox host and it refuses to do so with an error message:

Code:
2025-07-13 00:39:01 110-0: end replication job with error: command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=node3' -o 'UserKnownHostsFile=/etc/pve/nodes/node3/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@10.0.0.3 -- pvesr prepare-local-job 110-0 local-zfs-data1:subvol-110-disk-0 local-zfs-data1:subvol-110-disk-1 local-zfs-data1:subvol-110-disk-4 --last_sync 0' failed: malformed number (leading zero must not be followed by another digit), at character offset 2 (before "0:39:01 up 37 min,  ...") at /usr/share/perl5/PVE/Replication.pm line 128.

What I did is the following:
- Is replication working between the old hosts - yes
- chronyc sources - got router as timeserver and looks fine
- Replication schedules can't be deleted in the GUI - Console can help out

I am confused about "leading zero must not be followed by another digit".

Are the SSH keys not distributed correctly? How can I check? - there is a key of the new host in .ssh/authorized_keys

So I am out of Ideas where to look for my mistake.
 
Last edited:
Are the SSH keys not distributed correctly? How can I check?
May be. I've been there when I changed the members of an existing cluster.

You nodes have names. You did not tell us anything about your cluster, so let's assume there are three nodes named pveh / pvei / pvej. Now this must run without any any prompt or error message:
Code:
~# for HOST in pveh pvei pvej ; do ssh root@$HOST whoami; done
root
root
root
Run the above on all three nodes - running on only one node is NOT sufficient.

Check /etc/hosts on all nodes. That file must contain correct and identical information about those hosts. (Assuming there is no full blown local DNS server in the background.)
 
  • Like
Reactions: liszca
I think the problem isn't coming from ssh. I logged into the server without my .ssh/config like ssh -F /dev/null thats how I logged into PVE.

From there I checked for ssh accessibility and no issue.

I looked into this message for several times, and the "character offset" can also 3:
Code:
at character offset 2 (before "0:39:01 up 37 min,  ...") at

Another example:
Code:
failed: garbage after JSON object, at character offset 3 (before ":36:02 up 4 days, 18...") at

looking at locale it looks like this:
Code:
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

So date and time format is the same on all nodes.

Checked for ZFS pools having the expected name - Looks good
 
Its wired I can just migrate a Container/VM just fine, but not replicate it.
 
fo
Can you please post the storage configuration of both nodes?

For node{1..3}:
Code:
zfspool: local-zfs
        disable
        pool rpool/data
        content images,rootdir
        sparse 0

zfspool: local-zfs-data1
        pool data1
        content rootdir,images
        mountpoint /data1
        sparse 1

dir: local-backup-data1
        disable
        path /data1
        content backup
        prune-backups keep-last=4
        shared 0

dir: local
        disable
        path /var/lib/vz
        content snippets
        prune-backups keep-all=1
        shared 0

dir: local-zfs-dir
        path /data1/nfs/proxmox
        content vztmpl,iso,backup
        prune-backups keep-all=1
        shared 0

zfspool: dpool
        pool dpool
        content images,rootdir
        mountpoint /dpool
        nodes node2,node1

everything is the same except for dpool (diskpool, horrible thing from the past ;)).
 
How much space is available in each pool on each node? Please paste the results of zpool status and zfs list
 
  • Like
Reactions: liszca
zfspool: local-zfs-data1
I'm assuming you are using this Storage ONLY for replication - since this is the only active zfspool that is local-zfs & not shared (unlike zfspool: dpool which is shared & the other local-zfs pool/s which are disabled).

So this seems unclear to me:
Checked for ZFS pools having the expected name - Looks good
 
  • Like
Reactions: Johannes S
How much space is available in each pool on each node? Please paste the results of zpool status and zfs list
node3 is the target and it got sufficient space.

Code:
node3 # zpool status
  pool: data1
 state: ONLINE
  scan: scrub repaired 0B in 00:01:00 with 0 errors on Sun Jul 13 00:25:01 2025
config:

        NAME                                                      STATE     READ WRITE CKSUM
        data1                                                     ONLINE       0     0     0
          mirror-0                                                ONLINE       0     0     0
            nvme-Patriot_P400L_1000GB_P400LWCBB25010902147-part4  ONLINE       0     0     0
            nvme-Patriot_P400L_1000GB_P400LWCBB25010901194-part4  ONLINE       0     0     0

errors: No known data errors

  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:02 with 0 errors on Sun Jul 13 00:24:04 2025
config:

        NAME           STATE     READ WRITE CKSUM
        rpool          ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            nvme0n1p3  ONLINE       0     0     0
            nvme1n1p3  ONLINE       0     0     0

errors: No known data errors

Code:
node3 # zfs list
NAME                  USED  AVAIL  REFER  MOUNTPOINT
data1                42.7G   841G    27K  /data1
data1/vm-131-disk-0  83.5K   841G  83.5K  -
data1/vm-131-disk-1    19K   841G    19K  -
data1/vm-131-disk-2  42.6G   841G  42.6G  -
rpool                2.67G  11.4G   104K  /rpool
rpool/ROOT           2.66G  11.4G    96K  /rpool/ROOT
rpool/ROOT/pve-1     2.66G  11.4G  2.66G  /
rpool/data             96K  11.4G    96K  /rpool/data
rpool/var-lib-vz       96K  11.4G    96K  /var/lib/vz
 
Follow this guide for Storage Replication.

Quoting from the guide's possible issues:
  • Storage with the same storage ID is not available on the target node.

So the VM you are trying to replicate must have all of its' disks stored on a zfspool that both the source & target nodes have the same named Storage ID in Proxmox (& are active).

If you provide the output of cat /etc/pve/storage.cfg of ALL three nodes + the <vmid>.conf file of the prospective VM to replicate - you will probably get more help.
 
  • Like
Reactions: Johannes S
lxc config:
Code:
arch: amd64
cores: 4
features: nesting=1
hostname: nextcloud.der-space.prod
memory: 12288
mp0: local-zfs-data1:subvol-110-disk-1,mp=/var/lib/nextcloud-data,backup=1,size=446G
mp1: local-zfs-data1:subvol-110-disk-4,mp=/tmp,backup=1,size=12G
net0: name=eth0,bridge=vmbr0,firewall=1,hwaddr=7E:CF:EB:43:42:5F,ip=dhcp,type=veth
onboot: 1
ostype: debian
rootfs: local-zfs-data1:subvol-110-disk-0,size=12G
swap: 0
unprivileged: 1


nodes storage.cfg, without promox backup server storage:
Code:
root@node3.der-space.prod:/root
 # awk '/^pbs:/ {flag = 1; next; }/^$/ {flag = 0;} !flag' /etc/pve/storage.cfg
zfspool: local-zfs
        disable
        pool rpool/data
        content images,rootdir
        sparse 0

zfspool: local-zfs-data1
        pool data1
        content rootdir,images
        mountpoint /data1
        sparse 0

dir: local-backup-data1
        disable
        path /data1
        content backup
        prune-backups keep-last=4
        shared 0

dir: local
        disable
        path /var/lib/vz
        content snippets
        prune-backups keep-all=1
        shared 0

dir: local-zfs-dir
        path /data1/nfs/proxmox
        content vztmpl,import,rootdir,iso,backup,images
        prune-backups keep-all=1
        shared 1

zfspool: dpool
        pool dpool
        content rootdir,images
        mountpoint /dpool
        nodes node2,node1




root@node2.der-space.prod:/root
 # awk '/^pbs:/ {flag = 1; next; }/^$/ {flag = 0;} !flag' /etc/pve/storage.cfg
zfspool: local-zfs
        disable
        pool rpool/data
        content images,rootdir
        sparse 0

zfspool: local-zfs-data1
        pool data1
        content rootdir,images
        mountpoint /data1
        sparse 0

dir: local-backup-data1
        disable
        path /data1
        content backup
        prune-backups keep-last=4
        shared 0

dir: local
        disable
        path /var/lib/vz
        content snippets
        prune-backups keep-all=1
        shared 0

dir: local-zfs-dir
        path /data1/nfs/proxmox
        content vztmpl,import,rootdir,iso,backup,images
        prune-backups keep-all=1
        shared 1

zfspool: dpool
        pool dpool
        content rootdir,images
        mountpoint /dpool
        nodes node2,node1



root@node1.der-space.prod:/root
 # awk '/^pbs:/ {flag = 1; next; }/^$/ {flag = 0;} !flag' /etc/pve/storage.cfg
zfspool: local-zfs
        disable
        pool rpool/data
        content images,rootdir
        sparse 0

zfspool: local-zfs-data1
        pool data1
        content rootdir,images
        mountpoint /data1
        sparse 0

dir: local-backup-data1
        disable
        path /data1
        content backup
        prune-backups keep-last=4
        shared 0

dir: local
        disable
        path /var/lib/vz
        content snippets
        prune-backups keep-all=1
        shared 0

dir: local-zfs-dir
        path /data1/nfs/proxmox
        content vztmpl,import,rootdir,iso,backup,images
        prune-backups keep-all=1
        shared 1

zfspool: dpool
        pool dpool
        content rootdir,images
        mountpoint /dpool
        nodes node2,node1
 
Your storage/s seems good.

Looking at the original error you've shown:
failed: malformed number (leading zero must not be followed by another digit)
which in essence is some form of JSON output parser error, I'm starting to think that maybe you have some .basherc scripting/mod going on in your system (your subsequent posts, specifically the command prompts not looking regular, may also suggest so).

Do you mind sharing the output of cat /root/.bashrc of the affected node.
 
Edit: In an effort to search my above (wild?) theory, I found almost the identical experience - here!
That did it!

renamed .bashrc and problem was gone ...
Code:
# ~/.bashrc: executed by bash(1) for non-login shells.

# Note: PS1 and umask are already set in /etc/profile. You should not
# need this unless you want different defaults for root.
FQDN=$(hostname -f)
PS1='\e]0;\u@$FQDN $PWD\a\n $? \[\033[01;31m\]\u@$FQDN\[\033[00m\]:\[\033[01;34m\]$PWD \n\[\033[31m\]$([[ -d "$HOME/.git" ]] && git rev-parse --abbrev-ref HEAD 2>/dev/null)\[\033[01;34m\] \$\[\033[00m\] '

# umask 022

# You may uncomment the following lines if you want `ls' to be colorized:
 export LS_OPTIONS='--color=auto'
 eval "$(dircolors)"
 alias ls='ls $LS_OPTIONS'
 alias ll='ls $LS_OPTIONS -l'
 alias l='ls $LS_OPTIONS -lA'

# Some more alias to avoid making mistakes:
# alias rm='rm -i'
# alias cp='cp -i'
# alias mv='mv -i'
uptime

Is it the FQDN variable? A simple check if running interactive should do!?

The Problem was solved as soon as .bashrc was renamed and I got back onto the server.

EDIT: the "uptime" command was the problem, never expected that after using it like that all the years before.
 
Last edited: