iSCSI Disk as LXC Mount Point does not reconnect after outage

jonnewman

New Member
May 29, 2024
1
0
1
Kenya
Hello all. I've been tinkering with Proxmox for a few months now, and most of the time I can find solutions to my config 'panics' on these forums (thanks all!), however there is an issue I'm having with the particular setup I have for PBS which I am running on an LXC with iSCSI host for it's backup store.
Brief setup overview:
  • I have two non-clustered PVE nodes on the same network (primary and secondary), one on a NUC, the other a MacMini, both in the same rack.
  • I have two PBS instances.
    • One PBS instance is installed as a VM on a Synology NAS (RS1221+) with it's datastore being a normal VM disk also on the NAS, managed by Synology VMM. This PBS instance is working well and both nodes backup various VMs and CTs to this datastore under their own namespaces.
    • The second PBS instance (#110) is an Unprivilaged LXC on the secondary PVE node, and is set up to simply 'sync' the first PBS datastore contents to it's own datastore, which I have configured as a Mount Point disk in the second PVE node UI. This MP disk is stored on an iSCSI LUN on another Synology NAS (DS918+) in another room.
This setup is usually working well, however the UPS attached to the DS918+ is failing, and until I can get a replacement battery (based in East Africa, so not easy to get genuine APC parts) every time we have a power cut for more than 5mins, the DS918+ powers down unsafely. All other network gear is in the main rack and has a 200ah backup UPS, so typically the main rack, both PVE nodes and the primary PBS instance will survive the power cut.

The issue is that the second PBS (in LXC) loses the connection to the iSCSI LUN and then refuses to reconnect or restore the connection once the NAS/LUN is back up.

It can be resolved easily by rebooting the LXC, however I'd like this to be automated if possible, similar to how Samba shares can be reconnected via FSTAB.

Example error when I try to do a GC task on the second PBS instance after the iSCSI LUN connection was lost but after the server is back up:
Code:
2024-05-29T13:35:42+03:00: starting garbage collection on store backups
2024-05-29T13:35:42+03:00: Start GC phase1 (mark used chunks)
2024-05-29T13:35:42+03:00: TASK ERROR: can't open index /mnt/backups/ns/macmini/ct/104/2024-04-20T23:02:26Z/catalog.pcat1.didx - Input/output error (os error 5)

And when I try to CD to the mount point within the PBS LXC Shell:
Bash:
root@pbs-macmini:~# cd /mnt/backups/
-bash: cd: /mnt/backups/: Input/output error

Some other config:
  • Everything is on the same subnet.
  • LXC is debian based, using the Proxmox-Helpers script.
Content of /etc/pve/lxc/110.conf: (PBS LXC)
mp0: Syno_918_LUN1:vm-110-disk-0,mp=/mnt/backups,size=1000G

Extract from /etc/pve/storage.cfg showing the iSCSI connection: (IP and targets obscured)
Code:
iscsi: Syno_918_iSCSI
        portal 10.0.x.x
        target iqn.2000-01.com.synology:Synology-918.default-target.xxxxxx
        content none

lvm: Syno_918_LUN1
        vgname Synology_918
        base Syno_918_iSCSI:0.0.1.scsi-xxxxxx
        content images,rootdir
        saferemove 0
        shared 0

EDIT: Should note that I am on PVE 8.2.2 with all updates installed, PBS instances are also fully up to date. All network connections are wired between PVE and PBS.

I'm really hoping this is a simple config problem, otherwise I will have to re-think how I store the secondary backups outside of the current LUN.

Thanks in advance!
 
Last edited:
Hello, i have a similar problem like @jonnewman

My iSCSI LUN is on dedicated machine.
Container with attached iSCSI LVM works fine until disconnect to iSCSI Server. After reconnect iSCSI Initiator doesn´t connect properly. On access i get a input/output error.

The only way to resolve is a CT Reboot.

PVE is up to date (8.3.3)

Any ideas? Thanks for help.
 
Several layers are involved in this scenario: iSCSI, the file system, and the kernel mount.

The iSCSI protocol has built-in mechanisms to retry a broken connection, but the number of retries and the timeout between them depend on the initial connection options and specific failure.

If an I/O error occurs on a locally mounted iSCSI disk, the file system and kernel mount typically have limited recovery options.

However, using the _netdev option during mount signals the kernel that the filesystem is network-based, which can adjust the recovery profile.

How this applies to container-mounted filesystems is more complex, and the behavior may depend on the specific container runtime and storage settings. You might need to experiment with different timeouts and mount options to find a suitable configuration.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Thanks for the explanation.

Why is the _netdev option not set by default when adding an iSCSI storage? Or do you mean the LVM mount?.

Where can i add the _netdev option? There is no entry in /etc/fstab on PVE :(
 
Why is the _netdev option not set by default when adding an iSCSI storage?
The option is for mounting a file system. PVE, normally, uses raw disk to build upon. So in most cases _netdev is not applicable.

Where can i add the _netdev option?
When it comes to containers it becomes a more complex situation. You have an iSCSI raw disk with its set of recovery options. You have a file system that is mounted to the host/hypervisor with its timeouts, then bind-mounted to container with another set of rules, then mounted inside the container.

Frankly, I have not spent time trying to untangle these various layers. If your iSCSI is prone to outages, you have an option of researching this and reporting back to the community, or investing into a more reliable solution.

Another option is to automate the recovery, which may be simpler and less time consuming.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited:
  • Like
Reactions: Kingneutron