inplace upgrade 8to9: (unsupported) OCFS2 issues

Apr 5, 2024
95
4
8
www.oops.co.at
Upgraded one of 3 nodes in my setup with OCFS2. YES, I know, it's not supported by Proxmox, but I ask and report anyway, as others here might hit the issue or have the solution already.

The node booted OK after the upgrade, with the pinned kernel 6.2.16-20-pve (I think, that is still needed).

The mount of the OCFS-storage fails at boot time "mount.ocfs2: Cluster name is invalid while trying to join the group", in turn several cluster- and pve-related services fail.

If I mount the storage manually, it works and in turn the other services come up ... and the node is part of the cluster, and VMs can be migrated there and started.

I assume some race-condition or so ... as if the mount is tried too early or so.

The mount is currently generated by systemd from this fstab-line:

`/dev/mapper/msa2060_lun1 /mnt/ocfs2/PVE001 ocfs2 _netdev,defaults 0 0`

afaik `_netdev` should make it wait for connectivity, right?

Should I create a specific mount-unit with some extra dependency maybe?

thanks for any pointers!
 
Same issue again. The OCFS2-fs does not mount.

```
Sep 16 11:32:43 srv1 mount[1307]: mount.ocfs2: Cluster name is invalid while trying to join the group
Sep 16 11:32:43 srv1 systemd[1]: mnt-ocfs2-PVE001.mount: Mount process exited, code=exited, status=1/FAILURE
Sep 16 11:32:43 srv1 systemd[1]: mnt-ocfs2-PVE001.mount: Failed with result 'exit-code'.
Sep 16 11:32:43 srv1 systemd[1]: Failed to mount mnt-ocfs2-PVE001.mount - /mnt/ocfs2/PVE001.
Sep 16 11:32:43 srv1 systemd[1]: Dependency failed for remote-fs.target - Remote File Systems.
Sep 16 11:32:43 srv1 systemd[1]: Dependency failed for pve-cluster.service - The Proxmox VE cluster filesystem.
Sep 16 11:32:43 srv1 systemd[1]: pve-cluster.service: Job pve-cluster.service/start failed with result 'dependency'.
Sep 16 11:32:43 srv1 systemd[1]: Dependency failed for pve-guests.service - PVE guests.
Sep 16 11:32:43 srv1 systemd[1]: pve-guests.service: Job pve-guests.service/start failed with result 'dependency'.
Sep 16 11:32:43 srv1 systemd[1]: remote-fs.target: Job remote-fs.target/start failed with result 'dependency'.
Sep 16 11:32:43 srv1 systemd[1]: Reached target pve-storage.target - PVE Storage Target.
```

Workaround:

```
systemctl restart o2cb.service
systemctl start mnt-ocfs2-PVE001.mount # mount works then
systemctl restart pvescheduler.service pvestatd.service pve-ha-lrm.service pve-firewall.service
```

The journal for o2cb.service doesn't show any errors:

```
-- Boot 805cc209a5e24c15a520d728a8282cb2 --
Sep 16 11:43:36 srv1 systemd[1]: Starting o2cb.service - Load o2cb Modules...
Sep 16 11:43:36 srv1 o2cb.init[2246]: Writing O2CB configuration: OK
Sep 16 11:43:36 srv1 o2cb.init[2246]: checking debugfs...
Sep 16 11:43:36 srv1 o2cb.init[2246]: Loading stack plugin "o2cb": OK
Sep 16 11:43:36 srv1 o2cb.init[2246]: Loading filesystem "ocfs2_dlmfs": OK
Sep 16 11:43:36 srv1 o2cb.init[2246]: Mounting ocfs2_dlmfs filesystem at /dlm: OK
Sep 16 11:43:36 srv1 o2cb.init[2246]: Setting cluster stack "o2cb": OK
Sep 16 11:43:36 srv1 o2cb.init[2246]: Registering O2CB cluster "OCFS4PVE": OK
Sep 16 11:43:36 srv1 o2cb.init[2246]: Setting O2CB cluster timeouts : OK
Sep 16 11:43:36 srv1 o2hbmonitor[2294]: Starting
Sep 16 11:43:36 srv1 systemd[1]: Finished o2cb.service - Load o2cb Modules.
```

BUT it is disabled. I enabled it and rebooted. The service starts OK then, but the mount still fails:

```
journalctl -b0 -u mnt-ocfs2-PVE001.mount
Sep 16 11:49:18 srv1 systemd[1]: mnt-ocfs2-PVE001.mount: Directory /mnt/ocfs2/PVE001 to mount over is not empty, mounting anyway.
Sep 16 11:49:18 srv1 systemd[1]: Mounting mnt-ocfs2-PVE001.mount - /mnt/ocfs2/PVE001...
Sep 16 11:49:19 srv1 mount[1306]: mount.ocfs2: Cluster name is invalid while trying to join the group
Sep 16 11:49:19 srv1 systemd[1]: mnt-ocfs2-PVE001.mount: Mount process exited, code=exited, status=1/FAILURE
Sep 16 11:49:19 srv1 systemd[1]: mnt-ocfs2-PVE001.mount: Failed with result 'exit-code'.
Sep 16 11:49:19 srv1 systemd[1]: Failed to mount mnt-ocfs2-PVE001.mount - /mnt/ocfs2/PVE001.
```

(I cleaned the directory now to get rid of that "Directory .. is not empty")

same result after rebooting

I wonder if `mnt-ocfs2-PVE001.mount` is started too early even with option `_netdev` in fstab.

I see a bond interface coming up after the ocfs2-mount failed ... but I am quite sure that the cluster-related network traffic does not run over the bonding interface(s). Will check that.

And mabye try to write an explicit systemd-mount-unit instead of the fstab-entry (suggestions welcome).

So far I can't proceed upgrading the other 2 PVE-nodes to 9.x when the boot process doesn't work reliably.

And YES, I perfectly know that OCFS2 is not supported officially by PVE ... unfortunately at this site I am stuck with that for now.
 
Last edited:
checked the underlying multipath-config: the multipath-LUN comes up fine, the LVM looks good.

But I always see these services fail:

```
systemctl --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● mnt-ocfs2-PVE001.mount loaded failed failed /mnt/ocfs2/PVE001
● pve-firewall.service loaded failed failed Proxmox VE firewall
● pve-ha-crm.service loaded failed failed PVE Cluster HA Resource Manager Daemon
● pve-ha-lrm.service loaded failed failed PVE Local HA Resource Manager Daemon
● pvescheduler.service loaded failed failed Proxmox VE scheduler
● pvestatd.service loaded failed failed PVE Status Daemon
```

I wonder if some explicit dependency or timeout would help in a mount-unit.

Maybe I should consult a debian-forum ...
 
digged further. Maybe I am barking up the wrong tree, maybe not ...

On my PVE-8 node o2cb.service is generated from `/etc/init.d/o2cb` and this results in "After=network-online.target".

On PVE-9 o2cb.service comes in `/usr/lib/systemd/system/o2cb.service` and contains only "After=network.target".

I will try to set this via override and test a reboot.