[TUTORIAL] PVE 7.x Cluster Setup of shared LVM/LV with MSA2040 SAS [partial howto]

Glowsome

Renowned Member
Jul 25, 2017
176
44
68
51
The Netherlands
www.comsolve.nl
Hi all,

I've been working on setting up my MSA2040 (SAS) to be available in more then just sharing a raw LVM due to the fact that offering a raw lvm only supports diskimages and containers.
The idea was to also have (shared) storage (directory)presented to pve for backup/snippets/templates ( all which is unavailable when using a raw lvl basically)

So i set out on this journey and started out reading about shared LVM ... first stuff coming into view was 'clvm' .. however that package seems unavailable on Debian Buster and was replaced by lvmlockd.

Up untill now i have devised the following approach, but with ToDo points before i can implement this in a production env :

What is needed to be installed (ON ALL NODES !):
  • lvmlockd (package lvm2-lockd) - this will add option in /etc/lvm/lvm.conf : use_lvmlockd = 1
  • dlm ( package dlm-controld)
  • gfs2 (package gfs2-utils) => i used gfs filesystem on the shared LV
Next:

Create a /etc/dlm/dlm.conf with content :
Code:
# Enable debugging
log_debug=1
# Use tcp as protocol
protocol=tcp
# Delay at join
post_join_delay=10
# Disable fencing (for now)
enable_fencing=0

Then :
  • Start lvmlockd on all nodes ( systemctl start lvmlockd) and let it come up (i waited ~20 seconds before continueing ( see To-Do list)
  • Start dlm on all nodes (systemctl start dlm)
Check with "dlm_tool ls" if the shared lock is active on lvm_global (this holds you global lock).

If all is well create a shared LVM on your storage :
  • pvcreate /dev/sdX
  • vgcreate --shared vgname /dev/sdX
Check with "vgs" to see if the 'shared' bit is present (example below) (this should be visible on all nodes !)

Code:
VG #PV #LV #SN Attr VSize VFree
cluster01 1 42 0 wz--n- <7.09t <5.16t
cluster02 1 1 0 wz--ns 2.00t 0
                     ^-----Shared

Start the lock for this VG on all nodes
Code:
vgchange --lock-start

recheck "dlm_tool ls" , a lockspace should've been added

Create a Logical Volume in the share VG the regular way:
Code:
lvcreate -n lvname -l 100%FREE vgname

Activate the LV with a shared lock: (needs to be done on all nodes)
Code:
lvchange -asy /dev/vgname/lvname

Create a filesystem on it (only once on one node) , the -j option specifies the journals ( one for each node, in my case 4)
Code:
mkfs.gfs2 -t <YOUR CLUSTERNAME>:backups -j 4 -J 64 /dev/vgname/lvname

Mount the LV somewhere ( all nodes)
Code:
mount -t gfs2 /dev/vgname/lvname /your/mountpoint

Do some testing with writing data to it, and verify you can see it on other nodes.
A simple touch <filename> will do

ToDo list :
  • when i reboot a node and comes up it seems that lvmlockd isnt fully up when dlm starts resulting in global lockspace is not obtained, killing both lvmlockd and dlm_controld and restarting them with like 20 sec apart seems to solve this => Timing issue ?
  • automate the activation of the shared VG and shared LV on all nodes

Regarding the ToDo list, if anyone has input it is greatly appreciated, cause atm i'm feeling like i am (re-)inventing the wheel here.

Glowsome
 
  • Like
Reactions: wbk and rusquad
Made some progress, basically i found out that lvmlockd (by running it in debug mode) tries to retrieve lock information from dlm , and if dlm is not available it 'assumes' it is the master and starts its own lockspace ignoring an already exisiting lockspace on the other remaining clusternodes.

In finding this out the manpages for lvmlockd are completely wrong in the order of startup.

As it describes :
Code:
• start lvmlockd
• start lock manager
• vgchange --lock-start
• activate LVs in shared VGs

After seeing this my conclusion ( via testing , see below) is this order :

Code:
• start lock manager
• start lvmlockd
• vgchange --lock-start
• activate LVs in shared VGs

Analysis :

To simulate a 'clean startup' i killed both lvmlockd and the dlm_controld processes. having 2 SSH sessions running

Then as described in the manpage i started lvmlockd ( not via systemctl but manually in debugmode (via command lvmlockd -D ) with the debug info available i noticed lvmlockd started to create its own lockspace ( as no dlm was running it could not know of a lock already present on the other 3 nodes present)

When dlm was started afterwards it resulted in a 'split lock' where the single node who did not know of the already existing lock erected its own global lockspace, with the same name as the global lockspace already present on the other 3 nodes.

The result was not a merge, but separate locks, where the 1-node was waiting for quorum, but successfully communicating with the other nodes over dlm ... so we got ourselves am issue that also presented itself in the pve UI where all storage ( managed over lvm went to a (?)- unqueryable state for all other nodes except the single node running the 2nd global lockspace.

Conclusion on this part .. lvmlockd NEEDS the information from dlm to be aware of an already existing (global and other) locks from the other nodes.

Further analysis :

i killed both lvmlockd process and dlm_controld again as if a node went down ... as soon as i killed lvmlockd in the pve UI all storage on all nodes came back into 'green' queryable mode, when i killed dlm the node went to 'failed' status.

Then i started dlm (via systemctl) as first process and observed the nodestatus in pve UI going back to online.
Then i started lvmlockd in debug mode .. and it read dlm's info, and joined the existing lockspace without issues at all.

Having observed this i then moved to changing the startup order of systemd units ( specifically lvmlockd) adding the 'After=' -directive
with corosync and dlm as service dependancies.

i rebooted the machine afterwards to test it, but again i had the 'split lock', as if lvmlockd didnt obtain the info in time.
i then went and changed a bit in the config to be absolutely sure that dlm was up, and had the info lvmlockd required to successfully start (and join the existing lock).

To do that i made 2 changes, one in /etc/dlm/dlm.conf, commenting the delay join line so it would not delay :

Code:
# Enable debugging
log_debug=1
# Use tcp as protocol
protocol=tcp
# Delay at join
# post_join_delay=10
# Disable fencing (for now)
enable_fencing=0

And i changed the systemd unit file again to include a sleep as prestart, forcing a delay.

Code:
[Unit]
Description=LVM lock daemon
Documentation=man:lvmlockd(8)
After=corosync.service dlm.service

[Service]
Type=notify
ExecStartPre=/bin/sleep 20
ExecStart=/sbin/lvmlockd --foreground
PIDFile=/run/lvmlockd.pid
SendSIGKILL=no

[Install]
WantedBy=multi-user.target

I then restarted the node again to see how it would come up ...

Success , the node came up, and successfully joined the existing lockspace(s) !!!!!!!!!!!!!!!!!!!!!

Due to the implemented (by lvmlockd package itself) lvmlocks systemd unit file it successfully joined the other available lockspaces.

Still i am not at the end :

ToDo :

- activate the shared LV
- mount the LV on the filesystem





 
  • Like
Reactions: rusquad
Another bit of progression ...

To define the shared LV's and their mountpoints i created /etc/lvm/lvmshared.conf
Code:
/dev/cluster02/backups:/data/backups
The one thing that (at the moment) still is a catch is the file should not contain a carriage return !

After that i started working on a script to activate the LV and process the mountpoint accordingly, slowly testing it this is the result :
Code:
#!/bin/bash
#file="/etc/lvm/lvmshared.conf"
case $1 in
    mount)
        file="/etc/lvm/lvmshared.conf"
        while IFS=: read -r lvname mountpoint
        do
            printf "Activating: $lvname \n"
            lvchange -asy $lvname
            printf "Adding mountpoint: $mountpoint \n"
            mount $lvname $mountpoint
        done <"$file"
        ;;
    unmount)
        file="/etc/lvm/lvmshared.conf"
        while IFS=: read -r lvname mountpoint
        do
            printf "Removing mountpoint: $mountpoint \n"
            umount $mountpoint
            printf "Dectivating: $lvname \n"
            lvchange -an $lvname
        done < "$file"
        ;;
    *)
        file="/etc/lvm/lvmshared.conf"
        while IFS=: read -r lvname mountpoint
        do
             printf "Shared LVname: $lvname Mountpoint: $mountpoint \n"
        done < "$file"
        ;;
esac

Testing it manually works flawlessly ( both in activating the LV and mounting it where i wanted it to be , and reverse removing the mountpoint and deactivating the shared LV.

After this i created a systemd unitfile to let this process be automated :
Code:
[Unit]
Description=LV shared activation and mounting
Documentation=man:lvmlockd(8)
After=lvmlocks.service lvmlockd.service sanlock.service dlm.service

[Service]
Type=oneshot
RemainAfterExit=yes

# start activating shared LV's  and mount them to their mountpoints
ExecStart=/usr/local/share/lvmmount.sh mount

# stop added mountpoints and deactivate shared LV's
ExecStop=/usr/local/share/lvmmount.sh unmount

[Install]
WantedBy=multi-user.target

I tested this a few times manually and it seems to do its job correctly, but somewhere i did get a kernel-panic, so rebooted the machine ( this node i am testing on is empty .. so no damage done to running VM's or LXC's here)

After resetting the node tested it again, and seems the LV is activated and mounted correctly, so added it to startup (systemctl enable).

Seems again i am facing timing issues, so again adding a ExecStartPre=/bin/sleep 20 to the unitfile and rebooted (after reloading systemd daemon)

Again a kernel panic happened ... unknown why, as the only thing after reloading the daemon i issued a 'reboot'
So resetted the machine again

After the reset checked if the LV came up, and the mountpoint ... SUCCESS!

Will repeat rebooting the box to verify the stuff tomorrow ....

ToDo :
  • need to get rid/filter blank lines/carriage returns in the definition file /etc/lvm/lvmshared.conf
 
  • Like
Reactions: rusquad
Update :

Seems that this configuration somehow upsets dlm.
Every time i now reboot the node ( as i did a few times) it generates a panic

Code:
Sep  2 22:39:39 node04 kernel: [ 1828.032673] dlm_controld[11954]: segfault at 0 ip 00007f13d2006206 sp 00007ffe68835a08 error 4 in libc-2.28.so[7f13d1f90000+148000]
Sep  2 22:44:38 node04 kernel: [ 2126.551103] dlm_controld[13902]: segfault at 0 ip 00007f8f7b2d7206 sp 00007fff6679dd98 error 4 in libc-2.28.so[7f8f7b261000+148000]

The 1st time i experienced it was after an update , and even tho it said it all completed successfully it lead me to a damaged drive and dropping me on boot to a 'grub' -prompt.

I solved this with booting into a rescue-iso, and following the procedure to correct the bootloader.
After that i checked for updates again, all were ok, so what went wrong .. no idea the whole update was handled/exitted with no errors.

i've tested commenting out the 'ExecStop=' -line from the activation- and mounting added unit file, but still experience the same.

Will need to test further, but it seems reproducable.
Might test unloading/killing dlm beforehand ( default systemd of dlm has a commented out 'ExecStop=' definition, but have not found out why.
 
  • Like
Reactions: rusquad
Strange behaviour found:

As in my previous post i am now facing kernel-panics when issueing a 'reboot' on the one node where i tested/gone thu the whole setup.
Whereas i am facing other issues on a 2nd node whilst implementing this .. it hangs on shutdown/reboot on shutting down the lockmanager (dlm)

Details:
  • The 1st (paniccing) one is a fresh node setup directly with Debian Buster, and then moved to PVE 6.x (as described in docs)
  • the 2nd one is a migrated node from PVE 5.x to 6.X (following docs on upgrading to 6.x)
So another thing to add to my ToDo list ...

If i have the time i will reinstall the upgraded node as a fresh one so i can compare behaviour.
 
  • Like
Reactions: rusquad
Been playing around a bit more, the behaviour of the newly installed node04 ( the one i directly installed with Buster and switched to Pve6 after) is still kernel-paniccing when issueing a reboot....

i have freed up a 2nd node to play with ( node03 ) which is a upgraded node ( jessie/Pve5 to Buster/Pve6) , and have implemented the changes as written above ( timeout settings added,sequence, and added script/unitfile for handling activation of the shared LV and mounting it)

It also experienced when shutting down the waiting for dlm to shutdown issue, so i changed the line SendSIGKILL=no to SendSIGKILL=yes in the unitfile.

Rebooting the node again now shows a clean shutdown, and a full startup, just forgot to enable the LVactivation/mountpoint systemd service, so enabled it, rebootted the node once more and saw it shutdown cleanly and reboot correctly, bringing the shared LV up and assigning it to the corect mountpoint.

Then implemented the change on Node04 , tried a reboot, but still it drops into a panic .. so no go on a direct Buster/Pve6 install

This needs further investigation, as my plan was to reinstall all nodes ( with different local disks) after having set up the shared storage correctly.
 
  • Like
Reactions: rusquad
Seems i have solved my 'kernel-panic'ing on Node 04 , i still had the debian kernel available on the system, i removed it, as mentioned in the https://pve.proxmox.com/wiki/Install_Proxmox_VE_on_Debian_Buster as optional.
Then restarted it ( ofc a panic came up) and then wanted to record the screen over ILo ( with Open Broadcaster Software) to share, however rebooting it again did not lead to the 'expected' panic ..

It shut down cleanly and is booting up as wanted .. so i think its time to reinstall node03 with other disks (removing 4x600Gb disks) and switching to 4x 146Gb disk ( 2 in raid 0+1 for OS, rest local storage for emergencies)..

Will keep you all posted on my progress
 
  • Like
Reactions: rusquad
I have reinstalled Node #1 according to the https://pve.proxmox.com/wiki/Install_Proxmox_VE_on_Debian_Buster (after i had pulled out the SAS connection to the MSA2040, to be absolutely sure i would not overwrite anything on the shared storage) docks provided, then set up my dlm, lvmlockd, unitfiles and shared LVvolume script/uniitfile stuff as described above.

only thing was that when i joined/was joining the reinstalled node ( which i took out with pvecm delnode node01) my workstation ended up with the infamous internal screensaver from windows (or better said a BSOD), so i had a clusternode half hanging in the cluster but not completely...

I solved this by separating the node as described in https://pve.proxmox.com/wiki/Cluster_Managerfollowing the section 'Separate a node without reinstalling' .. which again worked flawlessly, Big grats for the Wiki-makers !

After having re-added my clusternode i missed one item and that was the config parameter in /etc/lvm/lvm.conf .. it still had the parameter on use_lvmlockd set to 0 .. after changing that, rebooting it all fell into place ... no kernel panics, no nothing in regards of issues .. it all just 'worked/started up as wanted/expected'

will migrate my resources to this new node and go for another node reinstall

and ofc again will keep you posted of my experiences
 
  • Like
Reactions: rusquad
Again i reinstalled another node (node03) whilst changing out the local disks ... install went fine, but again it joined the cluster , but somehow PVE kept complaining on wiating on Quorum, somehow it kept thinking it needed 4 votes, i ended up setting it to 1 on the newly added node ( pvecm expected 1)
Then the join appeared to have completed successfully, but still it showed up as (x) in the UI, and when selecting it it moaned about certificates.

So again i separated the node the same way as in prior post and re-added it after (without a reboot in between) .. this worked , the node was now showing correct status, then rebooted the machine and waited to let it come up with all the MSA2040 specific changes i had described before.

When it came up it joined the cluster, joined the lockspace, and mounted the shared LVvolume as intended ... So in the end its working as wanted, but still missing the why the initial (re-)join of the cluster fails.

i have one clusternode left for reinstall, on which i will change the sequence in regards of when i install dlm and lvmlockd and join the cluster on it without having access to the shared storage and see if this will change/correct the issue with a/the 1st (re-)join.

will keep you posted on it....
 
  • Like
Reactions: rusquad
Ok tested the reinstall of the last node (Node04) and changed a bit in the order in which i performed it.

Just to be sure i always pull the connection to the shared storage to never make a mistake in overwriting so i always perform a local installation !

Basically as soon as i install / switch to PVE i also let it join the cluster, so WITHOUT having either dlm or lvmlockd installed ( so it thinks it doesnt have shared storage) .. this completes successfully, and the node comes in as online in the UI

Then i remove the original Buster kernel (described as optional in the installation of PVE on Buster) and reboot to see a clean startup.

After the startup i followed the order below to install the rest
  • lvmlockd (lvm2-lockd)
  • change the parameter in /etc/lvm/lvm.conf ( use_lvmlockd = 1)
  • install dlm (dlm-controld)
  • add the /etc/dlm/dlm.conf
  • Edit the unitfile for both lvmlockd and dlm (add dependancies and stagger loading)
  • Add the lvmshared.conf (which contains the info on LVvolume(s) and mountpoint(s) i use) to /etc/lvm/
  • Add the script to handle the activation and mounting ( /usr/local/share/lvmmount.sh ) and make it executable
  • Add the unitfile which uses the script ( /lib/systemd/system/lvshared.service) and enable it
  • reconnect the shared storage
  • reboot the machine to let it come up fully configured
This gave me a flawless result in regards of reinstalling PVE6, adding it to the cluster and then configuring it to use the MSA2040 shared storage.
 
  • Like
Reactions: rusquad
When i have more time i will rewrite the whole post so it reflects all expeciences and gives a full documentation which (should) reflect a one-shot install without issues , including the setup of the shared LVM volumes.
 
  • Like
Reactions: rusquad
Just an update in regards of the workings :

I have been updating the PVE install (regular software update) on all nodes, without any issues, basically they restarted flawlessly, joined the global lockspace and then mounted the shared LVvolumes.

So i would say the final configuration is stable, and withstands upgrades.
 
  • Like
Reactions: rusquad
Another Update after a while of time.
I extended my storage, so i now have enough for the future.

One of the things i had to do was to migrate all VM/CT storage devices from the original 'Raw' LVM device i offered to Proxmox to the GFS2 offered volume (as directory).

When i had completed the transfer i removed the raw LVM device from proxmox and then :
  1. reconfigured it as a 'shared' lvm
  2. created the GFS filesystem and mountpoint
  3. updated the /etc/lvm/lvmshared.conf to reflect the added gfs2 LVolumes
Then rebooted the 1st (test)node and found out that it rebooted fine, but it was attempting to start the CT/VMs before the mountpoints were present.
So i was facing a new 'timing'-issue

I solved this by adding a dependancy to the pve-guests.service :

After=lvshared.service

As documented above the lvshared service facilitates the activation and mounting of the volumes i created with GFS2.

I have yet to find out if this holds when Proxmox is updated.
 
  • Like
Reactions: rusquad
Something i had forgotten to mention in the whole previous is that the directory being offered to Proxmox is not set to shared.
As the GFS2 filesystem takes care of this by itself it is not needed to set the directory to 'shared'
 
  • Like
Reactions: rusquad
Offering raw LVM (or LVM-thin) does not give me the capabillity of pulling snapshots, which in my env is very required (i test customer-cases/issues)
So prior to starting on testing a case from the 'base' product i take a snapshot, then further tune into the situation, test, report back on findings and then for the next 'request' i want to rollback to the starting position.

From a historical perspective i always offered (single/standalone) storage as directory, with LVM(volume) configured underneath.
The i went to Cluster, and i sort-of handled it in the same way i experienced offering local-storage
 
Hi all,

It has been a while and we have gone thru a set of updates on the services.

After running the updates i noticed that my (manual) additions to some unitfiles from PVE were overwritten.
Especially the ones on the pve-guests.service where a line was added to let the startup of VM's be dependant on the availabillity of the underlying storage.
This very line needs to be present to guarantee error-free startup of the VM's.. however as there are several timeouts builtin to let all stuff startup smoothly pve-guests.service needs the entry.

As i am also managing stuff via puppet, making it easy to manage same settings on multimple proxmox nodes i have solved it by creating a class which ensures the entry is present in the systemd unitfile.

its a crude way for now, but as i am also in the midst of a house-rebuild/redecorate everything is out of order, so it will be refined in the future :

init.pp
class prxservices { include prxservices::config }

config.pp
class prxservices::config inherits prxservices { file { '/lib/systemd/system/pve-guests.service': mode => '0644', owner => 'root', group => 'root', source => 'puppet:///modules/prxservices/pve-guests.service', }~> exec { 'prxservices-systemd-reload': command => 'systemctl daemon-reload', path => [ '/usr/bin', '/bin', '/usr/sbin' ], refreshonly => true, } }

content of pve-guests.service
[Unit] Description=PVE guests ConditionPathExists=/usr/bin/pvesh RefuseManualStart=true RefuseManualStop=true Wants=pvestatd.service Wants=pveproxy.service Wants=spiceproxy.service Wants=pve-firewall.service Wants=lxc.service After=pveproxy.service After=pvestatd.service After=spiceproxy.service After=pve-firewall.service After=lxc.service After=pve-ha-crm.service pve-ha-lrm.service After=lvshared.service [Service] Environment="PVE_LOG_ID=pve-guests" ExecStartPre=-/usr/share/pve-manager/helpers/pve-startall-delay ExecStart=/usr/bin/pvesh --nooutput create /nodes/localhost/startall ExecStop=-/usr/bin/vzdump -stop ExecStop=/usr/bin/pvesh --nooutput create /nodes/localhost/stopall Type=oneshot RemainAfterExit=yes TimeoutSec=infinity [Install] WantedBy=multi-user.target Alias=pve-manager.service

As you can see the added line is After=lvshared.service which on my setup controls the activation and mounting of the MSA shared storage


Hope it helps .....
 
  • Like
Reactions: lingguchong
An update again, as after an update i noticed the manual additions i had made to the lvmlockd unitfile vanished.
This lead to the issue where the node no longer successfully joined up with the other lockspaces.

For this i also added a class to bring it under puppet control (basically the same simple class as above for prxservices :

init.pp
# == Class: lvmlockdsvc # # Manage lvmlockd service unit file # class lvmlockdsvc ( $lvmlockd_description = 'USE_DEFAULTS', $lvmlockd_prestartsleep = 'USE_DEFAULTS', $lvmlockd_timeout = 'USE_DEFAULTS', $lvmlockd_sendkillsig = 'USE_DEFAULTS', $lvmlockd_path = '/lib/systemd/system/lvmlockd.service', $lvmlockd_owner = 'root', $lvmlockd_group = 'root', $lvmlockd_mode = '0644', $lvmlockd_template = 'lvmlockdsvc/lvmlockdsvc.erb', ) { # $default_description = 'LVM lock daemon', # $default_prestartsleep = undef, # $default_timeout = '120', # $default_sendkillsig = 'no', # parameter definitions if $lvmlockd_description == 'USE_DEFAULTS' { $lvmlockd_real_description = 'LVM lock daemon' } else { $lvmlockd_real_description = $lvmlockd_description } if $lvmlockd_prestartsleep == 'USE_DEFAULTS' { $lvmlockd_real_prestartsleep = undef } else { $lvmlockd_real_prestartsleep = $lvmlockd_prestartsleep } if $lvmlockd_timeout == 'USE_DEFAULTS' { $lvmlockd_real_timeout = '120' } else { $lvmlockd_real_timeout = $lvmlockd_timeout } if $lvmlockd_sendkillsig == 'USE_DEFAULTS' { $lvmlockd_real_sendkillsig = 'no' } else { $lvmlockd_real_sendkillsig = $lvmlockd_sendkillsig } # Parameter checks if $lvmlockd_real_prestartsleep != undef { validate_numeric($lvmlockd_real_prestartsleep, 120, 5) } if $lvmlockd_real_timeout != $lvmlockd_timeout { validate_numeric($lvmlockd_real_timeout, 600 , 1) } if $lvmlockd_real_sendkillsig != $lvmlockd_sendkillsig { validate_re($lvmlockd_real_sendkillsig, '^(yes|no)$', "lvmlockdsvc::lvmlockd_sendkillsig may be either 'yes' or 'no' and is set to <${lvmlockd_real_sendkillsig}>.") } file { 'lvmlockd.service' : ensure => file, owner => $lvmlockd_owner, group => $lvmlockd_group, path => $lvmlockd_path, content => template($lvmlockd_template), }~> exec { 'lvmlockdsvc-systemd-reload': command => 'systemctl daemon-reload', path => [ '/usr/bin', '/bin', '/usr/sbin' ], refreshonly => true, } }

template
# This file is being maintained by Puppet. # DO NOT EDIT # $OpenBSD: lvmlockd.service,v 1.01 2020/02/15 13:24:27 reyk Exp $ # This is the lvmlockd service systemd unit configuration file. See # lvmlockd(8) for more information. [Unit] <% if @lvmlockd_real_description -%> Description=<%= @lvmlockd_real_description %> <% end -%> Documentation=man:lvmlockd(8) After=corosync.service dlm.service [Service] Type=notify <%if @lvmlockd_real_prestartsleep != nil -%> ExecStartPre=/bin/sleep <%= @lvmlockd_real_prestartsleep %> <% end -%> ExecStart=/sbin/lvmlockd --foreground PIDFile=/run/lvmlockd.pid <%if @lvmlockd_timeout -%> TimeoutStopSec=<%= @lvmlockd_real_timeout %> <% end -%> <%if @lvmlockd_sendkillsig -%> SendSIGKILL=<%=@lvmlockd_real_sendkillsig %> <% end -%> [Install] WantedBy=multi-user.target


This is a rework of my original class i quickly wrote, this one works with a template and has become alot more dynamic because of it.

Hope it helps others too.
 
Last edited:
Hi! Good stuff. But I have some questions about it:
1) Why you turn off fence?
2) When NODE lose connections with NETWORK cluster is down (corosync is failed - killed by dlm) and only reboot can alive the node.
What do you think about it?
3) Sometimes corosync lost connections to network
4) It works only with kernel 5.3.18 -> upper -> kernel bug and error.

My hard: HP GL 380 x2 + MSA2040 and CISCO 3560 (Agregate LACP)
 
Hi! Good stuff. But I have some questions about it:
1) Why you turn off fence?
2) When NODE lose connections with NETWORK cluster is down (corosync is failed - killed by dlm) and only reboot can alive the node.
What do you think about it?
3) Sometimes corosync lost connections to network
4) It works only with kernel 5.3.18 -> upper -> kernel bug and error.

My hard: HP GL 380 x2 + MSA2040 and CISCO 3560 (Agregate LACP)

1. as said this was a research-project trying to bend behaviour to my needs, fencing gave alot of issues, so i turned it off, and never looked back to be honest.
2. i never had a full cluster/network fallout, so i have not reproduced this behaviour.
3. not have had that issue.
4. i am atm running latest pve-kernel-5.4/stable 6.2-6

My setup atm = 4x DL360Gen7 + MSA2040SAS+ 1 shelf added
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!