HELP NEEDED! HA cluster , a virtualmin VM stops (but only when using a particular node)

tenfoldinaus · May 6, 2024

HA cluster identical nodes,
but a VIrtualmin instance stops (but only when running on a single node) its a VM not LXC becuase i wanted live migration and LXc was causing issues way back

weird thing is, i get a 503 cloudflare error for the WP website, around about 2pm, every afternoon, if i run it on this particular node,
but if its running on any other node it doesnt happen, im at a loss as to whats causing it, i checked schedueled backups etc, but there is nothing.
also i can reach the virtualmin dasahboard through the assigned https port,

i look in all the virtual min logs shows no problems at all and the site is running perfect.
and nothing shows up on VE as far as i can see.
but around 2pm without fails , the site actual website front end stops AND a reboot command in the CLI will fail also a shut command from CLI will fail, also a shutdown command in the GUI of proxmox will fail, the only thing that works is the STOP command from proxmox GUI
but then a restart and all works again perfect. but if i dont do this, it wont ever recover the website front end, however EVERYTHING else apears to work flawlessy still,
if i move the VM to another node, the issue never happens.

very very strange,

here is the setup

each node containes identical kit as follows
lenovo p620 5995wx threadripper 64
1024gb ram
8x 2tb v4 enterprise nvme ssd CEPH
1x 2tb v5 enterprise nvme xfs operating disk
1x 2tb v5 enterprise nvme xfs HS ceph
2x 16tb 10,000 rpm hdd xfs ceph backup storage
1x 8gb host gpu
1x 16gb telsa M1 gpu for VGPU shared across all three nodes
2x 25gb sfp mellanox cards both running as binded 50gb uplinks to a agg switch ( 1x 50gb is dedicated to ceph )
1x 10gb rj45 built in adapter (used as managment port)
|
|
(50gb link to ubiq agg enterpise switch)
|
|
(10gb ubiq udm pro max 10gb router)
connection is 10gb symmetrical high cos internet

every Node setup has
swap set to =10

and the vm setup
base is = 22.04 ubuntu tteck script
ssh enabled
qemu guest enabled
cpu type = host
memory = 32gb
swap=0gb
storage is =cephfs on the nvme large setup
hdd discard=yes
hdd io thread=yes
hdd cache= write through
cpu= 1socket and 24cores
scsi controller= VIRTIO scsi single
bios= ovmf uefi

network device = vmbr0 mtu=9000
network multiques=24
vlan tag =70

really really have no idea whay this happens at the same time every day,
on just 1 node and why it also stops responding to cli shutdown and reboot commands,
even worse no idea why nothing at all appears in the apache or php logs in virtualmin,
i really believe its to do with something silly in this particular nodes settings, but ill be dammned if i can find out what!!!

tenfoldinaus · May 6, 2024

UPDATE its happening on the other nodes too...
must be the vm

HELP NEEDED! HA cluster , a virtualmin VM stops (but only when using a particular node)

tenfoldinaus

Member

tenfoldinaus

Member

We value your privacy