Angry cluster config, now getting lots of 400 errors from web UI

seanmahrt

New Member
Jul 4, 2023
3
0
1
PVE version 8.1.3

So I finally fixed my 2 node cluster due to some LACP bonded issues with corosync (I think). Was able to split one of the nodes out of the cluster and readd them. That was a painful couple hours. :)

Now, I was going to re-setup my HA after I cleaned it all out. When I go to add a pool (not resource pool), I get:

Parameter verification failed. (400)

poolid: type check ('string') failed - got ARRAY

This is the error when I go to create a backup at the cluster level:


Parameter verification failed. (400)

poolid: property is not defined in schema and the schema does not allow additional properties



I get a lot of 400 errors trying to mess with settings in the cluster manager. Is there a hidden database I need to clean out too? i cleaned out some stale deleting lines in ha-manager by shutting down the cluster software and cleaning out the status file for HA.

Somewhere there must be angry remnants in there somewhere.
 
Backup VMs, wipe nodes and reinstall from scratch and restore VMs....
 
Backup VMs, wipe nodes and reinstall from scratch and restore VMs....
yeah, that'd be the easiest assuming I had a place to put all that stuff.

How do I completely clear cluster information out of a node? I'm thinking what I'll do is separate the two nodes, clear the cluster off the tiny one, reinstall proxmox from scratch, and create a new cluster on the tiny node. Then I can clean the cluster off the big node and join the cluster.

Would that reset all the craziness from when corosync/LACP went bezerk?

Maybe I can just break the cluster and clean off the cluster data on the big one and create a new cluster there. The reason I was thinking on the small node is that way it'd be 100% clean and the big node would just clone the config from the small node....
 
yeah, that'd be the easiest assuming I had a place to put all that stuff.

How do I completely clear cluster information out of a node? I'm thinking what I'll do is separate the two nodes, clear the cluster off the tiny one, reinstall proxmox from scratch, and create a new cluster on the tiny node. Then I can clean the cluster off the big node and join the cluster.

Would that reset all the craziness from when corosync/LACP went bezerk?

Maybe I can just break the cluster and clean off the cluster data on the big one and create a new cluster there. The reason I was thinking on the small node is that way it'd be 100% clean and the big node would just clone the config from the small node....
Maybe I was not clear enough. I mean to WIPE both nodes and start from scratch. Create an offline backup of your VMs and reinstall both Nodes. Then form your complete NEW cluster and restore the backups.
 
please post "pveversion -v" for both nodes, and ensure you are using the UI with a clean slate (e.g., clear the cache of your browser/force-reload the UI)
 
please post "pveversion -v" for both nodes, and ensure you are using the UI with a clean slate (e.g., clear the cache of your browser/force-reload the UI)
Hey Fabian, I ended up seeing this a day late. :)

But, I did confirm they were the same version (did apt-update on both)

I ended up migrating all VM/CT over to the "big" one, removing the "little" node properly, reinstalling proxmox on the "little" one and creating a new cluster. Then I "forcefully" removed the cluster information from the "big" one (after taking a copy of /etc/pve), added the "big" one to the new cluster, copied over storage.cfg and the VM/CT definitions and rebooted "big".

All seems OK. Next time I have issues I'll try clearing cache or using private browsing mode to see if it's a web cache issue.

Still hesitant to plug in the other LACP link on "big" as I think that's what made corosync go crazy as my cluster management and network/general traffic are on the same port. the "little" node is essentially an intel NUC.

Have there been new issues of the more recent kernels losing packets on LACP links? Only thing I can think of is the nodes lost connectivity with each other and occasionally connected to the qdevice and the config got out of sync somehow. Made for a rough couple days (home server, but runs a lot of the home automation stuff)

Thanks everybody (and the proxmox documentation) for the help

Sean
Maybe I was not clear enough. I mean to WIPE both nodes and start from scratch. Create an offline backup of your VMs and reinstall both Nodes. Then form your complete NEW cluster and restore the backups.
I got that. but I was hoping to avoid that.
 
Well this sucks! I just got my three node test cluster all set up with Ceph and backup server squared away and then Boom! Out of nowhere the same pool related schema errors. No clue as to what triggered it.

So now what? I have to start over again? This will seriously dent any confidence we had of bringing Proxmox into our datacenter for production workloads.

Bob

UPDATE: "... ensure you are using the UI with a clean slate (e.g., clear the cache of your browser/force-reload the UI)"

Clearing the browser cache for the UI by forcing a reload (shift refresh in chrome) resolved the issue and I can now create pools and backup jobs.

Super relieved!
 
Last edited:
Well this sucks! I just got my three node test cluster all set up with Ceph and backup server squared away and then Boom! Out of nowhere the same pool related schema errors. No clue as to what triggered it.

So now what? I have to start over again? This will seriously dent any confidence we had of bringing Proxmox into our datacenter for production workloads.

Bob

UPDATE: "... ensure you are using the UI with a clean slate (e.g., clear the cache of your browser/force-reload the UI)"

Clearing the browser cache for the UI by forcing a reload (shift refresh in chrome) resolved the issue and I can now create pools and backup jobs.

Super relieved!
Holy balls I can't believe clearing browser cache fixed this problem for me. THANK YOU FOR POSTING THIS! This is INSANE that this fixed it!

It's as insane that the default solution in this thread is to wipe all nodes and rebuild... just for something like this... I have never seen this problem until today and I've been working with PVE for over 12 years!
 
Why is this necessary, what's causing the issue?
I can't speak for fabian, but in my case I _THINK_ it might have been because I added two VMs to a Resource Pool (when they previously were not a member of any Resource Pool). But the actual reason really was not clear. CTRL+F5 clearing the cache in my case seemed to do the "magical trick". As much as I don't "like" the solution, that seemed to be the solution for me.
 
I can't speak for fabian, but in my case I _THINK_ it might have been because I added two VMs to a Resource Pool (when they previously were not a member of any Resource Pool). But the actual reason really was not clear. CTRL+F5 clearing the cache in my case seemed to do the "magical trick". As much as I don't "like" the solution, that seemed to be the solution for me.

I mostly asked because I would consider it a bug for the JS SPA to be running in some weird state because of anything in a browser cache, in particular after re-login on a healthy node. In that case, a bugreport should be filed and assigned priority for fixing. Worst of all errors are those that need some "magic" to know instead of just acting on a good error message.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!