i came across it , did not help.
But i think i found a reason, still not sure if it caused it.
all Proxmox nodes have chrony services and should sync by it.
i have two more dedicated servers DC, DC backup
and the backup had a time issue, after it fixed most of the issues resolved
But i dont...
I have a cluster of around 20 nodes, Proxmox 8.0.4 +ceph (on 4 of them)
all have the same configuration of chronyd. and most of the time everything works.
recently (in the past month) i started to have issues with time sync. usually it comes after a power failure, some servers are out of...
flow:
1 servers had reboot due to power maintenance,
2 (after the reboot) i noticed one server had bad clock sync - fixing the issue and another reboot solved it)
the
3. after time sync fixed cluster started to load and rebalance,
4 it hang at error state (data looks ok and everything stable and...
We have a setup of around 30 servers, 4 of them with ceph storage,
Unfortunately we have many power outages in our building and the backup battery does not last for long periods , casing entire cluster crash, (server, switches, storages)
Most of the time the entire cluster turn up when the...
i am trying to reboot lxc host as part of an ansible script , however i cannot make it work using ansible,
default reboot module did not work: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/reboot_module.html
it printed Socket exception: Connection reset by peer (104) and...
unfortunately i rolled back to kernel 5.15 on all hosts with vms (after rollback no issues at all)
we use the servers in our production so i cannot risk another downtime.
the affect is only for VM.not for lxc
settings mitgation off did not solve the issue, just reduce the occurrence due to more efficient kernel ,
i have around 10 Ubuntu vms that the error occurs repeatedly under load (while running all nodes at 70% cpu capacity ) in less then an 1 hour i had the error at least on one of the nodes...
Setting mitigation off reduced the amount of the errors (still the bigger the load the more errors),
but going back to kernel 5.15 removed the issue entirely
I want to try it on a new a new node (fresh installed, and not upgraded from 7.4)
the node dont have kernel 5.15.116-1-pve installed
is it the flow:
proxmox-boot-tool kernel add 5.15.116-1-pve
proxmox-boot-tool pin 5.15.116-1-pve
proxmox-boot-tool refresh
after reboot does the kernet...
i have stability issues on nodes with high cpu load and i would move back the to kernel was on 7.4.
what is the best approach ?
i am on pve 8.0.4 kernel 6.2.16-14
sure here:
both host and vm have mitigations=off in grub (error was more frequent before settings this configuration)
my vm is host for high cpu load when the error occurs
proxmox host:
version:
proxmox-ve: 8.0.2 (running kernel: 6.2.16-14-pve)
pve-manager: 8.0.4 (running version...
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.