Hi,
we are running proxmox+ceph since 2017, 15 hosts, amd AMD Opteron(tm) Processor 6380 (2 Sockets) + AMD EPYC 7513.
After the latest update on 8.Mai 2024 three opteron hosts are locked - red X and (no ping, no ssh, all vms with grey (?) mark)
after reboot everything ok.
After 6 hours two other host are locked (opteron)
And then one AMD EPYC 7513.
after reboot ceph storage has grey (?) marks and some vms are not starting any longer AND restore failed:
no lock found trying to remove 'create' lock
error before or during data restore, some or all disks were not completely restored. VM 206 state is NOT cleaned up.
ceph health detail:
HEALTH_OK
ceph osd df also ok (maX 71%)
Found this in the logs:
May 13 15:32:52 benno pvestatd[2186]: got timeout
May 13 15:32:57 benno pvestatd[2186]: got timeout
May 13 15:32:57 benno pvestatd[2186]: status update time (15.338 seconds)
May 13 15:33:02 benno pvestatd[2186]: got timeout
May 13 15:33:08 benno pvestatd[2186]: got timeout
May 13 15:33:13 benno pvestatd[2186]: got timeout
May 13 15:33:13 benno pvestatd[2186]: status update time (15.351 seconds)
May 13 15:33:18 benno pvestatd[2186]: got timeout
May 13 15:33:18 benno pmxcfs[467101]: [status] notice: received log
May 13 15:33:23 benno pvestatd[2186]: got timeout
May 13 15:33:28 benno pvestatd[2186]: got timeout
May 13 15:33:28 benno pvestatd[2186]: status update time (15.335 seconds)
May 13 15:33:33 benno pvestatd[2186]: got timeout
May 13 15:33:38 benno pvestatd[2186]: got timeout
May 13 15:33:41 benno pmxcfs[467101]: [status] notice: received log
May 13 15:33:43 benno pvestatd[2186]: got timeout
May 13 15:33:43 benno pvestatd[2186]: status update time (15.335 seconds)
May 13 15:33:48 benno pvestatd[2186]: got timeout
May 13 15:33:53 benno pvestatd[2186]: got timeout
ansible cluster -m shell -a "grep microcode /proc/cpuinfo | uniq"
udo.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000822
kalle.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000832
bruno.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000822
felix.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000822
daniel.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000832
egon.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa10113e
fritz.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa10113e
andre.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa001119
bernd.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa001119
otto.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000832
moritz.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000822
kulle.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa10113e
ralf.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa001119
benno.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa10113e
paul.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000832
What can we do???
With regards
Konrad
we are running proxmox+ceph since 2017, 15 hosts, amd AMD Opteron(tm) Processor 6380 (2 Sockets) + AMD EPYC 7513.
After the latest update on 8.Mai 2024 three opteron hosts are locked - red X and (no ping, no ssh, all vms with grey (?) mark)
after reboot everything ok.
After 6 hours two other host are locked (opteron)
And then one AMD EPYC 7513.
after reboot ceph storage has grey (?) marks and some vms are not starting any longer AND restore failed:
no lock found trying to remove 'create' lock
error before or during data restore, some or all disks were not completely restored. VM 206 state is NOT cleaned up.
ceph health detail:
HEALTH_OK
ceph osd df also ok (maX 71%)
Found this in the logs:
May 13 15:32:52 benno pvestatd[2186]: got timeout
May 13 15:32:57 benno pvestatd[2186]: got timeout
May 13 15:32:57 benno pvestatd[2186]: status update time (15.338 seconds)
May 13 15:33:02 benno pvestatd[2186]: got timeout
May 13 15:33:08 benno pvestatd[2186]: got timeout
May 13 15:33:13 benno pvestatd[2186]: got timeout
May 13 15:33:13 benno pvestatd[2186]: status update time (15.351 seconds)
May 13 15:33:18 benno pvestatd[2186]: got timeout
May 13 15:33:18 benno pmxcfs[467101]: [status] notice: received log
May 13 15:33:23 benno pvestatd[2186]: got timeout
May 13 15:33:28 benno pvestatd[2186]: got timeout
May 13 15:33:28 benno pvestatd[2186]: status update time (15.335 seconds)
May 13 15:33:33 benno pvestatd[2186]: got timeout
May 13 15:33:38 benno pvestatd[2186]: got timeout
May 13 15:33:41 benno pmxcfs[467101]: [status] notice: received log
May 13 15:33:43 benno pvestatd[2186]: got timeout
May 13 15:33:43 benno pvestatd[2186]: status update time (15.335 seconds)
May 13 15:33:48 benno pvestatd[2186]: got timeout
May 13 15:33:53 benno pvestatd[2186]: got timeout
ansible cluster -m shell -a "grep microcode /proc/cpuinfo | uniq"
udo.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000822
kalle.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000832
bruno.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000822
felix.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000822
daniel.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000832
egon.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa10113e
fritz.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa10113e
andre.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa001119
bernd.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa001119
otto.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000832
moritz.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000822
kulle.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa10113e
ralf.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa001119
benno.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0xa10113e
paul.intern.bfw-dresden.de | CHANGED | rc=0 >>
microcode : 0x6000832
What can we do???
With regards
Konrad