Node randomly reboots

Jan 4, 2022
11
0
6
Hi everyone.

First, some information about the setup we are running:

• 4 x Proxmox nodes (version 8.3.2) with Ceph installed – cluster without HA

• Separate networks for Ceph (2 x 10GB), Corosync (1GB), and Backup (1GB) - 2 switches (10GB & 1GB)

• 1 x Proxmox Backup Server



Each server is backed up using separate jobs.

We have the issue that several servers randomly reboot. Here is the log from the server "node4" that rebooted. I can’t find anything useful there.

Jan 22 21:17:01 node04 CRON[497988]: pam_unix(cron:session): session closed for user root
Jan 22 21:19:14 node04 smartd[1264]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 62
Jan 22 21:19:14 node04 smartd[1264]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 62 to 59
Jan 22 21:19:14 node04 smartd[1264]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 59
Jan 22 21:19:14 node04 smartd[1264]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 61
Jan 22 21:40:10 node04 pmxcfs[1662]: [dcdb] notice: data verification successful
Jan 22 21:49:14 node04 smartd[1264]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 62 to 64
Jan 22 21:49:14 node04 smartd[1264]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 59 to 63
Jan 22 21:49:14 node04 smartd[1264]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 59 to 62
Jan 22 21:49:14 node04 smartd[1264]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 63
Jan 22 22:00:04 node04 pmxcfs[1662]: [status] notice: received log
Jan 22 22:17:01 node04 CRON[541085]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 22 22:17:01 node04 CRON[541086]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jan 22 22:17:01 node04 CRON[541085]: pam_unix(cron:session): session closed for user root
Jan 22 22:19:14 node04 smartd[1264]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 63
Jan 22 22:19:14 node04 smartd[1264]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 62
Jan 22 22:19:14 node04 smartd[1264]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 62 to 61
Jan 22 22:30:02 node04 pmxcfs[1662]: [status] notice: received log
Jan 22 22:40:10 node04 pmxcfs[1662]: [dcdb] notice: data verification successful
Jan 22 22:49:14 node04 smartd[1264]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 60
Jan 22 22:49:14 node04 smartd[1264]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 62 to 59
Jan 22 22:49:14 node04 smartd[1264]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 58
Jan 22 22:49:14 node04 smartd[1264]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 60
Jan 22 23:00:05 node04 pmxcfs[1662]: [status] notice: received log
-- Reboot --
Jan 22 23:05:28 node04 kernel: Linux version 6.8.12-5-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-5 (2024-12-03T10:26Z) ()
Jan 22 23:05:28 node04 kernel: Command line: initrd=\EFI\proxmox\6.8.12-5-pve\initrd.img-6.8.12-5-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs
Jan 22 23:05:28 node04 kernel: KERNEL supported cpus:
Jan 22 23:05:28 node04 kernel: Intel GenuineIntel
Jan 22 23:05:28 node04 kernel: AMD AuthenticAMD
Jan 22 23:05:28 node04 kernel: Hygon HygonGenuine
Jan 22 23:05:28 node04 kernel: Centaur CentaurHauls
Jan 22 23:05:28 node04 kernel: zhaoxin Shanghai
Jan 22 23:05:28 node04 kernel: BIOS-provided physical RAM map

There was no backup job on this Server during this time. Node3 was backed up when node4 rebooted. Very strange,

Where can I look for further information, and what can I do about this?

Thanks in advance.
Holger
 
Hello LOGINTechBlog! Just to get an overview of the situation:
  1. Is this a new server? If not, did you change something before the issues started to happen?
  2. Random restarts might be caused by faulty hardware. You can try running memtest to see if you have faulty RAM.
  3. Updating BIOS might help in some situations.
 
Could you please post the server hardware? Do you have VMs running on them? If yes, can you please post their configs as well? Also, please post the output of zpool status.
 
Thank you for your reply!

The Server-Hardware is this:

Supermicro AS1015-A MT - 192GB RAM ECC - 4 x Samsung 4TB SSD - AMD Ryzen 9 7900X
2 x LAN on Board - Dual 10GB LAN Intel X550-T2 & 2xUSB-C Network Adapter

On this machine are running 3 VM's

Windows SQL Server
agent: 1
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 8
cpu: x86-64-v2-AES
efidisk0: ceph01:vm-105-disk-0,efitype=4m,pre-enrolled-keys=1,size=528K
ide2: none,media=cdrom
machine: pc-q35-9.0
memory: 32768
meta: creation-qemu=9.0.2,ctime=1732791155
name: SQL01
net0: virtio=BC:24:11:6F:24:C5,bridge=V100,firewall=1
numa: 0
onboot: 1
ostype: win11
scsi0: ceph01:vm-105-disk-1,discard=on,iothread=1,size=100G,ssd=1
scsi1: ceph01:vm-105-disk-3,discard=on,iothread=1,size=500G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=0904e55b-5936-4eac-8405-62db0f283c76
sockets: 1
tpmstate0: ceph01:vm-105-disk-2,size=4M,version=v2.0
vga: virtio
vmgenid: 091a65ea-1ec6-40de-8a93-353402086da3


SEL Oracle Server
agent: 1
boot: order=scsi0;ide2;net0
cores: 24
cpu: x86-64-v2-AES
ide2: none,media=cdrom
memory: 98308
meta: creation-qemu=9.0.2,ctime=1733906905
name: WAWI
net0: virtio=BC:24:11:E4:11:2F,bridge=V100,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: ceph01:vm-107-disk-0,discard=on,iothread=1,size=150G,ssd=1
scsi1: ceph01:vm-107-disk-1,discard=on,iothread=1,size=500G,ssd=1
scsi2: ceph01:vm-107-disk-2,discard=on,iothread=1,size=500G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=5373641e-519a-4af3-8391-013af28538d2
sockets: 1
vmgenid: a9ca2e74-5a65-4cc0-970b-be331c8e424a

Windows Server
agent: 1
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 2
cpu: x86-64-v2-AES
efidisk0: ceph01:vm-108-disk-0,efitype=4m,pre-enrolled-keys=1,size=528K
ide2: none,media=cdrom
machine: pc-q35-9.0
memory: 8192
meta: creation-qemu=9.0.2,ctime=1732791155
name: DMS
net0: virtio=BC:24:11:B4:FA:95,bridge=V100,firewall=1
numa: 0
onboot: 1
ostype: win11
scsi0: ceph01:vm-108-disk-1,discard=on,iothread=1,size=100G,ssd=1
scsi1: ceph01:vm-108-disk-3,discard=on,iothread=1,size=1000G,ssd=1
scsi2: ceph01:vm-108-disk-4,discard=on,iothread=1,size=50G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=f3ba5161-4aa6-43ea-b14c-878ac3f5a96e
sockets: 1
tpmstate0: ceph01:vm-108-disk-2,size=4M,version=v2.0
vga: virtio
vmgenid: 9dad127e-a59a-4a33-8c2b-8d042696fafa


zpool status may not be too helpful because all vms or in the Ceph storage

pool: rpool
state: ONLINE
scan: scrub repaired 0B in 00:00:12 with 0 errors on Sun Jan 12 00:24:13 2025
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme-eui.002538b141b4c2f4-part3 ONLINE 0 0 0
nvme-eui.002538b141b4c2e8-part3 ONLINE 0 0 0

errors: No known data errors




 
Thanks for the info.
We have the issue that several servers randomly reboot. Here is the log from the server "node4" that rebooted.
I think I initially misunderstood what you meant. Just to be sure, by "several servers" you mean that the VMs restart, and not the servers that Proxmox VE is installed on?

What I see is that the AMD Ryzen 9 7900X you are using has 12 cores and 24 threads (with SMT). However you are assigning a total of 8 + 24 + 2 = 34 cores to the VMs. This can work well, but you might want to combine this with CPU limit to avoid overloading the server. While the server is under load, I recommend using top or htop inside the Linux VMs to monitor CPU usage, or the Task Manager inside the Windows VMs.

Also, it might be useful if you provide a journal of the host as well to see if it reports something unusual.
 
Thanks for your replay. I am talking about the nodes (physical servers). They just reboot without any hint of the cause.

quote;
Jan 22 22:49:14 node04 smartd[1264]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 60
Jan 22 23:00:05 node04 pmxcfs[1662]: [status] notice: received log
-- Reboot --

And the reboot was outside business hours without any load on this specific server (here node4). I just want to understand the reason and what can I do to prevent this
 
There are some things that come to mind, but it's difficult to debug further without any further information. Such issues can happen either due to storage issues (e.g. damaged SSD/HDD), or due to CPU issues, or due to RAM issues, or due to wrong voltages. I can recommend trying a few things, but since we have no hints (yet) to what is happening, I can't promise that this will help.

You can try the following:
  1. Post the output of dmesg.
  2. Check the S.M.A.R.T. values of the disks to see if they report any errors.
  3. Try running a memtest.
  4. Try running a stress test on the CPU.
  5. Try updating the BIOS and other firmware of the motherboard.
  6. Try installing the latest microcode updates for your CPU. For that, you'll need to enable the non-free-firmware Debian repository and install the CPU-vendor specific microcode package (in your case, apt install amd64-microcode).
Depending on how much downtime you can afford, keep in mind that points 3 and 4 take some time, so feel free to try other things first.

Sorry for the rather generic answer, but by trying the steps above hopefully you can find more detailed information about what is going wrong.

The hardware is new, so defective hardware is rather unlikely but not impossible. What’s strange is that I’ve seen this issue on some of the other four servers as well.
You are right that it's unusual to have multiple servers that have the same issue. On the other hand, if they all have the same hardware, it might also happen that you got multiple faulty components from a faulty batch. Unlikely, but not impossible.

What can also help are BIOS, firmware and microcode updates, in case you have hardware issues that have already been fixed.
 
Could you please also post the output of pvecm status?

You might also have issues with Ceph/Corosync instability, so it might makes sense to check that as well.
 
Herr some of the logs and infos you have asked for:

Code:
Cluster information
-------------------
Name:             cluster01
Config Version:   4
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Jan 29 13:17:03 2025
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000004
Ring ID:          1.141
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.10.11
0x00000002          1 192.168.10.12
0x00000003          1 192.168.10.13
0x00000004          1 192.168.10.14 (local)

This night we had again a reboot of another node during the backup.
 

Attachments

Thanks for the logs. So, just to be sure...
We have the issue that several servers randomly reboot. Here is the log from the server "node4" that rebooted. I can’t find anything useful there.
Are all servers that you have issues with using Ceph?

The output of dmesg does not show anything unusual except:
Code:
[306714.167850] libceph (f658f337-ebeb-4146-9352-16455366d201 e2197): osd9 down
[306719.564718] libceph (f658f337-ebeb-4146-9352-16455366d201 e2199): osd10 down
[306738.795311] libceph (f658f337-ebeb-4146-9352-16455366d201 e2201): osd8 down
[306779.344698] libceph (f658f337-ebeb-4146-9352-16455366d201 e2203): osd11 down
[307317.082463] libceph (f658f337-ebeb-4146-9352-16455366d201 e2205): osd9 weight 0x0 (out)
[307317.082680] libceph (f658f337-ebeb-4146-9352-16455366d201 e2205): osd10 weight 0x0 (out)
[307342.106113] libceph (f658f337-ebeb-4146-9352-16455366d201 e2208): osd8 weight 0x0 (out)
[307382.121204] libceph (f658f337-ebeb-4146-9352-16455366d201 e2211): osd11 weight 0x0 (out)

My guess at this point is that at some point in time, your server(s) can no longer connect to Ceph, and with no access to storage, the server is restarting. This would also explain why we don't see anything useful in the journal, since it can no longer be written.

Could you please also do the following:
  1. Post the output of pveceph status
  2. Post the configuration from /etc/network/interfaces (assuming that your servers are configured similarly; otherwise please post the config of all servers)
It would generally be good to keep an eye on Ceph if you see anything out of the ordinary, especially shortly before a restart.
 
The Problem with our Ceph storage is quite new. It comes last night in top. But when I opened the post Ceph was okay. We have since yesterday that one node stopped and outed 3 of 4 disks. I tried to take them in and try to start them but I had no luck.

here the status:

Code:
pveceph status
  cluster:
    id:     f658f337-ebeb-4146-9352-16455366d201
    health: HEALTH_WARN
            83 daemons have recently crashed
 
  services:
    mon: 4 daemons, quorum node01,node02,node03,node04 (age 12h)
    mgr: node04(active, since 12h), standbys: node03, node02, node01
    osd: 16 osds: 13 up (since 12h), 13 in (since 36h)
 
  data:
    pools:   4 pools, 193 pgs
    objects: 650.14k objects, 2.5 TiB
    usage:   7.3 TiB used, 40 TiB / 47 TiB avail
    pgs:     193 active+clean
 
  io:
    client:   20 KiB/s rd, 1.2 MiB/s wr, 2 op/s rd, 136 op/s

All servers are the same, here is the network config

Code:
auto lo
iface lo inet loopback

iface eno1 inet manual
#Management

iface eno2 inet manual
#WAN

iface enxbe3af2b6059f inet manual
#BMC

auto enp1s0f0
iface enp1s0f0 inet manual
        mtu 9000
#Storage

auto enp1s0f1
iface enp1s0f1 inet manual
        mtu 9000
#Storage

iface enxc8a3624fc2e6 inet manual
#BAK

iface enxc8a3624ea7fc inet manual
#DMZ

auto bond0
iface bond0 inet manual
        bond-slaves enp1s0f0 enp1s0f1
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer2
        mtu 9000
#Storage Network

auto vmbr0
iface vmbr0 inet static
        address 192.168.10.11/24
        gateway 192.168.10.1
        bridge-ports eno1
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094
#Management

auto vmbr1
iface vmbr1 inet manual
        bridge-ports eno2
        bridge-stp off
        bridge-fd 0
#WAN

auto vmbr2
iface vmbr2 inet static
        address 10.0.0.11/24
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
        mtu 9000
#Storage Network

auto vmbr3
iface vmbr3 inet static
        address 172.16.10.11/24
        bridge-ports enxc8a3624fc2e6
        bridge-stp off
        bridge-fd 0
#Backup Network

auto vmbr4
iface vmbr4 inet manual
        bridge-ports enxc8a3624ea7fc
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094
#DMZ Network

source /etc/network/interfaces.d/*

I will try to update the firmware of the servers on the weekend. I saw in some posts that there are also the combination of AMD CPU and the Samsung SSD (870 EVO) might be an issue

Any help is appreciated
 
Here is another update to this issue.

During the backup I saw that some osd reported to be unavailable but came back in some seconds. I am wondering what this issue could cause.

Code:
2025-01-31T21:26:37.709202+0100 mgr.node04 (mgr.11276218) 38817 : cluster [DBG] pgmap v38807: 193 pgs: 12 active+clean+laggy, 181 active+clean; 2.5 TiB data, 7.3 TiB used, 40 TiB / 47 TiB avail; 8.0 KiB/s rd, 32 KiB/s wr, 3 op/s
2025-01-31T21:26:38.019180+0100 osd.11 (osd.11) 25 : cluster [WRN] 1 slow requests (by type [ 'waiting for sub ops' : 1 ] most affected pool [ 'ceph01' : 1 ])
2025-01-31T21:26:39.057345+0100 osd.11 (osd.11) 26 : cluster [WRN] 1 slow requests (by type [ 'waiting for sub ops' : 1 ] most affected pool [ 'ceph01' : 1 ])
2025-01-31T21:26:39.364004+0100 mon.node01 (mon.0) 16578 : cluster [DBG] osd.15 reported failed by osd.3
2025-01-31T21:26:39.709538+0100 mgr.node04 (mgr.11276218) 38818 : cluster [DBG] pgmap v38808: 193 pgs: 12 active+clean+laggy, 181 active+clean; 2.5 TiB data, 7.3 TiB used, 40 TiB / 47 TiB avail; 8.0 KiB/s rd, 33 KiB/s wr, 3 op/s
2025-01-31T21:26:40.018997+0100 osd.11 (osd.11) 27 : cluster [WRN] 1 slow requests (by type [ 'waiting for sub ops' : 1 ] most affected pool [ 'ceph01' : 1 ])
2025-01-31T21:26:40.097467+0100 osd.7 (osd.7) 29 : cluster [WRN] 2 slow requests (by type [ 'waiting for sub ops' : 2 ] most affected pool [ 'ceph01' : 2 ])
2025-01-31T21:26:40.409966+0100 mon.node01 (mon.0) 16579 : cluster [DBG] osd.15 reported failed by osd.5
2025-01-31T21:26:40.410010+0100 mon.node01 (mon.0) 16580 : cluster [INF] osd.15 failed (root=default,host=node04) (2 reporters from different host after 23.046012 >= grace 20.000000)
2025-01-31T21:26:40.800881+0100 mon.node01 (mon.0) 16581 : cluster [DBG] osd.15 failure report canceled by osd.5
2025-01-31T21:26:40.974014+0100 mon.node01 (mon.0) 16582 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2025-01-31T21:26:40.996927+0100 mon.node01 (mon.0) 16583 : cluster [DBG] osdmap e3339: 16 total, 12 up, 13 in
2025-01-31T21:26:41.709891+0100 mgr.node04 (mgr.11276218) 38819 : cluster [DBG] pgmap v38810: 193 pgs: 7 peering, 10 stale+active+clean, 2 active+clean+laggy, 174 active+clean; 2.5 TiB data, 7.3 TiB used, 40 TiB / 47 TiB avail; 24 MiB/s rd, 246 KiB/s wr, 33 op/s
2025-01-31T21:26:41.995162+0100 mon.node01 (mon.0) 16584 : cluster [WRN] Health check failed: Reduced data availability: 1 pg inactive, 1 pg peering (PG_AVAILABILITY)
2025-01-31T21:26:41.995175+0100 mon.node01 (mon.0) 16585 : cluster [WRN] Health check update: 3 slow ops, oldest one blocked for 35 sec, daemons [osd.1,osd.11,osd.2,osd.6,osd.7] have slow ops. (SLOW_OPS)
2025-01-31T21:26:42.018007+0100 mon.node01 (mon.0) 16586 : cluster [DBG] osdmap e3340: 16 total, 12 up, 13 in
2025-01-31T21:26:43.710253+0100 mgr.node04 (mgr.11276218) 38820 : cluster [DBG] pgmap v38812: 193 pgs: 3 active+undersized+degraded+wait, 7 peering, 9 stale+active+clean, 1 active+clean+laggy, 1 active+undersized+wait, 172 active+clean; 2.5 TiB data, 7.3 TiB used, 40 TiB / 47 TiB avail; 36 MiB/s rd, 293 KiB/s wr, 43 op/s; 15300/1950537 objects degraded (0.784%)
2025-01-31T21:26:44.028023+0100 mon.node01 (mon.0) 16595 : cluster [WRN] Health check failed: Degraded data redundancy: 15300/1950537 objects degraded (0.784%), 3 pgs degraded (PG_DEGRADED)
2025-01-31T21:26:44.372055+0100 osd.15 (osd.15) 50 : cluster [WRN] Monitor daemon marked osd.15 down, but it is still running
2025-01-31T21:26:44.372058+0100 osd.15 (osd.15) 51 : cluster [DBG] map e3340 wrongly marked me down at e3339
2025-01-31T21:26:44.372114+0100 mon.node01 (mon.0) 16596 : cluster [INF] osd.15 marked itself dead as of e3340
2025-01-31T21:26:45.065870+0100 mon.node01 (mon.0) 16597 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2025-01-31T21:26:45.088906+0100 mon.node01 (mon.0) 16598 : cluster [INF] osd.15 [v2:10.0.0.14:6800/3204135080,v1:10.0.0.14:6802/3204135080] boot
2025-01-31T21:26:45.088922+0100 mon.node01 (mon.0) 16599 : cluster [DBG] osdmap e3341: 16 total, 13 up, 13 in
2025-01-31T21:26:45.710736+0100 mgr.node04 (mgr.11276218) 38821 : cluster [DBG] pgmap v38814: 193 pgs: 3 active+undersized+degraded+wait, 30 peering, 1 active+undersized+wait, 159 active+clean; 2.5 TiB data, 7.3 TiB used, 40 TiB / 47 TiB avail; 71 MiB/s rd, 500 KiB/s wr, 78 op/s; 15369/1950537 objects degraded (0.788%)
2025-01-31T21:26:46.110769+0100 mon.node01 (mon.0) 16600 : cluster [DBG] osdmap e3342: 16 total, 13 up, 13 in
2025-01-31T21:26:47.075625+0100 mon.node01 (mon.0) 16601 : cluster [WRN] Health check update: Reduced data availability: 2 pgs peering (PG_AVAILABILITY)
2025-01-31T21:26:47.075642+0100 mon.node01 (mon.0) 16602 : cluster [WRN] Health check update: 2 slow ops, oldest one blocked for 34 sec, osd.7 has slow ops (SLOW_OPS)
2025-01-31T21:26:47.711060+0100 mgr.node04 (mgr.11276218) 38822 : cluster [DBG] pgmap v38816: 193 pgs: 3 active+undersized+degraded+wait, 31 peering, 1 active+undersized+wait, 158 active+clean; 2.5 TiB data, 7.3 TiB used, 40 TiB / 47 TiB avail; 35 MiB/s rd, 134 KiB/s wr, 30 op/s; 15369/1950537 objects degraded (0.788%)
2025-01-31T21:26:48.138068+0100 osd.15 (osd.15) 52 : cluster [DBG] 4.5 scrub starts
2025-01-31T21:26:48.139072+0100 osd.15 (osd.15) 53 : cluster [DBG] 4.5 scrub ok
 
The Problem with our Ceph storage is quite new. It comes last night in top. But when I opened the post Ceph was okay. We have since yesterday that one node stopped and outed 3 of 4 disks. I tried to take them in and try to start them but I had no luck.
You can try to look at earlier logs in the journal to see if it reports any other issues. Until then, I will try to concentrate on the Ceph issues, as these are the ones that cause issues at the moment. My guess is that all your issues are related to storage, but please check earlier logs to see whether that is true.

Code:
pveceph status
  cluster:
    id:     f658f337-ebeb-4146-9352-16455366d201
    health: HEALTH_WARN
            83 daemons have recently crashed
This does not look good. Can you please confirm that all daemon crashes are due to slow requests?

Code:
io:
    client:   20 KiB/s rd, 1.2 MiB/s wr, 2 op/s rd, 136 op/s
This is very bad. I would definitely try to do some benchmarks (network, storage, etc.) and try to find out why this is happening.

I will try to update the firmware of the servers on the weekend. I saw in some posts that there are also the combination of AMD CPU and the Samsung SSD (870 EVO) might be an issue

My guess at this point is that you might have issues due to the fact that you're using consumer-grade SSDs instead of enterprise-grade ones. Consumers have rather different requirements than datacenters (including storage platforms like Ceph). In case of a consumer (e.g. usual desktop PC or laptop), SSDs/HDDs will be idle most of the time, but when they are used, they have to be very fast. They thus don't need to achieve fast speeds over longer periods of time. In case of a datacenter, this is a very different situation: while the speed is also important, it's even more important to have consistent read/write speeds over long periods of time, as well as high TBW to ensure long-term reliability. Consumer-grade SSDs might degrade quickly, as they have a low TBW. I recommend looking for "Enterprise" or "Datacenter" SSDs.

While you still have to check whether the SSDs are the issue, either way it seems that storage can't keep up with the amount of data, so as soon as you put it under pressure (increasing latency), Ceph starts having issues, causing your servers to restart and losing logs.
 
Thank you for reply.

during the weekend I was very studiously.


I have done this so far:
  • Updated firmware of every server to the most recent version
  • Updated Proxmox server to the most recent version
  • Wiped the crashed 3 SSD and took them back as OSD to Ceph
  • Deactivated NOCD for the SSD's in grub GRUB_CMDLINE_LINUX_DEFAULT="quiet libata.force=noncq"
I know that the consumer grade SSDs or not ideal für server use, but these 4 servers have a very low load and there are not that many iops. We just have 10 VMs on 4 Servers...

So far, it looks stable now:

Code:
2025-02-03T10:38:04.827655+0100 mgr.node01 (mgr.11905275) 72171 : cluster [DBG] pgmap v72378: 193 pgs: 193 active+clean; 2.4 TiB data, 7.2 TiB used, 51 TiB / 58 TiB avail; 1.8 MiB/s rd, 28 MiB/s wr, 452 op/s
2025-02-03T10:38:06.828029+0100 mgr.node01 (mgr.11905275) 72172 : cluster [DBG] pgmap v72379: 193 pgs: 193 active+clean; 2.4 TiB data, 7.2 TiB used, 51 TiB / 58 TiB avail; 381 KiB/s rd, 22 MiB/s wr, 290 op/s
2025-02-03T10:38:08.828408+0100 mgr.node01 (mgr.11905275) 72173 : cluster [DBG] pgmap v72380: 193 pgs: 193 active+clean; 2.4 TiB data, 7.2 TiB used, 51 TiB / 58 TiB avail; 172 KiB/s rd, 21 MiB/s wr, 297 op/s
2025-02-03T10:38:10.828935+0100 mgr.node01 (mgr.11905275) 72174 : cluster [DBG] pgmap v72381: 193 pgs: 193 active+clean; 2.4 TiB data, 7.2 TiB used, 51 TiB / 58 TiB avail; 192 KiB/s rd, 21 MiB/s wr, 368 op/s
2025-02-03T10:38:12.829264+0100 mgr.node01 (mgr.11905275) 72175 : cluster [DBG] pgmap v72382: 193 pgs: 193 active+clean; 2.4 TiB data, 7.2 TiB used, 51 TiB / 58 TiB avail; 187 KiB/s rd, 16 MiB/s wr, 303 op/s
2025-02-03T10:38:14.829815+0100 mgr.node01 (mgr.11905275) 72176 : cluster [DBG] pgmap v72383: 193 pgs: 193 active+clean; 2.4 TiB data, 7.2 TiB used, 51 TiB / 58 TiB avail; 271 KiB/s rd, 16 MiB/s wr, 376 op/s
2025-02-03T10:38:16.830164+0100 mgr.node01 (mgr.11905275) 72177 : cluster [DBG] pgmap v72384: 193 pgs: 193 active+clean; 2.4 TiB data, 7.2 TiB used, 51 TiB / 58 TiB avail; 169 KiB/s rd, 3.4 MiB/s wr, 230 op/s
2025-02-03T10:38:18.830557+0100 mgr.node01 (mgr.11905275) 72178 : cluster [DBG] pgmap v72385: 193 pgs: 193 active+clean; 2.4 TiB data, 7.2 TiB used, 51 TiB / 58 TiB avail; 166 KiB/s rd, 2.3 MiB/s wr, 230 op/s
2025-02-03T10:38:20.831097+0100 mgr.node01 (mgr.11905275) 72179 : cluster [DBG] pgmap v72386: 193 pgs: 193 active+clean; 2.4 TiB data, 7.2 TiB used, 51 TiB / 58 TiB avail; 205 KiB/s rd, 2.3 MiB/s wr, 277 op/s
2025-02-03T10:38:22.831410+0100 mgr.node01 (mgr.11905275) 72180 : cluster [DBG] pgmap v72387: 193 pgs: 193 active+clean; 2.4 TiB data, 7.2 TiB used, 51 TiB / 58 TiB avail; 149 KiB/s rd, 1.4 MiB/s wr, 186 op/s

Bildschirmfoto 2025-02-03 um 10.40.54.png

We did not install ipmitool (yet), but the the IPMI console of the server does not contain any helpful information. All looks quite normal.As far is I found out that the server reboots are caused be the Ceph issues.

Tonight we are running again a backup and I will check the logs during the backup to see if there is something unusual.

I will keep you updated.
 
After many tests and no real solution to the problem, I have decided to replace the built-in Samsung SSDs with new enterprise SSDs.



So my question is: What is the best approach? Should I create a backup and destroy the existing Ceph setup? Or should I replace all the drives at once? The new drives have 1.9TB, while the old ones have 4TB. Each server has four drives installed.



I would be very grateful for any tips and a more detailed guide
 
How you do it exactly is your choice, but when using Ceph you can replace disks without destroying everything first. The chapter on Replacing OSDs explains this well - just make sure to wait for the health checks until the OSD is safe to be destroyed, just to be on the safe side.

Let us know if you have any questions!
 
Thank you for your reply. Due the long amount of time we need when replacing each of the 16 OSDs its not possible. I tink we need to get rid of the existing ceph and create a fresh new one. How we can proceed with it?
 
Alright, then I can recommend the following:
  1. Make a backup of all important data from the Ceph cluster you are trying to replace. Destroying the cluster CANNOT be undone!
  2. The command for destroying a Ceph cluster is pveceph purge. This will show you what you need to do in order to remove everything. Make sure to execute that command on each of the nodes of the Ceph cluster. You can either do the requested steps in the web UI, or by using the pveceph CLI tool. In short:
    • Remove all pools
    • Stop and destroy all OSDs
    • Stop and destroy all Metadata Servers (MDS)
    • Stop and destroy all Managers except the last one
    • Stop and destroy all Monitors except the last one
    • On each of the nodes except the last one, execute pveceph purge
    • On the last node, execute systemctl stop ceph-mgr.target and systemctl stop ceph-mon.target.
    • On the last node, execute pveceph purge
I hope this helps!