Ceph cluster broke after updating proxmox to the latest 6.4, please help

Xenocit · Jul 7, 2021

I am in serious trouble. I am running a 4 node PVE cluster in the office.
Today I started to update the nodes one by one to the latest 6.4 version in order to prepare for Proxmox 7 update.
After I updated and restarted 2 of the nodes, the ceph seems to degrade and start complaining that the other 2 nodes are running older versions of ceph in the ceph cluster.
At this point everything went south - VMs hang. I rushed to perform the upgrade and restart on the rest 2 nodes.

The PVE cluster is now UP - all Nodes are green, but the Ceph Cluster is not. I get timeout (500) in the web interface.

Every PVE has 4 SSD OSDs in the cluster and all but 1 VMs are using the Ceph.... I have no idea what to do at this point, I am in very very BIG trouble if I don't recover the cluster - I am positive that the drives are healthy.

Unfortunately I have no subscription but I am ready to get on asap if anyone can please help me!

fabian · Jul 7, 2021

the following might help getting an overview of the situation:
pveversion -v on all nodes
ceph -s

Xenocit · Jul 7, 2021

ceph -s just hangs on all 4 nodes - no results showing

We are ready to provide remove access (SSH or IPMI KVM to the nodes) as we are getting desparate here. Huge mistake was made to approach this update without making backups and all this data is very critical to our operation. We are ISP and all our customers have no services atm

I am available on phone +359885511000 (Ivan). Ofource subscriptions/payments will be made as necessary.

pveversion follows:

### NODE 1 ###
root@pve-n1:~# pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.124-1-pve)
pve-manager: 6.4-11 (running version: 6.4-11/28d576c2)
pve-kernel-5.4: 6.4-4
pve-kernel-helper: 6.4-4
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.124-1-pve: 5.4.124-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.18-1-pve: 5.0.18-3
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 14.2.20-pve1
ceph-fuse: 14.2.20-pve1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.3-1
proxmox-backup-client: 1.1.10-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.2-4
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

### NODE 2 ###
root@pve-n2:~# pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.124-1-pve)
pve-manager: 6.4-11 (running version: 6.4-11/28d576c2)
pve-kernel-5.4: 6.4-4
pve-kernel-helper: 6.4-4
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.124-1-pve: 5.4.124-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.18-1-pve: 5.0.18-3
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 14.2.20-pve1
ceph-fuse: 14.2.20-pve1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.3-1
proxmox-backup-client: 1.1.10-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.2-4
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

### NODE 3 ###
root@pve-n3:~# pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.124-1-pve)
pve-manager: 6.4-11 (running version: 6.4-11/28d576c2)
pve-kernel-5.4: 6.4-4
pve-kernel-helper: 6.4-4
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.124-1-pve: 5.4.124-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 14.2.20-pve1
ceph-fuse: 14.2.20-pve1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.3-1
proxmox-backup-client: 1.1.10-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.2-4
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

### NODE 4 ###
root@pve-n4:~# pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.124-1-pve)
pve-manager: 6.4-11 (running version: 6.4-11/28d576c2)
pve-kernel-5.4: 6.4-4
pve-kernel-helper: 6.4-4
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.124-1-pve: 5.4.124-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.3.10-1-pve: 5.3.10-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.18-1-pve: 5.0.18-3
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 14.2.20-pve1
ceph-fuse: 14.2.20-pve1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.3-1
proxmox-backup-client: 1.1.10-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.2-4
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

fabian · Jul 7, 2021

remote support via SSH is only available via our enterprise support channels (level standard or above): https://proxmox.com/en/proxmox-ve/pricing

what does journalctl -b -u "ceph*" say?

mira · Jul 7, 2021

Please provide your Ceph config (cat /etc/ceph/ceph.conf) and the output of ip -details a in addition to the journal @fabian asked for.

Xenocit · Jul 7, 2021

Please recommend what subscription and how many licenses should we purchase to start working on this asap?

Here is the output of:
journalctl -b -u "ceph*"
from NODE-1:

-- Logs begin at Wed 2021-07-07 12:31:40 EEST, end at Wed 2021-07-07 13:37:17 EEST. --
Jul 07 12:31:42 pve-n1 systemd[1]: Starting Ceph Volume activation: lvm-0-3de32be9-7235-412f-99b1-b039ec2fac6c...
Jul 07 12:31:42 pve-n1 systemd[1]: Starting Ceph Volume activation: lvm-1-06e845e6-8803-4077-b833-a4bb0a238d2f...
Jul 07 12:31:42 pve-n1 systemd[1]: Starting Ceph Volume activation: lvm-2-14a10a3b-24b7-4ad1-b774-8786a9a49242...
Jul 07 12:31:43 pve-n1 systemd[1]: Started Ceph crash dump collector.
Jul 07 12:31:43 pve-n1 systemd[1]: Starting Ceph Volume activation: lvm-3-c68e6a1e-5a7c-493a-a0ed-388571440041...
Jul 07 12:31:43 pve-n1 ceph-crash[1176]: INFO:__main__:monitoring path /var/lib/ceph/crash, delay 600s
Jul 07 12:31:43 pve-n1 sh[1169]: Running command: /usr/sbin/ceph-volume lvm trigger 2-14a10a3b-24b7-4ad1-b774-8786a9a49242
Jul 07 12:31:43 pve-n1 sh[1186]: Running command: /usr/sbin/ceph-volume lvm trigger 3-c68e6a1e-5a7c-493a-a0ed-388571440041
Jul 07 12:31:43 pve-n1 sh[1160]: Running command: /usr/sbin/ceph-volume lvm trigger 0-3de32be9-7235-412f-99b1-b039ec2fac6c
Jul 07 12:31:43 pve-n1 sh[1163]: Running command: /usr/sbin/ceph-volume lvm trigger 1-06e845e6-8803-4077-b833-a4bb0a238d2f
Jul 07 12:31:47 pve-n1 systemd[1]: Started Ceph cluster manager daemon.
Jul 07 12:31:47 pve-n1 systemd[1]: Started Ceph cluster monitor daemon.
Jul 07 12:31:47 pve-n1 systemd[1]: Reached target ceph target allowing to start/stop all ceph-mon@.service instances at once.
Jul 07 12:31:47 pve-n1 systemd[1]: Reached target ceph target allowing to start/stop all ceph-mgr@.service instances at once.
Jul 07 12:31:47 pve-n1 systemd[1]: Reached target ceph target allowing to start/stop all ceph-osd@.service instances at once.
Jul 07 12:31:47 pve-n1 systemd[1]: Reached target ceph target allowing to start/stop all ceph-mds@.service instances at once.
Jul 07 12:31:47 pve-n1 systemd[1]: Reached target ceph target allowing to start/stop all ceph*@.service instances at once.
Jul 07 12:31:48 pve-n1 sh[1169]: Running command: /usr/sbin/ceph-volume lvm trigger 2-14a10a3b-24b7-4ad1-b774-8786a9a49242
Jul 07 12:31:48 pve-n1 sh[1163]: Running command: /usr/sbin/ceph-volume lvm trigger 1-06e845e6-8803-4077-b833-a4bb0a238d2f
Jul 07 12:31:48 pve-n1 sh[1186]: Running command: /usr/sbin/ceph-volume lvm trigger 3-c68e6a1e-5a7c-493a-a0ed-388571440041
Jul 07 12:31:48 pve-n1 sh[1160]: Running command: /usr/sbin/ceph-volume lvm trigger 0-3de32be9-7235-412f-99b1-b039ec2fac6c
Jul 07 12:32:22 pve-n1 ceph-mon[1891]: 2021-07-07 12:32:22.379 7efc93c2b700 -1 mon.pve-n1@0(probing) e4 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Jul 07 12:32:27 pve-n1 ceph-mon[1891]: 2021-07-07 12:32:27.383 7efc93c2b700 -1 mon.pve-n1@0(probing) e4 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Jul 07 12:32:32 pve-n1 ceph-mon[1891]: 2021-07-07 12:32:32.383 7efc93c2b700 -1 mon.pve-n1@0(probing) e4 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Jul 07 12:32:37 pve-n1 ceph-mon[1891]: 2021-07-07 12:32:37.383 7efc93c2b700 -1 mon.pve-n1@0(probing) e4 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Jul 07 12:32:42 pve-n1 ceph-mon[1891]: 2021-07-07 12:32:42.383 7efc93c2b700 -1 mon.pve-n1@0(probing) e4 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Jul 07 12:32:47 pve-n1 ceph-mon[1891]: 2021-07-07 12:32:47.379 7efc93c2b700 -1 mon.pve-n1@0(probing) e4 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 73 bytes epoch 0)

#### THIS LINE REPEATS ALOT UNTIL:
Jul 07 13:21:42 pve-n1 ceph-mon[1891]: 2021-07-07 13:21:42.496 7efc93c2b700 -1 mon.pve-n1@0(probing) e4 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Jul 07 13:21:47 pve-n1 ceph-mgr[1890]: failed to fetch mon config (--no-mon-config to skip)
Jul 07 13:21:47 pve-n1 systemd[1]: ceph-mgr@pve-n1.service: Main process exited, code=exited, status=1/FAILURE
Jul 07 13:21:47 pve-n1 systemd[1]: ceph-mgr@pve-n1.service: Failed with result 'exit-code'.
Jul 07 13:21:47 pve-n1 ceph-mon[1891]: 2021-07-07 13:21:47.496 7efc93c2b700 -1 mon.pve-n1@0(probing) e4 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Jul 07 13:21:52 pve-n1 ceph-mon[1891]: 2021-07-07 13:21:52.496 7efc93c2b700 -1 mon.pve-n1@0(probing) e4 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Jul 07 13:21:53 pve-n1 sh[1169]: Running command: /usr/sbin/ceph-volume lvm trigger 2-14a10a3b-24b7-4ad1-b774-8786a9a49242
Jul 07 13:21:54 pve-n1 sh[1186]: Running command: /usr/sbin/ceph-volume lvm trigger 3-c68e6a1e-5a7c-493a-a0ed-388571440041
Jul 07 13:21:54 pve-n1 sh[1163]: Running command: /usr/sbin/ceph-volume lvm trigger 1-06e845e6-8803-4077-b833-a4bb0a238d2f
Jul 07 13:21:54 pve-n1 sh[1160]: Running command: /usr/sbin/ceph-volume lvm trigger 0-3de32be9-7235-412f-99b1-b039ec2fac6c
Jul 07 13:21:57 pve-n1 ceph-mon[1891]: 2021-07-07 13:21:57.496 7efc93c2b700 -1 mon.pve-n1@0(probing) e4 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Jul 07 13:21:57 pve-n1 systemd[1]: ceph-mgr@pve-n1.service: Service RestartSec=10s expired, scheduling restart.
Jul 07 13:21:57 pve-n1 systemd[1]: ceph-mgr@pve-n1.service: Scheduled restart job, restart counter is at 1.
Jul 07 13:21:57 pve-n1 systemd[1]: Stopped Ceph cluster manager daemon.
Jul 07 13:21:57 pve-n1 systemd[1]: Started Ceph cluster manager daemon.
Jul 07 13:22:02 pve-n1 ceph-mon[1891]: 2021-07-07 13:22:02.496 7efc93c2b700 -1 mon.pve-n1@0(probing) e4 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 73 bytes epoch 0)

### THEN AGAIN ALOT OF THIS LINE:
Jul 07 13:22:07 pve-n1 ceph-mon[1891]: 2021-07-07 13:22:07.496 7efc93c2b700 -1 mon.pve-n1@0(probing) e4 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Jul 07 13:22:12 pve-n1 ceph-mon[1891]: 2021-07-07 13:22:12.496 7efc93c2b700 -1 mon.pve-n1@0(probing) e4 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Jul 07 13:22:17 pve-n1 ceph-mon[1891]: 2021-07-07 13:22:17.496 7efc93c2b700 -1 mon.pve-n1@0(probing) e4 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Jul 07 13:22:22 pve-n1 ceph-mon[1891]: 2021-07-07 13:22:22.496 7efc93c2b700 -1 mon.pve-n1@0(probing) e4 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Jul 07 13:22:27 pve-n1 ceph-mon[1891]: 2021-07-07 13:22:27.500 7efc93c2b700 -1 mon.pve-n1@0(probing) e4 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Jul 07 13:22:32 pve-n1 ceph-mon[1891]: 2021-07-07 13:22:32.500 7efc93c2b700 -1 mon.pve-n1@0(probing) e4 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Jul 07 13:22:37 pve-n1 ceph-mon[1891]: 2021-07-07 13:22:37.500 7efc93c2b700 -1 mon.pve-n1@0(probing) e4 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Jul 07 13:22:42 pve-n1 ceph-mon[1891]: 2021-07-07 13:22:42.500 7efc93c2b700 -1 mon.pve-n1@0(probing) e4 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Jul 07 13:22:47 pve-n1 ceph-mon[1891]: 2021-07-07 13:22:47.500 7efc93c2b700 -1 mon.pve-n1@0(probing) e4 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Jul 07 13:22:52 pve-n1 ceph-mon[1891]: 2021-07-07 13:22:52.500 7efc93c2b700 -1 mon.pve-n1@0(probing) e4 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Jul 07 13:22:57 pve-n1 ceph-mon[1891]: 2021-07-07 13:22:57.500 7efc93c2b700 -1 mon.pve-n1@0(probing) e4 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 73 bytes epoch 0)

ph0x · Jul 7, 2021

An ISP upgrades his systems to a beta version without backups. Are you mocking us?

tom · Jul 7, 2021

Xenocit said:
Please recommend what subscription and how many licenses should we purchase to start working on this asap?

We sent you all details as answer to your request on office@proxmox.com, but your email systems in not up and running.

=> Host or domain name not found. Name service error for name=_____ type=MX: Host not found, try again

Please use a working email for ordering your support package via https://shop.maurer-it.com

Xenocit · Jul 7, 2021

Here is the complete output from all 4 nodes for journalctl -b -u "ceph*"
A little background on how things went:
1. As usual I started updating from node-4, but first I live migrate all VMs from node-4 to node-3. Upgrade and restart node-4.
2. After node-4 is up I live migrate all VMs from node-3 to node-4 and perform upgrade and restart on node-3.
3. After node-3 went up, ceph degraded and start complaining about older versions running on node-1 and node-2.
4. All VMs started slowing down and freezing.
5. At this point I guess I paniced and decided to quickly upgrade and restart both node-1 and node-2 at the same time. The Ceph dashboard was still showing the degraded health and moniotrs in the we at this point, but after node-1 and 2 upgraded and restarted it only shows timeout (500).

I know how stupid was of me to not check for fresh backups before attempting all this... But now the only thing that matters is I can get out of this without loosing all company data for a veeeeery long time :|

Xenocit · Jul 7, 2021

tom said:
We sent you all details as answer to your request on office@proxmox.com, but your email systems in not up and running.

=> Host or domain name not found. Name service error for name=_____ type=MX: Host not found, try again

Please use a working email for ordering your support package via https://shop.maurer-it.com

I've changed my account e-mail as the company one is not working for the same reason - our DNS VMs are down ofc.
Can you please send the email to the e-mail I'm using now?

Xenocit · Jul 7, 2021

tom said:
We sent you all details as answer to your request on office@proxmox.com, but your email systems in not up and running.

=> Host or domain name not found. Name service error for name=_____ type=MX: Host not found, try again

Please use a working email for ordering your support package via https://shop.maurer-it.com

I've just messaged to office@proxmox.com from my personal gmail account. Please resend your response.

tom · Jul 7, 2021

Xenocit said:
I've just messaged to office@proxmox.com from my personal gmail account. Please resend your response.

done.

Xenocit · Jul 7, 2021

tom said:
done.

Hi, I've purchased a 2 CPU annual subscription as instructed in the e-mail.
What communication channels you support to arrange remote SSH/KVM support on this case?

Moayad · Jul 7, 2021

Xenocit said:
What communication channels you support to arrange remote SSH/KVM support on this case?

only ssh, as fabian mentioned above.

Xenocit · Jul 7, 2021

Moayad said:
only ssh, as fabian mentioned above.

Ok, but how do we arrange this - where do I send you addresses and credentials?

fabian · Jul 7, 2021

https://my.proxmox.com/en please open a support ticket in our enterprise support portal.

tom · Jul 7, 2021

Xenocit said:
Hi, I've purchased a 2 CPU annual subscription as instructed in the e-mail.
What communication channels you support to arrange remote SSH/KVM support on this case?

As you have four nodes (with 2-CPUs), you need for all nodes a subscriptions and not just one.

Details on https://www.proxmox.com/en/downloads/item/proxmox-ve-subscription-agreement

Xenocit · Jul 7, 2021

tom said:
As you have four nodes (with 2-CPUs), you need for all nodes a subscriptions and not just one.

Details on https://www.proxmox.com/en/downloads/item/proxmox-ve-subscription-agreement

I've submitted a ticked and activated the license on NODE-1. I am currently working on getting the money for the other 3 licenses. My humble request is to please get some to take a look at it - I've provided login detail in the ticket to the WebUI and SSH for all 4 nodes. Please guys, I am dying here...

jehobi · Sep 2, 2021

Oh man. Can We have an update on what happened next? It was like that "24" show where all action happens in one day.

I'm afraid of upgrading now.

GoZippy · May 21, 2022

I upgraded and everything broke too... been down for 4 months... just too busy to try to think about it much anymore. Come back every week to see if anyone else had similar issues I can learn from without being called names... if this got resolved, would love to hear how.

Ceph cluster broke after updating proxmox to the latest 6.4, please help

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Proxmox Staff Member

Member

Renowned Member

Proxmox Staff Member

Member

Attachments

Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Proxmox Staff Member

Member

Member

Member