Proxmox host reboot

dhmusil · Jul 9, 2020

Hey all,

We have stood up a 3 node Proxmox cluster, each node with 1.2T local storage for gust servers. Each of the nodes is running Debian 10.

We have also stood up a 6 node Ceph cluster with ~400G of storage. Each of these nodes is running CentOS 8

We have mounted Ceph onto two nodes on the cluster for testing using RDB. (End goal is a poor-man's fail-over.) Steps used...

a) On Proxmox node, created a /etc/pve/priv/ceph

b) Copied the /etc/ceph/ceph.client.admin.keyring from one of the monitor servers to the above directory on the proxmox server.

c) On one of the Proxmox cluster nodes GUI's I went to Datacenter>Storage and selected add. The name for the storage is the same as the name of the keyring I copied over. I put all three monitors in the list.

The proxmox server seem to be happy with that.

This configuration allows us to create a guest machine using the external Ceph storage but every time we attempt to install any OS on the guest machine, the proxmox host reboots. Below is what we have been able to capture from the logs just prior to the reboot.


Jul 08 15:38:27 {HOSTNAME} systemd[1]: Started 122.scope.
Jul 08 15:38:27 {HOSTNAME} systemd-udevd[30890]: Using default interface naming scheme 'v240'.
Jul 08 15:38:27 {HOSTNAME} systemd-udevd[30890]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jul 08 15:38:27 {HOSTNAME} systemd-udevd[30890]: Could not generate persistent MAC address for tap122i0: No such file or directory
Jul 08 15:38:28 {HOSTNAME} kernel: device tap122i0 entered promiscuous mode
Jul 08 15:38:28 {HOSTNAME} systemd-udevd[30890]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jul 08 15:38:28 {HOSTNAME} systemd-udevd[30890]: Could not generate persistent MAC address for fwbr122i0: No such file or directory
Jul 08 15:38:28 {HOSTNAME} systemd-udevd[30890]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jul 08 15:38:28 {HOSTNAME} systemd-udevd[30890]: Could not generate persistent MAC address for fwpr122p0: No such file or directory
Jul 08 15:38:28 {HOSTNAME} systemd-udevd[30881]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jul 08 15:38:28 {HOSTNAME} systemd-udevd[30881]: Using default interface naming scheme 'v240'.
Jul 08 15:38:28 {HOSTNAME} systemd-udevd[30881]: Could not generate persistent MAC address for fwln122i0: No such file or directory
Jul 08 15:38:28 {HOSTNAME} kernel: fwbr122i0: port 1(fwln122i0) entered blocking state
Jul 08 15:38:28 {HOSTNAME} kernel: fwbr122i0: port 1(fwln122i0) entered disabled state
Jul 08 15:38:28 {HOSTNAME} kernel: device fwln122i0 entered promiscuous mode
Jul 08 15:38:28 {HOSTNAME} kernel: fwbr122i0: port 1(fwln122i0) entered blocking state
Jul 08 15:38:28 {HOSTNAME} kernel: fwbr122i0: port 1(fwln122i0) entered forwarding state
Jul 08 15:38:28 {HOSTNAME} kernel: vmbr0: port 5(fwpr122p0) entered blocking state
Jul 08 15:38:28 {HOSTNAME} kernel: vmbr0: port 5(fwpr122p0) entered disabled state
Jul 08 15:38:28 {HOSTNAME} kernel: device fwpr122p0 entered promiscuous mode
Jul 08 15:38:28 {HOSTNAME} kernel: vmbr0: port 5(fwpr122p0) entered blocking state
Jul 08 15:38:28 {HOSTNAME} kernel: vmbr0: port 5(fwpr122p0) entered forwarding state
Jul 08 15:38:28 {HOSTNAME} kernel: fwbr122i0: port 2(tap122i0) entered blocking state
Jul 08 15:38:28 {HOSTNAME} kernel: fwbr122i0: port 2(tap122i0) entered disabled state
Jul 08 15:38:28 {HOSTNAME} kernel: fwbr122i0: port 2(tap122i0) entered blocking state
Jul 08 15:38:28 {HOSTNAME} kernel: fwbr122i0: port 2(tap122i0) entered forwarding state
Jul 08 15:38:28 {HOSTNAME} pvedaemon[5599]: <USER@pve> end task UPID:HOSTNAME:000078A0:0001260A:5F064AE3:qmstart:122:USER@pve: OK
Jul 08 15:38:29 {HOSTNAME} pvedaemon[30954]: starting vnc proxy UPID:HOSTNAME:000078EA:000126F4:5F064AE5:vncproxy:122:USER@pve:
Jul 08 15:38:29 {HOSTNAME} pvedaemon[5599]: <USER@pve> starting task UPID:HOSTNAME:000078EA:000126F4:5F064AE5:vncproxy:122:USER@pve:
Jul 08 15:39:00 {HOSTNAME} systemd[1]: Starting Proxmox VE replication runner...
Jul 08 15:39:01 {HOSTNAME} systemd[1]: pvesr.service: Succeeded.
Jul 08 15:39:01 {HOSTNAME} systemd[1]: Started Proxmox VE replication runner.
Jul 08 15:39:38 {HOSTNAME} pvedaemon[5600]: <root@pam> successful auth for user 'USER@pve'
Jul 08 15:40:00 {HOSTNAME} systemd[1]: Starting Proxmox VE replication runner...
Jul 08 15:40:01 {HOSTNAME} systemd[1]: pvesr.service: Succeeded.
Jul 08 15:40:01 {HOSTNAME} systemd[1]: Started Proxmox VE replication runner.

There is no evidence of a panic, just a reboot. I would make sense if the guest had issues but why would it reboot the host?

I am continuing to investigate what the issue might be but any direction would be greatly appreciated. If there are any additional configurations needed, please let me know and I will append them.

Thank you,

DHM

H4R0 · Jul 9, 2020

There have been several simmilar posts, a node reboot is often the result of delay on the ceph network link.

Do you have a seperate network for ceph ? (hypervsior nic, switch etc.)

dhmusil · Jul 9, 2020

Thank you for you quick response.

We have dedicated bonded NIC's. One set is for client access to the Proxmox cluster and the second pair is in the storage network back-plane. Now, that said, I will test to verify that the storage traffic is going to the proper network. BTW, both of the bonded NIC's are active/active, or mode=4. Would that cause an issue? should we change it to active passive?

H4R0 · Jul 10, 2020

LACP should be fine given the switch supports it and is correctly configured as well.

Its important that there is no gateway in between the proxmox host and storage. Ping should be less then 1ms.

Can you post your network config and status "cat /etc/network/interfaces" and "ip a"

dhmusil · Jul 10, 2020

Again, thank you kindly for your response.

I have attached a rough diagram in PDF format of the architecture we have put together. the 10 and 100 networks each have redundant switches that are used for the bonds.

here is the data you requested. I have included the data for each of the Proxmox server. Proxmox 2 and Proxmox 3 are the machines that I have been using for testing since Proxmox 1 has production machines on it. (All local storage.)

Sorry I left one more thing out. these are the traceroutes to the Ceph servers from the proxmox servers.


root@prox01:~# traceroute 10.161.100.11
traceroute to 10.161.100.11 (10.161.100.11), 30 hops max, 60 byte packets
 1  11-100-161-10-in-addr.tusimple.ai (10.161.100.11)  0.231 ms *  0.196 ms
root@prox01:~# traceroute 10.161.100.13
traceroute to 10.161.100.13 (10.161.100.13), 30 hops max, 60 byte packets
 1  13-100-161-10-in-addr.tusimple.ai (10.161.100.13)  0.253 ms * *
root@prox01:~# traceroute 10.161.100.15
traceroute to 10.161.100.15 (10.161.100.15), 30 hops max, 60 byte packets
 1  15-100-161-10-in-addr.tusimple.ai (10.161.100.15)  0.317 ms  0.308 ms *


root@prox02:~# traceroute 10.161.100.11
traceroute to 10.161.100.11 (10.161.100.11), 30 hops max, 60 byte packets
 1  11-100-161-10-in-addr.tusimple.ai (10.161.100.11)  0.175 ms *  0.094 ms
root@prox02:~# traceroute 10.161.100.13
traceroute to 10.161.100.13 (10.161.100.13), 30 hops max, 60 byte packets
 1  13-100-161-10-in-addr.tusimple.ai (10.161.100.13)  0.235 ms *  0.210 ms
root@prox02:~# traceroute 10.161.100.15
traceroute to 10.161.100.15 (10.161.100.15), 30 hops max, 60 byte packets
 1  15-100-161-10-in-addr.tusimple.ai (10.161.100.15)  0.173 ms  0.175 ms *


root@prox03:~# traceroute 10.161.100.11
traceroute to 10.161.100.11 (10.161.100.11), 30 hops max, 60 byte packets
 1  11-100-161-10-in-addr.tusimple.ai (10.161.100.11)  0.465 ms  0.465 ms  0.460 ms
root@prox03:~# traceroute 10.161.100.13
traceroute to 10.161.100.13 (10.161.100.13), 30 hops max, 60 byte packets
 1  13-100-161-10-in-addr.tusimple.ai (10.161.100.13)  0.134 ms  0.108 ms  0.105 ms
root@prox03:~# traceroute 10.161.100.15
traceroute to 10.161.100.15 (10.161.100.15), 30 hops max, 60 byte packets
 1  15-100-161-10-in-addr.tusimple.ai (10.161.100.15)  0.177 ms  0.163 ms  0.160 ms

Search

Search

Proxmox host reboot

dhmusil

Member

H4R0

Well-Known Member

dhmusil

Member

H4R0

Well-Known Member

dhmusil

Member

Attachments

We value your privacy