VM Migration between nodes crashes node

tjk · Aug 15, 2022

I'm out of ideas and looking for assistance on this problem, not sure what else I can be testing or what logs I can be looking at for data to solve this problem.

Setup is a new cluster of 7 nodes, Ryzen 5950X ASRockRack Servers, 128G ECC RAM, each node has 4x2TB SSD in Raid10, 2x10G LACP Bond on each node to brocade switches. No shared storage, no HA ZFS setup. Each node has its own ZFS setup (NodeX-Tank) for local VM storage.

When I online migrate a VM from 1 node to another node, it works sometimes, other times the source node just hard reboots, other times the target node (yes, the node, not the VM) reboots.

I ran Memtest86 on each node for 48 hours, no errors. I ran iperf between VM's on all the nodes, and no crashes, I am able to get 9Gb/s across VM's on different nodes.

I am running proxmox enterprise, all updated to latest version on each node.

When I do the migration, it starts and gets at various levels, and then source or target node just reboot.

Output from migration window:

Bash:

drive-scsi0: transferred 24.8 GiB of 100.0 GiB (24.77%) in 1m 40s
drive-scsi0: transferred 24.8 GiB of 100.0 GiB (24.77%) in 1m 41s
drive-scsi0: transferred 24.8 GiB of 100.0 GiB (24.77%) in 1m 42s
drive-scsi0: transferred 24.8 GiB of 100.0 GiB (24.77%) in 1m 43s
drive-scsi0: transferred 24.8 GiB of 100.0 GiB (24.77%) in 1m 44s
drive-scsi0: Cancelling block job
drive-scsi0: Done.
2022-08-14 20:08:26 ERROR: online migrate failure - block job (mirror) error: drive-scsi0: 'mirror' has been cancelled
2022-08-14 20:08:26 aborting phase 2 - cleanup resources
2022-08-14 20:08:26 migrate_cancel
client_loop: send disconnect: Broken pipe

Use of uninitialized value $res in string eq at /usr/share/perl5/PVE/Tunnel.pm line 110, <GEN845> line 2.
Use of uninitialized value $res in concatenation (.) or string at /usr/share/perl5/PVE/Tunnel.pm line 113, <GEN845> line 2.
2022-08-14 20:08:28 ERROR: tunnel replied '' to command 'quit'
2022-08-14 20:08:28 ERROR: migration finished with problems (duration 00:01:49)
TASK ERROR: migration problems

Output from syslog on source node (node3), showing node4 crashed and corosync doing its thing:

Bash:

Aug 14 20:06:38 node3 pvedaemon[3175]: <root@pam> starting task UPID:node3:000058D2:0001DF45:62F98E0E:qmigrate:101:root@pam:
Aug 14 20:06:39 node3 pmxcfs[3011]: [status] notice: received log
Aug 14 20:06:41 node3 pmxcfs[3011]: [status] notice: received log
Aug 14 20:07:08 node3 corosync[3108]:   [KNET  ] link: host: 4 link: 0 is down
Aug 14 20:07:08 node3 corosync[3108]:   [KNET  ] link: host: 4 link: 1 is down
Aug 14 20:07:08 node3 corosync[3108]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Aug 14 20:07:08 node3 corosync[3108]:   [KNET  ] host: host: 4 has no active links
Aug 14 20:07:08 node3 corosync[3108]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Aug 14 20:07:08 node3 corosync[3108]:   [KNET  ] host: host: 4 has no active links
Aug 14 20:07:11 node3 corosync[3108]:   [TOTEM ] Token has not been received in 4687 ms
Aug 14 20:07:20 node3 corosync[3108]:   [QUORUM] Sync members[6]: 1 2 3 5 6 7
Aug 14 20:07:20 node3 corosync[3108]:   [QUORUM] Sync left[1]: 4
Aug 14 20:07:20 node3 corosync[3108]:   [TOTEM ] A new membership (1.3f6) was formed. Members left: 4
Aug 14 20:07:20 node3 corosync[3108]:   [TOTEM ] Failed to receive the leave message. failed: 4
Aug 14 20:07:20 node3 pmxcfs[3011]: [dcdb] notice: members: 1/2948, 2/3207, 3/3011, 5/2905, 6/3247, 7/3316
Aug 14 20:07:20 node3 pmxcfs[3011]: [dcdb] notice: starting data syncronisation
Aug 14 20:07:20 node3 pmxcfs[3011]: [status] notice: members: 1/2948, 2/3207, 3/3011, 5/2905, 6/3247, 7/3316
Aug 14 20:07:20 node3 pmxcfs[3011]: [status] notice: starting data syncronisation
Aug 14 20:07:20 node3 corosync[3108]:   [QUORUM] Members[6]: 1 2 3 5 6 7
Aug 14 20:07:20 node3 corosync[3108]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 14 20:07:20 node3 pmxcfs[3011]: [dcdb] notice: received sync request (epoch 1/2948/0000000A)
Aug 14 20:07:20 node3 pmxcfs[3011]: [status] notice: received sync request (epoch 1/2948/0000000A)
Aug 14 20:07:20 node3 pmxcfs[3011]: [dcdb] notice: received all states
Aug 14 20:07:20 node3 pmxcfs[3011]: [dcdb] notice: leader is 1/2948
Aug 14 20:07:20 node3 pmxcfs[3011]: [dcdb] notice: synced members: 1/2948, 2/3207, 3/3011, 5/2905, 6/3247, 7/3316
Aug 14 20:07:20 node3 pmxcfs[3011]: [dcdb] notice: all data is up to date
Aug 14 20:07:20 node3 pmxcfs[3011]: [dcdb] notice: dfsm_deliver_queue: queue length 6
Aug 14 20:07:20 node3 pmxcfs[3011]: [status] notice: received all states
Aug 14 20:07:20 node3 pmxcfs[3011]: [status] notice: all data is up to date
Aug 14 20:07:20 node3 pmxcfs[3011]: [status] notice: dfsm_deliver_queue: queue length 51

Output from syslog on destination node (node4):

Bash:

Aug 14 20:06:38 node4 pmxcfs[3176]: [status] notice: received log
Aug 14 20:06:38 node4 systemd[1]: Started Session 7 of user root.
Aug 14 20:06:39 node4 systemd[1]: session-7.scope: Succeeded.
Aug 14 20:06:39 node4 systemd[1]: Started Session 8 of user root.
Aug 14 20:06:39 node4 systemd[1]: session-8.scope: Succeeded.
Aug 14 20:06:39 node4 systemd[1]: Started Session 9 of user root.
Aug 14 20:06:39 node4 systemd[1]: session-9.scope: Succeeded.
Aug 14 20:06:39 node4 systemd[1]: Started Session 10 of user root.
Aug 14 20:06:39 node4 qm[25178]: <root@pam> starting task UPID:node4:0000625B:0002D30F:62F98E0F:qmstart:101:root@pam:
Aug 14 20:06:39 node4 qm[25179]: start VM 101: UPID:node4:0000625B:0002D30F:62F98E0F:qmstart:101:root@pam:
Aug 14 20:06:40 node4 systemd[1]: Created slice qemu.slice.
Aug 14 20:06:40 node4 systemd[1]: Started 101.scope.
Aug 14 20:06:40 node4 systemd-udevd[25239]: Using default interface naming scheme 'v247'.
Aug 14 20:06:40 node4 systemd-udevd[25239]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Aug 14 20:06:41 node4 kernel: [ 1853.566721] device tap101i0 entered promiscuous mode
Aug 14 20:06:41 node4 systemd-udevd[25239]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Aug 14 20:06:41 node4 systemd-udevd[25240]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Aug 14 20:06:41 node4 systemd-udevd[25240]: Using default interface naming scheme 'v247'.
Aug 14 20:06:41 node4 systemd-udevd[25239]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Aug 14 20:06:41 node4 kernel: [ 1853.591041] vmbr601: port 2(fwpr101p0) entered blocking state
Aug 14 20:06:41 node4 kernel: [ 1853.591062] vmbr601: port 2(fwpr101p0) entered disabled state
Aug 14 20:06:41 node4 kernel: [ 1853.591110] device fwpr101p0 entered promiscuous mode
Aug 14 20:06:41 node4 kernel: [ 1853.591155] vmbr601: port 2(fwpr101p0) entered blocking state
Aug 14 20:06:41 node4 kernel: [ 1853.591167] vmbr601: port 2(fwpr101p0) entered forwarding state
Aug 14 20:06:41 node4 kernel: [ 1853.593100] fwbr101i0: port 1(fwln101i0) entered blocking state
Aug 14 20:06:41 node4 kernel: [ 1853.593121] fwbr101i0: port 1(fwln101i0) entered disabled state
Aug 14 20:06:41 node4 kernel: [ 1853.593626] device fwln101i0 entered promiscuous mode
Aug 14 20:06:41 node4 kernel: [ 1853.593774] fwbr101i0: port 1(fwln101i0) entered blocking state
Aug 14 20:06:41 node4 kernel: [ 1853.593786] fwbr101i0: port 1(fwln101i0) entered forwarding state
Aug 14 20:06:41 node4 kernel: [ 1853.595971] fwbr101i0: port 2(tap101i0) entered blocking state
Aug 14 20:06:41 node4 kernel: [ 1853.595991] fwbr101i0: port 2(tap101i0) entered disabled state
Aug 14 20:06:41 node4 kernel: [ 1853.596065] fwbr101i0: port 2(tap101i0) entered blocking state
Aug 14 20:06:41 node4 kernel: [ 1853.596076] fwbr101i0: port 2(tap101i0) entered forwarding state
Aug 14 20:06:41 node4 qm[25178]: <root@pam> end task UPID:node4:0000625B:0002D30F:62F98E0F:qmstart:101:root@pam: OK
Aug 14 20:06:41 node4 systemd[1]: session-10.scope: Succeeded.
Aug 14 20:06:41 node4 systemd[1]: Started Session 11 of user root.

cool · Aug 15, 2022

tjk · Aug 15, 2022

cool said:
mark

Spam?

cool · Aug 16, 2022

tjk said:
垃圾邮件？

not spam i want to mark this forum and when i meet this problem i can use the solution and my english is not good

Neobin · Aug 16, 2022

cool said:
not spam i want to mark this forum and when i meet this problem i can use the solution and my english is not good

On the top of every thread there is a "Watch" button. With this you can subscribe to a thread without posting "empty" posts.

cool · Aug 16, 2022

Neobin said:
On the top of every thread there is a "Watch" button. With this you can subscribe to a thread without posting "empty" posts.

haha.thanks bro good!

tjk · Aug 16, 2022

I'm a bit desperate at this point, outside of starting to replace servers, anyone have any ideas or suggestions to narrow this down?

fabian · Aug 17, 2022

are you sharing corosync links with migration network? because your source node seems to drop from corosync quorum, which, if HA is enabled, will cause fencing..

LnxBil · Aug 17, 2022

fabian said:
are you sharing corosync links with migration network? because your source node seems to drop from corosync quorum, which, if HA is enabled, will cause fencing..

Have you (as in the Proxmox staff) tried to configure QoS in order to share the link? I know that it is always recommended to have dedicated links, but most of the time, those links are not saturated and could be perfectly used for other stuff if one could ensure that the corosync packages are of the highest priority.

fabian · Aug 17, 2022

no, we haven't. such setups might work (or at least work better than without such prioritization).

LnxBil · Aug 17, 2022

fabian said:
no, we haven't. such setups might work (or at least work better than without such prioritization).

Oh yeah .. I found @guletz post here and that looks great.

tjk · Aug 17, 2022

fabian said:
are you sharing corosync links with migration network? because your source node seems to drop from corosync quorum, which, if HA is enabled, will cause fencing..

Yes. but I don't think that is the problem. That drop you are seeing is when the target node rebooted. I've narrowed it down further to just two nodes now - any storage migrations to either of those 2 nodes causes reboots, but not to the rest of the nodes.

Also, HA is NOT enabled for any of the VM's under test. Local ZFS setup on each node and basic VM migration, pretty simple setup.

Network setup is 2x 10G LACP and VLANs setup.

fabian · Aug 17, 2022

then the log of the target node is incomplete.. since you are using ZFS - are you sure you are not just running out of memory because your ARC limits are set too high?

tjk · Aug 17, 2022

fabian said:
then the log of the target node is incomplete.. since you are using ZFS - are you sure you are not just running out of memory because your ARC limits are set too high?

I am limiting ARC on each node:

Bash:

root@node1:~# cat /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=8589934592
options zfs zfs_arc_min=4294967296
root@node1:~#

Code:

ARC size (current):                                    23.5 %    1.9 GiB
        Target size (adaptive):                        80.9 %    6.5 GiB
        Min size (hard limit):                         50.0 %    4.0 GiB
        Max size (high water):                            2:1    8.0 GiB

The target node log is what is on the system, that is the last line before it rebooted.

If you need addl/more logs, let me know what you need and I can grab them, I can repeat this by just doing a vm migration.

fabian · Aug 17, 2022

you could try connecting with SSH and leaving "journalctl -f" running. alternatively, setting up a serial or netconsole and dumping the output somewhere would be a more failsafe, but a bit more involved option.

Gilberto Ferreira · Aug 17, 2022

Can I suggest that you use pve kernel edge, from here: https://github.com/fabianishere/pve-edge-kernel?
I got a lot trouble couple of time ago with Lenovo and AMD CPU.
That pve kernel edge solve my problems.
I'm just trying to help by suggesting using this kernel edge.
I know it's not the official Proxmox kernel, but it might solve your problem or at least give us a directive.
Cheers

tjk · Aug 17, 2022

Gilberto Ferreira said:
Can I suggest that you use pve kernel edge, from here: https://github.com/fabianishere/pve-edge-kernel?
I got a lot trouble couple of time ago with Lenovo and AMD CPU.
That pve kernel edge solve my problems.
I'm just trying to help by suggesting using this kernel edge.
I know it's not the official Proxmox kernel, but it might solve your problem or at least give us a directive.
Cheers

Thanks, I'd be happy to if I was having problems on all the nodes. The fact that it is isolated to just 2 nodes in the cluster and only with vm migrations leads me to believe it is something else.

tjk · Aug 17, 2022

Ok, just triggered it and have journalctl logs.

Migrating a VM from node5 to node1 and node5 rebooted during the migration, here are the logs I've captured:

Node 5 journalctl -f output, notice there isn't much, it just hard rebooted:

Code:

Aug 17 09:24:59 node5 pvedaemon[2637684]: <root@pam> starting task UPID:node5:00390D73:01528A42:62FCEC2B:qmigrate:102:root@pam:
Aug 17 09:25:00 node5 pmxcfs[2905]: [status] notice: received log
Aug 17 09:25:01 node5 pmxcfs[2905]: [status] notice: received log

Node1 journalctl -f output:

Code:

Aug 17 09:24:59 node1 pmxcfs[2948]: [status] notice: received log
Aug 17 09:24:59 node1 sshd[3624383]: Accepted publickey for root from 172.19.1.14 port 48884 ssh2: RSA SHA256:EFzJG/pKNiDT216MNkQ6QOnSLkFvNGnMLBR4G7ar4yg
Aug 17 09:24:59 node1 sshd[3624383]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Aug 17 09:24:59 node1 systemd-logind[2444]: New session 97 of user root.
Aug 17 09:24:59 node1 systemd[1]: Started Session 97 of user root.
Aug 17 09:24:59 node1 sshd[3624383]: Received disconnect from 172.19.1.14 port 48884:11: disconnected by user
Aug 17 09:24:59 node1 sshd[3624383]: Disconnected from user root 172.19.1.14 port 48884
Aug 17 09:24:59 node1 sshd[3624383]: pam_unix(sshd:session): session closed for user root
Aug 17 09:24:59 node1 systemd[1]: session-97.scope: Succeeded.
Aug 17 09:24:59 node1 systemd-logind[2444]: Session 97 logged out. Waiting for processes to exit.
Aug 17 09:24:59 node1 systemd-logind[2444]: Removed session 97.
Aug 17 09:24:59 node1 sshd[3624391]: Accepted publickey for root from 172.19.1.14 port 48886 ssh2: RSA SHA256:EFzJG/pKNiDT216MNkQ6QOnSLkFvNGnMLBR4G7ar4yg
Aug 17 09:24:59 node1 sshd[3624391]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Aug 17 09:24:59 node1 systemd-logind[2444]: New session 98 of user root.
Aug 17 09:24:59 node1 systemd[1]: Started Session 98 of user root.
Aug 17 09:24:59 node1 sshd[3624391]: Received disconnect from 172.19.1.14 port 48886:11: disconnected by user
Aug 17 09:24:59 node1 sshd[3624391]: Disconnected from user root 172.19.1.14 port 48886
Aug 17 09:24:59 node1 sshd[3624391]: pam_unix(sshd:session): session closed for user root
Aug 17 09:24:59 node1 systemd[1]: session-98.scope: Succeeded.
Aug 17 09:24:59 node1 systemd-logind[2444]: Session 98 logged out. Waiting for processes to exit.
Aug 17 09:24:59 node1 systemd-logind[2444]: Removed session 98.
Aug 17 09:24:59 node1 sshd[3624399]: Accepted publickey for root from 172.19.1.14 port 48888 ssh2: RSA SHA256:EFzJG/pKNiDT216MNkQ6QOnSLkFvNGnMLBR4G7ar4yg
Aug 17 09:24:59 node1 sshd[3624399]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Aug 17 09:24:59 node1 systemd-logind[2444]: New session 99 of user root.
Aug 17 09:24:59 node1 systemd[1]: Started Session 99 of user root.
Aug 17 09:24:59 node1 sshd[3624399]: Received disconnect from 172.19.1.14 port 48888:11: disconnected by user
Aug 17 09:24:59 node1 sshd[3624399]: Disconnected from user root 172.19.1.14 port 48888
Aug 17 09:24:59 node1 sshd[3624399]: pam_unix(sshd:session): session closed for user root
Aug 17 09:24:59 node1 systemd[1]: session-99.scope: Succeeded.
Aug 17 09:24:59 node1 systemd-logind[2444]: Session 99 logged out. Waiting for processes to exit.
Aug 17 09:24:59 node1 systemd-logind[2444]: Removed session 99.
Aug 17 09:25:00 node1 sshd[3624407]: Accepted publickey for root from 172.19.1.14 port 48890 ssh2: RSA SHA256:EFzJG/pKNiDT216MNkQ6QOnSLkFvNGnMLBR4G7ar4yg
Aug 17 09:25:00 node1 sshd[3624407]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Aug 17 09:25:00 node1 systemd-logind[2444]: New session 100 of user root.
Aug 17 09:25:00 node1 systemd[1]: Started Session 100 of user root.
Aug 17 09:25:00 node1 qm[3624414]: <root@pam> starting task UPID:node1:00374DDF:01531422:62FCEC2C:qmstart:102:root@pam:
Aug 17 09:25:00 node1 qm[3624415]: start VM 102: UPID:node1:00374DDF:01531422:62FCEC2C:qmstart:102:root@pam:
Aug 17 09:25:01 node1 systemd[1]: Started 102.scope.
Aug 17 09:25:01 node1 systemd-udevd[3624479]: Using default interface naming scheme 'v247'.
Aug 17 09:25:01 node1 systemd-udevd[3624479]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Aug 17 09:25:01 node1 kernel: device tap102i0 entered promiscuous mode
Aug 17 09:25:01 node1 systemd-udevd[3624479]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Aug 17 09:25:01 node1 systemd-udevd[3624479]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Aug 17 09:25:01 node1 systemd-udevd[3624480]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Aug 17 09:25:01 node1 systemd-udevd[3624480]: Using default interface naming scheme 'v247'.
Aug 17 09:25:01 node1 kernel: vmbr601: port 2(fwpr102p0) entered blocking state
Aug 17 09:25:01 node1 kernel: vmbr601: port 2(fwpr102p0) entered disabled state
Aug 17 09:25:01 node1 kernel: device fwpr102p0 entered promiscuous mode
Aug 17 09:25:01 node1 kernel: vmbr601: port 2(fwpr102p0) entered blocking state
Aug 17 09:25:01 node1 kernel: vmbr601: port 2(fwpr102p0) entered forwarding state
Aug 17 09:25:01 node1 kernel: fwbr102i0: port 1(fwln102i0) entered blocking state
Aug 17 09:25:01 node1 kernel: fwbr102i0: port 1(fwln102i0) entered disabled state
Aug 17 09:25:01 node1 kernel: device fwln102i0 entered promiscuous mode
Aug 17 09:25:01 node1 kernel: fwbr102i0: port 1(fwln102i0) entered blocking state
Aug 17 09:25:01 node1 kernel: fwbr102i0: port 1(fwln102i0) entered forwarding state
Aug 17 09:25:01 node1 kernel: fwbr102i0: port 2(tap102i0) entered blocking state
Aug 17 09:25:01 node1 kernel: fwbr102i0: port 2(tap102i0) entered disabled state
Aug 17 09:25:01 node1 kernel: fwbr102i0: port 2(tap102i0) entered blocking state
Aug 17 09:25:01 node1 kernel: fwbr102i0: port 2(tap102i0) entered forwarding state
Aug 17 09:25:01 node1 qm[3624414]: <root@pam> end task UPID:node1:00374DDF:01531422:62FCEC2C:qmstart:102:root@pam: OK
Aug 17 09:25:01 node1 sshd[3624407]: Received disconnect from 172.19.1.14 port 48890:11: disconnected by user
Aug 17 09:25:01 node1 sshd[3624407]: Disconnected from user root 172.19.1.14 port 48890
Aug 17 09:25:01 node1 sshd[3624407]: pam_unix(sshd:session): session closed for user root
Aug 17 09:25:01 node1 systemd[1]: session-100.scope: Succeeded.
Aug 17 09:25:01 node1 systemd-logind[2444]: Session 100 logged out. Waiting for processes to exit.
Aug 17 09:25:01 node1 systemd-logind[2444]: Removed session 100.
Aug 17 09:25:01 node1 sshd[3624689]: Accepted publickey for root from 172.19.1.14 port 48892 ssh2: RSA SHA256:EFzJG/pKNiDT216MNkQ6QOnSLkFvNGnMLBR4G7ar4yg
Aug 17 09:25:01 node1 sshd[3624689]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Aug 17 09:25:01 node1 systemd-logind[2444]: New session 101 of user root.
Aug 17 09:25:01 node1 systemd[1]: Started Session 101 of user root.
Aug 17 09:26:17 node1 corosync[3044]:   [KNET  ] link: host: 5 link: 0 is down
Aug 17 09:26:17 node1 corosync[3044]:   [KNET  ] link: host: 5 link: 1 is down
Aug 17 09:26:17 node1 corosync[3044]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Aug 17 09:26:17 node1 corosync[3044]:   [KNET  ] host: host: 5 has no active links
Aug 17 09:26:17 node1 corosync[3044]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Aug 17 09:26:17 node1 corosync[3044]:   [KNET  ] host: host: 5 has no active links
Aug 17 09:26:20 node1 corosync[3044]:   [TOTEM ] Token has not been received in 4687 ms
Aug 17 09:26:22 node1 corosync[3044]:   [TOTEM ] A processor failed, forming new configuration: token timed out (6250ms), waiting 7500ms for consensus.
Aug 17 09:26:29 node1 corosync[3044]:   [QUORUM] Sync members[6]: 1 2 3 4 6 7
Aug 17 09:26:29 node1 corosync[3044]:   [QUORUM] Sync left[1]: 5
Aug 17 09:26:29 node1 corosync[3044]:   [TOTEM ] A new membership (1.411) was formed. Members left: 5
Aug 17 09:26:29 node1 corosync[3044]:   [TOTEM ] Failed to receive the leave message. failed: 5
Aug 17 09:26:29 node1 pmxcfs[2948]: [dcdb] notice: members: 1/2948, 2/3207, 3/3011, 4/7443, 6/3247, 7/3316
Aug 17 09:26:29 node1 pmxcfs[2948]: [dcdb] notice: starting data syncronisation
Aug 17 09:26:29 node1 corosync[3044]:   [QUORUM] Members[6]: 1 2 3 4 6 7
Aug 17 09:26:29 node1 corosync[3044]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 17 09:26:29 node1 pmxcfs[2948]: [dcdb] notice: cpg_send_message retried 1 times
Aug 17 09:26:29 node1 pmxcfs[2948]: [status] notice: members: 1/2948, 2/3207, 3/3011, 4/7443, 6/3247, 7/3316
Aug 17 09:26:29 node1 pmxcfs[2948]: [status] notice: starting data syncronisation
Aug 17 09:26:29 node1 pmxcfs[2948]: [dcdb] notice: received sync request (epoch 1/2948/00000010)
Aug 17 09:26:29 node1 pmxcfs[2948]: [status] notice: received sync request (epoch 1/2948/00000010)
Aug 17 09:26:29 node1 pmxcfs[2948]: [dcdb] notice: received all states
Aug 17 09:26:29 node1 pmxcfs[2948]: [dcdb] notice: leader is 1/2948
Aug 17 09:26:29 node1 pmxcfs[2948]: [dcdb] notice: synced members: 1/2948, 2/3207, 3/3011, 4/7443, 6/3247, 7/3316
Aug 17 09:26:29 node1 pmxcfs[2948]: [dcdb] notice: start sending inode updates
Aug 17 09:26:29 node1 pmxcfs[2948]: [dcdb] notice: sent all (0) updates
Aug 17 09:26:29 node1 pmxcfs[2948]: [dcdb] notice: all data is up to date
Aug 17 09:26:29 node1 pmxcfs[2948]: [dcdb] notice: dfsm_deliver_queue: queue length 6
Aug 17 09:26:29 node1 pmxcfs[2948]: [status] notice: received all states
Aug 17 09:26:29 node1 pmxcfs[2948]: [status] notice: all data is up to date
Aug 17 09:26:29 node1 pmxcfs[2948]: [status] notice: dfsm_deliver_queue: queue length 61
Aug 17 09:27:33 node1 corosync[3044]:   [KNET  ] rx: host: 5 link: 1 is up
Aug 17 09:27:33 node1 corosync[3044]:   [KNET  ] host: host: 5 (passive) best link: 1 (pri: 1)
Aug 17 09:27:33 node1 corosync[3044]:   [KNET  ] rx: host: 5 link: 0 is up
Aug 17 09:27:33 node1 corosync[3044]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Aug 17 09:27:34 node1 corosync[3044]:   [QUORUM] Sync members[7]: 1 2 3 4 5 6 7
Aug 17 09:27:34 node1 corosync[3044]:   [QUORUM] Sync joined[1]: 5
Aug 17 09:27:34 node1 corosync[3044]:   [TOTEM ] A new membership (1.416) was formed. Members joined: 5
Aug 17 09:27:34 node1 corosync[3044]:   [TOTEM ] Retransmit List: 4
Aug 17 09:27:34 node1 corosync[3044]:   [TOTEM ] Retransmit List: e
Aug 17 09:27:34 node1 corosync[3044]:   [TOTEM ] Retransmit List: 14
Aug 17 09:27:34 node1 corosync[3044]:   [TOTEM ] Retransmit List: 1c
Aug 17 09:27:34 node1 corosync[3044]:   [QUORUM] Members[7]: 1 2 3 4 5 6 7
Aug 17 09:27:34 node1 corosync[3044]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 17 09:27:34 node1 corosync[3044]:   [TOTEM ] Retransmit List: 22
Aug 17 09:27:35 node1 pmxcfs[2948]: [dcdb] notice: members: 1/2948, 2/3207, 3/3011, 4/7443, 5/3227, 6/3247, 7/3316
Aug 17 09:27:35 node1 pmxcfs[2948]: [dcdb] notice: starting data syncronisation
Aug 17 09:27:35 node1 pmxcfs[2948]: [status] notice: members: 1/2948, 2/3207, 3/3011, 4/7443, 5/3227, 6/3247, 7/3316
Aug 17 09:27:35 node1 pmxcfs[2948]: [status] notice: starting data syncronisation
Aug 17 09:27:35 node1 pmxcfs[2948]: [dcdb] notice: received sync request (epoch 1/2948/00000011)
Aug 17 09:27:35 node1 pmxcfs[2948]: [status] notice: received sync request (epoch 1/2948/00000011)
Aug 17 09:27:35 node1 pmxcfs[2948]: [dcdb] notice: received all states
Aug 17 09:27:35 node1 pmxcfs[2948]: [dcdb] notice: leader is 1/2948
Aug 17 09:27:35 node1 pmxcfs[2948]: [dcdb] notice: synced members: 1/2948, 2/3207, 3/3011, 4/7443, 6/3247, 7/3316
Aug 17 09:27:35 node1 pmxcfs[2948]: [dcdb] notice: start sending inode updates
Aug 17 09:27:35 node1 pmxcfs[2948]: [dcdb] notice: sent all (8) updates
Aug 17 09:27:35 node1 pmxcfs[2948]: [dcdb] notice: all data is up to date
Aug 17 09:27:35 node1 pmxcfs[2948]: [status] notice: received all states
Aug 17 09:27:35 node1 pmxcfs[2948]: [status] notice: all data is up to date
Aug 17 09:27:35 node1 pmxcfs[2948]: [status] notice: received log
Aug 17 09:27:37 node1 QEMU[3624590]: kvm: Disconnect client, due to: Failed to read CMD_WRITE data: Unable to read from socket: Connection reset by peer

The cluster message(s) at 09:26:17 are seen across the rest of the nodes, from node5 hard rebooting.

fabian · Aug 17, 2022

so nothing in the journal -> next step would be to dump the actual console (via IPMI/netconsole/serial console) in the hopes that something gets logged there.. it's strange that the node reboots on its own though, usually you'd expect a kernel bug to print some sort of trace and hang :-/

fabian · Aug 17, 2022

could you maybe also post the VM config? thanks!

VM Migration between nodes crashes node

Active Member

Member

Active Member

Member

Distinguished Member

Member

Active Member

Proxmox Staff Member

Distinguished Member

Proxmox Staff Member

Distinguished Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Renowned Member

Active Member

Active Member

Proxmox Staff Member

Proxmox Staff Member

We value your privacy