Random VM crashes with a SPICE vm: QEMU free(): corrupted unsorted chunks

garbled · Apr 4, 2024

Worth noting, the VM of mine that keeps crashing is a debian 11 VM, not windows. So guess would be not guest related.

DeeMaas · Apr 5, 2024

orion said:
I don't have storage latency and there are still random VMs crashes with messages in logs like yours. I only use SPICE on Windows Server VMs of which I don't have many, but sudden stops of machines are irritating. This started happening after the last updates.

same. The are no latency problem today, but two VMs crashes

garbled · Apr 8, 2024

Hrmm... it looks like my coredumps timeout while being taken, for some wacky reason... The coredump program gets sigkilled...?

garbled · Apr 8, 2024

Got one

fiona · Apr 11, 2024

Unfortunately, all the provided backtraces show crashes in different locations in the code. So my best guess is that it is a heap corruption, which is notoriously difficult to debug. What you might still try is to gather debug messages by doing

Code:

export G_MESSAGES_DEBUG=all
qm start <ID>

There will be some initial messages in the terminal, the rest should be logged to the system journal.

You might also want to report the issue to Debian or to the SPICE project upstream, maybe they have an idea for how to debug this further: https://gitlab.freedesktop.org/groups/spice/-/issues

garbled · Apr 11, 2024

https://gitlab.freedesktop.org/spice/spice/-/issues/87

Throwing this link here for future reference

DeeMaas · Apr 11, 2024

garbled said:
https://gitlab.freedesktop.org/spice/spice/-/issues/87

Throwing this link here for future reference

It's me)

qwaz · Apr 23, 2024

what I noticed in the logs after the crash of the virtual machine (debug mode is enabled, as mentioned above)

grusniy_kotik · Apr 23, 2024

We have the same thing

MadMike86 · Apr 23, 2024

Good afternoon!
Just last week, a similar problem appeared, I notice the same errors in the logs as the people above

garbled · Apr 23, 2024

I have less problems on my slower machines. When I ran my VM's on an X5670, I had zero crashes ever. When I moved them to a 6140, sometimes it would crash 4-5 times a day. Currently I have the VM sitting on an e5-2690v2 and it's been happy for a week now. Previously it did crash once each on an e5-2670v0 and an e5-2670v2.

Heavy graphics seems to be what triggers it. I often trigger it in one of three ways:

1) I preview mkv files in thunar. On the 6140 I would usually crash it in about an hour of fiddling like this.
2) Firefox. Switching tabs, scrolling on a web page, loading new web page. The web page doesn't need to be graphically intense. I've crashed it while scrolling down in netbox just filling out a form.
3) pulling up a VNC from another box and fiddling around in there.

I've never had the vm crash overnight, or when idle, even with firefox running. I've never triggered it by just moving windows around or fiddling about in emacs. It's always heavy graphics use. I haven't tried something like running a demo overnight on it or something. I have 4 nodes of 6140's, it crashes the same on all 4 nodes. No other non-spice VM's crash, on any node, regardless of workload.

The VM that crashes does NOT use virgl. I have a few VM's with virgl, but they are very lightly used, so I cannot say one way or another if they are crash prone.

Worth noting, I believe one of the other replies to this post was experiencing this on a windows VM, whereas mine is debian, so it looks like OS is not a factor.

fiona · Apr 23, 2024

Hi,

qwaz said:
what I noticed in the logs after the crash of the virtual machine (debug mode is enabled, as mentioned above)

these are OOM (out-of-memory) debug messages. How much memory do you have assigned in the display settings? You could try setting more and see if the crashes get rarer. That would be a hint that it's really related to the OOM-Handling for SPICE.

I'm able to get lots and lots of OOM-Messages from SPICE when configuring the minimum amount of display memory the UI allows, i.e. 4MiB. But was still not able to reproduce a crash even with videos running and tab-switching in Firefox.

qwaz · Apr 23, 2024

With a memory capacity of more than 16 MB, the picture simply does not display.
I'ill try to set 32 MiB to all users and keep an eye on system crashes.
My current settings are as follows:

qwaz · Apr 23, 2024

Failures occur in random order on various virtual machines.

garbled · Apr 23, 2024

My setup for the VM that keeps crashing (also, the only one I heavily use)

Curious, is everyone having this problem on multi-monitor? All my other spice VM's are single monitor.

DeeMaas · May 22, 2024

I updated all three nodes to last kernel and pve and I see that Thread 1 always same from all last coredumps (unlink_chunk and qxl_cursor)

First

Code:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fa6faa3cfcd in unlink_chunk (p=p@entry=0x7fa4dc292330, av=0x7fa4dc000030) at ./malloc/malloc.c:1628
1628    ./malloc/malloc.c: No such file or directory.
[Current thread is 1 (Thread 0x7fa4e6bff6c0 (LWP 668375))]

Thread 1 (Thread 0x7fa4e6bff6c0 (LWP 668375)):
#0  0x00007fa6faa3cfcd in unlink_chunk (p=p@entry=0x7fa4dc292330, av=0x7fa4dc000030) at ./malloc/malloc.c:1628
#1  0x00007fa6faa3ff4d in _int_malloc (av=av@entry=0x7fa4dc000030, bytes=bytes@entry=4112) at ./malloc/malloc.c:4201
#2  0x00007fa6faa416e2 in __libc_calloc (n=<optimized out>, elem_size=<optimized out>) at ./malloc/malloc.c:3674
#3  0x00007fa6fc2846d1 in g_malloc0 () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#4  0x0000558bee8abcb0 in cursor_alloc (width=<optimized out>, height=<optimized out>) at ../ui/cursor.c:103
#5  0x0000558beeb2384b in qxl_cursor (group_id=1, cursor=0x7fa4d325b7c8, qxl=0x558bf25ccda0) at ../hw/display/qxl-render.c:252
#6  qxl_render_cursor (qxl=qxl@entry=0x558bf25ccda0, ext=ext@entry=0x7fa4e6bf9f70) at ../hw/display/qxl-render.c:333
--Type <RET> for more, q to quit, c to continue without paging--
#7  0x0000558beeb225ab in interface_get_cursor_command (sin=0x558bf25cd868, ext=0x7fa4e6bf9f70) at ../hw/display/qxl.c:821
#8  0x00007fa6fc97e1bc in ?? () from /lib/x86_64-linux-gnu/libspice-server.so.1
#9  0x00007fa6fc97ecac in ?? () from /lib/x86_64-linux-gnu/libspice-server.so.1
#10 0x00007fa6fc27e7a9 in g_main_context_dispatch () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#11 0x00007fa6fc27ea38 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#12 0x00007fa6fc27ecef in g_main_loop_run () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#13 0x00007fa6fc97dfa9 in ?? () from /lib/x86_64-linux-gnu/libspice-server.so.1
#14 0x00007fa6faa31134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#15 0x00007fa6faab17dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Second

Code:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007e2415c35fcd in unlink_chunk (p=p@entry=0x7e21e089fd20, av=0x7e21e0000030) at ./malloc/malloc.c:1628
1628    ./malloc/malloc.c: No such file or directory.
[Current thread is 1 (Thread 0x7e22038006c0 (LWP 3794221))]

Thread 1 (Thread 0x7e22038006c0 (LWP 3794221)):
#0  0x00007e2415c35fcd in unlink_chunk (p=p@entry=0x7e21e089fd20, av=0x7e21e0000030) at ./malloc/malloc.c:1628
#1  0x00007e2415c38dcd in _int_malloc (av=av@entry=0x7e21e0000030, bytes=bytes@entry=6412) at ./malloc/malloc.c:4303
#2  0x00007e2415c3a6e2 in __libc_calloc (n=<optimized out>, elem_size=<optimized out>) at ./malloc/malloc.c:3674
#3  0x00007e24175326d1 in g_malloc0 () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#4  0x00006212b2da1cb0 in cursor_alloc (width=<optimized out>, height=<optimized out>) at ../ui/cursor.c:103
#5  0x00006212b301984b in qxl_cursor (group_id=1, cursor=0x7e21e91022c8, qxl=0x6212b5dd4ef0) at ../hw/display/qxl-render.c:252
#6  qxl_render_cursor (qxl=qxl@entry=0x6212b5dd4ef0, ext=ext@entry=0x7e22037faf70) at ../hw/display/qxl-render.c:333
#7  0x00006212b30185ab in interface_get_cursor_command (sin=0x6212b5dd59b8, ext=0x7e22037faf70) at ../hw/display/qxl.c:821
#8  0x00007e2417c2e1bc in red_process_cursor (ring_is_empty=0x7e22037fafd4, worker=0x6212b5f0cbd0) at ../server/red-worker.cpp:117
--Type <RET> for more, q to quit, c to continue without paging--
#9  red_process_cursor (worker=0x6212b5f0cbd0, ring_is_empty=0x7e22037fafd4) at ../server/red-worker.cpp:105
#10 0x00007e2417c2ecac in worker_source_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at ../server/red-worker.cpp:923
#11 0x00007e241752c7a9 in g_main_context_dispatch () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#12 0x00007e241752ca38 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#13 0x00007e241752ccef in g_main_loop_run () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#14 0x00007e2417c2dfa9 in red_worker_main (arg=0x6212b5f0cbd0) at ../server/red-worker.cpp:1021
#15 0x00007e2415c2a134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#16 0x00007e2415caa7dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Third

Code:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000077b8fdeb6fcd in unlink_chunk (p=p@entry=0x77b6c8116cf0, av=0x77b6c8000030) at ./malloc/malloc.c:1628
1628    ./malloc/malloc.c: No such file or directory.
[Current thread is 1 (Thread 0x77b6eb8006c0 (LWP 1812778))]

Thread 1 (Thread 0x77b6eb8006c0 (LWP 1812778)):
#0  0x000077b8fdeb6fcd in unlink_chunk (p=p@entry=0x77b6c8116cf0, av=0x77b6c8000030) at ./malloc/malloc.c:1628
#1  0x000077b8fdeb9dcd in _int_malloc (av=av@entry=0x77b6c8000030, bytes=bytes@entry=6412) at ./malloc/malloc.c:4303
#2  0x000077b8fdebb6e2 in __libc_calloc (n=<optimized out>, elem_size=<optimized out>) at ./malloc/malloc.c:3674
#3  0x000077b8ff7b36d1 in g_malloc0 () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#4  0x00005c728c429cb0 in cursor_alloc (width=<optimized out>, height=<optimized out>) at ../ui/cursor.c:103
#5  0x00005c728c6a184b in qxl_cursor (group_id=1, cursor=0x77b6d31a9fe0, qxl=0x5c7290c15ca0) at ../hw/display/qxl-render.c:252
#6  qxl_render_cursor (qxl=qxl@entry=0x5c7290c15ca0, ext=ext@entry=0x77b6eb7faf70) at ../hw/display/qxl-render.c:333
#7  0x00005c728c6a05ab in interface_get_cursor_command (sin=0x5c7290c16768, ext=0x77b6eb7faf70) at ../hw/display/qxl.c:821
#8  0x000077b8ffeaf1bc in ?? () from /lib/x86_64-linux-gnu/libspice-server.so.1
#9  0x000077b8ffeafcac in ?? () from /lib/x86_64-linux-gnu/libspice-server.so.1
#10 0x000077b8ff7ad7a9 in g_main_context_dispatch () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#11 0x000077b8ff7ada38 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
--Type <RET> for more, q to quit, c to continue without paging--
#12 0x000077b8ff7adcef in g_main_loop_run () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#13 0x000077b8ffeaefa9 in ?? () from /lib/x86_64-linux-gnu/libspice-server.so.1
#14 0x000077b8fdeab134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#15 0x000077b8fdf2b7dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

DeeMaas · May 22, 2024

Full debug

DeeMaas · May 22, 2024

New crash. It's diffrent

Only libspice-server.so.1 very often.

Code:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  _int_malloc (av=av@entry=0x7dffec000030, bytes=bytes@entry=240) at ./malloc/malloc.c:4004
4004    ./malloc/malloc.c: No such file or directory.
[Current thread is 1 (Thread 0x7e000fc006c0 (LWP 3999183))]
(gdb) bt
#0  _int_malloc (av=av@entry=0x7dffec000030, bytes=bytes@entry=240) at ./malloc/malloc.c:4004
#1  0x00007e01a23b76e2 in __libc_calloc (n=<optimized out>, elem_size=<optimized out>) at ./malloc/malloc.c:3674
#2  0x00007e01a3caf6d1 in g_malloc0 () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#3  0x00007e01a43990fc in ?? () from /lib/x86_64-linux-gnu/libspice-server.so.1
#4  0x00007e01a43aba2c in ?? () from /lib/x86_64-linux-gnu/libspice-server.so.1
#5  0x00007e01a43abcb7 in ?? () from /lib/x86_64-linux-gnu/libspice-server.so.1
#6  0x00007e01a3ca97a9 in g_main_context_dispatch () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#7  0x00007e01a3ca9a38 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#8  0x00007e01a3ca9cef in g_main_loop_run () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#9  0x00007e01a43aafa9 in ?? () from /lib/x86_64-linux-gnu/libspice-server.so.1
#10 0x00007e01a23a7134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#11 0x00007e01a24277dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb)

freddy77 · May 23, 2024

I am/was a SPICE developer, still following the project.
It's hard to understand these issues, as already said in some posts here heap corruptions are reported after the fact. I can see most of the stack traces refers to memory allocations but some also memory release. It seems all stack trace have SPICE in the list so it should be something close to that code. As far as I understand this issue appeared after some upgrade. Can we detect which upgrade caused the issue? SPICE has been stable (in a good and bad sense) for a while, I would expect something external to that changed (not saying the bug is not there but maybe some external condition makes the bug reveals itself).

One useful thing would be to dump a bit the memory around the corrupted chunk of memory and try to understand which data was these, possibly a dandling pointer in some code.

Another thing to try would be to change Qemu memory allocator, for instance jemalloc o tcalloc. Usually you can override the default (which in your case seems the system one from glibc) using LD_PRELOAD technique. They will change the error pattern, hopefully helping sharing the source of the issue. There are also allocators which have different debug additions but for a full virtual machine can be quite expensive although some could report the exact point of the corruption; in the past I used ElectricFence but that's way heavy.

DeeMaas · May 23, 2024

We have installed Proxmox from proxmox-ve_8.0-2.iso 11/30/2023
They have been actively using it since January and errors began on February 1, 2024. I wrote to the forum when we had version 8.1.4. Now we have 8.2.2 . I uploaded a file where all our updates are between January and February 1st. We have been actively filling the cluster with virtual machines since February, so I do not know how accurately we can rely on this data.

Random VM crashes with a SPICE vm: QEMU free(): corrupted unsorted chunks

Member

Renowned Member

Member

Member

Attachments

Proxmox Staff Member

Member

Renowned Member

New Member

Attachments

New Member

New Member

Member

Proxmox Staff Member

New Member

New Member

Member

Renowned Member

Renowned Member

Attachments

Renowned Member

New Member

Renowned Member

Attachments

We value your privacy