PVE Non-responsive after file upload

MarkLFT

New Member
Aug 31, 2022
7
1
3
Four times in the last two days I have tried to upload an img file to the PVE local storage. Each time the copy seems to complete, then immediately afterwards, we get a no communication error on the browser. We also lose SSH and ping contact to the server. I have left it in this state for 12 hours, and it doesn't recover.

The file is 15 GB but the LAN is a 1Gbps and doesn't seem to be the issue. The drive is an NVMe, and IO performance during the copy does not seem to be the issue.

The syslog reports

Sep 29 22:25:33 pve pvedaemon[1241]: <root@pam> successful auth for user 'root@pam'
Sep 29 22:27:09 pve pvedaemon[1240]: <root@pam> successful auth for user 'root@pam'
Sep 29 22:29:01 pve kernel: usb 1-7: new full-speed USB device number 108 using xhci_hcd
Sep 29 22:29:01 pve kernel: usb 1-7: device descriptor read/64, error -71
Sep 29 22:29:32 pve pveproxy[10772]: multipart upload complete (size: 15032385536 time: 213s rate: 67.00MiB/s md5sum: f913e09caa5f3e6c82dd0d8496fe8514)
Sep 29 22:29:32 pve unknow:
-- Reboot --
Sep 29 22:54:06 pve kernel: Linux version 5.15.60-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.60-1 (Mon, 19 Sep 2022 17:53:17 +0200) ()
Sep 29 22:54:06 pve kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.60-1-pve root=/dev/mapper/pve-root ro quiet

I do not believe the USB error is the issue, as this has been connected for several months without issue. This has only happened for the last two days.
As you can see form the syslog, the upload seems to complete, the pve get an unknown: status. Then I must perform a hardware reset to get the server online again.
Does anyone have any ideas regarding why this might me.
There were only two small VMs running at the time. CPU % < 5% Memory usage, 6Gb from 32Gb.
 
Unfortunately, I don't have a good idea for why your server crashes. Though, such a sudden and fatal crash might indicate hardware failure. You could try to get an idea of what the kernel is doing right before becoming unresponsive via a Kernel Crash Trace log.

Some more information about your system would be helpful, i.e. what version are you running pveversion -v and what does your storage configuration look like? Are you e.g. using ZFS? Some other things to try would be, e.g. if are you able to upload a smaller file...
 
My proxmox is 7.2-11. I was using the 5.15 Linux build but I today tried to update to the 7.19 build, but still the same.
Storage config is two physical drives a main NVMe 1TB for system and VMs, then a 512Gb SATA SSD for templates etc.
Not using ZFS. My storage I have local and local-lvm. VM's in local-lvm. The copy was to the local storages. There is an nfs drive I use for backups on a NAS system. the SSD is a lvm storage also.

I have tried copying to all three storage from two different clients, always ends the same.
I have run hardware diagnostics, and no errors logged. When I am not copying files the system is stable, and no apparent problems, so pretty sure there is no hardware issue.

With regards to setting up a Crash Trace log, I do not have another server I can use currently, only my development desktop.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!