Proxmox VM with yellow exclamation mark and none responsive!!! Please HELP?

barrynza

Member
Dec 5, 2020
26
1
8
42
Hi All,

I'm using 3 hosts as cluster which worked fine until recent upgrade from "5.13.19-4-pve #1 SMP PVE 5.13.19-9" to 5.15 of proxmox that started to have these funny issues where machine shows an exclamation mark and freezes, lucky i didn't upgrade all 3 servers so my 3rd node works fine, can someone shed some light what is happening?
Here is the status of the machine:
NOTE: Machine's hdd is pointed to a file server on the network which is a raid1 wd nas.

~# qm status 100 --verbose
balloon: 2147483648
ballooninfo:
actual: 2147483648
max_mem: 2147483648
blockstat:
ide2:
account_failed: 0
account_invalid: 0
failed_flush_operations: 0
failed_rd_operations: 0
failed_unmap_operations: 0
failed_wr_operations: 0
flush_operations: 0
flush_total_time_ns: 0
idle_time_ns: 63582752539
invalid_flush_operations: 0
invalid_rd_operations: 0
invalid_unmap_operations: 0
invalid_wr_operations: 0
rd_bytes: 90
rd_merged: 0
rd_operations: 5
rd_total_time_ns: 241895
timed_stats:
unmap_bytes: 0
unmap_merged: 0
unmap_operations: 0
unmap_total_time_ns: 0
wr_bytes: 0
wr_highest_offset: 0
wr_merged: 0
wr_operations: 0
wr_total_time_ns: 0
scsi0:
account_failed: 1
account_invalid: 1
failed_flush_operations: 0
failed_rd_operations: 0
failed_unmap_operations: 0
failed_wr_operations: 1
flush_operations: 0
flush_total_time_ns: 0
idle_time_ns: 61312376769
invalid_flush_operations: 0
invalid_rd_operations: 0
invalid_unmap_operations: 0
invalid_wr_operations: 0
rd_bytes: 0
rd_merged: 0
rd_operations: 0
rd_total_time_ns: 0
timed_stats:
unmap_bytes: 0
unmap_merged: 0
unmap_operations: 0
unmap_total_time_ns: 0
wr_bytes: 1318912
wr_highest_offset: 6666747904
wr_merged: 0
wr_operations: 33
wr_total_time_ns: 4158811418
cpus: 2
disk: 0
diskread: 90
diskwrite: 1318912
maxdisk: 10737418240
maxmem: 2147483648
mem: 1499032768
name: pfsense
netin: 13496872
netout: 13468018
nics:
tap100i0:
netin: 213892
netout: 13270534
tap100i1:
netin: 13282870
netout: 197184
tap100i2:
netin: 110
netout: 300
pid: 11530
proxmox-support:
pbs-dirty-bitmap: 1
pbs-dirty-bitmap-migration: 1
pbs-dirty-bitmap-savevm: 1
pbs-library-version: 1.3.1 (4d450bb294cac5316d2f23bf087c4b02c0543d79)
pbs-masterkey: 1
query-bitmap-info: 1
qmpstatus: io-error
running-machine: pc-i440fx-6.1+pve0
running-qemu: 7.0.0
status: running
uptime: 93
vmid: 100
 

Attachments

  • Screenshot 2022-08-26 083859.png
    Screenshot 2022-08-26 083859.png
    2.5 KB · Views: 11
qmpstatus: io-error
this shows you that qemu had problems with io. what exactly happens you probably can get out of the syslog (e.g. with journalctl) but most often it's a storage that is full
 
this shows you that qemu had problems with io. what exactly happens you probably can get out of the syslog (e.g. with journalctl) but most often it's a storage that is full
just checked and there is 476GB free space so that isn't the issue, i think the version of proxmox is buggy but still waiting for some more suggestion or help.
 
i think the version of proxmox is buggy but still waiting for some more suggestion or help.
i'm trying to help, so please check the syslog as i already wrote...
 
i'm trying to help, so please check the syslog as i already wrote...
Thanks Mate, All right i have checked the syslog but can't see what exactly is causing this freezing see below i filtered out the syslog for vm#1000 which is currently sitting on yellow ex.mark - see last two lines.
Funny enough when i migrate the same machine to my 3rd host which is in a slight older version of kernel works fine no issues, but this vm machine has a vhdd ponting to nas hence migration can be done, so not sure why my first and second host that are on newest pve kernel version are having this yellow ex mark :)

Line 671: Aug 26 08:13:53 Glide2 qm[187792]: <root@pam> starting task UPID:Glide2:0002DD91:0040E59E:6307D801:qmstart:1000:root@pam:
Line 672: Aug 26 08:13:53 Glide2 qm[187793]: start VM 1000: UPID:Glide2:0002DD91:0040E59E:6307D801:qmstart:1000:root@pam:
Line 672: Aug 26 08:13:53 Glide2 qm[187793]: start VM 1000: UPID:Glide2:0002DD91:0040E59E:6307D801:qmstart:1000:root@pam:
Line 673: Aug 26 08:13:53 Glide2 systemd[1]: Started 1000.scope.
Line 676: Aug 26 08:13:53 Glide2 kernel: device tap1000i0 entered promiscuous mode
Line 681: Aug 26 08:13:53 Glide2 kernel: vmbr0: port 3(fwpr1000p0) entered blocking state
Line 682: Aug 26 08:13:53 Glide2 kernel: vmbr0: port 3(fwpr1000p0) entered disabled state
Line 683: Aug 26 08:13:53 Glide2 kernel: device fwpr1000p0 entered promiscuous mode
Line 684: Aug 26 08:13:53 Glide2 kernel: vmbr0: port 3(fwpr1000p0) entered blocking state
Line 685: Aug 26 08:13:53 Glide2 kernel: vmbr0: port 3(fwpr1000p0) entered forwarding state
Line 686: Aug 26 08:13:53 Glide2 kernel: fwbr1000i0: port 1(fwln1000i0) entered blocking state
Line 686: Aug 26 08:13:53 Glide2 kernel: fwbr1000i0: port 1(fwln1000i0) entered blocking state
Line 687: Aug 26 08:13:53 Glide2 kernel: fwbr1000i0: port 1(fwln1000i0) entered disabled state
Line 687: Aug 26 08:13:53 Glide2 kernel: fwbr1000i0: port 1(fwln1000i0) entered disabled state
Line 688: Aug 26 08:13:53 Glide2 kernel: device fwln1000i0 entered promiscuous mode
Line 689: Aug 26 08:13:53 Glide2 kernel: fwbr1000i0: port 1(fwln1000i0) entered blocking state
Line 689: Aug 26 08:13:53 Glide2 kernel: fwbr1000i0: port 1(fwln1000i0) entered blocking state
Line 690: Aug 26 08:13:53 Glide2 kernel: fwbr1000i0: port 1(fwln1000i0) entered forwarding state
Line 690: Aug 26 08:13:53 Glide2 kernel: fwbr1000i0: port 1(fwln1000i0) entered forwarding state
Line 691: Aug 26 08:13:53 Glide2 kernel: fwbr1000i0: port 2(tap1000i0) entered blocking state
Line 691: Aug 26 08:13:53 Glide2 kernel: fwbr1000i0: port 2(tap1000i0) entered blocking state
Line 692: Aug 26 08:13:53 Glide2 kernel: fwbr1000i0: port 2(tap1000i0) entered disabled state
Line 692: Aug 26 08:13:53 Glide2 kernel: fwbr1000i0: port 2(tap1000i0) entered disabled state
Line 693: Aug 26 08:13:53 Glide2 kernel: fwbr1000i0: port 2(tap1000i0) entered blocking state
Line 693: Aug 26 08:13:53 Glide2 kernel: fwbr1000i0: port 2(tap1000i0) entered blocking state
Line 694: Aug 26 08:13:53 Glide2 kernel: fwbr1000i0: port 2(tap1000i0) entered forwarding state
Line 694: Aug 26 08:13:53 Glide2 kernel: fwbr1000i0: port 2(tap1000i0) entered forwarding state
Line 695: Aug 26 08:13:54 Glide2 qm[187792]: <root@pam> end task UPID:Glide2:0002DD91:0040E59E:6307D801:qmstart:1000:root@pam: OK
Line 847: Aug 26 08:37:48 Glide2 kernel: e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang:
Line 859: PHY 1000BASE-T Status <3800>
Line 864: Aug 26 08:37:50 Glide2 kernel: e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang:
Line 876: PHY 1000BASE-T Status <3800>
Line 879: Aug 26 08:37:52 Glide2 kernel: e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang:
Line 891: PHY 1000BASE-T Status <3800>
Line 908: Aug 26 08:37:54 Glide2 kernel: NETDEV WATCHDOG: enp0s25 (e1000e): transmit queue 0 timed out
Line 911: Aug 26 08:37:54 Glide2 kernel: rc_core zcommon(PO) znvpair(PO) acpi_pad mac_hid spl(O) vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfio_pci vfio_pci_core vfio_virqfd irqbypass vfio_iommu_type1 vfio nct6775 hwmon_vid coretemp drm sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic xor zstd_compress raid6_pq simplefb dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c nvme ahci crc32_pclmul xhci_pci libahci lpc_ich i2c_i801 i2c_smbus nvme_core xhci_pci_renesas ehci_pci ehci_hcd e1000e xhci_hcd video
Line 924: Aug 26 08:37:54 Glide2 kernel: CR2: 00007fbfd6d4a000 CR3: 00000003bc810005 CR4: 00000000003726e0
Line 956: Aug 26 08:37:54 Glide2 kernel: e1000e 0000:00:19.0 enp0s25: Reset adapter unexpectedly
Line 966: Aug 26 08:37:57 Glide2 kernel: e1000e 0000:00:19.0 enp0s25: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Line 966: Aug 26 08:37:57 Glide2 kernel: e1000e 0000:00:19.0 enp0s25: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Line 1018: Aug 26 08:55:40 Glide2 pvestatd[186716]: VM 1000 qmp command failed - VM 1000 qmp command 'query-proxmox-support' failed - got timeout
Line 1018: Aug 26 08:55:40 Glide2 pvestatd[186716]: VM 1000 qmp command failed - VM 1000 qmp command 'query-proxmox-support' failed - got timeout
 
Last edited:
please post the complete journal since the last boot (the logfile don't have to necessarily contain the vmid, depending on whats wrong)
you can redirect it in a file and upload it for example:
Code:
journalctl -b > /tmp/journal.log
 
please post the complete journal since the last boot (the logfile don't have to necessarily contain the vmid, depending on whats wrong)
you can redirect it in a file and upload it for example:
Code:
journalctl -b > /tmp/journal.log
Here it is!
 

Attachments

  • syslog.txt
    326.8 KB · Views: 5
ok according to the syslog your nic hangs/resets:

Code:
Aug 26 08:37:48 Glide2 kernel: e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang:             
                                 TDH                  <41>                                           
                                 TDT                  <5c>                                           
                                 next_to_use          <5c>                                           
                                 next_to_clean        <40>                                           
                               buffer_info[next_to_clean]:                                           
                                 time_stamp           <100a68ff4>                                    
                                 next_to_watch        <41>                                           
                                 jiffies              <100a69268>                                    
                                 next_to_watch.status <0>                                            
                               MAC Status             <80083>                                        
                               PHY Status             <796d>                                         
                               PHY 1000BASE-T Status  <3800>                                         
                               PHY Extended Status    <3000>                                         
                               PCI Status             <10>                                           
Aug 26 08:37:50 Glide2 kernel: e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang:             
                                 TDH                  <41>                                           
                                 TDT                  <5c>                                           
                                 next_to_use          <5c>                                           
                                 next_to_clean        <40>                                           
                               buffer_info[next_to_clean]:                                           
                                 time_stamp           <100a68ff4>                                    
                                 next_to_watch        <41>                                           
                                 jiffies              <100a69458>                                    
                                 next_to_watch.status <0>                                            
                               MAC Status             <80083>                                        
                               PHY Status             <796d>                                         
                               PHY 1000BASE-T Status  <3800>                                         
                               PHY Extended Status    <3000>                                         
                               PCI Status             <10>                                           
Aug 26 08:37:52 Glide2 kernel: e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang:             
                                 TDH                  <41>                                           
                                 TDT                  <5c>                                           
                                 next_to_use          <5c>                                           
                                 next_to_clean        <40>                                           
                               buffer_info[next_to_clean]:                                           
                                 time_stamp           <100a68ff4>                                    
                                 next_to_watch        <41>                                           
                                 jiffies              <100a69650>                                    
                                 next_to_watch.status <0>                                            
                               MAC Status             <80083>                                        
                               PHY Status             <796d>                                         
                               PHY 1000BASE-T Status  <3800>                                         
                               PHY Extended Status    <3000>                                         
                               PCI Status             <10>

Code:
Aug 26 08:37:54 Glide2 kernel: e1000e 0000:00:19.0 enp0s25: Reset adapter unexpectedly
which means if your disk is on a network storage then qemu cannot read/write anymore

i'd check if the hardware is ok, and if there is any firmware upgrade available
 
  • Like
Reactions: barrynza
ok according to the syslog your nic hangs/resets:

Code:
Aug 26 08:37:48 Glide2 kernel: e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang:            
                                 TDH                  <41>                                          
                                 TDT                  <5c>                                          
                                 next_to_use          <5c>                                          
                                 next_to_clean        <40>                                          
                               buffer_info[next_to_clean]:                                          
                                 time_stamp           <100a68ff4>                                   
                                 next_to_watch        <41>                                          
                                 jiffies              <100a69268>                                   
                                 next_to_watch.status <0>                                           
                               MAC Status             <80083>                                       
                               PHY Status             <796d>                                        
                               PHY 1000BASE-T Status  <3800>                                        
                               PHY Extended Status    <3000>                                        
                               PCI Status             <10>                                          
Aug 26 08:37:50 Glide2 kernel: e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang:            
                                 TDH                  <41>                                          
                                 TDT                  <5c>                                          
                                 next_to_use          <5c>                                          
                                 next_to_clean        <40>                                          
                               buffer_info[next_to_clean]:                                          
                                 time_stamp           <100a68ff4>                                   
                                 next_to_watch        <41>                                          
                                 jiffies              <100a69458>                                   
                                 next_to_watch.status <0>                                           
                               MAC Status             <80083>                                       
                               PHY Status             <796d>                                        
                               PHY 1000BASE-T Status  <3800>                                        
                               PHY Extended Status    <3000>                                        
                               PCI Status             <10>                                          
Aug 26 08:37:52 Glide2 kernel: e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang:            
                                 TDH                  <41>                                          
                                 TDT                  <5c>                                          
                                 next_to_use          <5c>                                          
                                 next_to_clean        <40>                                          
                               buffer_info[next_to_clean]:                                          
                                 time_stamp           <100a68ff4>                                   
                                 next_to_watch        <41>                                          
                                 jiffies              <100a69650>                                   
                                 next_to_watch.status <0>                                           
                               MAC Status             <80083>                                       
                               PHY Status             <796d>                                        
                               PHY 1000BASE-T Status  <3800>                                        
                               PHY Extended Status    <3000>                                        
                               PCI Status             <10>

Code:
Aug 26 08:37:54 Glide2 kernel: e1000e 0000:00:19.0 enp0s25: Reset adapter unexpectedly
which means if your disk is on a network storage then qemu cannot read/write anymore

i'd check if the hardware is ok, and if there is any firmware upgrade available
Thanks mate, the issue is that same problem i have with my first host that got upgraded to the latest pve version, since upgrade am having these issues, it does sound that is a bug on the pve version, never had problems before. NIC is integrated on the mb for both hosts which are NUC pcs intel based.
 
Hi,
you can also try the workaround suggested here.
Hi Fiona, i am not sure about the workaround looks too risky to try am afraid if i will break the cluster, but as above findouts from deepak do make sense the nic is hanging thats why i got problems with vms due to being retreived from lan storage, but these issues started to happen since i got upgraded to the latest version of pve, reason why i can tell is that my 3rd host isnt upgraded and has no issues at all, so can you please check why intel based nic are hanging on latest version of pve.
 
Hi Fiona, i am not sure about the workaround looks too risky to try am afraid if i will break the cluster, but as above findouts from deepak do make sense the nic is hanging thats why i got problems with vms due to being retreived from lan storage, but these issues started to happen since i got upgraded to the latest version of pve, reason why i can tell is that my 3rd host isnt upgraded and has no issues at all, so can you please check why intel based nic are hanging on latest version of pve.
Since got to this version 5.15.39-4-pve #1 SMP PVE 5.15.39-4 i started to have these problems.
 
Hi Fiona, i am not sure about the workaround looks too risky to try am afraid if i will break the cluster, but as above findouts from deepak do make sense the nic is hanging thats why i got problems with vms due to being retreived from lan storage, but these issues started to happen since i got upgraded to the latest version of pve, reason why i can tell is that my 3rd host isnt upgraded and has no issues at all, so can you please check why intel based nic are hanging on latest version of pve.
Tried it not working like the suggested solution on that forum
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!