Proxmox VE 7.1 released!

I have another machine (different Xeon cpu, different vendor etc) where I put 7.1 and the issue also shows up there. My method of reproduction is to start building kernel. i/o hangs quite fast with that, in 1-2min of building.

or changing the Async Mode from io_uring as described in a post above for some other issue (not sure if directly related)

That's it. With default "io_uring" hangs, with "threads" I'm already in middle of kernel build and no problem. Now how to track what qemu is actually doing at that "hang" moment?

Hmm, quite an old CPU (released Q1'2010) does the system has the newest firmware version installed?

Latest available, yes, which means also old. But I don't think CPU matters for this.

Edit: would be nice to split these posts into separate thread, if possible.
 
Last edited:
Hi,
I have just upgraded my test cluster but now disk access is incredibly slow and "migrate" "ha stop" commands have no effect.
Restarted servers and ceph without improvement
 
Hi,
I have just upgraded my test cluster but now disk access is incredibly slow and "migrate" "ha stop" commands have no effect.
Restarted servers and ceph without improvement
Maybe you are affected by the same problem as me.
I had to edit VM config to set aio=native on all VM disk and switch virtio disk to scsi.

check : https://forum.proxmox.com/threads/s...-guest-raw-on-lvm-on-top-of-drbd.21051/page-2

You can validate this theory by checking IO error log message INSIDE your VMs (ex: /var/log/syslog)
 
I just upgraded my cluster to 7.1-5 and man oh man nothing but problems when it comes to windows machines all together. Some machines wont start others freeze up the GUI is terrible in performance on the VNC console. DO NOT UPDATE! you have been warned! I have plenty of backups of all of these VMs and even restores from backups do not work.
 
I just upgraded my cluster to 7.1-5 and man oh man nothing but problems when it comes to windows machines all together. Some machines wont start others freeze up the GUI is terrible in performance on the VNC console. DO NOT UPDATE! you have been warned! I have plenty of backups of all of these VMs and even restores from backups do not work.
After changing my disks and cdrom from SATA to SCSI it works fine again
 
I just upgraded my cluster to 7.1-5 and man oh man nothing but problems when it comes to windows machines all together. Some machines wont start others freeze up the GUI is terrible in performance on the VNC console. DO NOT UPDATE! you have been warned! I have plenty of backups of all of these VMs and even restores from backups do not work.
I got my problem resolved by changing my hardisk to native and then changing the cache to the default(no cache) option. Now the windows machine boots properly. Hours of struggling to resolve this
 
I think I'll be holding off upgrades to 7.1 with all these guest/disk issues.

Have any of these been identified as bugs or all customer hardware related issues?
 
I think I'll be holding off upgrades to 7.1 with all these guest/disk issues.

Have any of these been identified as bugs or all customer hardware related issues?
There seems to be an actual bug in the io_uring related stack of the 5.13 kernel that triggers with certain specific setups, namely using SATA as VM disk-bus and Windows as guest OS, or some seemingly specific work loads on VirtioBlock disks and Linux.
VirtIO-SCSI seems unaffected, nothing could be reproduced on that in general recommend configuration.

We're currently actively investigating this issue, until it's tackled it can be worked around by either keep booting the 5.11 kernel or overriding the async mode as described by me in this thread here.
 
Hi,
I have just upgraded my test cluster but now disk access is incredibly slow and "migrate" "ha stop" commands have no effect.
Restarted servers and ceph without improvement
After many reboots I have been able to reenable migration and shutdown. One of the servers was still on 5.11 perhaps it was the problem.
Anyway there is a general problem of slugginesh.
Please note that I use virtio-scsi on all VMs, linux and windows.
I am playing with io_uring and so on but anyway I notice that:
- even spice seems slow
- backup hangs completely servers
- io delay of 40%

7.0 update was flawless, 7.1 is terrible.
 
EDIT: added syslog lines - VM hosting drive is nvme with ext4 (defaults, noatime)

---------------------------------------------------------------------------------------------------------------------------------

Hi folks,

unfortunately I had the same Windows OS related freeze afert updating - managed to keep the VMs responding via booting into kernel 5.11 instead of 5.13. No backup was running at the time of the crashes.

When on 5.13 I tried multiple settings WITHOUT success like
- CPU type (usually I am on "host", tried kvm64)
- qemu guest agent on/off
- tablet pointer on/off
- ACPI on/off
- CPU freeze on start on/off
- OS Type (only for new Win11 option)


On 5.11 everything works as expected. All Windows machines died on 5.13 almost immediately - configured with
- matching OS Type
- CPU=host (multiple cores, one socket)
- SATA0 as boot drive (discard, backup, SSD emulation, no cache, qcow2)
- Intel e1000 as eth device
- protection off
- firewall off
- ballooning device
- NUMA disabled
- i440fx 5.1
- KVM hardware virtualization
( - Win 8.1 and Win10 with SeaBIOS)
(- Win11 with OVMF on EFI & TPM)


Attached configuration of Win10 & Win11 as well as pveversion
Happy to help if you need any additional input.

Right now stay on pve 7.0 or boot kernel 5.11 if you use Windows machines - linux VMs work fine #whoamitojudge ;)

Best,
oernst
 

Attachments

  • config.txt
    2.5 KB · Views: 8
  • syslog.txt
    5 KB · Views: 4
Last edited:
Just to add a setup having issues with io_uring Async IO:
Dell R710 with PERC H700 Raid Controller. 6 SAS Disks RAID 10. Older CPUs (2x E5640).
About 15 VMs running CentOS 8 Stream, VirtIO SCSI Controller, Raw Disks, no cache, xfs Filesystem.
After latest Proxmox update VMs sporadically became unresponsive with "task blocked for more than 120 seconds"-errors, i.e.:

Code:
kernel: INFO: task ftdc:1660 blocked for more than 120 seconds.
kernel:      Not tainted 4.18.0-348.2.1.el8_5.x86_64 #1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: task:ftdc            state:D stack:    0 pid: 1660 ppid:     1 flags:0x00000080
kernel: Call Trace:
kernel: __schedule+0x2c4/0x700
kernel: schedule+0x37/0xa0
kernel: schedule_timeout+0x274/0x300
kernel: ? xfs_trans_read_buf_map+0x20c/0x360 [xfs]
kernel: __down+0x9b/0xf0
kernel: ? xfs_buf_find.isra.29+0x3c8/0x5f0 [xfs]
kernel: down+0x3b/0x50
kernel: xfs_buf_lock+0x33/0xf0 [xfs]
kernel: xfs_buf_find.isra.29+0x3c8/0x5f0 [xfs]
kernel: xfs_buf_get_map+0x4c/0x320 [xfs]
kernel: xfs_buf_read_map+0x53/0x310 [xfs]
kernel: ? xfs_da_read_buf+0xcf/0x120 [xfs]
kernel: xfs_trans_read_buf_map+0x124/0x360 [xfs]
kernel: ? xfs_da_read_buf+0xcf/0x120 [xfs]
kernel: xfs_da_read_buf+0xcf/0x120 [xfs]
kernel: ? mls_range_isvalid+0x41/0x50
kernel: xfs_dir3_block_read+0x35/0xc0 [xfs]
kernel: ? _cond_resched+0x15/0x30
kernel: xfs_dir2_block_lookup_int+0x4a/0x1d0 [xfs]
kernel: xfs_dir2_block_lookup+0x35/0x120 [xfs]
kernel: ? xfs_dir2_isblock+0x34/0xc0 [xfs]
kernel: xfs_dir_lookup+0x1a1/0x1c0 [xfs]
kernel: xfs_lookup+0x58/0x120 [xfs]
kernel: xfs_vn_lookup+0x70/0xa0 [xfs]
kernel: ? security_inode_create+0x37/0x50
kernel: path_openat+0x878/0x14f0
kernel: ? mem_cgroup_write+0x36/0x190
kernel: ? mod_objcg_state+0x10d/0x250
kernel: ? __switch_to_asm+0x41/0x70
kernel: do_filp_open+0x93/0x100
kernel: ? getname_flags+0x4a/0x1e0
kernel: ? __check_object_size+0xa8/0x16b
kernel: do_sys_open+0x184/0x220

After switching from io_uring to native the errors stopped.

If there is any information that could help I would be glad to try and provide it. Thank you for Proxmox and all your work.
 
I'm now having VM Disk issues since upgrading to 7.1, tried loading up kernel 5.11.22-7-pve, but the issues still remain. I've now tried using SCSI over virtio and the issue seems to go away, at least for now.

I also tried using "threads" on the Async for VirtIO Block, but didn't help.

I can easily reproduce the problem by running high disk load like `dd if=/dev/urandom bs=1M count=4000 of=/root/test`. Guest OS is Debian Buster 10.11, but also having the issue on Fedora guest.

I'm using ZFS mirror, issues happens on NVMe mirrored disk and HDD mirrored disks. The Virtual machines have a simple SWAP and EXT4 partition no LVM on the guest level.

I originally thought it was an issue with my NVMe drives, so those have been replaced with new ones and fresh install of 7.1 done, issue still remains.

CPU: Intel(R) Xeon(R) E-2236 CPU
Memory: 64GB
NVME1: WDC CL SN720 SDAQNTW-512G-2000
NVME2: WDC CL SN720 SDAQNTW-512G-2000
HDD1: HGST_HUS726T4TALA6L1
HDD2: HGST_HUS726T4TALA6L1

Code:
# zpool status
  pool: hdd
 state: ONLINE
  scan: scrub repaired 0B in 00:23:40 with 0 errors on Sat Nov 20 18:02:32 2021
config:

        NAME                                   STATE     READ WRITE CKSUM
        hdd                                    ONLINE       0     0     0
          mirror-0                             ONLINE       0     0     0
            ata-HGST_HUS726T4TALA6L1_V6H29J7S  ONLINE       0     0     0
            ata-HGST_HUS726T4TALA6L1_V6H2GB8S  ONLINE       0     0     0

errors: No known data errors

  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 00:02:31 with 0 errors on Sat Nov 20 14:05:32 2021
config:

        NAME                                                 STATE     READ WRITE CKSUM
        rpool                                                ONLINE       0     0     0
          mirror-0                                           ONLINE       0     0     0
            nvme-eui.e8238fa6bf530001001b448b48b7a1e0-part3  ONLINE       0     0     0
            nvme-eui.e8238fa6bf530001001b448b467d6748-part3  ONLINE       0     0     0

errors: No known data errors
 
I know others are having issues, and this is not to diminish those concerns, but I just did update from 6.4 -> 7.1 and everything is working really great. Good job Proxmox team!
 
  • Like
Reactions: tjh
Thank you for the report! There is indeed a (hopefully mostly cosmetic) issue when the pvescheduler service is restarted while a replication is running (as happens when upgrading the pve-manager package). What happens is that the script handling the replication is terminated (which is why it shows as an error in the UI and the log stops), but I think the actual replication should still be running in the background. We'll make sure to fix this.


At that time the replication might still have been running. If it happens again, please check with ps aux | grep pvesm.
Hi, Fabian, I have encounter the same issue after update from 7.0-14 to 7.1-5, some of replication jobs failed. I also tried to remove and delete vDisk if necessary, and then tried to rebuild the replications. The replication-jobs completed and said OK, but the replicated vDisk size is 0B ... I checked the command "ps aux | grep pvesm" in nodes:
Code:
root@node1:~# ps aux | grep pvesm
root     2428020  0.0  0.0   3836  2712 ?        S    15:47   0:00 /bin/bash -c set -o pipefail && pvesm export pool1:vm-104-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_104-1_1637567222__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=CC15' root@192.168.33.27 -- pvesm import pool1:vm-104-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_104-1_1637567222__ -allow-rename 0
root     2428021  0.3  0.0 301300 86808 ?        S    15:47   0:00 /usr/bin/perl /sbin/pvesm export pool1:vm-104-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_104-1_1637567222__
root     2428022 22.6  0.0  10572  7276 ?        S    15:47   0:36 /usr/bin/ssh -e none -o BatchMode=yes -o HostKeyAlias=CC15 root@192.168.33.27 -- pvesm import pool1:vm-104-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_104-1_1637567222__ -allow-rename 0
root     2852215  0.0  0.0   6180   732 pts/0    S+   15:49   0:00 grep pvesm

root@node2:~# ps aux | grep pvesm
root       30613  0.4  0.0 301304 86472 ?        Ss   15:47   0:00 /usr/bin/perl /usr/sbin/pvesm import pool1:vm-104-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_104-1_1637567222__ -allow-rename 0
root      780319  0.0  0.0   6180   720 pts/0    S+   15:49   0:00 grep pvesm

And sometimes the replication-job's status got:
Code:
2021-11-22 16:00:13 112-0: start replication job
2021-11-22 16:00:13 112-0: guest => VM 112, running => 0
2021-11-22 16:00:13 112-0: volumes => pool1:vm-112-disk-0
2021-11-22 16:00:15 112-0: create snapshot '__replicate_112-0_1637568013__' on pool1:vm-112-disk-0
2021-11-22 16:00:15 112-0: using secure transmission, rate limit: 15 MByte/s
2021-11-22 16:00:15 112-0: full sync 'pool1:vm-112-disk-0' (__replicate_112-0_1637568013__)
2021-11-22 16:00:15 112-0: using a bandwidth limit of 15000000 bps for transferring 'pool1:vm-112-disk-0'
2021-11-22 16:00:16 112-0: full send of pool1/vm-112-disk-0@__replicate_112-0_1637564821__ estimated size is 29.9G
2021-11-22 16:00:16 112-0: send from @__replicate_112-0_1637564821__ to pool1/vm-112-disk-0@__replicate_112-0_1637567762__ estimated size is 386M
2021-11-22 16:00:16 112-0: send from @__replicate_112-0_1637567762__ to pool1/vm-112-disk-0@__replicate_112-0_1637568013__ estimated size is 624B
2021-11-22 16:00:16 112-0: total estimated size is 30.3G
2021-11-22 16:00:16 112-0: volume 'pool1/vm-112-disk-0' already exists
2021-11-22 16:00:16 112-0: 2231304 B 2.1 MB 0.92 s 2433690 B/s 2.32 MB/s
2021-11-22 16:00:16 112-0: write: Broken pipe
2021-11-22 16:00:16 112-0: warning: cannot send 'pool1/vm-112-disk-0@__replicate_112-0_1637564821__': signal received
2021-11-22 16:00:16 112-0: warning: cannot send 'pool1/vm-112-disk-0@__replicate_112-0_1637567762__': Broken pipe
2021-11-22 16:00:16 112-0: warning: cannot send 'pool1/vm-112-disk-0@__replicate_112-0_1637568013__': Broken pipe
2021-11-22 16:00:17 112-0: cannot send 'pool1/vm-112-disk-0': I/O error
2021-11-22 16:00:17 112-0: command 'zfs send -Rpv -- pool1/vm-112-disk-0@__replicate_112-0_1637568013__' failed: exit code 1
2021-11-22 16:00:17 112-0: delete previous replication snapshot '__replicate_112-0_1637568013__' on pool1:vm-112-disk-0
2021-11-22 16:00:17 112-0: end replication job with error: command 'set -o pipefail && pvesm export pool1:vm-112-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_112-0_1637568013__ | /usr/bin/cstream -t 15000000 | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=node2' root@192.168.33.27 -- pvesm import pool1:vm-112-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_112-0_1637568013__ -allow-rename 0' failed: exit code 255

root@node1:~# ps aux | grep pvesm
root     3244834  0.0  0.0   3836  2648 ?        S    15:56   0:00 /bin/bash -c set -o pipefail && pvesm export pool1:vm-112-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_112-0_1637567762__ | /usr/bin/cstream -t 15000000 | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=node2' root@192.168.33.27 -- pvesm import pool1:vm-112-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_112-0_1637567762__ -allow-rename 0
root     3244835  0.1  0.0 301268 86696 ?        S    15:56   0:00 /usr/bin/perl /sbin/pvesm export pool1:vm-112-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_112-0_1637567762__
root     3244837  5.2  0.0  10164  7148 ?        S    15:56   0:19 /usr/bin/ssh -e none -o BatchMode=yes -o HostKeyAlias=node2 root@192.168.33.27 -- pvesm import pool1:vm-112-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_112-0_1637567762__ -allow-rename 0
root     3369126  0.0  0.0   6180   664 pts/0    S+   16:02   0:00 grep pvesm

root@node2:~# ps aux | grep pvesm
root     1630575  0.1  0.0 301272 86372 ?        Ss   15:56   0:00 /usr/bin/perl /usr/sbin/pvesm import pool1:vm-112-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_112-0_1637567762__ -allow-rename 0
root     2081832  0.0  0.0   6180   664 pts/0    S+   16:06   0:00 grep pvesm

What can I do in this situation ? Thanks.
 
Last edited:
Hi, Fabian, I have encounter the same issue after update from 7.0-14 to 7.1-5, some of replication jobs failed. I also tried to remove and delete vDisk if necessary, and then tried to rebuild the replications. The replication-jobs completed and said OK, but the replicated vDisk size is 0B ... I checked the command "ps aux | grep pvesm" in nodes:
Code:
root@node1:~# ps aux | grep pvesm
root     2428020  0.0  0.0   3836  2712 ?        S    15:47   0:00 /bin/bash -c set -o pipefail && pvesm export pool1:vm-104-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_104-1_1637567222__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=CC15' root@192.168.33.27 -- pvesm import pool1:vm-104-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_104-1_1637567222__ -allow-rename 0
root     2428021  0.3  0.0 301300 86808 ?        S    15:47   0:00 /usr/bin/perl /sbin/pvesm export pool1:vm-104-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_104-1_1637567222__
root     2428022 22.6  0.0  10572  7276 ?        S    15:47   0:36 /usr/bin/ssh -e none -o BatchMode=yes -o HostKeyAlias=CC15 root@192.168.33.27 -- pvesm import pool1:vm-104-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_104-1_1637567222__ -allow-rename 0
root     2852215  0.0  0.0   6180   732 pts/0    S+   15:49   0:00 grep pvesm

root@node2:~# ps aux | grep pvesm
root       30613  0.4  0.0 301304 86472 ?        Ss   15:47   0:00 /usr/bin/perl /usr/sbin/pvesm import pool1:vm-104-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_104-1_1637567222__ -allow-rename 0
root      780319  0.0  0.0   6180   720 pts/0    S+   15:49   0:00 grep pvesm

What can I do in this situation ? Thanks.
There should be no need to re-create the jobs. As said, this is an issue with the UI/job states. You can see that the actual replication is still running and should complete normally (if there is no other error).

If you deleted the disks, the on-going job might've gotten confused. I'd just wait for the pvesm processes to finish.
 
There should be no need to re-create the jobs. As said, this is an issue with the UI/job states. You can see that the actual replication is still running and should complete normally (if there is no other error).

If you deleted the disks, the on-going job might've gotten confused. I'd just wait for the pvesm processes to finish.
Hi, Fabian, thanks very. I had wait long time but not help, I had limited the bandwidth to 15M but this vDisk size only 30G, and it was working well in before. I notice the error status due to replication-job rerun automatically(I setup two hours schedule and it rerun in several minutes). I will try to unlimit bandwidth and see if improve. Thanks.
 
On 5.11 everything works as expected. All Windows machines died on 5.13 almost immediately - configured with
- matching OS Type
- CPU=host (multiple cores, one socket)
- SATA0 as boot drive (discard, backup, SSD emulation, no cache, qcow2)
- Intel e1000 as eth device
- protection off
- firewall off
- ballooning device
- NUMA disabled
- i440fx 5.1
- KVM hardware virtualization
( - Win 8.1 and Win10 with SeaBIOS)
(- Win11 with OVMF on EFI & TPM)
Hi,
do not use SATA to boot Windows VMs.
Please change to IDE (Microsoft recommendation) or to VirtIO (My recommendation)

Best regards,
Falk
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!