[PVE6] ZFS issue (freeze i/o r/w)

elmemis

Renowned Member
Jan 23, 2012
43
0
71
Hi,

In my installation PVE-6, I have an issue with the ZFS raid 1. At times it stops processing on the disks. There is no read/write (i/o) and the VMs are frozen.
The issue is present with the sda, sdb, sdc and sdd disks. All disks are Seagate Barracuda ZA2000CM10002.

ZPOOL output:
Code:
# zpool status -t
  pool: nvme-zfs
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:36:05 with 0 errors on Sun Mar  8 01:00:07 2020
config:

    NAME                                                STATE     READ WRITE CKSUM
    nvme-zfs                                            ONLINE       0     0     0
      mirror-0                                          ONLINE       0     0     0
        nvme-SAMSUNG_MZQLB3T8HALS-000AZ_S3VJNF0K700531  ONLINE       0     0     0  (untrimmed)
        nvme-eui.33564a304b7005300025384600000001       ONLINE       0     0     0  (untrimmed)

errors: No known data errors

  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:02:33 with 0 errors on Sun Mar  8 00:26:36 2020
config:

    NAME                                             STATE     READ WRITE CKSUM
    rpool                                            ONLINE       0     0     0
      mirror-0                                       ONLINE       0     0     0
        ata-HP_SSD_S700_500GB_HBSA39194101315-part3  ONLINE       0     0     0  (untrimmed)
        ata-HP_SSD_S700_500GB_HBSA39194102188-part3  ONLINE       0     0     0  (untrimmed)

errors: No known data errors

  pool: ssd-zfs
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:10:17 with 0 errors on Sun Mar  8 00:34:21 2020
config:

    NAME        STATE     READ WRITE CKSUM
    ssd-zfs     ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sda     ONLINE       0     0     0  (untrimmed)
        sdb     ONLINE       0     0     0  (untrimmed)
      mirror-1  ONLINE       0     0     0
        sdc     ONLINE       0     0     0  (untrimmed)
        sdd     ONLINE       0     0     0  (untrimmed)

errors: No known data errors

Packages version:
Code:
# pveversion --verbose
proxmox-ve: 6.1-2 (running kernel: 5.3.18-2-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-helper: 6.1-7
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.3.10-1-pve: 5.3.10-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-3-pve: 5.0.21-7
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 2.0.1-1+pve8
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 4.0.1-pve1
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-23
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-6
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1

1586536917288.png

1586537134465.png

At all those points the system froze. With the NVME raid, I haven't problems.

Any suggestion ?.
 
Nothing in the /var/log/syslog or /var/log/kern.log?

How are the drives sda-sdd connected? All on the same controller?
 
Syslog fragment:
Code:
tail -300 /var/log/syslog | grep -v 'VE replication'
Apr 10 12:41:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 12:42:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 12:43:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 12:44:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 12:45:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 12:45:01 pve-us CRON[14416]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 10 12:46:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 12:46:18 pve-us systemd[1]: Starting Daily apt download activities...
Apr 10 12:46:19 pve-us systemd[1]: apt-daily.service: Succeeded.
Apr 10 12:46:19 pve-us systemd[1]: Started Daily apt download activities.
Apr 10 12:47:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 12:48:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 12:49:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 12:50:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 12:51:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 12:52:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 12:53:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 12:54:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 12:54:36 pve-us pvedaemon[3313]: <root@pam> successful auth for user 'egonzalez@pve'
Apr 10 12:55:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 12:55:01 pve-us CRON[3391]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 10 12:56:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 12:57:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 12:57:32 pve-us pveproxy[17061]: worker exit
Apr 10 12:57:32 pve-us pveproxy[5435]: worker 17061 finished
Apr 10 12:57:32 pve-us pveproxy[5435]: starting 1 worker(s)
Apr 10 12:57:32 pve-us pveproxy[5435]: worker 19054 started
Apr 10 12:58:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 12:59:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:00:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:01:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:02:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:03:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:04:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:05:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:05:01 pve-us CRON[36812]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 10 13:06:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:06:33 pve-us smartd[4841]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 99 to 97
Apr 10 13:06:33 pve-us smartd[4841]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 93 to 96
Apr 10 13:06:33 pve-us smartd[4841]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 98 to 96
Apr 10 13:06:33 pve-us smartd[4841]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 98 to 96
Apr 10 13:07:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:08:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:09:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:10:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:11:00 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:12:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:12:48 pve-us pvedaemon[3313]: <root@pam> successful auth for user 'egonzalez@pve'
Apr 10 13:13:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:14:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:15:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:15:01 pve-us CRON[18282]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 10 13:16:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:17:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:17:01 pve-us CRON[10303]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 10 13:18:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:18:06 pve-us pveproxy[17063]: worker exit
Apr 10 13:18:06 pve-us pveproxy[5435]: worker 17063 finished
Apr 10 13:18:06 pve-us pveproxy[5435]: starting 1 worker(s)
Apr 10 13:18:06 pve-us pveproxy[5435]: worker 26257 started
Apr 10 13:19:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:20:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:21:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:22:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:23:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:24:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:25:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:25:01 pve-us CRON[15848]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 10 13:26:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:27:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:27:48 pve-us pvedaemon[16990]: <root@pam> successful auth for user 'egonzalez@pve'
Apr 10 13:28:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:29:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:29:41 pve-us pveproxy[17060]: worker exit
Apr 10 13:29:41 pve-us pveproxy[5435]: worker 17060 finished
Apr 10 13:29:41 pve-us pveproxy[5435]: starting 1 worker(s)
Apr 10 13:29:41 pve-us pveproxy[5435]: worker 35371 started
Apr 10 13:30:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:31:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:32:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:33:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:34:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:35:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:35:01 pve-us CRON[36328]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 10 13:36:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:36:33 pve-us smartd[4841]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 97 to 98
Apr 10 13:36:33 pve-us smartd[4841]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 96 to 98
Apr 10 13:36:33 pve-us smartd[4841]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 96 to 94
Apr 10 13:37:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:38:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:38:31 pve-us systemd[1]: Started Session 33377 of user root.
Apr 10 13:39:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:40:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:40:38 pve-us pvedaemon[30714]: starting vnc proxy UPID:pve-us:000077FA:0916BECB:5E90AF96:vncproxy:139:egonzalez@pve:
Apr 10 13:40:38 pve-us pvedaemon[16990]: <egonzalez@pve> starting task UPID:pve-us:000077FA:0916BECB:5E90AF96:vncproxy:139:egonzalez@pve:
Apr 10 13:41:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:42:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:42:49 pve-us pvedaemon[6933]: <root@pam> successful auth for user 'egonzalez@pve'
Apr 10 13:43:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:44:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:44:56 pve-us pvedaemon[16990]: <egonzalez@pve> end task UPID:pve-us:000077FA:0916BECB:5E90AF96:vncproxy:139:egonzalez@pve: OK
Apr 10 13:45:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:45:01 pve-us CRON[14576]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 10 13:46:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:47:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:48:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:49:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:50:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:51:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:52:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:53:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:54:01 pve-us systemd[1]: pvesr.service: Succeeded.
Apr 10 13:55:01 pve-us systemd[1]: pvesr.service: Succeeded.

Kern.log
Code:
tail /var/log/kern.log
Apr  5 02:17:11 pve-us kernel: [1051858.446888] audit: type=1400 audit(1586067430.999:185): apparmor="STATUS" operation="profile_load" label="lxc-142_</var/lib/lxc>//&:lxc-142_<-var-lib-lxc>:unconfined" name="man_filter" pid=5460 comm="apparmor_parser"
Apr  5 02:20:02 pve-us kernel: [1052029.975424] device veth142i0 left promiscuous mode
Apr  5 23:25:01 pve-us kernel: [1127930.107462] audit: type=1400 audit(1586143501.921:193): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-214_</var/lib/lxc>" name="/tmp/" pid=34311 comm="(sh)" flags="rw, remount, bind"
Apr  7 17:27:06 pve-us kernel: [1279255.681521] device tap121i0 entered promiscuous mode
Apr  7 17:27:06 pve-us kernel: [1279255.708623] vmbr0: port 47(tap121i0) entered blocking state
Apr  7 17:27:06 pve-us kernel: [1279255.714812] vmbr0: port 47(tap121i0) entered blocking state
Apr  7 18:32:44 pve-us kernel: [1283193.920701] vmbr0: port 47(tap121i0) entered disabled state
Apr 10 11:56:13 pve-us kernel: [1518605.513490] fwbr102i0: port 2(veth102i0) entered disabled state
Apr 10 11:56:15 pve-us kernel: [1518607.052333] vmbr0: port 6(fwpr102p0) entered disabled state
Apr 10 11:56:15 pve-us kernel: [1518607.057715] fwbr102i0: port 1(fwln102i0) entered disabled state
 
While not perfectly correlated time wise this is still interesting:

Code:
Apr 10 13:06:33 pve-us smartd[4841]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 99 to 97
Apr 10 13:06:33 pve-us smartd[4841]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 93 to 96
Apr 10 13:06:33 pve-us smartd[4841]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 98 to 96
Apr 10 13:06:33 pve-us smartd[4841]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 98 to 96

For SSDs to be running close to 100 Celsius is a lot. I would not be surprised if they throttle themselves quite a bit which may even appear to be that freeze, trying to keep the temperature down.

As a first guess I would try to increase the cooling of the SSDs and see if that helps.
 
I think it is a bug in the firmware. Every 30 minutes send that message and the ssd are not hot.

Code:
Apr 10 12:06:33 pve-us smartd[4841]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 95 to 97
Apr 10 12:06:33 pve-us smartd[4841]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 99 to 98
Apr 10 12:06:33 pve-us smartd[4841]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 99 to 93
Apr 10 12:06:33 pve-us smartd[4841]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 93 to 96
Apr 10 12:36:33 pve-us smartd[4841]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 97 to 99
Apr 10 12:36:33 pve-us smartd[4841]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 98 to 93
Apr 10 12:36:33 pve-us smartd[4841]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 93 to 98
Apr 10 12:36:33 pve-us smartd[4841]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 96 to 98
Apr 10 13:06:33 pve-us smartd[4841]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 99 to 97
Apr 10 13:06:33 pve-us smartd[4841]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 93 to 96
Apr 10 13:06:33 pve-us smartd[4841]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 98 to 96
Apr 10 13:06:33 pve-us smartd[4841]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 98 to 96
Apr 10 13:36:33 pve-us smartd[4841]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 97 to 98
Apr 10 13:36:33 pve-us smartd[4841]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 96 to 98
Apr 10 13:36:33 pve-us smartd[4841]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 96 to 94
Apr 10 14:06:33 pve-us smartd[4841]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 98 to 97
Apr 10 14:06:33 pve-us smartd[4841]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 98 to 96
Apr 10 14:06:33 pve-us smartd[4841]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 94 to 96
Apr 10 14:36:33 pve-us smartd[4841]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 96 to 97
Apr 10 14:36:33 pve-us smartd[4841]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 96 to 94
Apr 10 15:06:33 pve-us smartd[4841]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 97 to 98
Apr 10 15:06:33 pve-us smartd[4841]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 97 to 98
Apr 10 15:06:33 pve-us smartd[4841]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 96 to 93
Apr 10 15:06:33 pve-us smartd[4841]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 94 to 97
Apr 10 15:36:33 pve-us smartd[4841]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 98 to 100
Apr 10 15:36:33 pve-us smartd[4841]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 98 to 100
Apr 10 15:36:34 pve-us smartd[4841]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 93 to 95
Apr 10 15:36:34 pve-us smartd[4841]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 97 to 99
Apr 10 16:06:33 pve-us smartd[4841]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 100 to 97
Apr 10 16:06:33 pve-us smartd[4841]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 100 to 98
Apr 10 16:06:33 pve-us smartd[4841]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 95 to 94
Apr 10 16:06:33 pve-us smartd[4841]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 99 to 94
Apr 10 16:36:33 pve-us smartd[4841]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 97 to 95
Apr 10 16:36:33 pve-us smartd[4841]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 98 to 97
Apr 10 16:36:33 pve-us smartd[4841]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 94 to 96
Apr 10 16:36:33 pve-us smartd[4841]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 94 to 96
Apr 10 17:06:33 pve-us smartd[4841]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 95 to 98
Apr 10 17:06:33 pve-us smartd[4841]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 97 to 94
Apr 10 17:06:33 pve-us smartd[4841]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 96 to 97
Apr 10 17:06:33 pve-us smartd[4841]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 96 to 97
Apr 10 17:36:33 pve-us smartd[4841]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 98 to 97
Apr 10 17:36:33 pve-us smartd[4841]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 94 to 98
Apr 10 17:36:33 pve-us smartd[4841]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 97 to 94
Apr 10 18:06:33 pve-us smartd[4841]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 97 to 98
Apr 10 18:06:33 pve-us smartd[4841]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 94 to 97
 
smartctl output for sda device.
Code:
smartctl -a /dev/sda
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.3.18-2-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     Seagate BarraCuda SSD ZA2000CM10002
Serial Number:    7M2007VB
LU WWN Device Id: 5 000c50 0bb01b8f9
Firmware Version: STAS1024
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Apr 10 18:11:54 2020 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (   30) seconds.
Offline data collection
capabilities:              (0x79) SMART execute Offline immediate.
                    No Auto Offline data collection support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      (   2) minutes.
Conveyance self-test routine
recommended polling time:      (   3) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       5312
 12 Power_Cycle_Count       0x0012   100   100   000    Old_age   Always       -       13
 16 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       197
 17 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       197
168 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       0
170 Unknown_Attribute       0x0003   100   100   010    Pre-fail  Always       -       844
173 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       8591769654
174 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       7
177 Wear_Leveling_Count     0x0000   000   000   000    Old_age   Offline      -       5
192 Power-Off_Retract_Count 0x0012   100   100   000    Old_age   Always       -       7
194 Temperature_Celsius     0x0023   097   090   057    Pre-fail  Always       -       31 (Min/Max 20/38)
218 Unknown_Attribute       0x000b   100   100   050    Pre-fail  Always       -       0
231 Temperature_Celsius     0x0013   100   100   000    Pre-fail  Always       -       109951162777698
232 Available_Reservd_Space 0x0013   100   100   000    Pre-fail  Always       -       438086664192
233 Media_Wearout_Indicator 0x000b   100   100   000    Pre-fail  Always       -       58803
235 Unknown_Attribute       0x000b   100   100   000    Pre-fail  Always       -       123320202208
241 Total_LBAs_Written      0x0012   100   100   000    Old_age   Always       -       20220
242 Total_LBAs_Read         0x0012   100   100   000    Old_age   Always       -       7728

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
Have you checked if there are firmware updates available for the SSDs? Otherwise, if you can spare one SSD, I would perform some benchmark and write tests on it to see if similar behavior can be observed.
 
Yes, they have the latest firmware.

Code:
# fio --size=20G --bs=4k --rw=write --direct=1 --sync=1 --runtime=60 --group_reporting --name=test --ramp_time=5s --filename=/dev/sda
test: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=2714KiB/s][w=678 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=19517: Sun Apr 12 16:59:52 2020
  write: IOPS=671, BW=2688KiB/s (2752kB/s)(157MiB/60001msec); 0 zone resets
    clat (nsec): min=0, max=36147k, avg=1485733.48, stdev=445332.15
     lat (nsec): min=0, max=36148k, avg=1486113.09, stdev=445336.03
    clat percentiles (usec):
     |  1.00th=[ 1434],  5.00th=[ 1434], 10.00th=[ 1450], 20.00th=[ 1450],
     | 30.00th=[ 1450], 40.00th=[ 1467], 50.00th=[ 1467], 60.00th=[ 1483],
     | 70.00th=[ 1483], 80.00th=[ 1500], 90.00th=[ 1516], 95.00th=[ 1532],
     | 99.00th=[ 1565], 99.50th=[ 1598], 99.90th=[ 2573], 99.95th=[ 4490],
     | 99.99th=[35914]
   bw (  KiB/s): min=    0, max= 2730, per=100.00%, avg=2687.50, stdev=62.81, samples=120
   iops        : min=    0, max=  682, avg=671.85, stdev=15.70, samples=120
  lat (msec)   : 2=99.88%, 4=0.05%, 10=0.02%, 20=0.03%, 50=0.01%
  cpu          : usr=0.39%, sys=1.51%, ctx=100492, majf=1, minf=21
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,40318,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=2688KiB/s (2752kB/s), 2688KiB/s-2688KiB/s (2752kB/s-2752kB/s), io=157MiB (165MB), run=60001-60001msec

Disk stats (read/write):
  sda: ios=9/87205, merge=0/0, ticks=6/64182, in_queue=512, util=99.25%


# fio --size=20G --bs=4k --rw=write --direct=1 --sync=1 --runtime=60 --group_reporting --name=test --ramp_time=5s --filename=/dev/sdb
test: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=2710KiB/s][w=677 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=10675: Sun Apr 12 17:01:20 2020
  write: IOPS=668, BW=2676KiB/s (2740kB/s)(157MiB/60002msec); 0 zone resets
    clat (nsec): min=0, max=78126k, avg=1492206.14, stdev=667485.31
     lat (nsec): min=0, max=78127k, avg=1492570.49, stdev=667487.85
    clat percentiles (usec):
     |  1.00th=[ 1434],  5.00th=[ 1434], 10.00th=[ 1450], 20.00th=[ 1450],
     | 30.00th=[ 1450], 40.00th=[ 1467], 50.00th=[ 1467], 60.00th=[ 1467],
     | 70.00th=[ 1483], 80.00th=[ 1500], 90.00th=[ 1516], 95.00th=[ 1532],
     | 99.00th=[ 1582], 99.50th=[ 1631], 99.90th=[10421], 99.95th=[11600],
     | 99.99th=[35914]
   bw (  KiB/s): min=    0, max= 2736, per=100.00%, avg=2675.60, stdev=69.14, samples=120
   iops        : min=    0, max=  684, avg=668.87, stdev=17.28, samples=120
  lat (msec)   : 2=99.74%, 4=0.12%, 10=0.03%, 20=0.10%, 50=0.01%
  lat (msec)   : 100=0.01%
  cpu          : usr=0.53%, sys=1.49%, ctx=98975, majf=0, minf=21
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,40141,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=2676KiB/s (2740kB/s), 2676KiB/s-2676KiB/s (2740kB/s-2740kB/s), io=157MiB (164MB), run=60002-60002msec

Disk stats (read/write):
  sdb: ios=97/87068, merge=0/0, ticks=41/64145, in_queue=672, util=99.05%


# fio --size=20G --bs=4k --rw=write --direct=1 --sync=1 --runtime=60 --group_reporting --name=test --ramp_time=5s --filename=/dev/sdc
test: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=2726KiB/s][w=681 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=35809: Sun Apr 12 17:02:30 2020
  write: IOPS=675, BW=2704KiB/s (2768kB/s)(158MiB/60001msec); 0 zone resets
    clat (nsec): min=0, max=36235k, avg=1476982.89, stdev=421301.37
     lat (nsec): min=0, max=36236k, avg=1477344.83, stdev=421303.90
    clat percentiles (usec):
     |  1.00th=[ 1434],  5.00th=[ 1434], 10.00th=[ 1434], 20.00th=[ 1450],
     | 30.00th=[ 1450], 40.00th=[ 1467], 50.00th=[ 1467], 60.00th=[ 1467],
     | 70.00th=[ 1483], 80.00th=[ 1483], 90.00th=[ 1500], 95.00th=[ 1532],
     | 99.00th=[ 1565], 99.50th=[ 1582], 99.90th=[ 1745], 99.95th=[ 2040],
     | 99.99th=[35914]
   bw (  KiB/s): min=    0, max= 2736, per=100.00%, avg=2702.88, stdev=60.10, samples=120
   iops        : min=    0, max=  684, avg=675.67, stdev=15.02, samples=120
  lat (msec)   : 2=99.94%, 4=0.03%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=0.41%, sys=1.66%, ctx=98748, majf=0, minf=15
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,40554,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=2704KiB/s (2768kB/s), 2704KiB/s-2704KiB/s (2768kB/s-2768kB/s), io=158MiB (166MB), run=60001-60001msec

Disk stats (read/write):
  sdc: ios=11/87967, merge=0/0, ticks=7/64198, in_queue=252, util=99.56%


# fio --size=20G --bs=4k --rw=write --direct=1 --sync=1 --runtime=60 --group_reporting --name=test --ramp_time=5s --filename=/dev/sdd
test: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=2732KiB/s][w=683 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=21779: Sun Apr 12 17:03:47 2020
  write: IOPS=675, BW=2702KiB/s (2767kB/s)(158MiB/60001msec); 0 zone resets
    clat (nsec): min=0, max=77767k, avg=1477969.53, stdev=560611.78
     lat (nsec): min=0, max=77768k, avg=1478364.18, stdev=560617.05
    clat percentiles (usec):
     |  1.00th=[ 1434],  5.00th=[ 1434], 10.00th=[ 1434], 20.00th=[ 1450],
     | 30.00th=[ 1450], 40.00th=[ 1450], 50.00th=[ 1467], 60.00th=[ 1467],
     | 70.00th=[ 1483], 80.00th=[ 1483], 90.00th=[ 1500], 95.00th=[ 1532],
     | 99.00th=[ 1565], 99.50th=[ 1582], 99.90th=[ 1745], 99.95th=[ 1926],
     | 99.99th=[35914]
   bw (  KiB/s): min=    0, max= 2736, per=100.00%, avg=2701.59, stdev=66.26, samples=119
   iops        : min=    0, max=  684, avg=675.38, stdev=16.59, samples=119
  lat (msec)   : 2=99.96%, 4=0.02%, 20=0.01%, 50=0.01%, 100=0.01%
  cpu          : usr=0.38%, sys=1.72%, ctx=98705, majf=0, minf=20
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,40528,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=2702KiB/s (2767kB/s), 2702KiB/s-2702KiB/s (2767kB/s-2767kB/s), io=158MiB (166MB), run=60001-60001msec

Disk stats (read/write):
  sdd: ios=97/87809, merge=0/0, ticks=31/64133, in_queue=376, util=99.40%