ZFS: task txg_sync blocked for more than 300 seconds

casalicomputers · Jul 16, 2015

Hello,
I'm experiencing random system lockups with the following error message:

task txg_sync blocked for more than 300 seconds

The last system lockup occurred a couple of hours ago, and I figured out the problem has somewhat to do with disk I/O.
The only thing I tried to do, is to tweak /etc/sysctl.conf (see at the bottom of this post).

I have Proxmox 3.4-6/102d4547 on a Dell PowerEdge T110 II equipped with:

16GB RAM
2x 500GB SATA Disks (Seagate ST500NM0011)
2x 250GB SSD Drives (Samsung SSD 850 EVO).

SATA disks are directly connected to onboard controller (Intel C200), configured as ZFS mirror (RAID1) and contains the whole system and data, while SSDs are used as cache devices having 2 partitions 30GB mirrored (SLOG) and 2x 200GB (CACHE) each.
No more disks can be added.

Code:

pool                                                     alloc   free   read  write   read  write
-------------------------------------------------------  -----  -----  -----  -----  -----  -----
rpool                                                     114G   350G    300     66  1.24M   316K
  mirror                                                  114G   350G    300     62  1.17M   204K
    ata-ST500NM0011_Z1M0QP69-part3                           -      -    106     12   931K   340K
    ata-ST500NM0011_Z1M0PYK0-part3                           -      -    105     12   897K   340K
logs                                                         -      -      -      -      -      -
  mirror                                                 1.99M  29.7G      0      3  74.6K   112K
    ata-Samsung_SSD_850_EVO_250GB_S21PNXAG406075Z-part1      -      -      0      3  75.0K   112K
    ata-Samsung_SSD_850_EVO_250GB_S21PNXAG406071R-part1      -      -      0      3  75.0K   112K
cache                                                        -      -      -      -      -      -
  ata-Samsung_SSD_850_EVO_250GB_S21PNXAG406075Z-part2     709M   202G      0      2  2.81K   322K
  ata-Samsung_SSD_850_EVO_250GB_S21PNXAG406071R-part2     778M   202G      0      3  3.18K   354K
-------------------------------------------------------  -----  -----  -----  -----  -----  -----

There are 2 VMs on this systems:

1x Windows Server 2003 - RAM: 2GB
1x Linux CentOS - RAM: 4GB

So there are ~10GB RAM free for ZFS use.

Here's my /etc/modprobe.d/zfs.conf

Code:

# MIN: 4Gb
options zfs zfs_arc_min=4294967296
# MAX: 8Gb
options zfs zfs_arc_max=8589934592
# L2ARC tuning
options zfs l2arc_noprefetch=0
options zfs l2arc_write_max=26214400
options zfs l2arc_write_boost=52428800

and my /etc/sysctl.conf:

Code:

net.ipv4.tcp_syncookies=1
# -- added after first lockup
vm.swappiness=0
# -- added after second lockup
kernel.panic = 5
kernel.hung_task_panic = 1
kernel.hung_task_timeout_secs = 300
# -- added after third lockup
# see: https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
vm.dirty_ratio = 5
vm.dirty_background_ratio = 10

pveperf

Code:

CPU BOGOMIPS:      24742.36
REGEX/SECOND:      1377638
HD SIZE:           327.50 GB (rpool/ROOT/pve-1)
FSYNCS/SECOND:     294.61
DNS EXT:           76.29 ms
DNS INT:           72.69 ms

I need your help to avoid this to happen again.

Thanks.

casalicomputers · Jul 18, 2015

Nobody helps?

spirit · Jul 18, 2015

Hi, I'm not sure with zfs on linux, but that could mean that zfs can flush the slog datas to disks fast enough. They are an interval time of flush time config, which is configurable, don't remember exactly.

Also be carefull with samsung evo, the are pretty shitty with sync write (I have seen a lof of report bug with ceph, with evos are journal)

casalicomputers · Jul 20, 2015

Hello,
thanks for the reply.

So do you think that the SLOG device could be the problem?
I'll try to disable it to see if something changes.

Anyway what SSD do you suggest to use for caching, as an alternative to Intel whose, from what I see, are the best choice?
Maybe Samsung PRO series could be better than EVOs?

Thanks

mir · Jul 20, 2015

Samsung 850 PRO is a cheap alternative to Intel x3[5,7]00. I also have good experience with Corsair Force GT or Corsair Neutron GT. The top dog is still Zeusram.

spirit · Jul 20, 2015

I can reach 20k iops 4k sync write with intel s3500 , 40k iops sync write with intel s3700.

(intel 3500 don't have good write endurance, I'll prefer to choose small (100GB) intel s3700 for this)

casalicomputers · Jul 20, 2015

Ok thank you for the suggestions.
I will keep them in mind for the next deployment.

Meanwhile I just disabled the SLOG device, and I'll wait until the issue appears again (hopely not

)

casalicomputers · Aug 25, 2015

Hi,
I disabled SLOG device, and I'm not experiencing txg_sync lockups like before, but sometimes it happens that the server reboots without any apparent reason.
Probably something this is due to a kernel panic which triggers an automatic reboot (kernel.panic = 5 - /etc/sysctl.conf).

Unfortunately I have no idea of which is the cause since syslog does not give any clue, just some SMART messages:

Code:

Aug 25 13:38:19 pve01 rrdcached[3080]: started new journal /var/lib/rrdcached/journal/rrd.journal.1440502699.405782
Aug 25 13:38:19 pve01 rrdcached[3080]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1440495499.405766
Aug 25 13:45:01 pve01 /USR/SBIN/CRON[860656]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 25 13:55:02 pve01 /USR/SBIN/CRON[861763]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 25 14:05:01 pve01 /USR/SBIN/CRON[862870]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 25 14:08:21 pve01 smartd[3162]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 66
Aug 25 14:15:01 pve01 /USR/SBIN/CRON[863978]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 25 14:17:01 pve01 /USR/SBIN/CRON[864203]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug 25 14:25:01 pve01 /USR/SBIN/CRON[865085]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 25 14:35:01 pve01 /USR/SBIN/CRON[866188]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 25 14:38:19 pve01 rrdcached[3080]: flushing old values
Aug 25 14:38:19 pve01 rrdcached[3080]: rotating journals
Aug 25 14:38:19 pve01 rrdcached[3080]: started new journal /var/lib/rrdcached/journal/rrd.journal.1440506299.405797
Aug 25 14:38:19 pve01 rrdcached[3080]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1440499099.405808
Aug 25 14:38:20 pve01 smartd[3162]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 66
Aug 25 14:45:01 pve01 /USR/SBIN/CRON[867296]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 25 14:55:01 pve01 /USR/SBIN/CRON[868401]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 25 15:05:01 pve01 /USR/SBIN/CRON[869504]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 25 15:08:21 pve01 smartd[3162]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 66 to 67
Aug 25 15:08:21 pve01 smartd[3162]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 66 to 68
Aug 25 15:15:01 pve01 /USR/SBIN/CRON[870604]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 25 15:17:01 pve01 /USR/SBIN/CRON[870828]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug 25 15:25:01 pve01 /USR/SBIN/CRON[871717]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 25 15:35:01 pve01 /USR/SBIN/CRON[872822]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 25 15:38:19 pve01 rrdcached[3080]: flushing old values
Aug 25 15:38:19 pve01 rrdcached[3080]: rotating journals
Aug 25 15:38:19 pve01 rrdcached[3080]: started new journal /var/lib/rrdcached/journal/rrd.journal.1440509899.405796
Aug 25 15:38:19 pve01 rrdcached[3080]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1440502699.405782
Aug 25 15:45:01 pve01 /USR/SBIN/CRON[873933]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 25 15:55:01 pve01 /USR/SBIN/CRON[875038]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 25 16:05:01 pve01 /USR/SBIN/CRON[876147]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 25 16:08:21 pve01 smartd[3162]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 68
Aug 25 16:08:21 pve01 smartd[3162]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 68 to 69
Aug 25 16:15:01 pve01 /USR/SBIN/CRON[877255]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 25 16:17:01 pve01 /USR/SBIN/CRON[877479]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug 25 16:25:01 pve01 /USR/SBIN/CRON[878365]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 25 16:35:01 pve01 /USR/SBIN/CRON[879475]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 25 16:38:19 pve01 rrdcached[3080]: flushing old values
Aug 25 16:38:19 pve01 rrdcached[3080]: rotating journals
Aug 25 16:38:19 pve01 rrdcached[3080]: started new journal /var/lib/rrdcached/journal/rrd.journal.1440513499.405798
Aug 25 16:38:19 pve01 rrdcached[3080]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1440506299.405797
Aug 25 16:38:20 pve01 smartd[3162]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 80 to 81
Aug 25 16:38:20 pve01 smartd[3162]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 116 to 117
Aug 25 16:45:01 pve01 /USR/SBIN/CRON[880581]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 25 16:55:01 pve01 /USR/SBIN/CRON[881686]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 25 17:05:01 pve01 /USR/SBIN/CRON[882793]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 25 17:15:01 pve01 /USR/SBIN/CRON[883908]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 25 17:17:01 pve01 /USR/SBIN/CRON[884133]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug 25 17:25:01 pve01 /USR/SBIN/CRON[885014]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 25 17:35:01 pve01 /USR/SBIN/CRON[886118]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 25 17:38:19 pve01 rrdcached[3080]: flushing old values
Aug 25 17:38:19 pve01 rrdcached[3080]: rotating journals
Aug 25 17:38:19 pve01 rrdcached[3080]: started new journal /var/lib/rrdcached/journal/rrd.journal.1440517099.405890
Aug 25 17:38:19 pve01 rrdcached[3080]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1440509899.405796
Aug 25 17:38:20 pve01 smartd[3162]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 68 to 69
Aug 25 17:45:01 pve01 /USR/SBIN/CRON[887256]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 25 17:55:01 pve01 /USR/SBIN/CRON[888420]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 25 18:09:44 pve01 kernel: imklog 5.8.11, log source = /proc/kmsg started.
---- SERVER REBOOTED HERE ----

Do you have any idea of what's going on?
Maybe there's something wrong with my hardware configuration?
Are there known issues with cheap Seagate ST500NM0011 disks? If so, which model of disks would you suggest?

Any help is appreciated.

mir · Aug 25, 2015

HDD: HGST or WD Red. A HBA could also give some performance and stability especially if you a using SATA ports on a consumer grade motherboard.

mozp · Aug 26, 2015

Hello,

casalicomputers said:
Hello,
I'm experiencing random system lockups with the following error message:
task txg_sync blocked for more than 300 seconds
[...]
I need your help to avoid this to happen again.

You should post your kernel version and your zfs version.

As a pointer I am experiencing similar problems since an upgrade to kernel 2.6.32-40-pve and a newer zfs version

Code:

dmesg | grep -E 'SPL:|ZFS:'
SPL: Loaded module v0.6.4-358_gaaf6ad2
ZFS: Loaded module v0.6.4.1-1099_g7939064, ZFS pool version 5000, ZFS filesystem version 5

See this thread for more: "After Upgrade ZFS pool gone".

My lockups reproducible happen during scrubs of a larger zpool, but it may also be the high IO activity during the scrub.

best regards

casalicomputers · Aug 26, 2015

Here's my ZFS version:

Code:

SPL: Loaded module v0.6.4.2-1
ZFS: Loaded module v0.6.4.2-1, ZFS pool version 5000, ZFS filesystem version 5

and kernel version:

Code:

Linux pve01 2.6.32-40-pve #1 SMP Fri Jul 24 11:16:05 CEST 2015 x86_64 GNU/Linux

mozp · Aug 26, 2015

casalicomputers said:

Here's my ZFS version:

Code:

SPL: Loaded module v0.6.4.2-1
ZFS: Loaded module v0.6.4.2-1, ZFS pool version 5000, ZFS filesystem version 5

Linux pve01 2.6.32-40-pve #1 SMP Fri Jul 24 11:16:05 CEST 2015 x86_64 GNU/Linux

Do you use the enterprise or the no-subscription update channel?
Because on the latter channel I only have this quite old zfs/spl version "SPL: Loaded module v0.6.4-358_gaaf6ad2".

best regards

casalicomputers · Aug 27, 2015

I'm using the enterprise repository.

nixcz · Aug 28, 2015

Is it possible the machine is rebooting due to a heat issue? If the SMART logs are to be believed, it seems to me that 66-69 degrees Celsius is very high for hard drives, and the temperature can be significantly higher for the CPU/RAM/motherboard.

casalicomputers · Aug 28, 2015

Reading around, it seems that most hard drives have 60°C set as the maximum working temperature, so maybe due to high I/O especially during backups (and poor airflow), they overheated leading to some disk malfunction which resulted in kernel panic and consequent reboot. It sounds reasonable, even if I think the system would poweroff itself on a overheat condition...

I'll try improve system cooling and see if the issue persists.

Thanks

mozp · Aug 28, 2015

casalicomputers said:
Reading around, it seems that most hard drives have 60°C set as the maximum working temperature, so maybe due to high I/O especially during backups (and poor airflow), they overheated leading to some disk malfunction which resulted in kernel panic and consequent reboot. It sounds reasonable, even if I think the system would poweroff itself on a overheat condition...

I'll try improve system cooling and see if the issue persists.
Thanks

Check the SMART attributes table of your drives (smartctl -a /dev/sdc) if attribute 190 is really the temperature attribute and if the reported values are really the temperature in (decimal) Celsius. (look at column RAW_VALUE)

Because, as you figured out yourself, a drive temperature of 69°C is way too much! As a rule of thumb it should never exceed 40°C, beyond that life expectancy decreases rapidly. Furthermore no system compontents temperature should exceed 40°C. In a cool room with proper ventilation it would stay considerably lower than that.
AFAIK most/all systems will only power off if CPU (and/or MB) sensors show an overheat condition

regards

casalicomputers · Sep 2, 2015

It happened again

Kernel panic: task txg_syng blocked for 120 seconds. (I just set kernel.hung_task_timeout_secs back to the original value = 120)
System did not reboot automatically, and needed to hold down power button to shut it off.

Code:

Sep  2 03:09:48 pve01 smartd[3130]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 113 to 114
Sep  2 03:15:01 pve01 /USR/SBIN/CRON[189136]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep  2 03:17:01 pve01 /USR/SBIN/CRON[189369]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Sep  2 03:25:01 pve01 /USR/SBIN/CRON[190311]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep  2 03:35:01 pve01 /USR/SBIN/CRON[191507]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep  2 03:39:47 pve01 smartd[3130]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 71 to 72
Sep  2 03:39:47 pve01 smartd[3130]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 107 to 108
Sep  2 03:45:01 pve01 /USR/SBIN/CRON[192677]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep  2 03:55:01 pve01 /USR/SBIN/CRON[193833]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep  2 04:05:01 pve01 /USR/SBIN/CRON[195011]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep  2 04:09:46 pve01 rrdcached[3079]: flushing old values
Sep  2 04:09:46 pve01 rrdcached[3079]: rotating journals
Sep  2 04:09:46 pve01 rrdcached[3079]: started new journal /var/lib/rrdcached/journal/rrd.journal.1441159786.036264
Sep  2 04:09:46 pve01 rrdcached[3079]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1441152586.036252
Sep  2 04:09:47 pve01 smartd[3130]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 72 to 73
Sep  2 04:09:47 pve01 smartd[3130]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 108 to 109
Sep  2 04:09:47 pve01 smartd[3130]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 66 to 67
Sep  2 04:15:01 pve01 /USR/SBIN/CRON[196197]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep  2 04:17:01 pve01 /USR/SBIN/CRON[196430]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Sep  2 04:25:01 pve01 /USR/SBIN/CRON[197358]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep  2 04:35:01 pve01 /USR/SBIN/CRON[198537]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep  2 04:45:01 pve01 /USR/SBIN/CRON[199692]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep  2 04:55:01 pve01 /USR/SBIN/CRON[200841]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep  2 05:05:01 pve01 /USR/SBIN/CRON[202003]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep  2 05:09:46 pve01 rrdcached[3079]: flushing old values
Sep  2 05:09:46 pve01 rrdcached[3079]: rotating journals
Sep  2 05:09:46 pve01 rrdcached[3079]: started new journal /var/lib/rrdcached/journal/rrd.journal.1441163386.036284
Sep  2 05:09:46 pve01 rrdcached[3079]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1441156186.036261
Sep  2 05:09:47 pve01 smartd[3130]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 73 to 74
Sep  2 05:09:47 pve01 smartd[3130]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 109 to 110
Sep  2 05:15:01 pve01 /USR/SBIN/CRON[203177]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep  2 05:17:01 pve01 /USR/SBIN/CRON[203401]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Sep  2 05:25:01 pve01 /USR/SBIN/CRON[204342]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep  2 05:35:01 pve01 /USR/SBIN/CRON[205503]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep  2 05:39:48 pve01 smartd[3130]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 74 to 75
Sep  2 05:39:48 pve01 smartd[3130]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 110 to 111
Sep  2 05:45:01 pve01 /USR/SBIN/CRON[206676]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep  2 05:55:01 pve01 /USR/SBIN/CRON[207856]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep  2 06:05:01 pve01 /USR/SBIN/CRON[209004]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep  2 06:09:46 pve01 rrdcached[3079]: flushing old values
Sep  2 06:09:46 pve01 rrdcached[3079]: rotating journals
Sep  2 06:09:46 pve01 rrdcached[3079]: started new journal /var/lib/rrdcached/journal/rrd.journal.1441166986.036288
Sep  2 06:09:46 pve01 rrdcached[3079]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1441159786.036264
Sep  2 06:09:47 pve01 smartd[3130]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 65
Sep  2 06:09:47 pve01 smartd[3130]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 36 to 35
Sep  2 06:15:01 pve01 /USR/SBIN/CRON[210173]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep  2 06:17:02 pve01 /USR/SBIN/CRON[210405]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Sep  2 06:25:01 pve01 /USR/SBIN/CRON[211330]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep  2 06:25:01 pve01 /USR/SBIN/CRON[211331]: (root) CMD (test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ))
Sep  2 06:25:10 pve01 pvepw-logger[47718]: received terminate request (signal)
Sep  2 06:25:10 pve01 pvepw-logger[47718]: stopping pvefw logger
Sep  2 06:25:10 pve01 pvepw-logger[211422]: starting pvefw logger
Sep  2 06:25:11 pve01 rsyslogd: [origin software="rsyslogd" swVersion="5.8.11" x-pid="2876" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Sep  2 06:27:09 pve01 pvedailycron[211578]: <root@pam> starting task UPID:pve01:00033B0B:03DE6B0A:55E67A9C:aptupdate::root@pam:
Sep  2 06:27:18 pve01 pvedailycron[211723]: update new package list: /var/lib/pve-manager/pkgupdates
Sep  2 06:27:22 pve01 pvedailycron[211578]: <root@pam> end task UPID:pve01:00033B0B:03DE6B0A:55E67A9C:aptupdate::root@pam: OK
Sep  2 06:35:01 pve01 /USR/SBIN/CRON[212648]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep  2 06:39:47 pve01 smartd[3130]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 78 to 79
Sep  2 06:39:47 pve01 smartd[3130]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 65 to 64
Sep  2 06:39:47 pve01 smartd[3130]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 35 to 36
Sep  2 06:39:47 pve01 smartd[3130]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 114 to 115
Sep  2 06:45:01 pve01 /USR/SBIN/CRON[213810]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep  2 06:55:01 pve01 /USR/SBIN/CRON[214964]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep  2 07:05:01 pve01 /USR/SBIN/CRON[216114]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep  2 07:57:32 pve01 kernel: imklog 5.8.11, log source = /proc/kmsg started. <-- SERVER REBOOTED HERE

I have no idea of what is going on, since there's no clear evidence of what is failing.
I don't think it's a temperature problem for different reasons:

this server was a plain linux box which didn't have such problems before migrating to PVE
the server, this time, hanged during non-working hours (around 7.05AM today) so there were not much I/O and enviroment temperature was likely not too high
SMART should come up with some "failing" alert if temperature was so high
I grepped the whole syslog for Airflow_Temperature_Cel value and found that the average value is around 65, so it seems to be a normal value.

Any idea?

nixcz · Sep 3, 2015

No additional ideas other than that your motherboard components might be going south on you. I just wanted to clear up my earlier assertion. I realized from your latest logs that I was looking at the wrong attribute.

SMART attribute 190 is the difference from 100C, while attribute 194 is the actual temperature. You can see from some lines there that the actual temperature appears to be in the mid to high 30s. Sorry for the misdirection.

casalicomputers · Sep 30, 2015

Hello,
I finally found the cause of the issue!
I just disabled scheduled backups of the two VMs, and the host stayed up for 2 weeks without interruptions.

Now, I'm looking for an alternative solution to vzdump+lzo backups ....

vkhera · Oct 2, 2015

I don't use compression on my backups. One one of my vm's which is too large for vzdump to complete backing up to an NFS mount (in under 24 hours!) I use zfs send/receive with a simple script that does basically "qm snapshot" followed by zfs send/receive followed by "qm delsnapshot" of the prior snapshot and deleting of the old snapshots on the remote zfs volume.

ZFS: task txg_sync blocked for more than 300 seconds

Renowned Member

Renowned Member

Distinguished Member

Renowned Member

Famous Member

Distinguished Member

Renowned Member

Renowned Member

Famous Member

New Member

Renowned Member

New Member

Renowned Member

New Member

Renowned Member

New Member

Renowned Member

New Member

Renowned Member

Member

We value your privacy