Crónica de una muerte anunciada

Hace un par de días, el postgresql se había estado quejando de fallas al hacer limpieza.

ERROR:  could not read block 40 of relation "pg_proc_proname_args_nsp_index": Input/output error
\connect: FATAL:  could not read block 0 of relation "pg_class": Input/output error

Eso fue el 13 de Julio.

Para el 14 el SMARTD  mando 2 correos offsite:

De:     root <root@xxx>
Enviado el:     Jueves, 14 de Julio de 2005 09:56:11 p.m.
Para:     vicm3@xxxx
Asunto:     SMART error (OfflineUncorrectableSector) detected on host: xxxxx

This email was generated by the smartd daemon running on:

host name: xxxx
DNS domain: xxxx
NIS domain: (none)

The following warning/error was logged by the smartd daemon:

Device: /dev/hda, 1 Offline uncorrectable sectors

For details see host's SYSLOG (default: /var/log/messages).

You can also use the smartctl utility for further investigation.
No additional email messages about this problem will be sent.

De:     root <root@xxxx>
Enviado el:     Jueves, 14 de Julio de 2005 09:56:12 p.m.
Para:     vicm3@xxxx
Asunto:     SMART error (SelfTest) detected on host: xxxx

This email was generated by the smartd daemon running on:

host name: xxxx
DNS domain: xxxx
NIS domain: (none)

The following warning/error was logged by the smartd daemon:

Device: /dev/hda, Self-Test Log error count increased from 0 to 1

For details see host's SYSLOG (default: /var/log/messages).

You can also use the smartctl utility for further investigation.
No additional email messages about this problem will be sent.

Dándole inmediatamente a la maquina en cuestión, corriendo el smartctl y buscando en el /var/log

smartctl -a /dev/hda
smartctl version 5.33 [i686-pc-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG SV1204H
Serial Number:    0450J1ETA00312
Firmware Version: RK100-13
User Capacity:    120,060,444,672 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   6
ATA Standard is:  ATA/ATAPI-6 T13 1410D revision 1
Local Time is:    Thu Jul 14 22:31:26 2005 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 112) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection:                 (3840) seconds.
Offline data collection
capabilities:                    (0x1b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  64) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate     0x000a   100   100   000    Old_age   Always - 407
4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always - 35
5 Reallocated_Sector_Ct   0x0033   253   253   009    Pre-fail  Always - 0
7 Seek_Error_Rate         0x000b   253   253   051    Pre-fail  Always - 0
8 Seek_Time_Performance   0x0024   253   253   000    Old_age   Offline - 0
9 Power_On_Half_Minutes   0x0032   098   098   000    Old_age   Always - 11376h+27m
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always - 34
194 Temperature_Celsius_x10 0x0022   128   121   000    Old_age   Always - 41.9
197 Current_Pending_Sector  0x0033   253   253   009    Pre-fail  Always - 0
198 Offline_Uncorrectable   0x0031   100   100   009    Pre-fail  Offline - 1
199 UDMA_CRC_Error_Count    0x000a   100   100   000    Old_age   Always - 0
200 Multi_Zone_Error_Rate   0x000b   100   100   051    Pre-fail  Always - 0
201 Soft_Read_Error_Rate    0x000b   100   100   051    Pre-fail  Always - 40

SMART Error Log Version: 1
ATA Error Count: 580 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 580 occurred at disk power-on lifetime: 11374 hours (473 days + 22 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 10 40 4e e9 ec  Error: UNC 16 sectors at LBA = 0x0ce94e40 = 216616512

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
-- -- -- -- -- -- -- --  ----------------  --------------------
c8 00 10 40 4e e9 ec 00  49d+15:55:04.296  READ DMA
ca 00 20 c0 56 eb ec 00  49d+15:54:49.296  WRITE DMA
ca 00 20 80 56 eb ec 00  49d+15:54:49.296  WRITE DMA
c8 00 20 30 4e e9 ec 00  49d+15:54:49.296  READ DMA
ca 00 08 f4 09 8a e0 00  49d+15:54:49.296  WRITE DMA

Error 579 occurred at disk power-on lifetime: 11374 hours (473 days + 22 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 18 38 4e e9 ec  Error: UNC 24 sectors at LBA = 0x0ce94e38 = 216616504

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
-- -- -- -- -- -- -- --  ----------------  --------------------
c8 00 18 38 4e e9 ec 00  49d+15:54:57.296  READ DMA
ca 00 20 c0 56 eb ec 00  49d+15:54:49.296  WRITE DMA
ca 00 20 80 56 eb ec 00  49d+15:54:49.296  WRITE DMA
c8 00 20 30 4e e9 ec 00  49d+15:54:49.296  READ DMA
ca 00 08 f4 09 8a e0 00  49d+15:54:49.296  WRITE DMA

Error 578 occurred at disk power-on lifetime: 11374 hours (473 days + 22 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 20 30 4e e9 ec  Error: UNC 32 sectors at LBA = 0x0ce94e30 = 216616496

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
-- -- -- -- -- -- -- --  ----------------  --------------------
c8 00 20 30 4e e9 ec 00  49d+15:54:49.296  READ DMA
ca 00 08 f4 09 8a e0 00  49d+15:54:49.296  WRITE DMA
ca 00 10 00 56 ec ec 00  49d+15:54:49.296  WRITE DMA
ca 00 20 c0 56 eb ec 00  49d+15:54:49.296  WRITE DMA
ca 00 20 80 56 eb ec 00  49d+15:54:49.296  WRITE DMA

Error 577 occurred at disk power-on lifetime: 11374 hours (473 days + 22 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 38 56 fe ec  Error: UNC 8 sectors at LBA = 0x0cfe5638 = 217994808

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
-- -- -- -- -- -- -- --  ----------------  --------------------
c8 00 08 38 56 fe ec 00  49d+15:54:41.296  READ DMA
c8 00 20 30 56 fc ec 00  49d+15:54:18.296  READ DMA
c8 00 20 f0 57 01 ec 00  49d+15:54:18.296  READ DMA
c8 00 20 00 56 fe ec 00  49d+15:54:18.296  READ DMA
c8 00 20 20 56 fe ec 00  49d+15:54:18.296  READ DMA

Error 576 occurred at disk power-on lifetime: 11374 hours (473 days + 22 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 10 30 56 fe ec  Error: UNC 16 sectors at LBA = 0x0cfe5630 = 217994800

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
-- -- -- -- -- -- -- --  ----------------  --------------------
c8 00 10 30 56 fe ec 00  49d+15:54:33.296  READ DMA
c8 00 20 30 56 fc ec 00  49d+15:54:18.296  READ DMA
c8 00 20 f0 57 01 ec 00  49d+15:54:18.296  READ DMA
c8 00 20 00 56 fe ec 00  49d+15:54:18.296  READ DMA
c8 00 20 20 56 fe ec 00  49d+15:54:18.296  READ DMA

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining LifeTime(hours) LBA_of_first_error
# 1  Extended offline    Completed: read failure       00%     11375 -
# 2  Short offline       Completed without error       00%     11374 -

Device does not support Selective Self Tests/Logging

De hecho revisando las bitacoras el dia de hoy ha tenido problemas para leer

Jul 14 10:02:35 kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=207027771, sector=30682192

!Se repite 90 veces!

Así que nuestra maquina esta offline! hasta nuevo aviso. : (

¿Mencione, que alcance a correr el script de respaldo y se completo al disco secundario, antes de que la maquina dejara de responder?, mis Felicitaciones y cumplidos al equipo de smartmontools http://smartmontools.sourceforge.net/ esta es como la segunda ocasion que me avisan con suficiente tiempo para respaldar!

En esta quincena he visto fallar mas maquinas que en todo el tiempo que he estado como administrador de sistemas… las lluvias, apagones y alteraciones de corriente han estado tremendas :(
Keep the good work!

Esta entrada fue publicada en planetalinux, Sin categoría, sysadmin. Guarda el enlace permanente.

Deja un comentario

Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *

Este sitio usa Akismet para reducir el spam. Aprende cómo se procesan los datos de tus comentarios.