Hace un par de días, el postgresql se había estado quejando de fallas al hacer limpieza.
ERROR: could not read block 40 of relation "pg_proc_proname_args_nsp_index": Input/output error \connect: FATAL: could not read block 0 of relation "pg_class": Input/output error
Eso fue el 13 de Julio.
Para el 14 el SMARTD mando 2 correos offsite:
De: root <root@xxx> Enviado el: Jueves, 14 de Julio de 2005 09:56:11 p.m. Para: vicm3@xxxx Asunto: SMART error (OfflineUncorrectableSector) detected on host: xxxxx This email was generated by the smartd daemon running on: host name: xxxx DNS domain: xxxx NIS domain: (none) The following warning/error was logged by the smartd daemon: Device: /dev/hda, 1 Offline uncorrectable sectors For details see host's SYSLOG (default: /var/log/messages). You can also use the smartctl utility for further investigation. No additional email messages about this problem will be sent. De: root <root@xxxx> Enviado el: Jueves, 14 de Julio de 2005 09:56:12 p.m. Para: vicm3@xxxx Asunto: SMART error (SelfTest) detected on host: xxxx This email was generated by the smartd daemon running on: host name: xxxx DNS domain: xxxx NIS domain: (none) The following warning/error was logged by the smartd daemon: Device: /dev/hda, Self-Test Log error count increased from 0 to 1 For details see host's SYSLOG (default: /var/log/messages). You can also use the smartctl utility for further investigation. No additional email messages about this problem will be sent.
Dándole inmediatamente a la maquina en cuestión, corriendo el smartctl y buscando en el /var/log
smartctl -a /dev/hda smartctl version 5.33 [i686-pc-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: SAMSUNG SV1204H Serial Number: 0450J1ETA00312 Firmware Version: RK100-13 User Capacity: 120,060,444,672 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: ATA/ATAPI-6 T13 1410D revision 1 Local Time is: Thu Jul 14 22:31:26 2005 CDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x80) Offline data collection activity was never started. Auto Offline Data Collection: Enabled. Self-test execution status: ( 112) The previous self-test completed having the read element of the test failed. Total time to complete Offline data collection: (3840) seconds. Offline data collection capabilities: (0x1b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. No Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 64) minutes. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 407 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 35 5 Reallocated_Sector_Ct 0x0033 253 253 009 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 253 253 051 Pre-fail Always - 0 8 Seek_Time_Performance 0x0024 253 253 000 Old_age Offline - 0 9 Power_On_Half_Minutes 0x0032 098 098 000 Old_age Always - 11376h+27m 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 34 194 Temperature_Celsius_x10 0x0022 128 121 000 Old_age Always - 41.9 197 Current_Pending_Sector 0x0033 253 253 009 Pre-fail Always - 0 198 Offline_Uncorrectable 0x0031 100 100 009 Pre-fail Offline - 1 199 UDMA_CRC_Error_Count 0x000a 100 100 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x000b 100 100 051 Pre-fail Always - 0 201 Soft_Read_Error_Rate 0x000b 100 100 051 Pre-fail Always - 40 SMART Error Log Version: 1 ATA Error Count: 580 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 580 occurred at disk power-on lifetime: 11374 hours (473 days + 22 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 10 40 4e e9 ec Error: UNC 16 sectors at LBA = 0x0ce94e40 = 216616512 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 10 40 4e e9 ec 00 49d+15:55:04.296 READ DMA ca 00 20 c0 56 eb ec 00 49d+15:54:49.296 WRITE DMA ca 00 20 80 56 eb ec 00 49d+15:54:49.296 WRITE DMA c8 00 20 30 4e e9 ec 00 49d+15:54:49.296 READ DMA ca 00 08 f4 09 8a e0 00 49d+15:54:49.296 WRITE DMA Error 579 occurred at disk power-on lifetime: 11374 hours (473 days + 22 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 18 38 4e e9 ec Error: UNC 24 sectors at LBA = 0x0ce94e38 = 216616504 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 18 38 4e e9 ec 00 49d+15:54:57.296 READ DMA ca 00 20 c0 56 eb ec 00 49d+15:54:49.296 WRITE DMA ca 00 20 80 56 eb ec 00 49d+15:54:49.296 WRITE DMA c8 00 20 30 4e e9 ec 00 49d+15:54:49.296 READ DMA ca 00 08 f4 09 8a e0 00 49d+15:54:49.296 WRITE DMA Error 578 occurred at disk power-on lifetime: 11374 hours (473 days + 22 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 20 30 4e e9 ec Error: UNC 32 sectors at LBA = 0x0ce94e30 = 216616496 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 20 30 4e e9 ec 00 49d+15:54:49.296 READ DMA ca 00 08 f4 09 8a e0 00 49d+15:54:49.296 WRITE DMA ca 00 10 00 56 ec ec 00 49d+15:54:49.296 WRITE DMA ca 00 20 c0 56 eb ec 00 49d+15:54:49.296 WRITE DMA ca 00 20 80 56 eb ec 00 49d+15:54:49.296 WRITE DMA Error 577 occurred at disk power-on lifetime: 11374 hours (473 days + 22 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 38 56 fe ec Error: UNC 8 sectors at LBA = 0x0cfe5638 = 217994808 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 38 56 fe ec 00 49d+15:54:41.296 READ DMA c8 00 20 30 56 fc ec 00 49d+15:54:18.296 READ DMA c8 00 20 f0 57 01 ec 00 49d+15:54:18.296 READ DMA c8 00 20 00 56 fe ec 00 49d+15:54:18.296 READ DMA c8 00 20 20 56 fe ec 00 49d+15:54:18.296 READ DMA Error 576 occurred at disk power-on lifetime: 11374 hours (473 days + 22 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 10 30 56 fe ec Error: UNC 16 sectors at LBA = 0x0cfe5630 = 217994800 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 10 30 56 fe ec 00 49d+15:54:33.296 READ DMA c8 00 20 30 56 fc ec 00 49d+15:54:18.296 READ DMA c8 00 20 f0 57 01 ec 00 49d+15:54:18.296 READ DMA c8 00 20 00 56 fe ec 00 49d+15:54:18.296 READ DMA c8 00 20 20 56 fe ec 00 49d+15:54:18.296 READ DMA SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 00% 11375 - # 2 Short offline Completed without error 00% 11374 - Device does not support Selective Self Tests/Logging
De hecho revisando las bitacoras el dia de hoy ha tenido problemas para leer
Jul 14 10:02:35 kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=207027771, sector=30682192
!Se repite 90 veces!
Así que nuestra maquina esta offline! hasta nuevo aviso. : (
¿Mencione, que alcance a correr el script de respaldo y se completo al disco secundario, antes de que la maquina dejara de responder?, mis Felicitaciones y cumplidos al equipo de smartmontools http://smartmontools.sourceforge.net/ esta es como la segunda ocasion que me avisan con suficiente tiempo para respaldar!
En esta quincena he visto fallar mas maquinas que en todo el tiempo que he estado como administrador de sistemas… las lluvias, apagones y alteraciones de corriente han estado tremendas :(
Keep the good work!