OVH Community, votre nouvel espace communautaire.

A DegradedArray event had been detected


bendevos
24/09/2014, 09h38
Merci pour vos réponses.

J'ai pu reconstruire le raid. (et comprendre ce qu'il se passait)
Il me restera donc à demander le remplacement.

PS: Tuto très bien fait!

Nowwhat
21/09/2014, 18h51
Même si montre des signes de fatigue, je te conseille de le remettre à sa place http://denisrosenkranz.com/tuto-mdad...raid-logiciel/ - chapitre "Simuler une panne de disque".
Faire des sauvegardes est toujours un bon idiée.

Si OVH est d'accord pour changer le disque sdb, à priori, t'as pas besoin des tes sauvegardes.


Le 'tuto' te montre aussi comment il faut faire pour préparer la sortie d'un dsqiue (sdb), puis comment activer un nouveau disque. http://denisrosenkranz.com/tuto-mdad...ue-defectueux/

bendevos
21/09/2014, 18h48
ha oui merci c'est déjà moins ennuyant si il n'y a pas besoin de restaurer le backup...


bon je vais revérifier les backups et faire faire le remplacement à ovh.
merci pour vos lumières

derinhger
21/09/2014, 15h58
Citation Envoyé par bendevos
donc si j'essaie de résumer la situation.
j'ai un seul disque HW qui était configuré en raid software (sda et sdb sont des mirroirs)
Tu as deux disques physique SDA et SDB configurer en RAID1
Citation Envoyé par bendevos
il y a eu des erreurs et sdb est sorti du raid (probablement suite à une partition pleine et un redémarrage)
Je dirait plutôt due au fait que SDB présente des erreurs
Citation Envoyé par bendevos
il faut faire changer le disque HW par ovh (parce que ?)
Car il présente des erreurs 184 Reallocated_Sector
Citation Envoyé par bendevos
je perdrai toutes les infos suite à cet échange et donc il faudra tout recopier depuis un backup externe.
Non, normalement pas de soucis de perte de données, ton SDA étant en bon état, mais le BACKUP peut te sauver la vie, si par un coup de mal chance ce dernier venais a claquer avant/pendant la resynchro du RAID
Citation Envoyé par bendevos
c'est bien çà ? pas moyen de refaire le raid soft sans changer le disque ?
Si tu peut toujours essayer de resynchro le RAID, mais autant remplacer avant le disque déffaillant, car dans un premier temps, pas sur que la synchro fonctionne correctement, et deuxiemement tu devra de toute façon le refaire quand tu voudra changer le SDB, car ce dernier n'est pas en grande forme

bendevos
21/09/2014, 15h43
donc si j'essaie de résumer la situation.
j'ai un seul disque HW qui était configuré en raid software (sda et sdb sont des mirroirs)
il y a eu des erreurs et sdb est sorti du raid (probablement suite à une partition pleine et un redémarrage)
il faut faire changer le disque HW par ovh (parce que ?)
je perdrai toutes les infos suite à cet échange et donc il faudra tout recopier depuis un backup externe.

c'est bien çà ? pas moyen de refaire le raid soft sans changer le disque ?

derinhger
21/09/2014, 14h43
SDB:
1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 197191688
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 184

Pas en pleine forme le petit, dans un premier temps tu devrais déjà t'assurer d'avoir des BACKUPS a jour et externe a la machine
Ensuite remplacer le disque et resynchro le RAID

fritz2cat
21/09/2014, 14h19
Il faudrait savoir pourquoi ton sdb est sorti du RAID.
Pour l'instant tu ne tournes que sur une patte et tu prends des risques.

bendevos
21/09/2014, 12h42
est -ce que cela vous aide ? (je reçois toujours ces messages... merci

bendevos
10/09/2014, 10h56
moi aussi je me suis fait avoir ;-)
ca donne ceci

smartctl -a /dev/sdb

smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.10.23-xxxx-grs-ipv6-64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda (SATA 3Gb/s, 4K Sectors)
Device Model: ST1000DM003-1CH162
Serial Number: S1D549VR
LU WWN Device Id: 5 000c50 05aea644b
Firmware Version: CC43
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Wed Sep 10 11:53:59 2014 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 575) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 117) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x3085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 197191688
3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 13
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 184
7 Seek_Error_Rate 0x000f 083 060 030 Pre-fail Always - 206123641
9 Power_On_Hours 0x0032 082 082 000 Old_age Always - 16511
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 13
183 Runtime_Bad_Block 0x0032 099 099 000 Old_age Always - 1
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 4295032833
189 High_Fly_Writes 0x003a 097 097 000 Old_age Always - 3
190 Airflow_Temperature_Cel 0x0022 070 054 045 Old_age Always - 30 (Min/Max 29/35)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 13
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 17
194 Temperature_Celsius 0x0022 030 046 000 Old_age Always - 30 (0 18 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 203267917234303
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 21059368030
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 45385793234

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 9 -
# 2 Short offline Completed without error 00% 5 -
# 3 Short offline Completed without error 00% 5 -
# 4 Short offline Completed without error 00% 0 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

fritz2cat
10/09/2014, 10h39
aaargh ! je me suis fait avoir avec tes deux smartctl du même disque !!!

et smartctl -a /dev/sdb, ça donne quoi ?

bendevos
10/09/2014, 10h27
bonjour et merci de prendre le temps de me répondre.

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 931.5G 0 disk
├─sda1 8:1 0 20G 0 part
│ └─md1 9:1 0 20G 0 raid1 /
├─sda2 8:2 0 911G 0 part
│ └─md2 9:2 0 911G 0 raid1 /home
└─sda3 8:3 0 513.2M 0 part [SWAP]
sdb 8:16 0 931.5G 0 disk
├─sdb1 8:17 0 20G 0 part
├─sdb2 8:18 0 911G 0 part
└─sdb3 8:19 0 513.2M 0 part [SWAP]

fritz2cat
10/09/2014, 10h12
Tes disques n'ont pas l'air défectueux, mais tes deux disques md1 et md2 ne sont plus sécurisés en miroir raid1.

Il faudrait donner plus d'info sur ta configuration (tables de partition, etc) car il y a quand même un truc bizarre, sda1 d'un côté et sda2 de l'autre qui se retrouvent isolés.

bendevos
10/09/2014, 07h55
Personne n'a d'avis sur cette question ? Je continue de recevoir des messages d'alerte.


Merci d'avance!

bendevos
20/08/2014, 21h11
Bonjour,

le script de monitoring raid soft me donne cette erreur depuis deux jours:
par contre avec smartctl , je ne vois rien de "cassé"

une partition du disque a été pleine un instant, est ce que cela peut etre la cause du probleme ? Y a t il une opération à effectuer pour "réparer" le RAIDSOFT ?

merci d'avance

A DegradedArray event had been detected on md device /dev/md2.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]
md1 : active raid1 sda1[0]
20971456 blocks [2/1] [U_]

md2 : active raid1 sda2[0]
955260864 blocks [2/1] [U_]

unused devices:
smartctl -a -d ata /dev/sda2

smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.10.23-xxxx-grs-ipv6-64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda (SATA 3Gb/s, 4K Sectors)
Device Model: ST1000DM003-1CH162
Serial Number: S1D54921
LU WWN Device Id: 5 000c50 05aea8c87
Firmware Version: CC43
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Wed Aug 20 21:44:22 2014 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 575) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 111) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x3085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 115 099 006 Pre-fail Always - 97126320
3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 13
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 083 060 030 Pre-fail Always - 209348773
9 Power_On_Hours 0x0032 082 082 000 Old_age Always - 16016
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 13
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 097 097 000 Old_age Always - 3
190 Airflow_Temperature_Cel 0x0022 067 051 045 Old_age Always - 33 (Min/Max 32/35)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 13
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 17
194 Temperature_Celsius 0x0022 033 049 000 Old_age Always - 33 (0 18 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 99651831217808
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 21131626294
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 47052302379

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 9 -
# 2 Short offline Completed without error 00% 5 -
# 3 Short offline Completed without error 00% 5 -
# 4 Short offline Completed without error 00% 0 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
smartctl -a -d ata /dev/sda1

smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.10.23-xxxx-grs-ipv6-64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda (SATA 3Gb/s, 4K Sectors)
Device Model: ST1000DM003-1CH162
Serial Number: S1D54921
LU WWN Device Id: 5 000c50 05aea8c87
Firmware Version: CC43
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Wed Aug 20 21:42:17 2014 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 575) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 111) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x3085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 115 099 006 Pre-fail Always - 97117664
3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 13
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 083 060 030 Pre-fail Always - 209348617
9 Power_On_Hours 0x0032 082 082 000 Old_age Always - 16016
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 13
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 097 097 000 Old_age Always - 3
190 Airflow_Temperature_Cel 0x0022 067 051 045 Old_age Always - 33 (Min/Max 32/35)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 13
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 17
194 Temperature_Celsius 0x0022 033 049 000 Old_age Always - 33 (0 18 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 127736622366352
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 21131622286
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 47052301915

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 9 -
# 2 Short offline Completed without error 00% 5 -
# 3 Short offline Completed without error 00% 5 -
# 4 Short offline Completed without error 00% 0 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.