OVH Community, votre nouvel espace communautaire.

Durée de Recovery RAID 1 soft et performance


Patricec
07/03/2014, 17h09
Merci de tout coeur pour cette confirmation.
sda ne montre pas d'erreur et semble tenir le choc. Mes backups sont pratiquement à jour.
Je viens de créer un ticket...

Athar
07/03/2014, 16h48
Ouai, j'allais je dire, il manque SDA, si problème de série, possible qu'il soit autant mal en point.

1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 134862176
7 Seek_Error_Rate 0x000f 085 060 030 Pre-fail Always - 353788000
Jolis chiffres en tout cas... c'est pas mal qu'il tourne encore xD
Sur mon NAS perso, j'avais des valeurs plus faible, mais le disque était plus que mort, perte de 65% des données sur le disque.

Nico94
07/03/2014, 15h33
sdb est mort, à faire remplacer.
Reste à croiser les doigts pour que l'autre se porte bien ... mais assure toi quand même que tes backups tournent comme il faut.

Patricec
07/03/2014, 15h29
J'ai changé le speed_limit_min (et le speed_limit_max, sans aucun effet sur la vitesse du débit de recovery, toujours aux alentours de 200K/s.

J'ai vu aussi que certains préconisent de changer la taille du readahead. Je l'ai mise à 65536 sans effet apparent.
Il y a aussi la possilbité de mettre l'option Bitmap (mdadm --grow --bitmap=internal /dev/mdx), mais il semble qu'on ne puisse pas ajuster ce paramètre en cours de recovery.
Une dernière chose que j'ai vue, mais n'ai pas essayé est de désactiver la NCQ de chaque disque.

Quant au diagnostic SMART de chaque array, le voici :

root@ns353067:~# smartctl -a /dev/sdb
smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model: ST1000DM003-9YN162
Serial Number: W1D14D4W
Firmware Version: CC4H
User Capacity: 1 000 204 886 016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Fri Mar 7 14:27:17 2014 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 249) Self-test routine in progress...
90% of test remaining.
Total time to complete Offline
data collection: ( 575) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 108) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x3085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 134862176
3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 19
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 085 060 030 Pre-fail Always - 353788000
9 Power_On_Hours 0x0032 083 083 000 Old_age Always - 15076
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 18
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 091 091 000 Old_age Always - 9
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 64425492495
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 071 057 045 Old_age Always - 29 (Lifetime Min/Max 23/43)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 17
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 30
194 Temperature_Celsius 0x0022 029 043 000 Old_age Always - 29 (0 20 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 198397424319202
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 5921659531336
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 192777722350774

SMART Error Log Version: 1
ATA Error Count: 9 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 9 occurred at disk power-on lifetime: 14990 hours (624 days + 14 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 c0 f3 3f 01

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 00 00 f0 3f 41 00 26d+05:51:29.910 READ FPDMA QUEUED
60 00 80 00 f4 3f 41 00 26d+05:51:29.909 READ FPDMA QUEUED
ef 10 02 00 00 00 a0 00 26d+05:51:29.909 SET FEATURES [Reserved for Serial ATA]
27 00 00 00 00 00 e0 00 26d+05:51:29.909 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 26d+05:51:29.908 IDENTIFY DEVICE

Error 8 occurred at disk power-on lifetime: 14990 hours (624 days + 14 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 c0 f3 3f 01

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 00 08 ff ff ff 4f 00 26d+05:51:26.089 WRITE FPDMA QUEUED
61 00 10 ff ff ff 4f 00 26d+05:51:26.089 WRITE FPDMA QUEUED
61 00 20 ff ff ff 4f 00 26d+05:51:26.089 WRITE FPDMA QUEUED
61 00 10 ff ff ff 4f 00 26d+05:51:26.089 WRITE FPDMA QUEUED
61 00 08 ff ff ff 4f 00 26d+05:51:26.089 WRITE FPDMA QUEUED

Error 7 occurred at disk power-on lifetime: 14855 hours (618 days + 23 hours)
When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 08 ff ff ff 4f 00 20d+14:56:25.953 READ FPDMA QUEUED
61 00 08 ff ff ff 4f 00 20d+14:56:25.953 WRITE FPDMA QUEUED
ea 00 00 00 00 00 a0 00 20d+14:56:25.814 FLUSH CACHE EXT
ea 00 00 00 00 00 a0 00 20d+14:56:25.583 FLUSH CACHE EXT
60 00 10 ff ff ff 4f 00 20d+14:56:23.828 READ FPDMA QUEUED

Error 6 occurred at disk power-on lifetime: 13597 hours (566 days + 13 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 38 2b 40 01

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 00 81 27 40 41 00 17d+21:49:48.038 READ FPDMA QUEUED
ef 10 02 00 00 00 a0 00 17d+21:49:48.037 SET FEATURES [Reserved for Serial ATA]
27 00 00 00 00 00 e0 00 17d+21:49:48.037 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 17d+21:49:48.037 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 17d+21:49:48.037 SET FEATURES [Set transfer mode]

Error 5 occurred at disk power-on lifetime: 13597 hours (566 days + 13 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 38 2b 40 01

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 00 81 27 40 41 00 17d+21:49:45.179 READ FPDMA QUEUED
ef 10 02 00 00 00 a0 00 17d+21:49:45.178 SET FEATURES [Reserved for Serial ATA]
27 00 00 00 00 00 e0 00 17d+21:49:45.178 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 17d+21:49:45.178 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 17d+21:49:45.178 SET FEATURES [Set transfer mode]

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Self-test routine in progress 90% 15027 -
# 2 Extended offline Interrupted (host reset) 00% 14906 -
# 3 Short offline Aborted by host 90% 14838 -
# 4 Short offline Completed without error 00% 44 -
# 5 Short offline Completed without error 00% 39 -
# 6 Short offline Completed without error 00% 39 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Nico94
07/03/2014, 15h26
+1

Une reconstruction de RAID 1 dont la durée prévisionnelle est de plus de 40 jours est hautement suspecte. Ca sent le problème disque.

Athar
07/03/2014, 14h02
Je vérifierais le status SMART des disques perso, pour moi, un disque ne sort pas tout seul d'un RAID (a moins que ce soit un "Check" automatique).

starouille
07/03/2014, 13h14
Bien entendu qu'il y a de l'impact.

Perso, je sors mes serveurs de prod pendant le rebuild et ça passe sur un serveur de secours.

Et je met le rebuild rate à fond. (en raid soft, un "echo 5000000 > /proc/sys/dev/raid/speed_limit_min " devrait pas mal aider, mais ton serveur sera au taquet niveau IO).

Patricec
07/03/2014, 12h47
Bonjour,

Mon système dédié est très lent depuis quelques jours, avec des sites web non responsifs et souvent en timeout. J'en suis venu à suspecter un problème de performance disque., après avoir testé le statut des serveurs Apache, mysql, etc.

Pour une raison non élucidée, le RAID 1 de 1 Go mon serveur SuperLoad Mini 60 s'est mis en recovery , il y a trois jours et en est à 6% de rec overy à ce jour avec une vitesse très petite d'environ 200K/s.
__________________________________________________ __________
/var# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]
md1 : active raid1 sdb1[1] sda1[0]
10485696 blocks [2/2] [UU]

md2 : active raid1 sdb2[1] sda2[0]
965746624 blocks [2/2] [UU]
[=>...................] check = 6.2% (60154432/965746624) finish=64042.2min speed=235K/sec

unused devices:

A cette cadence, la recovery se terminera dans plus de 1000 heures !!!
La vitesse d'écriture disque semble correcte cependant :

dd if=/dev/zero of=test.data bs=4k count=128k
131072+0 enregistrements lus
131072+0 enregistrements écrits
536870912 octets (537 MB) copiés, 0,317581 s, 1,7 GB/s

Mais un hdparm -tT sur sdb ne finit par un message "Minuterie d'alerte" au bout d'au moins une demi-heure.

D'où ma question : est-ce que le fait que le RAID soit en status Recovery affecte la performance IO de mes disques au point de rendre le serveur très lent, ou bien est-ce qu'il y a un problème d'écriture sur l'un des disques du RAID (ou les deux) qui a provoqué la désynchronisation et la mise en status Recovery ?

Merci par avance de vos remarques et suggestions.
Cordialement,
Patricec