Recovering RAID array on Synology DiskStation DS413j
I recieved a dead Synology DiskStation DS413j with some corporate files on it. Should I mention that those files were important ?! And yea, they didn't have backups ...
Initially after plugging in ethernet cable and powercord, I wasn't able to boot it up. It just stuck up with a blue-ring-of-dead (main blue LED kept flashing all the time, hard disks LEDs have been turned off). Hm .. I logged in my home router to see if there is a DHCP IP lease for diskstation, but nothing. It didn't look promising ...
Afterwards I decided to pull out disks from station, and to boot it up without them. In the meantime, on their official site I've found information that there is find.synology.com url which finds your device on your local network. Wow, it did the trick, diskstation was found and almost booted up. Synology Webassistant complained about no-disk configuration, and that was required to fill a device with disks, so the DiskStation Manager (DSM) could be installed ... Chicken and egg ... Argh :/
Regardless of that, I tried to recover data by plugging disk in spare machine with enough available SATA ports. Diskstation was delivered to me with 3 x 2TB disks. After plugging disks, spare machine was booted with Debian live distribution over USB keychain.
Some additional packages needed to be installed.
root@debian:~# apt-get update root@debian:~# apt-get install lvm2 mdadm smartmontools hdparm
Whenever I plugged a new hardware in machine, I usually ran dmesg command.
root@debian:~# dmesg
At this point we knew that disks were in RAID configuration, and we wanted to know more about it.
root@debian:~# cat /proc/mdstat root@debian:~# lsblk
There was inactive md2 device (md - multiple device also known as linux raid device) and two devices sda, sdb ... Err where was the third one ... sdc where were you ?
After 30 mins of trying to connect/reconnect it to a few other computers/controlers, I gave up ... R.I.P sdc ! Anyway two disks should have been enough to recover RAID array, except we had one which was failing ...
OK. So far so good. We were focused on md2 device.
First try to access/mount md2 device:
root@debian:~# mdadm -Asf root@debian:~# vgchange -ay
Ahh, there was I/O error on sdb disk sector 9453290 ....
* Unfortunately at that moment I didn't copy/paste those command outputs from spare machine because I had no idea that at the end blog post will be made. But, I took a picture with my phone (for customer report).
It looked like we had some errors ... Then came smartmontools. From their site - "The smartmontools package contains two utility programs (smartctl and smartd) to control and monitor storage systems using the Self-Monitoring, Analysis and Reporting Technology System (SMART) built into most modern ATA and SCSI harddisks. In many cases, these utilities will provide advanced warning of disk degradation and failure". Exactly what we were looking for.
To show disk(s) informations we used:
root@debian:~# smartctl -a /dev/sdb
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.14 (AF) Device Model: ST2000DM001-1CH164 Serial Number: Z1E3X2B2 LU WWN Device Id: 5 000c50 05074dd05 Firmware Version: CC26 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Mon Dec 21 14:37:04 2015 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 592) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 228) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x3085) SCT Status supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 119 099 006 Pre-fail Always - 218846716 3 Spin_Up_Time 0x0003 096 096 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 091 091 020 Old_age Always - 9900 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 2 7 Seek_Error_Rate 0x000f 064 060 030 Pre-fail Always - 12894131789 9 Power_On_Hours 0x0032 079 079 000 Old_age Always - 19150 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 091 091 020 Old_age Always - 9504 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 468 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 074 047 045 Old_age Always - 26 (Min/Max 23/26) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 096 096 000 Old_age Always - 9319 193 Load_Cycle_Count 0x0032 095 095 000 Old_age Always - 11542 194 Temperature_Celsius 0x0022 026 053 000 Old_age Always - 26 (0 13 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 6 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 6 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 19102h+56m+20.786s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 663488264 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 9144747847 SMART Error Log Version: 1 ATA Error Count: 960 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 960 occurred at disk power-on lifetime: 19150 hours (797 days + 22 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ea 3e 90 00 Error: UNC at LBA = 0x00903eea = 9453290 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 e8 3e 90 e0 00 00:02:03.258 READ DMA c8 00 08 00 00 90 e0 00 00:02:03.250 READ DMA 27 00 00 00 00 00 e0 00 00:02:03.249 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 a0 00 00:02:03.242 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 00:02:03.233 SET FEATURES [Set transfer mode] Error 959 occurred at disk power-on lifetime: 19150 hours (797 days + 22 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ea 3e 90 00 Error: UNC at LBA = 0x00903eea = 9453290 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 e8 3e 90 e0 00 00:02:03.258 READ DMA c8 00 08 00 00 90 e0 00 00:02:03.250 READ DMA 27 00 00 00 00 00 e0 00 00:02:03.249 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 a0 00 00:02:03.242 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 00:02:03.233 SET FEATURES [Set transfer mode] Error 958 occurred at disk power-on lifetime: 19150 hours (797 days + 22 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ea 3e 90 00 Error: UNC at LBA = 0x00903eea = 9453290 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 e8 3e 90 e0 00 00:02:02.864 READ DMA c8 00 08 e0 3e 90 e0 00 00:02:02.842 READ DMA 25 00 08 ff ff ff ef 00 00:02:02.841 READ DMA EXT 25 00 08 ff ff ff ef 00 00:02:02.841 READ DMA EXT c8 00 08 00 00 00 e0 00 00:02:02.841 READ DMA Error 957 occurred at disk power-on lifetime: 19150 hours (797 days + 22 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ea 3e 90 00 Error: UNC at LBA = 0x00903eea = 9453290 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 e8 3e 90 e0 00 00:02:02.864 READ DMA c8 00 08 e0 3e 90 e0 00 00:02:02.842 READ DMA 25 00 08 ff ff ff ef 00 00:02:02.841 READ DMA EXT 25 00 08 ff ff ff ef 00 00:02:02.841 READ DMA EXT c8 00 08 00 00 00 e0 00 00:02:02.841 READ DMA Error 956 occurred at disk power-on lifetime: 19150 hours (797 days + 22 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ea 3e 90 00 Error: UNC at LBA = 0x00903eea = 9453290 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 e8 3e 90 e0 00 00:01:51.614 READ DMA 27 00 00 00 00 00 e0 00 00:01:51.613 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 a0 00 00:01:51.605 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 00:01:51.597 SET FEATURES [Set transfer mode] 27 00 00 00 00 00 e0 00 00:01:51.597 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
There really was an error with 9453290 sector.
40 51 00 ea 3e 90 00 Error: UNC at LBA = 0x00903eea = 9453290
If we took a deeper look at smartmontool output, we would have seen that disk had more bad sectors
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 6 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 6
With hdparm I was able to read specific sector(s).
root@debian:~# hdparm --read-sector 9453290 /dev/sdb /sdb/sdb: reading sector 9453290: FAILED: Input/output error
Then we wanted to see if there was more bad sectors close to sector 9453290
root@debian:~# hdparm --read-sector 9453289 /dev/sdb /dev/sdb: reading sector 9453289: succeeded ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff
This one looked OK.
root@debian:~# hdparm --read-sector 9453291 /dev/sdb /sdb/sdb: reading sector 9453291: FAILED: Input/output error
root@debian:~# hdparm --read-sector 9453292 /dev/sdb /sdb/sdb: reading sector 9453292: FAILED: Input/output error
root@debian:~# hdparm --read-sector 9453293 /dev/sdb /sdb/sdb: reading sector 9453293: FAILED: Input/output error
root@debian:~# hdparm --read-sector 9453294 /dev/sdb /sdb/sdb: reading sector 9453294: FAILED: Input/output error
root@debian:~# hdparm --read-sector 9453295 /dev/sdb /sdb/sdb: reading sector 9453295: FAILED: Input/output error
root@debian:~# hdparm --read-sector 9453296 /dev/sdb /dev/sdb: reading sector 9453296: succeeded ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff ffff
We found exactly 6 bad sectors (9453290 - 9453295)
Because we are root and we have power, we reallocated the sectors by simply issuing a write command to it.
root@debian:~# hdparm –write-sector 9453290 –yes-i-know-what-i-am-doing /dev/sdb /dev/sdb: re-writing sector 9453290: succeeded
root@debian:~# hdparm –write-sector 9453291 –yes-i-know-what-i-am-doing /dev/sdb /dev/sdb: re-writing sector 9453291: succeeded
root@debian:~# hdparm –write-sector 9453292 –yes-i-know-what-i-am-doing /dev/sdb /dev/sdb: re-writing sector 9453292: succeeded
root@debian:~# hdparm –write-sector 9453293 –yes-i-know-what-i-am-doing /dev/sdb /dev/sdb: re-writing sector 9453293: succeeded
root@debian:~# hdparm –write-sector 9453294 –yes-i-know-what-i-am-doing /dev/sdb /dev/sdb: re-writing sector 9453294: succeeded
root@debian:~# hdparm –write-sector 9453295 –yes-i-know-what-i-am-doing /dev/sdb /dev/sdb: re-writing sector 9453295: succeeded
We tried to access our md2 device again
root@debian:~# mdadm -Asf && vgchange -ay
And finally we were able to mount it
root@debian:~# mount –o ro /dev/md2 /mnt
root@debian:~$ ls /mnt @appstore aquota.user @database homes @S2S @syncd.core @tmp aquota.group @cloudstation @eaDir share @spool synoquota.db
Well that's was it ! I was able to copy/rsync restored data to some temporary storage and to tell our customer a good news :)
Afterwards, Diskstation was reset and filled with a 3 new drives plus old one (smartmoontools didn't complain anything about sda so we kept it). New RAID array was configured in RAID6 mode, because of two disk failure scenario (4x2TB = usable 3.73TB).
Comments
Eric (not verified)
Tue, 07/12/2016 - 21:01
Permalink
Sector 9453290
I've had two incidents this month with my Syno 1513+, both losing 2 drives at the same time.
Checking bad sectors on all of these drives indicates first LBA is the same sector as your issue - 9453290.
Awfully strange if it is coincidence. I'm seeing other references to the same sector in other searches - not sure if it is the WD Red head parking issue or a Synology issue yet.
Add new comment