Recovering RAID array on Synology DiskStation DS413j

I recieved a dead Synology DiskStation DS413j with some corporate files on it. Should I mention that those files were important ?! And yea, they didn't have backups ... 

Initially after plugging in ethernet cable and powercord, I wasn't able to boot it up. It just stuck up with a blue-ring-of-dead (main blue LED kept flashing all the time, hard disks LEDs have been turned off). Hm .. I logged in my home router to see if there is a DHCP IP lease for diskstation, but nothing. It didn't look promising ...

Afterwards I decided to pull out disks from station, and to boot it up without them. In the meantime, on their official site I've found information that there is find.synology.com url which finds your device on your local network. Wow, it did the trick, diskstation was found and almost booted up. Synology Webassistant complained about no-disk configuration, and that was required to fill a device with disks, so the DiskStation Manager (DSM) could be installed ... Chicken and egg ... Argh :/  

Regardless of that, I tried to recover data by plugging disk in spare machine with enough available SATA ports. Diskstation was delivered to me with 3 x 2TB disks. After plugging disks, spare machine was booted with Debian live distribution over USB keychain.

Some additional packages needed to be installed.

root@debian:~# apt-get update
root@debian:~# apt-get install lvm2 mdadm smartmontools hdparm

Whenever I plugged a new hardware in machine, I usually ran dmesg command.

root@debian:~# dmesg

At this point we knew that disks were in RAID configuration, and we wanted to know more about it.

root@debian:~# cat /proc/mdstat
root@debian:~# lsblk

There was inactive md2 device (md - multiple device also known as linux raid device) and two devices sdasdb ... Err where was the third one ... sdc where were you ? 
After 30 mins of trying to connect/reconnect it to a few other computers/controlers, I gave up ...  R.I.P sdc ! Anyway two disks should have been enough to recover RAID array, except we had one which was failing ...   

OK. So far so good. We were focused on md2 device. 

First try to access/mount md2 device:

root@debian:~# mdadm -Asf
root@debian:~# vgchange -ay

Ahh, there was I/O error on sdb disk sector 9453290 ....

* Unfortunately at that moment I didn't copy/paste those command outputs from spare machine because I had no idea that at the end blog post will be made. But, I took a picture with my phone (for customer report).

It looked like we had some errors ... Then came smartmontools. From their site - "The smartmontools package contains two utility programs (smartctl and smartd) to control and monitor storage systems using the Self-Monitoring, Analysis and Reporting Technology System (SMART) built into most modern ATA and SCSI harddisks. In many cases, these utilities will provide advanced warning of disk degradation and failure". Exactly what we were looking for.

To show disk(s) informations we used:

root@debian:~# smartctl -a /dev/sdb
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST2000DM001-1CH164
Serial Number:    Z1E3X2B2
LU WWN Device Id: 5 000c50 05074dd05
Firmware Version: CC26
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Mon Dec 21 14:37:04 2015 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:         (  592) seconds.
Offline data collection
capabilities:              (0x73) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      ( 228) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x3085)    SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   119   099   006    Pre-fail  Always       -       218846716
  3 Spin_Up_Time            0x0003   096   096   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   091   091   020    Old_age   Always       -       9900
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       2
  7 Seek_Error_Rate         0x000f   064   060   030    Pre-fail  Always       -       12894131789
  9 Power_On_Hours          0x0032   079   079   000    Old_age   Always       -       19150
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   091   091   020    Old_age   Always       -       9504
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       468
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   074   047   045    Old_age   Always       -       26 (Min/Max 23/26)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   096   096   000    Old_age   Always       -       9319
193 Load_Cycle_Count        0x0032   095   095   000    Old_age   Always       -       11542
194 Temperature_Celsius     0x0022   026   053   000    Old_age   Always       -       26 (0 13 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       6
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       6
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       19102h+56m+20.786s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       663488264
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       9144747847

SMART Error Log Version: 1
ATA Error Count: 960 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 960 occurred at disk power-on lifetime: 19150 hours (797 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ea 3e 90 00  Error: UNC at LBA = 0x00903eea = 9453290

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 e8 3e 90 e0 00      00:02:03.258  READ DMA
  c8 00 08 00 00 90 e0 00      00:02:03.250  READ DMA
  27 00 00 00 00 00 e0 00      00:02:03.249  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 a0 00      00:02:03.242  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:02:03.233  SET FEATURES [Set transfer mode]

Error 959 occurred at disk power-on lifetime: 19150 hours (797 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ea 3e 90 00  Error: UNC at LBA = 0x00903eea = 9453290

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 e8 3e 90 e0 00      00:02:03.258  READ DMA
  c8 00 08 00 00 90 e0 00      00:02:03.250  READ DMA
  27 00 00 00 00 00 e0 00      00:02:03.249  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 a0 00      00:02:03.242  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:02:03.233  SET FEATURES [Set transfer mode]

Error 958 occurred at disk power-on lifetime: 19150 hours (797 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ea 3e 90 00  Error: UNC at LBA = 0x00903eea = 9453290

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 e8 3e 90 e0 00      00:02:02.864  READ DMA
  c8 00 08 e0 3e 90 e0 00      00:02:02.842  READ DMA
  25 00 08 ff ff ff ef 00      00:02:02.841  READ DMA EXT
  25 00 08 ff ff ff ef 00      00:02:02.841  READ DMA EXT
  c8 00 08 00 00 00 e0 00      00:02:02.841  READ DMA

Error 957 occurred at disk power-on lifetime: 19150 hours (797 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ea 3e 90 00  Error: UNC at LBA = 0x00903eea = 9453290

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 e8 3e 90 e0 00      00:02:02.864  READ DMA
  c8 00 08 e0 3e 90 e0 00      00:02:02.842  READ DMA
  25 00 08 ff ff ff ef 00      00:02:02.841  READ DMA EXT
  25 00 08 ff ff ff ef 00      00:02:02.841  READ DMA EXT
  c8 00 08 00 00 00 e0 00      00:02:02.841  READ DMA

Error 956 occurred at disk power-on lifetime: 19150 hours (797 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ea 3e 90 00  Error: UNC at LBA = 0x00903eea = 9453290

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 e8 3e 90 e0 00      00:01:51.614  READ DMA
  27 00 00 00 00 00 e0 00      00:01:51.613  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 a0 00      00:01:51.605  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:01:51.597  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      00:01:51.597  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

There really was an error with 9453290 sector.

40 51 00 ea 3e 90 00  Error: UNC at LBA = 0x00903eea = 9453290

If we took a deeper look at smartmontool output, we would have seen that disk had more bad sectors

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       6
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       6

With hdparm I was able to read specific sector(s).

root@debian:~# hdparm --read-sector 9453290 /dev/sdb 

/sdb/sdb:
reading sector 9453290: FAILED: Input/output error

Then we wanted to see if there was more bad sectors close to sector 9453290

root@debian:~# hdparm --read-sector 9453289 /dev/sdb 

/dev/sdb:
reading sector 9453289: succeeded
ffff ffff ffff ffff ffff ffff ffff ffff
ffff ffff ffff ffff ffff ffff ffff ffff​
ffff ffff ffff ffff ffff ffff ffff ffff

This one looked OK.

root@debian:~# hdparm --read-sector 9453291 /dev/sdb 

/sdb/sdb:
reading sector 9453291: FAILED: Input/output error
root@debian:~# hdparm --read-sector 9453292 /dev/sdb 

/sdb/sdb:
reading sector 9453292: FAILED: Input/output error
root@debian:~# hdparm --read-sector 9453293 /dev/sdb 

/sdb/sdb:
reading sector 9453293: FAILED: Input/output error
root@debian:~# hdparm --read-sector 9453294 /dev/sdb 

/sdb/sdb:
reading sector 9453294: FAILED: Input/output error
root@debian:~# hdparm --read-sector 9453295 /dev/sdb 

/sdb/sdb:
reading sector 9453295: FAILED: Input/output error
root@debian:~# hdparm --read-sector 9453296 /dev/sdb 

/dev/sdb:
reading sector 9453296: succeeded
ffff ffff ffff ffff ffff ffff ffff ffff
ffff ffff ffff ffff ffff ffff ffff ffff​
ffff ffff ffff ffff ffff ffff ffff ffff

We found exactly 6 bad sectors (9453290 - 9453295)

Because we are root and we have power, we reallocated the sectors by simply issuing a write command to it.

root@debian:~# hdparm –write-sector 9453290 –yes-i-know-what-i-am-doing /dev/sdb

/dev/sdb: re-writing sector 9453290: succeeded
root@debian:~# hdparm –write-sector 9453291 –yes-i-know-what-i-am-doing /dev/sdb

/dev/sdb: re-writing sector 9453291: succeeded
root@debian:~# hdparm –write-sector 9453292 –yes-i-know-what-i-am-doing /dev/sdb

/dev/sdb: re-writing sector 9453292: succeeded
root@debian:~# hdparm –write-sector 9453293 –yes-i-know-what-i-am-doing /dev/sdb

/dev/sdb: re-writing sector 9453293: succeeded
root@debian:~# hdparm –write-sector 9453294 –yes-i-know-what-i-am-doing /dev/sdb

/dev/sdb: re-writing sector 9453294: succeeded
root@debian:~# hdparm –write-sector 9453295 –yes-i-know-what-i-am-doing /dev/sdb

/dev/sdb: re-writing sector 9453295: succeeded

We tried to access our md2 device again

root@debian:~# mdadm -Asf && vgchange -ay

And finally we were able to mount it

root@debian:~# mount –o ro /dev/md2 /mnt
root@debian:~$ ls /mnt
@appstore     aquota.user    @database  homes  @S2S    @syncd.core   @tmp
aquota.group  @cloudstation  @eaDir   share  @spool  synoquota.db

Well that's was it ! I was able to copy/rsync restored data to some temporary storage and to tell our customer a good news :)

Afterwards, Diskstation was reset and filled with a 3 new drives plus old one (smartmoontools didn't complain anything about sda so we kept it). New RAID array was configured in RAID6 mode, because of two disk failure scenario (4x2TB = usable 3.73TB).

synology-dsm

Comments

I've had two incidents this month with my Syno 1513+, both losing 2 drives at the same time. 

Checking bad sectors on all of these drives indicates first LBA is the same sector as your issue - 9453290. 

Awfully strange if it is coincidence. I'm seeing other references to the same sector in other searches - not sure if it is the WD Red head parking issue or a Synology issue yet. 

Add new comment

Filtered HTML

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.