HDD diagnosis and recovery tools

HDD failing

Linux offers all the tools needed to diagnose and recover data for free from a failing hard drive as long as it can still be read. I'm just going to cover in a general way some tooling options to repair or extract data on a problematic hard drive in this post.

For those who cannot be bothered to roll out their own custom recovery image, System rescue CD has most of the needed tools within. To note that there are other equally capable custom distros images available for this very purpose. A spare USB key with at least enough storage to hold the distro's image and a quick use of dd is all that required get going.

For this post I'm assuming the reader is terminal-literate and has at least some understanding of the Linux filesystem. As most examples are generified they will need to be modified where appropriate to fit the situation.

1. Health of the drive (S.M.A.R.T.)

The first thing to do on a seemingly failing hard drive is to check if it actually is. Thankfully, on modern-ish drives, it is now possible to do just that by looking up the S.M.A.R.T. info of the device which indicates its general health and usage stats.

Enter smartctl (part of the smartmontools package) which can print all that information for you.

smartctl -a /dev/device

The resulting output should give a clue as to what the problem is. If the device's health looks good then the problems will likely stem from data corruption (see next section).

If there is a high number of sector re-allocation reported it would be wise to replace the drive preemptively. It's usually a sign that the drive is on its way out.

2. Good drive, Bad data

2.1 File system checker

The fsck (file system consistency check) utility can identify and correct integrity errors in unix/linux file systems. Just make sure the target partition is NOT mounted when subjecting it to fsck.

If the drive/partition is encrypted, it will need to be decrypted first (see section 2.4.1).

fsck -CVr /dev/device

2.2 Recovery from a formatted hard drive

So you formatted your drive by mistake or something obliterated the partition table... The good news is that it can be recovered from in most cases (lest data was written on top).

For this predicament testdisk can come to the rescue. It is an interactive utility that offers many useful things (excerpt taken from the cgsecurity site):

  • Fix partition table, recover deleted partition,
  • Recover FAT32 boot sector from its backup,
  • Rebuild FAT12/FAT16/FAT32 boot sector,
  • Fix FAT tables,
  • Rebuild NTFS boot sector,
  • Recover NTFS boot sector from its backup,
  • Fix MFT using MFT mirror,
  • Locate ext2/ext3/ext4 Backup SuperBlock,
  • Undelete files from FAT, exFAT, NTFS and ext2 filesystem,
  • Copy files from deleted FAT, exFAT, NTFS and ext2/ext3/ext4 partitions.

Without going too much into details this saved my arse a few years ago when I accidentally started formatting one of my archival drives. Definitely a tool worth knowing about and learning how to use!

2.3 Data recovery tools

Aside from testdisk mentioned above, there are some other tools available that cover different domains.

foremost is a utility that can recover files based on their header/footer and internal structure (a.k.a.: "file craving"). It supports a number of file types out-of-the-box but more can be added.

A file header/footer is just a set of bytes identifying the file as of a certain type. This information is useful when trying to find files from a raw data source with no data telling us the structure of the content within. E.g.: 'jpg' files start with a ff d8 ff e0 and end with ff d9.

scalpel, a Linux and Mac file system file recovery utility originally based on foremost, offers an alternative with it's own set of advanced features.

If one of the tools doesn't return much of the lost data from a device/image it is worth trying the other ones as well as they may be more successful or help complete the recovered set.

2.4 Accessing an encrypted volume

2.4.1 Decrypting a partition/drive

If the partition or the entire drive to be checked is encrypted (let's assume with LUKS since this is for Linux) it will need to be un-encrypted first.

cryptsetup luksOpen /dev/device luks_volume

2.4.2 Mounting to access content

In the case where the partition/drive is set up as a LVM then the volumes within the encrypted partition/drive will need to be mapped. To find out if this is the case just use vgscan command and, to mount the volume(s), vgchange -aly.

Otherwise, if LVM is *not* used, just mount the partitions found in /dev/mapper/luks_volume like as normal using the mount command.

3. Bad drive, Good data

3.1 Checking for bad blocks

On modern drive, S.M.A.R.T. can passively identify bad blocks on the device but only when an actual error is actually reported to it. To actively check for bad blocks the badblocks utility can be used in either a non-destructive or destructive scenario.

The non destructive method tests each blocks by writing to it but, on each iteration, does a backup of block's data prior. This way the block's data can be put back after its test is completed:

badblocks -nsv /dev/device

To create a list of all the blocks identified as bad just amend the -o /save/path/to/badblocks.txt to the options in the arguments.

The destructive way will effectively wipe all data on the drive as it is testing each blocks. This is great for testing new drives but not so much ones in use. When taking this option, a full backup is required to restore everything back to its original state once the test has completed.

badblocks -wsv /dev/device

3.2 Backing up the data

So the device looks to be failing hard. What now? If a backup has not been done very recently it might be a good idea to try doing one if the content can still be accessed... This is where that spare drive mentioned earlier comes into play.

A spare drive with at least an amount of free space equivalent to the size of the dying drive is will be required.

Either rsync or dd could be used but they are not really the best tools for the job when the drive might just die on you mid-transfer and/or the data therein has, potentially, errors. A much better option for purpose at hand is ddrescue.

ddrescue copies the data but also attempts to repair it when read errors occur. Specifying a mapping file will enable the process to be resumed if interrupted and further read attempts to be made on problematic areas of the failing device.

When dealing with a failing device, the aim should be to get as much of the good/uncorrupted data as possible in one go. Once all that data is copied safely the attention can be focused the parts of the device that caused read errors on the first pass.

3.2.1 Cloning from failing device to new device

Make sure neither the failing device nor the new device are mounted and that the target device for the cloning is at least the same size as the failing one.

The direct device-to-device cloning process should be considered when the failing device is to be replaced but the data needs to be the same and both the data and the failing device are in good enough conditions to do a full 1-1 copy without too many crippling errors.

ddrescue -df /dev/failing-device /dev/new-device /mnt/ext-usb/failing-device-map.log

3.2.2 Imaging failing device to a recovery file

Make sure that the file system where the image file is to be saved to has more free space than the size of the failing device available.

Cloning to an image is used for recovery purposes rather than just replacing a failing device with a fresh one. If a lot of errors are reported by ddrescue it may not be worth cloning the image into a new device. Targeted individual file/folder recovery from the image would then be a better option.

Pass #1: copying the good data: ddrescue -d /dev/device /mnt/ext-drive/hdd.img /mnt/ext-drive/hdd-map.log Pass #2: bad data recovery (3x attempts for each bad sectors: -r3): ddrescue -d -r3 /dev/device /mnt/ext-drive/hdd.img /mnt/ext-drive/hdd-map.log

3.2.3 Mounting a recovery image

To mount a recovery image into your system (e.g.: the /mnt/recovery-img directory) there are different approaches depending on what the image is.

Mounting an image of a partition

That's the easiest. Note that the ro read-only option is passed so that the image does not get modified by mistake.

mount -o loop,ro /mnt/ext-drive/hdd.img /mnt/recovery-img
Mounting an image of a drive

When there are multiple partitions in the image of the failing drive the offsets of each must be calculated before mounting these is possible. This is where the parted utility becomes useful.

parted /mnt/ext-drive/hdd.img

Then type the following in the interactive prompt to get the start/end/size of each partitions in the image:

units B print

With this information a partition can be mounted by passing its 'start' as the $OFFSET value and its 'size' as its $SIZELIMIT value. If the partition is the last on the imaged device then the sizelimit argument can be omitted.

mount -o loop,ro,offset=$OFFSET,sizelimit=$SIZELIMIT /mnt/ext-drive/hdd.img /mnt/recovery-partX

4. Dead drive, Unknown data

If the drive is not getting recognised on any systems then it might be time to pour out a drink and ponder about the value of backups. I shan't be judging as I've been in that place before.

The only recourse left on a drive that is not recognised is to send it off to your nearest friendly data recovery place that has a "clean room". If it's a mechanical drive you may still have a chance but with a solid state, it's another matter...

When the controller on an SSD goes you might as well just bin the thing or, as a last ditch attempt. try the power cycle method (I've had very limited success with it but your experience may vary). If that somehow works through some miracle, it is advisable to use that opportunity and do a full backup before swapping the device with a new one.