Saturday, January 29, 2011

How does RAID detect a faulty HD?

I have been looking over Raid levels over the past 3 days. And have been weighing up the pro/cons of raid controllers hardware/software. I understand that RAID is not a backup solution and I'm perfectly fine with it, though one question still remains.

How does a RAID controller, even Raid 1 to Raid 6 actually detect that a hard disk drive is failing. The research that I have done have showed that most common hard disk drive manufactures use ECC in their hard disk drive design that is suppose to protect against 1 bit failures to an extent 3 bits.

Though when thinking about this, lets say you have Raid (1) and two hard disk drives that are identical. Lets say, data is read from drive 0, and also at the same time from drive 1. Though drive 1 reports a ECC read failure to the Raid Controller.

Now this is the big question, with hardware raid what would the Raid controller do? Its got a signal from the hard disk that the read failed. It can report the hard disk drive as faulty and need replacing.

Does the Raid Controller Seeks to a different hard disk drive for the data until it gets a successfully read from the drive. (Yes, a drive can report read correct and the data can still be corrupted, and RAID does not check polarity or ECC on read)

  • The answer to the question is going to depend greatly on the RAID controller manufacturer and how they implemented error/failed drive detection.

    Chad : The trouble is, I cannot find correct documentation on any RAID controllers how they do this! Its really frustrating to try to find documentation on their error recovery procedures. This I think they do not openly release because trade secrets. Though if you know any RAID controllers that do tell you what happen I would love to read their doco's. For example, the hard drives I use are Server Grade, that report a read failure. Though how/what/why who reads this information is a mystery.
    womble : If they're trade secrets, then publishing them on this site would be a very stupid thing to do.
    Zypher : The exact algorithms they use are definatly going to be "secret souce"
    Bart Silverstrim : Publishing secrets doesn't matter. If a competitor wants to know about it they just reverse engineer it. Pepsi knows very well how to make Coke, and Coke can make Pepsi if they wanted. They don't because there's no point in Coke making an exact duplicate of Pepsi. Same with RAID cards...why make a card that's already out there? Make your own and make it perform better and be more reliable. If it can be proven they reverse engineered anything or "stole" code (which would probably be copyrighted), they'd be sued into oblivion and skewered in public opinion anyway.
    From Zypher
  • There are various methods that a RAID implementations can assess the "health" of a disk (SMART, SCSI "Check Condition" and "Sense Key" messages), but I'm not aware of any published "standard" as to how RAID implementations should act on these methods. The specific steps that each make and model of RAID controller firmware (or, for that matter, a software RAID implementation in an OS) uses are going to vary depending on the manufacturer's design.

    All hard disk drives use error correcting codes (ECC) today. At the data densities we're working at bit errors are just a fact of life. Unrecoverable read errors are what matter to a RAID controller. At the level you're interested in, you'd have to have the design specs on both the RAID controller and the drive firmware to really understand how media errors would be reported up the device stack to the OS, and ultimately the user.

  • Implementation is entirely up to the manufacturer. They could use any mix of tools... calculating parity of data as it's written to the drive and if it's wrong, it flags a possible issue, it could watch hard disk status if there's onboard SMART status, reading errors straight from the drive, see if there's issues through multiple errors to a particular drive, etc...

    I've had a controller that didn't KNOW there was an issue with a drive. We had a three-drive RAID 5 where one disk completely failed. Installed a new drive, and in the process of rebuilding one of the good disks upchucked an unrecoverable read error, which is an issue more and more as drives get bigger and manufacturers allow a certain number of these in the manufacturing process. End result? Rebuild from bare metal backup. So when you ask how the controller "knows" the drive is bad, it doesn't necessarily know.

    In other words, RAID controllers just do the best they can. They still fail.

    The end result is that RAID controllers usually simplify your setup by abstracting the work from the software, they offload processing power to dedicated hardware, and they add (usually) some better support for telling the end user which drive is bad (through software tools and/or blinky lights) so you don't have to guess which one is bad.

    Software RAID is integrated with the OS, it's far far cheaper, and it's just about as reliable now (if you're talking about Linux especially) and nearly as speedy (in some cases, faster). It also doesn't need special drivers unlike many controllers. If you use a high-end card it'll probably perform better but for most home-grade RAID they tend to be comparable in speed.

    If you're talking about motherboard RAID, it's not really RAID. It is a crappy version of software RAID, and it makes it nigh impossible to recover data if your motherboard goes south because often they're vendor-specific in how they mess with data on the drive. I've had cases where a system failed and you couldn't take the drive from the array to another system to recover data from.

    Overall, unless you're talking RAID for servers in a business or have really specialized needs, software RAID is probably on par with hardware RAID for %90 of what home users would use it for.

  • I asked a NetApp engineer who was giving us a talk this very question. His answer, more or less, was:

    Nobody reads the checksums on reads. There's no point. Reading a checksum means you have to read the entire slice plus checksum, then compute the checksum to verify you have the correct data. Plus the orthoganal checksum if you are running RAID-6 or whatever. It is a total performance killer because it breaks the ability to randomly seek to totally different sectors on different disks at the same time. Similarly, almost nobody reads both sides of a mirror in RAID-1 because if you only read one side you can alternate which side of the mirror you read from so that you get faster throughput, and if you suddenly have a mismatch, which disk do you take as correct and which do you take as broken? All modern RAID systems depend on the on-disk controllers to signal the RAID controller that they are in distress (through SMART or the like), at which point that disk is almost always kicked out of the array. Checksums are used for rebuilding arrays, not for read-verification.

    Chad : Thanks, exactly what I needed to know. RAID is not a real backup solution so the only conclusion I can come to RAID is to use it purely for the performance boost and not data recovery or data fault tolerance. If it does recover from a hard disk drive failure (great) but a pure backup and restore solution is better for fault tolerance.
    John Gardeniers : Wrong Chad. The real purpose of RAID is to provide redundancy. Redundant Array of Independent (originally it was Inexpensive) Drives. Performance boosts are an added bonus for some configurations.
    Chad : John there is no point talking about redundancy if your cannot guarantee that the data your reading from the drive correct or not. There is no redundancy I see in RAID, if it does recover then its all good but you have no validation if the data your reading is corrupt or have been silently corrupted from a faulty hard disk drive. So what Im saying is redundancy is useless if you cannot guarantee integrity. Something that Raid does not provide. So your better using it for performance and using backup that provide integrity.
    Chad : Though correct me If I'm wrong, you can have all the redundancy in the world, but it could all be corrupt so it becomes worthless.
    Helvick : It could always be corrupt but the redundancy gives you additional confidence that you can support service continuity to meet a target SLA. Nobody will give you 100% guarantees, what you get are better guarantees for higher costs and you need to balance those. No matter what your business continuity targets and mechanisms are you will still need a disaster recovery plan and a mechanism to selectively restore (or recover from archive) that can deal with the situation(s) when redundancy isn't enough or doesn't provide the service you need.
    David Mackintosh : Chad, you can never make guarantees. There are always possible causes of errors -- cosmic rays, phantom writes, meteor showers, maintenance staff plugging vacuums into the wrong circuits, whatever. What RAID-1, -5, -6, -10 do is increase the possibility that for some errors you will be able to recover from it without losing data and (depending on the controller) without any downtime. RAID is not a backup.
    Chad : Thank you Helvick and David Mackintosh for leaving the comment. You two are correct, and I was wrong. I rang up a server professional to discuss RAID and he said exactly the same thing. Though that RAID is redundancy as a fall back as a slower service but still keeps running well you correctly restores your system. Though it larger discussion than just RAID, went into backup and restore and long storage durations.

0 comments:

Post a Comment