Monday, July 13, 2009

Thinking of RAID on a System?

Over the years I've played with several different RAID configurations in a couple different environments.

Being paranoid about having my system available to me, I've used hardware RAID mirroring (RAID 1) on my desktop while at work I've had the honor of using Dell PERC controllers for RAID 5 arrays (three drives mirrored into one volume).

For a quick overview on RAID, check out this link. Basically it's a way of keeping your computer running if a hard disk fails without losing data.

Only it doesn't always work this way.

See, back when I first started playing with RAID there was debate about software RAID (where drivers in the OS handled the redundancy control) and hardware RAID (this is where you purchased a controller and installed it into your computer so that the operating system...Windows, Linux...wasn't aware of the RAID setup at all other than what brand controller you had; you went into a boot menu to configure your RAID setup, let it do it's thing to the drives, then install your operating system).

The debate was over speed and reliability. Software RAID was evolving. Hardware was more reliable.

Today the landscape has evolved.

In my home systems I've used 3Ware cards for mirroring drives. It added a LOT to the cost of my computers and I've not had any problems with them in use; in fact, the first one I had managed to save a computer that was so old I had relegated it to use for my daughter and as a test server later on. My daughter didn't even realize the drive had failed...I don't know how many weeks (or months) the one drive was crippled before I was in her room and saw an error come up about the volume being degraded.

The biggest con against using them? Cost. Adding these cards can cost hundreds of dollars more on a system. Check out the cost of a 3Ware card sometime to find out.

At work we have a lot of Dell servers with PERC cards. These are rack-mounted servers made for being servers; you would expect them to perform very well. Here's a fun story to share.

We had a server being relied on by hundreds of users as file server (and home directory server). It had RAID 5; this means there are three drives that appear as one volume for redundancy...check out the previous link for more information, but it basically means that one drive goes bad an alarm goes off and the server keeps chugging away with the remaining two drives. You pull the bad drive, install a new one, plug it back in and the RAID card will rebuild the data on the fly. That means you never even turn off the server and the users are never aware of the problem....you literally pull the drive out and switch it out with a new disk and the server is supposed to recover without a problem.

I say supposed to because we had a drive fail and it didn't recover.

There are tools with Dell servers (for Windows) that lets you monitor the status of the RAID array. We put in the new drives, it would start rebuilding, then the process stopped with an error.

After many retries (and anxiety...remember, one more drive dies, you lose the server at that point) we found out using the tools on the RAID card (meaning we had to shut off the server and use the boot-menu on the controller to check the disks at that point) that one of the other drives had one bad block on it that the controller and operating system never flagged as bad.

In other words, we had drives A, B, and C. C failed. We replaced C with a blank drive and in the process of rebuilding data from A and B to drive C discovered that drive B had one bad piece on it that nothing else had noticed until now.

And because of that little problem the array couldn't be rebuilt.

Crap crap crap.

In the end we ended up replacing both B and C then rebuilding the machine with a backup from bare metal...rebuilding the whole server from a backup that was a few days old. Fortunately we lost very little data but did lose quite a bit of time and added a number of upset people who couldn't get to their home directories for a couple days. Ouch!

So anyone telling you hardware RAID is a panacea...they're lying.

Some other considerations regarding RAID...
  1. If a computer breaks (and it's not the drive electronics), you can sometimes get data off by sticking the drive into another computer. Hardware RAID sometimes throws a loop into the mix because it'll add a signature to the drives so that the controller knows when you hot-swap drives which one is which (or even if it's not hot-swap). That means that you can't just pull the drive and stick it into another computer to get data off, unless you have identical hardware available in which to stick the drives in. This is of course true if you have data striped (like in RAID 5); you would think that in data mirroring (RAID 1) you could do this. Not always.
  2. Some systems today come with "hardware RAID" on the motherboard. They're crap. It's not really hardware-based as much as software; it's a chip with some programming on it to handle a kind of pseudo-RAID implementation that offloads the checksumming functions over to your computer's processor. See this article and this one for a few other opinions. These systems with RAID on the motherboard are also referred to as "fake RAID".
  3. "Fake RAID" also can be system-dependent. In other words, if you're running RAID with the onboard RAID chip and the motherboard dies, you lose your data because of the way the drives are formatted to work with that motherboard. You need to replace it with an identical (or fortunately similar) motherboard to access the data again.
  4. RAID is not a backup. It's there to protect access to a system in the event of drive failure. You still need to back up important data to another set of media.
  5. Software RAID in Windows and Linux means you increase your odds of recovering data if there's an issue with the system. Hardware RAID will often tie your drives to that computer. Today the performance of software RAID can meet or exceed hardware performance in many real-world use cases. See this article for some older numbers comparing the performance of (2004) Linux software RAID vs. a 3Ware controller.
  6. RAID means moving a single point of failure from one device to another. People seem to forget that. RAID protects you from a hard disk failure; but if your RAID controller dies, your data dies (unless you have a good backup). The only way to stop that is to have two RAID controllers. Then your motherboard becomes a point of failure.
  7. RAID is useless without monitoring tools so you know if the blasted thing is working. Here is another tradeoff; tools for Linux software RAID tend to still be cryptic, but if you're setting up an administrative environment, you can usually set up tools to email or alert you if ther's a problem. But it's cryptic. Not user friendly. Takes time to learn the ins and outs. Windows software RAID is as cryptic as Windows usually is. Hardware solutions usually means running a proprietary tool specific to your controller from that manufacturer...more vendor lockin. If you want flexibility you'll probably not get the pretty tools.
My next system for home data storage will probably be using software RAID. I just can't justify the cost of hardware RAID anymore.

I have the 3Ware type cards in some of my systems now but have been bitten by an aspect of the proprietary nature of the vendor. I loved the cards, they worked well. But the really old system I mentioned? I recently installed Ubuntu 8.10 on it. I then set about installing the 3Ware monitoring tool, a little web-based thing that gives the graphical status of your card's logs and configuration. I can't get it to work because they don't support the newer kernels; because the card is old (but perfectly usable!) they aren't planning on adding support either. I'd need a brand new card. Which is complete overkill for this old thing.

I've also been bitten while repurposing an old Dell server with a PERC card running FreeBSD and later Linux; I can't get any tools to monitor the RAID status. I have to either reboot to the card's controller or wait until the status light on the front of the server starts going nuts to check on it.

While hardware RAID can give high-end configurations a little bit of an edge in performance and, properly used, a definite edge in monitoring the status with the hardware (it's nice to have the drives physically numbered on the cable ports so you know which is drive 1 when 1 has failed, or have blinking status lights on the controller or disk telling you, "Hey! I'm broken!!"), the extra cost isn't really worth it for most people's situations.

Software RAID is, to be sure, not easy for the beginner (at least not on Linux) to implement as you can imagine from this example. Once running it should be relatively portable, recoverable, and at least usable.

The good news is that these tools continue to evolve and become more friendly. Today's Ubuntu installer includes information at setup on configuring software RAID, making it far easier than before for people who aren't deep in the arcana of system administration to add that kind of support. Okay, maybe it's still not all that friendly. But as a longtime Linux user, believe me when I say it's come leaps and bounds forward in friendliness.

Are there any other points I'm missing that should be included here regarding RAID? Feel free to comment and let me know...

No comments:

Post a Comment