Saturday, July 18, 2009

High Availability Through System Redundancy

The title of this post is quite a mouthful. If it didn't scare you away and you're something of a geek at heart though, read on.

If you're a regular follower of the blog you know that I had two types of side projects floating in my head. One was writing a story; the other, creating a saleable program.

I had started on the program but shelved it temporarily to work on some stories starting with The Writing Show's Halloween contest entries. But that didn't mean I stopped thinking about the software aspects of my interests.

I was mulling over what I would want to do for a fledgling business using a web service as it's means of income. Part of that...a big one...was having the systems be available for use as much as possible since downtime can kill your business.

The problems is that bootstrapping a business means budget...low budget. Such a small shoestring that we're talking velcro for the shoes.

Part of that mulling over of the idea is what led to my article a few days ago about RAID. It's not a total solution for high availability. What I want is for a separate computer to be ready on standby to take over if one system craps the bed, so to speak. RAID simply can't protect against a motherboard or controller failure; it's not meant to.

Solutions from big companies...if you deal with technology you've probably heard of them...can solve this problem for you. As long as you have a couple hundred thousand in your account to cover your first year.

But that got me to thinking about Google. Many may not know this, but Google's hardware is all commodity parts. They just built a staggering number of systems and concentrated on tying them together with high speed switches and custom software for the filesystem and data shuffling that occurs on the their systems behind the scenes. The systems themselves aren't all that much different that what you can buy from any computer retailer.

Services can be offered using inexpensive hardware.

I did a little more digging and found the DRBD project. It's a mature Linux project that basically creates a RAID 1 array...mirroring...across the network.

What does that mean? Use case I'd look at. Buy two computers. You can get decent systems for...lets say a thousand dollars. Have a small hard drive and a larger data drive in each one.

On the smaller drive, install Linux. Configure the OS, networking, etc.

On the larger drive you place your data...your web server, your database, etc.

Now install DRBD to "mirror" the large drive on machine 1 to the large drive on machine 2. Using a "primary-primary" configuration, the data drives on the computers are kept in sync with each other.

If there's a failure on the primary one where all your data is being accessed, you can shut down the first one and your data would be intact on the second system, ready to take over. In the abstract, you have two servers ready to work for you in case there's a problem with any component in your server.

Unfortunately there's more to it than that. You can implement this, sure, but that means that if your server dies someone has to be onsite (or able to remotely access) your server 2 so that you can redirect traffic manually to the new system.

To fix that you need heartbeat software. This is software that runs on the two computers and constantly chatters back and forth, usually every few seconds, just to see if the computer's "mate" is still alive. If not, the heartbeat software runs a script that tells the second server to take over, alter it's IP address on the network so it is now the master system, and makes any alterations necessary in order to take over the role of the primary computer.

Of course this would mean that RAIDing a lot of data over the network could take a toll on performance. To that end I think this commodity hardware would need a second gigabit network card to connect the two computers with a crossover cable so they could just talk to each other; a dedicated hotline just to exchange data.

I've also thought about management of the system; easy backup, easy archiving, etc. I've long been a fan of virtualization for this. I can create a generic virtual computer with it's own virtual hard disk device (on your computer it looks like one giant file, when you boot the virtual computer it uses this giant file as if it were a hard disk). Everything is sandboxed and secure on that fake computer.

I've loved it because once I get a computer configured to provide a specific function, for example a web server for an in-house bulletin board system, I can back it up by shutting down the virtual machine and copying the giant drive file to another location then fire up the virtual machine again. We had a system fail and to bring up the virtual server I just copied that backup file to another computer and fired up VMWare on that system. No reconfiguration necessary and minimal downtime.

In other words that generic computer can be copied to another device to run in a pinch. If I didn't have that system virtualized I would have had to reinstall an operating system on a dedicated piece of hardware, configured exactly for that hardware (device drivers are a pain sometimes!) and it would have taken more time and hence more downtime. A virtual machine is the same no matter where you're running it...as long as you have your virtualization software installed and space on that system's drive (and memory) you're good to go.

Hmm,...I thought about it and wondered, what if I ran virtualization software that would mirror the virtual hard drive file using DRBD from computer 1 to computer 2?

VMWare has a solution like that. As long as you're running hardware they approved of with a lot of beef and VMWare's additional enterprise tools...for lots of money...they can give you the ability to sync up and share virtual computers between two or more servers.

The lots of money part would be the tough part, though.

Digging some more...there is a Linux solution called Xen. There are other virtualization solutions under Linux, but this one is rather mature and fast; plus it's coming up with hits in Google coupled with DRBD. Hmm...

Xen should give me the ability to install virtual Linux machines on a server and because it's associated with the Linux High Availability Computing Project, there are scripts and hooks built in to the heartbeat management software and DRBD that will allow me to take two machines running DRBD and when one machine fails have the second one automatically take over the virtual machine. Nice!

This would also simplify things like hardware upgrades or maintenance or other issues. Best of all, aside from the cost of time and hardware, the software to do this is free.

I know that this is all theoretical for me at this point. I've only been exposed to very basic virtualization software and basic hardware for redundancies like RAID cards and the like. I've definitely never done anything like working on a home-grown cluster. Fortunately I'm so far behind on having a working project to sell that this isn't a big issue right now.

Periodically I poke around and see what hardware is available or being thrown out by friends that I could procure for use in creating a very basic testbed, if for nothing else than to test performance on trash hardware. I'm also not fully certain how well any of what I've found so far woudl actually work in the real world; there could be other setups that work better. I'm still reading through material in my copious free time from the clustering and high-availability sites to see if there's some other solution that would match my proposed needs better.

If anyone out there knows more please feel free to leave comments. I'm open to ideas. In the meantime I'm probably going to continue with some writing and researching and eventually continue on with the programming...hopefully...

For others reading this that are techie hobbyists I hope you may have found this at least a little interesting. If you are a techie that has a home "data center" this may be the type of project that could boost your geek cred substantially. It would be cheaper than hardware RAID, more expensive than a big system with software RAID, but properly configured it would be safer than either one because it eliminates the computer itself from being a single point of failure...instead that title would go to your network infrastructure (router and/or switch) or your "datacenter" (if both systems are in your house and you have a housefire...).

Interestingly enough if performance is decent with compression and depending on load you might even be able to eliminate part of that by locating the two servers in separate geographic locations and connecting them with fiber or a dedicated VPN. But again I'd need to test performance for that with some testbed computers.

What do you think? Interesting? Anyone have ideas to share?

No comments:

Post a Comment