If you work around computer systems with any regularity, you’ll hear the phrase “fault-tolerance” quite a bit. You realize it’s got to be a good thing, but what does it mean? Is it something you need for your systems?
Fault-tolerance is the general term that describes how quickly an important application can recover from a hardware failure, software problems or damaged data. There are both hardware solutions – think multiple servers or hard disks – and software solutions – think automated backups or copies of your data stored in at least two locations. Many fault-tolerant systems combine hardware and software strategies, harnessing the combination to build a system where a serious problem can happen and not in any way affect the users on the system. Well designed systems let managers repair the problem, bringing failed components back online while the users continue to work away happily.
Computers are clearly much more reliable than they used to be, but most people’s experience is that even so, things happen, and computers fail (and that usually happens at the worst possible time, of course!).
Whether you need your systems to be fault tolerant depends on how important your application is and how long you can wait to regain access to it in the case of a hardware problem. If your users can go hours or days without using an application, the only fault tolerance you might need is a tested and verified backup strategy that will allow you to eventually rebuild your system from the ground up. Of course, there aren’t a lot of systems out there these days with users that would be happy to hear it would be days before they could get back in!
If that’s the case for you, then you need to look at more robust solutions. Fault tolerance can imply two things: either the storage is fault-tolerant, so you can keep working even if a hard drive fails, or fully fault-tolerant, meaning in addition to having hard drives fail you can have entire server computers fail, and the users are never disturbed as they work.
Fault-tolerant storage is done with drive arrays that write the stored data in a way that it is spread across multiple disks with hardware that allows you to replace a single failed drive while the rest of the drives are online that will rebuild the data onto the new drive.
Fault-tolerant storage is a good first step, and protects you from drive failures. Running healthy drives won’t help you if you have a problem with other hardware in your server or a problem with the OS that runs it. So the next step is a fully fault-tolerant system, where you run redundant servers, all configured to support your applications and all keeping copies of your data. There are various architectures to accomplish this, but what they have in common is enough hardware so that any individual component can fail and the users are able to keep working while an administrator replaces the failed components. Some fully fault-tolerant schemes are more efficient and economical than others.
Most mission critical applications really need to be fully fault-tolerant, once you consider how much money you loose every hour such a system is down! That is why most of our license sales, and all our hosted digital asset management and e-Discovery systems are fully fault-tolerant: we supply our customers with a hosted Service Level Agreement where we have to refund them hard cash if the go down, and we’d rather not pay that money out!
Posted by Jennifer Cox