Last time we discussed security – how can you be sure that your digital assets aren’t available for unauthorized eyes to see. This week we’ll discuss how the technology ensures that, even during a hardware failure, you never lose access to the files in your digital asset management system.
I really need to feel I can depend on an digital asset management system. How do I know it will be there for me at any time of day or night? What if something breaks?
We all know that hardware can fail, and usually does so at the worse possible time (“Stuff Happens”). So to ensure very high availability, we need to make sure we do not have a single point of failure: meaning there can be no single system that can fail and bring our whole system down. Everything has to be redundant. A system that can survive one or more hardware failures is called “Fault-Tolerant”.
What would that look like? (We are omitting the firewalls to keep the diagrams clear)
Here we show two complete, separate systems, one in the left stack and one in the right stack (by the way, one of the beauties of such an architecture is the stacks can even be geographically located in different places in case you want to be flood or earthquake proof). As you can see (black arrows) the two Database Servers are connected by a process that replicates information between them, as are the two File Servers: at any point in time they have identical contents, even though they are separate boxes. Also, either Web Server can connect to either Database Server or either File server (even if they are geographically dispersed).
At the top users come into the system, and the first thing they come to is a pair (since EVERYTHING has to be redundant!) of Load Balancers, which decide which of the two Web Servers is less busy, and sends your request to that one. If the Load Balancer detects that one of the Web Servers has failed, of course it is smart enough to send your request to the surviving Web Server! So far, so good, we have made it into the system, and even if one of the Web Servers has failed, we are still “talking” to a live system. In this diagram let’s say Web Server 1 on the left side has failed, and Web Server 2 is handling our request on the top of the right stack.
Now Web Server 2 figures out who we are, what we are entitled to see and do, and parses our request into a query for the Database Server. Perhaps normally it would send its query to Database Server 2, right below it, but in this example, let’s say that system has failed too, so Web Server 2 is smart enough to know that it has gotten no response from Database Server 2 (this takes milliseconds to figure out), and it sends its query across the stack to Database Server 1, on the left side! We are still in good shape: our user (you) still has no idea that 2 hardware boxes have failed and you can still do your work.
Web Server 2 decides which documents it needs to fetch to build the web page you need to get back, and, as luck would have it File Server 2 has ALSO failed (this stuff is having a very bad day). No worries, Web Server 2 just reaches out to File Server 1 for the documents it needs, and finishes building the web page you want, and shoots it off to you! You get back to reviewing digital assets, or otherwise working on your project. You have no idea there have been any issues on the server farm.
Net result: THREE hardware platforms have failed, and you can still work away undisturbed. Meanwhile technicians can work on the failed boxes and bring your system back to fault-tolerant status without troubling you. Pretty cool, and this is why a fault-tolerant system can contractually guarantee you high availability.
Posted by David Tenenbaum
Flickr photo by torkildr