What is a cluster?
A cluster is a group of independent computers working together as a single system to ensure that mission-critical applications and resources are as highly-available as possible. The group of computers is managed as a single system, it shares a common namespace, and it is specifically designed to tolerate component failures. A cluster supports the addition or removal of components in a way that's transparent to users. Clustered applications have several advantages: fault-tolerance, high-availability, scalability, simplified management and support for rolling upgrades as well as other planned maintenance activities, to name a few.
There are two different types of cluster models in the industry: the shared device model and the shared nothing model.
In the shared device model, applications running within a cluster can access any hardware resource connected to any node in the cluster. As a result, access to the data must be synchronized. In many such implementations, a special component called a Distributed Lock Manager (DLM) is used for this purpose. A DLM is a service that manages access to cluster hardware resources. When multiple applications access the same resource, the DLM resolves any conflicts that might arise. Along with this sophistication and complexity, a DLM adds significant overhead to the cluster. Most of the performance loss is displayed as additional traffic between nodes; however, a performance hit is also realized due to the loss of serialized access to hardware resources.
By default, Microsoft Cluster Server and the Windows Cluster Service use the shared nothing model. Because this model does not use a DLM, it does not have the overhead incurred by using such a service. In the shared nothing model, only one node can own and access a single hardware resource at any given time. When failure occurs, a surviving node can take ownership of the failed node's resources and make them available to users.
While both Microsoft Cluster Server and the Windows Cluster Service support the shared nothing model, they can use the shared device model, but only if the clustered application supplies its own DLM.
Why should organizations use clusters?
Generally speaking, hardware failure is not the predominant cause of downtime for applications. The leading causes of downtime are typically related to events that are external to the system, such as mis-configuration, power outages, security breaches, and so forth. Clustering cannot help you solve those types of problems. In addition, a cluster cannot protect you from software incompatibilities, corrupt databases, viruses, catastrophes, or mistakes. Clustering is best implemented when a substantial proportion of your server downtime is caused by hardware failure, patching, and upgrades. If your organization’s leading cause of downtime is the result of failures in administration, software, or infrastructure, an investment in clustering technology may not reduce your downtime.
You need to assess the reasons for server downtime in your organization. Look at the problems that clustering solves, and then make a business decision as to whether clustering is an appropriate solution. The primary focus of clustering is solving problems that arise from hardware failure, such as a blown CPU, bad memory, the loss of an entire server, or down time associated with patching and upgrading. In addition, clustering allows you to continue providing resources during planned outages that may cause downtime for users. A cluster system can allow resources to be manually moved—or failed over—to one server while the other is brought down to perform a rolling upgrade, a configuration change, or other maintenance.
A rolling upgrade is the process of applying a service pack or other hardware or software update to each node in the cluster while the other node continues providing service. Rolling upgrades are typically a series of stages:
- Groups are moved from the node to be upgraded to another node.
- Take the node to be upgraded offline.
- Install or upgrade the software or hardware on the offline node.
- Bring the upgraded node online.
- Move the groups back to the upgraded node.
Then, repeat this process on each node in the cluster until the entire cluster is upgraded. Rolling upgrades are very attractive from a server management standpoint because services are only unavailable during the time it takes to move resources from one node to the other. By design, clusters help increase uptime. Increased uptime really means reduced downtime. Clustering can help reduce both planned and unplanned downtime. When any mission critical system fails, the consequences can include lost revenue, interruption of services to customers, and knowledge workers unproductively sitting idle. In organizations of all sizes, failures incur costs in many areas. Hidden costs often include damage to your reputation among customers, suppliers, and end-users; and the perception that your organization isn’t able to satisfy customer needs. Understanding the limitations of clustering is just as important as understanding the benefits. While clustering protects against the failure of a node in the cluster, it does not provide any protection against other problems, such as network failures, database corruption, loss of shared storage, or disasters.
Before implementing a cluster in your environment, you should evaluate whether this solution really solves enough of your problems to justify its cost. Clustering adds complexity to your environment and administration. Therefore, it is important that you understand and evaluate this technology in relation to your overall goals and the needs of your network.
What is meant by fault tolerance?
Fault tolerance is the ability of a system to continue functioning when part of it fails (e.g., experiences a fault). This term is used to describe disk subsystems (e.g., RAID), symmetric multiple processors (SMP), redundant power supplies (with separate power sources), uninterruptible power supplies, redundant network adapters, etc. Fault tolerance is designed to alleviate the problems caused by component failures, power outages, or other like occurrences. Computers contain many moving parts, and moving parts will eventually fail.
Disk subsystems that use RAID, which stands for Redundant Array of Inexpensive Disks (or Redundant Array of Independent Disks, or Redundant Array of Inexpensive Devices, depending on who you ask) are considered fault tolerant. RAID refers to the grouping of individual hard disks in a way that provides continued operation in the event of a disk failure. There is both hardware RAID (e.g., a RAID controller is used) and software RAID (e.g., the functionality is provided by an operating system or application). There are many forms (levels) of RAID:
There are other implementations of RAID, such as RAID-0+1 (aka RAID-10), RAID-2, RAID-3, etc., but these are typically proprietary implementations unique to the hardware manufacturer that support them. Many vendors have added other twists to disk fault tolerance, for example some vendors offer multiple parity disks which allow for multiple disks to fail while the data is still available.
Similar fault tolerant technology has been applied to the main memory in high-end computers which allows for the failure of memory chips without the computer missing a step.
What is meant by High Availability?
In a nutshell, High Availability is the combination of well defined, planned, tested, and implemented processes, software, and fault tolerant hardware focused on supplying and maintaining application availability.
For Example: As a high level example, consider messaging in an organization.
BAD - A poor implementation of Exchange is usually slapped together by purchasing a server that the administrator feels is about the right size and installing Exchange Server 2003 on it. Messaging clients are installed on network connected desktops and profiles are created. The Exchange server might even be successfully configured to connect to the Internet. It is very possible to install an Exchange messaging environment over a short business week and even over night in some cases. It is easy to do it fast and get it done, but lots of important details are missed.
GOOD - In an HA environment, the deployment is well designed. Administrators research organizational messaging requirements. Users are brought into discussions along with administrators and managers. Messaging is considered as a possible solution to many company ills. Research may go on for an extended period as consultants are brought in to help build a design and review the design of others. Vendors are brought in to discuss how their products (Antivirus and content management solutions, for example) are going to help keep the messaging environment available and not waste messaging resources processing spam and spreading viruses (or is that virii?). Potential 3rd party software is tested and approved after a large investment of administrator and end user time. Hardware is sized and evaluated based on performance requirements and expected loads. Hardware is also sized and tested for disaster recovery and to meet service level agreements for both performance and time to recovery in the case of a disaster. Hardware selected will often contain fault tolerant components such as redundant memory, drives, network connects, cooling fans, power supplies, and so on. An HA environment will incorporate lots of design, planning, and testing. An HA environment will often, but not always, include additional features such as server clustering which decreases downtime by allowing for rolling upgrades and allowing a preplanned response to failures. A top-notch HA messaging environment will also consider the messaging client software and its potential configurations that lead to increased availability for users. For example, Outlook 2003 offers a cache mode configuration allowing users to create new messages, respond to existing mail in their in-box, and manager their calendars (amongst many other tasks) without having to maintain a constant connection to the Exchange server. Cache mode allows users to continue working even though the Exchange server might be down for a short time, and it also allows for more efficient use of bandwidth.
The Goal - All critical business systems have to be analyzed to understand the cost of them being unavailable. If there is a significant cost, then the organization should take steps to minimize downtime. Taking this view to the extreme, the goal is really to provide continuous availability (CA) of applications and resources for the organization. Doesn't everyone want email to always be available processing messaging traffic and helping the people in the organization collaborate? Of course that is what we want. We want applications and their entire environment to continue running forever.
We strive for CA and we settle for HA.
"In information technology, high availability refers to a system or component that is continuously operational for a desirably long length of time. Availability can be measured relative to "100% operational" or "never failing." A widely-held but difficult-to-achieve standard of availability for a system or product is known as "five 9s" (99.999 percent) availability."
Source: http://searchcio.techtarget.com/sDefinition/0,,sid19_gci761219,00.htmlObviously, "continuously operational" just isn't possible over extremely long periods of time. Hardware will always fail, it is just a matter of when. Software becomes obsolete over time, too. We all need to understand that HA includes not just the hardware and software solution, but it also includes the backup/restore solution, and it includes failover processing. Most HA experts will also add that a true HA environment includes a well documented development, test, and production migration process for any and all changes to be made in production environments. There is much to achieving HA, however, it simply comes down to achieving high levels of application availability through well designed, planned, tested, and implemented processes, software, and hardware.
Another Example would be if you use network load balancing (NLB) to provide application availability to your users over the Internet for your web based app. NLB helps keep the application available to your users. The same can be said for server clustering, however, you need to take into account the non-availability during the actual failover of your application in the event of hardware or software failures. Sometimes, failover is a matter of seconds, in other cases it can be several minutes. In all cases, a clustering solution will significantly drive down non-availability and increase the uptime of your application as run on your servers. Many experts state that, for any application or system to be highly available, the parts need to be designed around availability and the individual parts need to be tested before being put into production. As an example, if you are using 3rd party products with your Exchange environment that have not been properly tested, you may find that they are a weak link that results in loss of availability. Implementing a cluster will not necessarily result in HA if there are problems with the software, as was discussed previously.HA is so much more than just slapping a couple of servers together in a cluster. Please keep in mind all of the details behind a top-notch HA environment.
What are the different types of clustering?