Archives
January 2005 | February 2005 | March 2005 | May 2005 | June 2005 | July 2005 | August 2005 | September 2005 | October 2005 | January 2006 | February 2006
July 27, 2005 - Costs of High Availability - Clustering Windows Server 2003 Russ Kaufmann
NOTE: For anyone looking for an actual cost, sorry, there isn't anything in this blog entry about the actual dollars needed.
I am having a flash back today, it must be the new medication. :)
The costs of HA seems to be a normal topic of discussion when a company looks into clustering and has sticker shock. I can't stress enough that clustering is not the end-all solution. Please do a quick read on my blog about my HA definition.
I was just talking to a client about how much clustering costs and how much the services cost to implement clustering. Yep, it isn't the same as just installing a standard server and multiplying it times the number of nodes. Servers with large hard drives, lots of RAM and multiple processors have come down a great deal in the last couple of years. What used to be about the cost of a 700 series BMW is now about the cost of a Chrysler 300. Really. However, when you start talking about HA, you have much more than the costs of individual nodes in a cluster.
The main cost issue with clustering is the cost of the additional components that are needed above and beyond the nodes themselves. For example, I keep hearing the term "disk is cheap" bandied (I love that word) about in meetings. It isn't true in all cases. Yes, a large hard drive is not that expensive. A LUN on a high-end SAN is expensive. It is even more expensive when you consider the initial costs of building the infrastructure to host that LUN.
OK, so back to the discussion of cost. Yes, clustering is costly, because it requires:
Windows Server 2003, Enterprise Edition which costs a good bit more than Standard Edition
Host Bus Adapters (two per server for redundancy) for the fiber fabric (yes, there are other less costly alternatives, but let's stick to mainstream right now) and the software for the HBAs
Fiber switches
SAN devices (or NAS depending on the certification of the hardware)
Experienced administrators (if you want it done right) to design and configure it
A 24/7 team for maintaining it (remember HA is not just clustering)
Significant documentation (in case the administrator gets hit by a bus)
Tried and tested processes
To achieve High Availability, an organization must implement well defiined, planned, tested, and implemented processes, software, and fault tolerant hardware. The focus is application availability. Yes, this costs money.
My favorite sales person used to use this phrase a great deal when we would talk to clients and potential clients about HA, "How much does it cost for the application to be down?" If it doesn't cost much, implementing clustering and instilling an HA attitude just might now be worth it. If they say it costs a fortune, then the response is simple, "if it costs you so much to be down, why are you sweating this relatively small amount to do the best job possible of keeping it up?"
I hate to think about how many organizations out there are gambling (yep, that is what it is) with their IT assets and the businesses that run on them. If your company will go out of business if an application fails, don't you owe it to the owners to protect that application?
July 26 WSRM for Microsoft Clusters Rodney R. Fournier
Recently Microsoft released a white paper (Using WSRM and Scripts to Manage Clusters - http://www.microsoft.com/downloads/details.aspx?familyid=ba2559e6-dd23-41a6-9efb-1d90f8f1fc17&displaylang=en) on how to configure and use Windows System Resource Manager to manage Clusters.
For those of you not familiar with WSRM, its a free product that comes with Windows Server 2003, Enterprise Edition or DataCenter both of which you can run a cluster on today. http://www.microsoft.com/windowsserver2003/techinfo/overview/wsrmfastfacts.mspx.
Here are a few features of WSRM:
Basically you ensure with WSRM that your clustered application gets the resources it needs and so does your base OS. This way Exchange or SQL gets everything it can without impacting normal operations.
The article correctly states that WSRM is not cluster-aware. It will monitor individual computers in a cluster J I would follow the best practice of configuring each clustered node with WSRM and identical resource allocation policies, process matching criteria, and other components of WSRM. Scripting the process is an excellent way to configure WSRM, as the articles title suggests.
July 22 - Moving a Cluster to a New SAN Russ Kaufmann
A fairly common scenario for a cluster administrator is to move a cluster from one SAN to another as SAN equipment is replaced with newer/faster SANs or the old SAN's lease is up and a new one is being brought in.
The easiest way that I have found to do this is to use these steps (this is from memory, let me know if I missed one or two):
Super High Level Steps:
Slightly More Detailed Steps:
Again, these are basic steps. Some of the individual steps will require lots of work. I have done this now several times and am very happy with the results.
July 21 Surviving the Windows Server 2003 Cluster Bomb Review Rodney R. Fournier
The following is an article recently published about clustering and recovering from a failure:
I would like to point out a few things. First go read the article (leave it open, so you can flip back to it below). Then come back here for my review.
Page 1:
First mistake Running an Active/Active (A/A) cluster is a very bad idea. Period! Since the cluster is on Windows Server 2003, a third node that can handle a failure of either Exchange or SQL should be added. This implementation is not mission critical worthy as is. The biggest issue here would be memory fragmentation. Windows learns how to page and react over time to running your applications, when an A/A server fails, the remaining server takes a very big performance hit.
Second mistake Much like the first mistake, installing the Exchange & SQL bits (binaries) on the same machine. Yes, I know that Microsoft does this with Small Business Server, but they tweak things to allow them to work nicely together. Never do this for any reason. With a third node, both will be installed, but only 1 would be running at a time.
Third mistake Performing a major (or minor) outage without a fully tested restore procedure and backups. On page 2 it states the current backups were not getting everything. They only figured this out during this fire drill. Shame on them for getting exactly what was required or heaven forbid actually testing a restore.
Ka . . . . boom? What does that mean? What really happened? What lesson was learned from the Ka . . . . boom? Can you prevent it from happening again? Sure, you can restore/rebuild, but can you prevent it? You should always learn something from a disaster or you are dooming into repeating it.
Option 1 Several tools were not mention that are freely available. http://www.microsoft.com/technet/prodtechnol/windowsserver2003/library/TechRef/7e782055-450b-46dd-a0a4-164eebf2ae18.mspx lists one of my favorite the Server Cluster Recovery Utility.
Page 2:
10 hours to get SQL back up and running? What the heck!!! That would be an install and attach method. If you have a proper backup (and tested restore), it pretty easy. Restore SQL and the System State and use it J The whole process will not take that long!
Again I feel its important to note that with the Cluster Recovery Utility you can get the signature back again. The process takes seconds.
4th paragraph Having a dead cluster database on one node DOES NOT mean it is dead on the others. The information is stored in the registry of each node and within the quorum. I have seen the quorum corrupted, I have not seen the cluster Hive of the registry corrupted I have only seen it become out of date (as in a server was turned off when something happened).
You can rebuild the quorum. http://www.microsoft.com/technet/prodtechnol/windowsserver2003/library/ServerHelp/c9fe11a9-97c0-496a-9223-ed4b77786368.mspx lists the 4 areas that need to be backuped. You do not have to use the Automated System Recovery (ASR) to backup and restore your cluster. It is nice and works great, but it is not the only way. Here is a third party article on recovery with normal System State backups (assuming total hardware failure) http://seer.support.veritas.com/docs/262709.htm.
July 21 - High Availability - How Not to Get There Russ Kaufmann
I am almost done laughing. Rod Fournier posted an entry about a really bad cluster recovery article. I was just dumbfounded.
I have written about high availability and some of the philosophy around making sure you do everything the right way whenever possible. The author of the offending article seems to have completely forgotten about how to treat a high availability environment.
Some basic mistakes that I hope you all will learn from reading the article:
There were several technical mistakes made in the article, but I won't go through them. I just don't have that much time. :) Rod does address some of them, but not all of them.
July 17 Expand your SAN partition on your Windows 2000 or Windows Server 2003 Cluster Shared Disk Rodney R. Fournier
Question:
How can we expand a couple of volumes in our SAN infrastructure that are used for our Clustering solution?
I have Knowledge Base article 304736 - http://support.microsoft.com/default.aspx?scid=kb;en-us;304736. How much downtime am I looking at?
Answer:
Great question! I did this not that long ago... The dispart part is seconds. And I mean seconds. Lets break down the Q article steps (Microsoft steps in italics, my comments in bold):
First you have to prepare the SAN, which should not involve any downtime for the SAN or your cluster. This of course assumes that your SAN allows this action (contact the hardware vendor). Depending on the amount of space added this step could take from minutes to weeks.
Excellent idea. You just never know J
Yes, I mean completely power them off nodes 2-8 (if you have that many).
This is where the outage starts. Make sure you on the controlling node (the only booted on at this point) when you do this step.
NOTE: If you have any disk or Host Bus Adapter (HBA) utilities that access the disk, you may need to quit them or stop the services so that they will release any handles to the disk.
Good advice.
I had not had to do anything with this step, but you might. Again, check with your hardware vendor before proceeding.
You might have to Rescan to see the new space. You may already have a name for the partition, if so you dont have to give it a new name (as long as the first name is unique).
NOTE If you encounter any problems with the preceding two steps while you are extending the drive, contact your hardware vendor for assistance.
Dang, that is easy! Yes, it only takes a few seconds to extend. Can I bill the customer for a 3 minute job? Of course I can and will J
Dont hold your breath, this step is simple.
Tip turn the rest of the nodes on before you try to move group to them J
As you can see, its really not that hard! The last time I did it on a two node cluster, the outage was under 20 seconds, though I still preformed it during a maintenance window :) The whole process under 3 minutes with proper preparation.