Making high availability more cost-effective in the cloud

Editor's note: The following is a guest article from David Bermingham, technical evangelist at SIOS Technology Corp.

Mission-critical database applications are often the most costly to run in the public cloud for a variety of reasons. They need to deliver high throughput performance and they must run 24x7 without fail.

As a result they require redundancy, which involves replicating the data and provisioning standby server instances. Data replication requires real-time data movement, including across the WAN. And the high-availability provisions themselves incur a cost to either license commercial software or customize open source software.

There are, of course, ways to make public cloud services equally, if not more, cost-effective than a high availability, high performance private cloud. Achieving the best results requires choosing the most appropriate option and then carefully managing how the public cloud services are utilized by each application.

The four HA options

Provisioning resources for high availability in a way that does not sacrifice security or performance has never been easy — or cheap. The challenge is especially difficult in a hybrid cloud environment where the private and public cloud infrastructures can differ significantly.

Potential solutions vary by application, operating system and cloud service provider (CSP). And the inherently complex configurations can be difficult to test and maintain, resulting in failover provisions regularly failing when actually needed.

The HA options that might appear to be the easiest to implement are those specifically designed for the various applications. A good example is Microsoft’s SQL Server database with its carrier-class Always On Availability Groups feature.

But there are two disadvantages to this approach. The higher licensing fees, in this case for the Enterprise Edition, can make it prohibitively expensive for many needs. And having different HA solutions for different applications makes ongoing management a constant (and costly) struggle.

Another appealing option involves the operating system's own HA provisions. Windows Server Failover Clustering, for example, is a powerful and proven feature that is integral to the OS. But on its own, WSFC does not provide a complete HA solution because it lacks a data replication feature.

Implementing robust data replication in the cloud, therefore, requires using separate commercial or custom-developed software.

For Linux, which lacks a feature like WSFC, the need for additional HA provisions and/or custom development is even greater. Using open source software requires integrating multiple capabilities that, at a minimum, must include data replication, server clustering, a heartbeat monitor and resource management.

But because getting the full HA stack to work well for every application is extraordinarily difficult, only very large organizations have the wherewithal needed to even consider taking on the task.

All CSPs offer at least some HA capabilities, presenting a third option. But these all have some limitations that are often unacceptable for the most critical applications.

Examples include: failovers normally being triggered only by zone outages and not many other common problems; master instances only being able to create a single failover replica; and the use of event logs to replicate data, which create a "replication lag" that results in temporary outages during a failover.

Of course, the issues and limitations involved are not insurmountable — with a sufficiently large budget. The challenge, therefore, is to find a "universal" approach capable of working cost-effectively for all applications running on either Windows or Linux across public, private and hybrid clouds.

Among the most versatile and affordable of such solutions is the fourth and final option: the purpose-built failover cluster. These HA solutions are implemented entirely in software that is designed specifically to create, as implied by the name, a cluster of servers and storage with automatic failover to assure high availability at the application level.

Most of these solutions provide a combination of real-time block-level data replication, continuous application monitoring and configurable failover/failback recovery policies. Some of the more robust ones also offer various advanced capabilities such as:

Choice of block-level synchronous or asynchronous replication
Ease of configuration and operation
Support for the less expensive Standard Edition of SQL Server
WAN optimization to maximize performance while minimizing bandwidth utilization
Manual switchover of primary and secondary server assignments to facilitate planned maintenance

Although these general-purpose solutions are generally storage-agnostic, enabling them to work with storage area networks, shared-nothing SANless failover clusters are usually preferred for their ability to eliminate potential single points of failure.

Cost-effective, carrier-class protection

The diagram below shows how a three-node SANless failover cluster is able to handle two concurrent failures with minimal or no downtime. The basic operation is the same in the LAN and/or WAN for Windows or Linux in a private, public or hybrid cloud.

Server No. 1 is initially the primary or active instance that replicates data continuously to both servers No. 2 and No. 3. It experiences a problem, triggering an automatic failover to server No. 2, which now becomes the primary replicating data to server No. 3.

In this situation, the IT department would likely begin diagnosing and repairing whatever problem caused server No. 1 to fail.

Once fixed, it could be restored as the primary, or server No. 2 could continue in that capacity replicating data to servers No. 1 and No. 3. Should server No. 2 fail before server No. 1 is returned to operation, as shown, a failover would be triggered to server No. 3.

In clusters like these, failovers are normally configured to occur automatically, and both failovers and failbacks can be controlled manually (with appropriate authorization, of course). Three-node clusters also facilitate planned hardware and software maintenance for all three servers while continuously providing high-availability for the application and its data.

With a purpose-built solution, carrier-class high availability need not mean paying a carrier-like high cost.

Failover clusters that make effective and efficient use of compute, storage and network resources, and are easy to implement and operate, minimize both capital and operational expenditures, resulting in high availability being more affordable for more applications than ever before.