10 tips for surviving an IT outage with IT resilience

The following is a guest article from Todd Scallan, VP of product and engineering at Axcient.

Growing business demands and shrinking budgets continually burden IT professionals. The ramification is increased pressure on the ability to respond to a crisis — from breaches and ransomware attacks to natural disasters and data loss due to user error.

This equates to much-too-long downtime that can cost a business dearly; in many cases, to the point of closure. That’s why reducing downtime from days to hours or even minutes, will ensure your business stays on and you can protect your business’s bottom line.

In October 2016, a survey called "The State of IT Resilience," published by Axcient, found that of the 500 IT professionals surveyed at VMworld, nearly 80 said it would take them anywhere from more than an hour to several days to get their systems fully up and running after a catastrophic event. As well, 72% indicated they use three or more solutions for data protection and disaster recovery, bogging them down with redundant infrastructure and multiple copies of data to manage. The result is an inability to recover fast enough from an IT outage.

It’s not if but inevitably when an IT outage will occur. When it does, swift recovery is critical. While disaster recovery aims to do just that — recover — IT resilience is intended to protect against and prevent downtime, as well as boost agile recovery post-incident. Put your plan in place with these 10 tips to build a robust IT resilience strategy so you can survive, and even thrive, during and after a crisis.

1. Think through all possible downtime scenarios.

When planning for a crisis, most businesses assume a catastrophic natural disaster-type event. In actuality, a business is more likely to be hit by a ransomware attack, server failure, power outage or accidental deletion of data by a user. Any of these incidents can bring down all or part of a business operation. Knowing your vulnerabilities is the first step in being able to protect and recover from them.

2. Ask the hard questions about how well your business can withstand a crisis.

For instance, how much data loss and downtime can your business tolerate for each of your critical and noncritical systems? What would be the impact of being down for an hour, a day or a week? What would a potential outage cost in terms of lost revenue, lost customers and decreased employee productivity? Answers to these questions will help inform how to defend against and prepare for an outage.

3. Make sure your entire perimeter is secure.

Beyond a properly configured firewall and antivirus software for your data center, make sure you understand what the perimeter for your organization actually is — from on-premise servers and desktops to mobile devices and cloud-based servers and applications.

From there, consider how intrusion prevention systems and breach detection solutions can protect your IT environment. In a BYOD world, MAC filtering and other methods can further lock down your perimeter.

4. Implement a recovery solution that works in minutes, not hours or days.

Relying on IT staff to manually recover systems will typically prolong downtime and hurt a business. Instead, find a data protection solution that enables quick recovery using point-in-time snapshots of your entire systems, including the operating system, data and application state.

5. Understand how point-in-time snapshots are critical to resuming operations.

Snapshots let you go back to precisely how systems and data existed at a specific point in time in the past. While handy for use cases like archival and restoring corrupted files or emails, point-in-time snapshots allow for the rapid resumption of operations after a ransomware attack. Instead of being held hostage until the ransom is paid, you can go back to a point in time before the attack and resume normal operations. This has the added benefit of discouraging the bad actors who want to profit from their evil deed.

6. Leverage virtualization to save time, effort and expense.

The value of virtualization for production workloads is well understood. However, physical servers and infrastructure remain present in many IT environments. To increase resilience and reduce downtime, virtualization can be extended to protect a business from physical system outages. Look for a data protection solution that supports failing over a physical server to a VM locally or in the cloud. This obviates the need for redundant physical hardware when responding to a server failure.

7. Harness the power of the cloud.

Disaster Recovery as a Service (DRaaS) offers a cost-effective, highly resilient means of failing over part or all of a data center without the need for duplicate hardware. This is achieved by utilizing the scalable on-demand nature of the cloud to achieve levels of IT resilience previously available only to the largest enterprises.

8. Evaluate which protection solution is right for your business.

Cost is not the only consideration when evaluating data protection solutions. Also consider how seamless the solution needs to be from an IT administrative and recoverability perspective. From there, you can decide whether a single-vendor solution or aggregating solutions from multiple suppliers is appropriate.

Another fundamental consideration is whether to replicate data and systems to secondary hardware, or shift to the on-demand computing advantage and operating costs of the cloud. Finally, try to eliminate reliance on manual steps during protection and recovery operations. Otherwise, you risk not achieving your recovery objectives.

9. Document recovery procedures ahead of time.

A prepared IT organization has a clearly documented protection and recovery plan with detailed, current and frequently practiced procedures. This enables swift and direct action, and less interruption to the business. Figuring out disaster recovery protocols mid-crisis is another disaster waiting to happen.

10. Test the solution often and look for gaps.

Recovery protocols should be run regularly to verify adequate breadth and depth of coverage, and to uncover any gaps or technical issues. Changes in primary infrastructure will likely impact your solution, so ensure readiness through frequent testing.