It's late night on a Saturday, and a critical enterprise system — crucial to company operations — just went down. Technology executives fear this scenario, but with the right strategies and procedures, a business can turn this situation from an operational threat to a minor hiccup.
IT system failures have consequences beyond the stress they cause executives. For global businesses, the consequences can mean supply chain disruptions or a degraded customer experience.
More than 3 in 5 IT leaders say system outages mean lost productivity, according to data from software firm LogicMonitor. Another 2 in 5 decision makers said outages hurt revenue too.
Reliability starts in the design phase, as leaders must start infusing potential failover mechanisms in the initial set up of software and IT infrastructure.
"You want to engineer for reliability, or you want to engineer your application or your workload in such a way that it can either automatically failover to redundant systems or it can survive some sort of expected system degradation," said John Annand, research director at Info-Tech Research Group.
Technology leaders pressed to uphold systems off-hours also have a number of tools to turn to, including software providers such as PagerDuty, OpsGenie and xMatters. These tools can connect with collaboration platforms, and execute a communication strategy prepared ahead of time.
"One of the struggle points is scheduling who's going to be on call," said Forrester Senior Analyst Julie Mohr. But now tools have evolved to ensure on-call schedules go straight into productivity software such as Outlook, letting team members know how time off will impact their on-call schedule assignments.
Setting up on-call IT system
For enterprise leaders hoping to ensure their companies can respond to outages whenever they happen, there are some essential tactics to execute, according to Mukesh Ranjan, VP at Everest Group.
- Enabling self-service for commonly occurring issues: Leaders can create marketplace portals, one-click resolutions, FAQs and do-it-yourself videos contextual to company needs.
- Chatbots with embedded RPA: To address key workflows and use cases such as internet issues.
- Making resources available during weekends and graveyard shifts to respond to critical outages.
- Follow the sun model: Creating rotation schedules to ensure round-the-clock resolution.
AI — often touted as an accelerant of efficiency for organizations — can help, too. By triaging and automating some systems resiliency needs through AIOps tools, companies can prevent the need to wake up on-call engineers.
These tools feed off large amounts of data and can generate immediate solutions to common reliability issues.
"After so many times you see it correcting a situation you become much more trusting, and you're more likely to allow the auto correction to occur," Mohr said.
But beyond the technical complexities, setting up a resilient technical backbone also takes into account issues of talent management, especially as IT workers continue to have their pick of places to work.
"Off-hours on-call responsibilities can sometimes be overwhelming," Ranjan said. "Providing an ear to listen to challenges by people and proactively trying to address those creates a good culture and more ownership.