System downtime affects organizations across all industries and of all sizes. Every once in a while, some of the world’s largest and most popular websites make headlines following a major outage. If cash-rich enterprises such as Facebook can experience downtime, no institution is immune.
One survey showed that 9 in 10 organizations encounter system downtime each year and 1 in 3 do so once a month on average. Losing access to systems including enterprise applications, backup processors and email servers diminishes employee productivity and frustrates customers.
In addition, data, security infrastructure and finances are all at risk during a system glitch. Although it is not possible to completely prevent an outage, there are ways business executives and IT leaders can minimize the risk. We look at a number of them.
1. Calculate the Cost of Downtime
One of the main reasons business fail to plan for downtime is because the impact is unquantified. With such ambiguity and a lack of solid analysis, business executives may not see the rationale for investing in preventative measures.
Estimating the true cost of downtime includes calculating losses due to a drop in staff productivity, the halting of goods production or service provision, the man hours necessary to restore systems, the cost of repairing damaged equipment or recovering lost data, and the resources necessary to re-establish the reputation and the brand.
2. Check SLAs of Key Partners and Vendors
Carefully studying the service level agreements (SLA) of all key partners and vendors your organization works with is vital. Any system worth its salt will come with a robust SLA that clearly states the expected level of availability. If the SLA’s standards are not in tandem with the business’ objectives, it may be time to explore other options.
Nowadays, businesses are keen on achieving 99.999% availability (also referred to as the ‘five nines’) which means no more than 332 seconds of downtime each year. This, however, will be difficult to attain when using systems whose SLAs guarantee just 90% uptime.
3. Use Active-Active Clusters
For years, many enterprises relied on active-passive clustering. This was not only an expensive approach due to the amount of hardware required, but also involved a less efficient model whereby redundant servers came online only when the primary server failed.
Active-active clustering helps balance server loads across multiple networks. This minimizes server and network overloads thus reducing overall downtime. Active-active clusters are also more secure with one study showing they experience 40% less data loss compared to active-passive setups.
4. Buy the Right Capacity
When buying technology infrastructure, business and IT leaders unsurprisingly aim to buy the equipment they need at the lowest price. After all, the cost savings on the technology purchase would then be applied to other aspects of the business. Yet, trying to save on the upfront cost of technology can have disastrous results in the medium and long-term.
For instance, if an organization buys servers that have lower capacity than is required, the excessive load on the server will increase the frequency of downtimes. Whatever savings were made from buying a small server will be dwarfed by the lost opportunity and employee idle time caused by regular outages.
5. Real-time Automated Monitoring
Usually, if you are keen enough, you can pick out the early warning signs of an impending system outage. Applications frequently maxing server resources, file servers approaching their storage limit or system entries taking longer to be applied are just some examples of a technology environment headed for trouble.
For small organizations, it may be possible to assign one person to manually check key system indicators regularly. However, in the increasingly complex and interconnected technology ecosystem of medium to large organizations, manual monitoring is slow and impractical.
Automated tools such as Stackify application monitoring are more efficient and effective than manual review. They’re also scalable meaning they work well in both small and large enterprise environments.
6. Avoid Procedural Single Points of Failure
In small, medium and even some large organizations, it is not unusual for there to be a procedural single point of failure. A business may install a robust system with very high availability, but fail to align procedures with the system.
An example of a single point of failure is where only one IT person knows the administrative passwords to critical routers and servers or knows where physical backups are located. Except for a one-person business, under no circumstances should resolution of system downtime be dependent on one person who holds all requisite information. This can severely curtail the speed of recovery if the person in question is unreachable.
7. Update System Documentation
Updating system documentation, including the policy and procedures that govern it, is a simple, but effective way of reducing downtime. Develop a routine to review and update all documentation at least once a year, but preferably every quarter. Frequent review means you need to make much fewer changes than you would if you were to wait for say two years to do so.
System documents should be within easy reach in the event of an outage as this will make it easier to address issues quickly. Updating documentation may be fairly simple, but is one of the hardest things to inculcate in employees. Create incentives and reward mechanisms that will encourage staff to care about keeping policy, procedures and system documents current.
Following these seven tips will help keep your systems productive, reliable and resilient.