You want to listen to music, but your Spotify app isn’t working. You want to shop, but Amazon is down. Aren’t these few of the most frustrating experiences as a customer? One minor app outage can at once overwhelm the entirety of your digital marketing efforts and bring the whole IT department to a standstill. And there are bound to be repercussions to it. But what exactly causes these outages and downtimes? Though we may never know the exact reason for the recent Facebook, Instagram and Whatsapp outages, we can outline a few of the leading reasons that have the potential to cause massive outages.
1. Runtime/Code Errors: Code errors account for almost 45% of total app outages in general. Runtime errors happen when developers miss out on tackling coding edge cases or implement faulty logic.
In 2008, terminal 5 of the UK’s Heathrow airport had to shut down for ten days straight because of the absolute mayhem that their new baggage handling system had caused. Apparently, even after testing the system with 12,000 pieces of luggage, the engineers had overlooked one minor detail, leaving the entire system confused when a passenger manually removed his bags from the conveyer belt. As a result, the transportation of about 42,000 bags was delayed, and over 500 flights were cancelled.
Another interesting example is the Year 2000 glitch (Y2K glitch,) aka the Millenium bug incident that cost the world billions of dollars.
Until the 1990s, computer programs were written such that they abbreviated the four digited year numbers as two. So, for example, computers would use the number 78 to represent the year 1978, 63 for the year 1963 and so on. This went on without any hiccups till the 31st of December, 1999. But, as soon as the clock ticked midnight and it was the 1st of January, 2000, everything went out of order. Computers started to represent ’00’ of the year 2000 as the year 1900, and it affected banks, insurance companies and multiple different businesses severely. An estimated $300 billion were spent on upgrading the computers and application programs to be Y2K-compliant that year.
2. Noisy Neighbours: When two instances are running alongside, and one of them is hogging all the network and CPU, it may cause an app to crash repeatedly. Spambots and bad actors can also act as noisy neighbours.
3. Network: DDoS attacks and local area network crashes can cause extended downtime periods. Recently, Amazon’s AWS Shield service stopped a 2.3 Tbps (Terabits per second) attack – one of the largest DDoS attacks ever to be recorded. According to ZDNet, the attack was carried out using hijacked CLDAP web servers (CLDAP: Connection-less Lightweight Directory Access Protocol – a protocol for connecting, searching, and modifying shared directories on the internet.) The world record for the largest recorded DDoS attack earlier was a 1.7 Tbps attack carried out on a NETSCOUT’s customer.
4. Infrastructure Issues: Electricity outages in the data centre or malfunction in the central server system can also lead to sudden and unplanned downtime periods. Many AWS customers, including Slack and Twilio, experienced multiple hours of downtime in March 2018. It had happened because of a massive electricity outage that was going on in the AWS-East Region, affecting the AWS servers at its Ashburn and Virginia data center space. It had reportedly also impacted the company’s own voice assistance service, Alexa.
5. Natural and Man-Made disasters: Extreme weather such as cyclones, earthquake, flood etc., and man-made disasters such as civil wars and geopolitical situations lead to disrupted power issues, hardware failure, and infrastructural damage, ultimately leading to increased downtime period. For example, let’s talk about the weather-induced Microsoft Azure outage that took place in June 2018. Ireland’s temperature had risen to a pleasant 18°C that month, causing water shortages for its residents. This left Microsoft with an inadequate amount of cooling supply to operate its Dublin data center resources. As a result, their data center services experienced an outage of 9 hours, and it affected Azure and Office 365 customers in the Northern European countries severely.
If not taken care of, these outages can lead to huge losses for businesses. Companies such as Apple and Facebook have lost as much as $25 to $60 million during 12 to 14 hour outages in the past. But these are the big companies – what about startups and small-medium enterprises (SMEs)? If small scale startups and SMEs go through a 12-14 hour outage, it may cause the company to go underwater.
To avoid these outages from ruining your brand’s reputation and you losing credibility as a service provider, prepare well beforehand:
- Avoid Single-point failures: When any kind of dysfunction takes place in a non-redundant part of a system, the whole system fails to function. This can be avoided by introducing redundant components and replicating critical parts of a system. For example, consider a data center where a single server runs a single application. The underlying server hardware would present a single point of failure for the application’s availability. If the server failed, the application would become unstable or crash entirely, preventing users from accessing the application and possibly even resulting in some measure of data loss. In this situation, the use of server clustering technology would allow a duplicate copy of the application to run on a second physical server. If the first server failed, the second would take over to preserve access to the application and avoid the SPOF.
- Implementing robust testing processes to catch bugs early on can avoid app outages and downtimes by a huge margin. Tools and processes such as Junit, Pytest, API Testing, End To End Selenium Based Testing, Stress Testing, etc. can be used.
- Testing accordingly: Simulating an environment identical to the scale and use case and then testing for system functionality and robustness can avoid software failures. For example, Netflix operates on the assumption that its critical software will fail at any given point and simulates similar environments and runs tests to be ready during the downtime.
Unless your name is Google, knowing and preparing for all the cases beforehand is unfeasible. Being on a constant lookout for anomalies and supervising the system behaviour can help you identify and act on the issues proactively.
- Implement Logging: Implement Logging: Assuming that there are many unknown unknowns, you want to be on top of how your application performs. Logging creates an ongoing record of application events and can be used to review different events within a system. Log data can help the DevOps teams find issues by identifying which changes resulted in error reporting. There are various tools available that help you with log management, for example, logstash, logmatic, splunc, graylock etc.
- Implement monitoring: Monitoring is an umbrella term that can include many facets of system evaluation, but here, we’re referring to application performance monitoring (APM). APM is the process of using an application to collect, aggregate and analyse metrics to better evaluate the use of the system by gauging availability, response time, memory usage, bandwidth, and CPU time consumption. These alert the IT teams of operating anomalies across applications and cloud services. There are various tools available that help you with monitoring: Nagios, Cloud Watch, Datadog etc.
And even after all the efforts and following the best practices, it is always a possibility that your system might just fail. The reasons being the factors that may not be in our hands to control. In such cases, it is better to be prepared and ready with a reaction plan so that the systems can be up and running as soon as possible without much loss.
- Replication: Data replication is storing the same data in multiple locations to improve data availability and accessibility. This process is widely used to prepare for disaster recovery, which ensures that a proper backup of the system exists in cases of emergency where data might be compromised.
- Backup: Today, backups are considered to be the second line of defence against data loss. If infrastructure resiliency and fault tolerance fail, there should be proper backups in place ready to be rolled out.
- Rollbacks: Software rollbacks are the all-inclusive manner of return to a point before the actual act-up occurred. Rollbacks return the software database to a previous operational state. To maintain database integrity, rollbacks are performed when system operations become erroneous. In worse case scenarios, rollback strategies are used for recovery when the structure crashes. A clean copy of the database is reinstated to a state of consistency and operational stability.
All in all, these are a list of a few things to keep in mind while talking about app and server outages in general. If you have more doubts regarding your application infrastructure, we would be happy to get on a call with you. You can reach out to us at +918002985878,+91439857338 or mail us at email@example.com for more information.