Postmortem: What Killed My Web Server

lawal Babatunde utility
3 min readJun 14, 2023

--

Postmortems, the chronicles of technical triumphs and occasional tribulations. Today, dear readers, we delve into the tumultuous tale of a web stack outage that sent my budding web application into a momentary hibernation.

Gather ‘round as we recount the saga of code deployments, misconfigurations, and a feisty database server that decided it had simply had enough disk space for one lifetime (chuckles).

From the depths of downtime despair to the triumphant restoration of digital harmony, this postmortem promises to be a thrilling blend of tech wizardry and strategic foresight.

So grab your debugging capes and join us on this misadventure through the highs and lows of web stack troubleshooting.

Issue Summary:

Duration: June 12, 2023, 10:30 AM to June 12, 2023, 1:45 PM (GMT +1)

Impact: The main web application was inaccessible to users, resulting in a complete service outage. Approximately 75% of users were affected, leading to a loss of revenue and a negative user experience.

Timeline:

- 10:30 AM: The issue was detected when a monitoring alert indicated a sudden spike in server response time.

- 10:35 AM: The engineering team received the alert and started investigating the issue.

- 10:45 AM: Initial assumption was that the database might be experiencing high load due to a recent code deployment.

- 11:00 AM: The team began analyzing the database logs and checking for any recent changes.

- 11:15 AM: It was discovered that the database server was running out of disk space, which could be causing the performance degradation.

- 11:30 AM: The team decided to scale up the database server to resolve the disk space issue.

- 12:00 PM: The database scaling process encountered unexpected errors, further delaying the resolution.

- 12:30 PM: As the issue persisted, the incident was escalated to the infrastructure team for assistance.

- 1:00 PM: The infrastructure team identified a misconfiguration in the database scaling process, which was causing the errors.

- 1:30 PM: The misconfiguration was corrected, and the database scaling was successfully completed.

- 1:45 PM: The web application was fully restored, and users regained access to the service.

Root Cause and Resolution:

Root Cause: The main cause of the web stack outage was the database server running out of disk space. This led to performance degradation, resulting in the web application becoming unresponsive.

Resolution: To fix the issue, the database server was scaled up to provide additional disk space. However, during the scaling process, a misconfiguration was encountered, causing errors and prolonging the downtime. The misconfiguration was identified and corrected by the infrastructure team, allowing the scaling process to be completed successfully. Once the database server had sufficient disk space, the web application was able to function properly again.

Corrective and Preventative Measures:

To prevent similar incidents in the future, the following measures will be implemented:

- Implement automated monitoring for disk space utilization on critical servers to receive alerts in advance.

- Review and enhance the database scaling process to ensure correct configurations and error handling.

- Conduct regular capacity planning and performance testing to anticipate and mitigate potential resource limitations.

- Improve communication and coordination between teams to expedite issue resolution and minimize downtime.

- Establish a comprehensive incident response plan to guide the team in a systematic and efficient manner during future outages.

Tasks to Address the Issue:

1. Enhance the monitoring system to include disk space utilization alerts for critical servers.

2. Conduct a thorough review of the database scaling process, addressing misconfigurations and improving error handling.

3. Perform regular capacity planning exercises to identify potential resource limitations and take proactive measures.

4. Implement improved communication channels and incident escalation procedures to streamline issue resolution.

5. Develop and document an incident response plan, including clear roles, responsibilities, and steps to follow during outages.

Conclusion:

The web server outage on June 12, 2023, was primarily caused by the database server running out of disk space. The incident highlighted the importance of robust monitoring, efficient troubleshooting, and effective coordination between teams. By implementing the corrective and preventative measures outlined above, we aim to enhance the stability and reliability of our web application, ensuring a seamless experience for our users in the future.

Please Note: This article is a part of a project undertaken as a student of ALX software engineering programme.

--

--