Ongoing AWS EBS (File System) Partial Outage in EU-CENTRAL-1 Region

Incident Report for Alumio

Postmortem

On August 20, 2024, we experienced a significant incident affecting the AWS Elastic Block Store (EBS) in the EU-CENTRAL-1 region, specifically in Availability Zone euc1-az2. This resulted in degraded performance and unavailability for a subset of Alumio EC2 environments that used the affected storage containers. The issue was reported by AWS as a partial outage impacting the file storage system.

Timeline of Events according to AWS:

00:05 CEST (August 21, 2024): AWS reported degraded performance affecting a subset of EBS volumes in euc1-az2.
00:37 CEST: AWS continued investigation, providing an update noting degraded performance affecting a small number of volumes.
01:32 CEST: AWS identified the root cause and began active mitigation efforts.
07:15 CEST: The EBS degraded status went to critical as our monitoring detected downtime for a several environments.
07:57 CEST: AWS reported slower progress than anticipated, with full recovery expected to take several more hours.
11:29 CEST: We noticed that the affected EBS partitions went back to OK status, we performed a background health check to ensure no data corrupted took place.

Resolution: By August 21, 2024, the file storage system’s status was confirmed as OK. Initial checks indicated that all data was intact and accessible, with no signs of corruption. A comprehensive background health check confirmed that no data loss or corruption had occurred. Servers and client environments were successfully restarted, returning to normal load and operation.

Impact:

Data Access: Customers experienced reduced performance and temporary unavailability of environments.
Data Integrity: No data loss or corruption was detected.
Service Availability: Alumio environments that were using EBS volumes in euc1-az2 were affected and their service became unavailable.

‌

Lessons Learned:

Monitoring and Communication: While timely updates were provided, ensuring even more frequent and transparent communication during critical periods can further enhance customer trust. We are planning to make the delivery of downtime notifications more accessible to our customers. We will be using https://status.alumio.com to display our service availability, and also plan to notify users pre-emptively via email in the future.
Workarounds and Contingency Planning: We will review and refine our contingency plans to minimize impact during similar events in the future.

We will continue to monitor AWS services and our own systems closely to prevent and mitigate any future issues. Additionally, we will conduct a thorough review of our incident response and communication strategies to better manage similar situations going forward.

We appreciate your patience and understanding throughout this incident. Should you have any further questions or require additional information, please do not hesitate to contact our InfoSec Team at infosec@alumio.com

Thank you for your continued support.

Posted Aug 21, 2024 - 13:13 CEST

Resolved

We are pleased to inform you that the issue with the AWS EBS has been fully resolved. All servers and client environments have returned to normal load and are operating as expected.

Our final checks confirm that data integrity remains intact and no further issues are present. We will continue to monitor the systems to ensure ongoing stability, but we anticipate no further disruptions.

If you have any additional questions or need further assistance, please feel free to reach out to our support team at support@alumio.com

Posted Aug 21, 2024 - 12:07 CEST

Monitoring

After closely monitoring the situation, we have received confirmation that AWS has resolved the issues with the file storage system, which is now back to an OK status. Initial results from the server indicate that all data is accessible and intact, with no signs of corruption. A comprehensive background health check has been completed, confirming that there has been no data loss or corruption.

Affected environments are currently being restarted and will be accessible to all customers soon. We are continuing to closely monitor the situation to ensure ongoing stability and uptime. We appreciate your patience and understanding as we work to fully restore services.

Posted Aug 21, 2024 - 11:29 CEST

Update

Last update from AWS:
[9:58 AM CEST] We continue to work toward recovery. We are seeing some improvements internally, though they may not yet be visible externally. As we recover, some volumes may experience temporary degraded performance. This is expected as part of our mitigation efforts. We continue to work toward full recovery and will share updates as we have additional information to share or by August 21 at 11:30 AM CEST.

Posted Aug 21, 2024 - 10:03 CEST

Update

Current Impact:

Some customers may be unable to access their environments.
AWS engineering teams are working to resolve the issue, but full recovery is expected to take several more hours.

Posted Aug 21, 2024 - 09:18 CEST

Identified

We are currently experiencing an issue affecting a subset of environments that are hosted in AWS' EU-CENTRAL-1 region (Availability Zone euc1-az2). This is due to degraded performance reported by AWS for a small number of Elastic Block Store (EBS) volumes in that Availability Zone.

Posted Aug 21, 2024 - 09:17 CEST

This incident affected: Alumio Cloud Network (Alumio Cloud Network [EU]).