On August 20, 2024, we experienced a significant incident affecting the AWS Elastic Block Store (EBS) in the EU-CENTRAL-1 region, specifically in Availability Zone euc1-az2. This resulted in degraded performance and unavailability for a subset of Alumio EC2 environments that used the affected storage containers. The issue was reported by AWS as a partial outage impacting the file storage system.
Timeline of Events according to AWS:
- 00:05 CEST (August 21, 2024): AWS reported degraded performance affecting a subset of EBS volumes in euc1-az2.
- 00:37 CEST: AWS continued investigation, providing an update noting degraded performance affecting a small number of volumes.
- 01:32 CEST: AWS identified the root cause and began active mitigation efforts.
- 07:15 CEST: The EBS degraded status went to critical as our monitoring detected downtime for a several environments.
- 07:57 CEST: AWS reported slower progress than anticipated, with full recovery expected to take several more hours.
- 11:29 CEST: We noticed that the affected EBS partitions went back to OK status, we performed a background health check to ensure no data corrupted took place.
Resolution: By August 21, 2024, the file storage system’s status was confirmed as OK. Initial checks indicated that all data was intact and accessible, with no signs of corruption. A comprehensive background health check confirmed that no data loss or corruption had occurred. Servers and client environments were successfully restarted, returning to normal load and operation.
Impact:
- Data Access: Customers experienced reduced performance and temporary unavailability of environments.
- Data Integrity: No data loss or corruption was detected.
- Service Availability: Alumio environments that were using EBS volumes in euc1-az2 were affected and their service became unavailable.
Lessons Learned:
- Monitoring and Communication: While timely updates were provided, ensuring even more frequent and transparent communication during critical periods can further enhance customer trust. We are planning to make the delivery of downtime notifications more accessible to our customers. We will be using https://status.alumio.com to display our service availability, and also plan to notify users pre-emptively via email in the future.
- Workarounds and Contingency Planning: We will review and refine our contingency plans to minimize impact during similar events in the future.
We will continue to monitor AWS services and our own systems closely to prevent and mitigate any future issues. Additionally, we will conduct a thorough review of our incident response and communication strategies to better manage similar situations going forward.
We appreciate your patience and understanding throughout this incident. Should you have any further questions or require additional information, please do not hesitate to contact our InfoSec Team at infosec@alumio.com
Thank you for your continued support.