Root Cause Analysis re: Miro Service Degradation (Dec 4-5, 2023)

Issue Severity P1 – Critical
Impacted Customers All Miro users
Primary Impact Start 14:00 UTC, 4 Dec 2023 Primary Impact End 20:43 UTC, 4 Dec 2023
Secondary Impact Start 14:02 UTC, 5 DEC 2023 Secondary Impact End 14:18 UTC, 5 Dec 2023

Executive Summary

Starting at approximately 14:00 UTC on Dec 4, Miro encountered a degradation across core services, which may have appeared to affected users as either elevated latency or an inability to access the core product.

Following multiple unsuccessful attempts to restore Miro’s service through scaling of server and database infrastructure, and an observed increase in legitimate customer traffic hindering engineering teams’ ability to remediate, Response leadership called an emergency maintenance window starting from 19:00 UTC.

During the emergency maintenance window, the Miro platform was unavailable to all customers and systems were restored to a prior stable state. No customer data was lost during this restoration. The emergency maintenance window was closed and service was restored at 20:43 UTC.

Miro’s service availability was monitored through Dec 5 and investigation into the root cause continued into Dec 6 when it was discovered that a change to the Miro client, released on Nov 30, introduced a bug that caused customer browsers to reconnect to Miro servers every 10 minutes. The additional load caused by this, in conjunction with a set of performance-intensive database queries, led to the degradation of Miro’s service.

Overview of Findings

  • A combination of slower Miro API calls and elevated traffic levels led to Miro’s systems failing under load. The client bug introduced on Nov 30 was responsible for the bulk of this elevated traffic and can therefore be considered the primary trigger and root cause for this degradation.
  • Elevated traffic levels additionally hindered remediation efforts as Miro’s backend deployment systems failed under load, preventing release or roll-back of Miro systems. This led to the decision to call a maintenance window and take the Miro platform offline to enable roll-back to a previous, stable state.
  • The nature of the client bug introduced on Nov 30, occurring only after a period of 10 minutes, caused it to go undetected by Miro’s pre-production testing and benchmarking systems. A corrective action has been taken to investigate how Miro can better enable identification of issues before they reach Miro’s production systems and prevent degradation.
  • While the client bug was released on Nov 30, the default mechanism for propagating client releases can take multiple days. A combination of this, and the low traffic usage over the weekend of Dec 2 and Dec 3, meant that the additional load only reached a critical volume on Dec 4 when users returned from the weekend and automatically received a new client version – in particular as US customers came online.
  • A hot-fix released on Dec 5 addressed a heavy database query involved in new team user permission calculation, allowing Miro to maintain the stable product state that had been established during the emergency maintenance window even as traffic volumes increased during the day.
  • This hot-fix temporarily removed the ability for Team Admins to invite users to their organization. Full functionality was restored on Dec 6.
  • A separate hot-fix released on Dec 5 addressed a client bug, released on Dec 4, which introduced additional load to a single Miro endpoint, /profile.
  • During the investigation on Dec 5 and Dec 6, a number of additional opportunities to improve server performance further were identified. An ongoing workstream has been established to implement these improvements and continue to identify even more opportunities.
  • There was a 41 minute delay between the start of the degradation within Miro and customer notification via the Miro status page. 

Corrective Actions

Completed

  • 4 Dec 2023 – Restoration of Miro to previous, stable state during emergency maintenance window.
  • 5 Dec 2023 – Release of hot-fix to remove heavy database query identified during investigation.
  • 5 Dec 2023 – Release of hot-fix to remove additional load introduced to /profile endpoint.
  • 6 Dec 2023 – Reintroduction of functionality enabling Team Admins to invite users to organizations.
  • 6 Dec 2023 – Release of hot-fix to remove additional client reconnects (degradation root cause).

Ongoing / Upcoming

  • Investigation into updating Miro’s pre-production testing and benchmarking systems to better catch client connection issues before they are rolled out.
  • Workstream to improve backend deployment systems, aimed at speeding up application roll-out and making it more resilient to traffic.
  • Workstream to implement identified improvements and further identify opportunities to improve server performance.
  • Workstream to improve customer communications and status updates during service events.

Additional Notes

We are deeply committed to providing up-to-date information to all our customers.  We acknowledge that there are improvements to be made in communicating in a timely manner and being proactively transparent in these situations, particularly during the early stages of an event. As a result, we are updating our processes to ensure that our communications are stronger and your experience with Miro is improved going forward. We look forward to sharing details of this updated process with you in the coming weeks. 

We understand that your organization depends on Miro and we take that responsibility very seriously. We apologize for the extended service degradation and any inconvenience it may have caused.

Timeline of Events

All times in UTC (Coordinated Universal Time)

14:00
4 Dec, 2023
Primary Impact Start. Elevated latency and error rates detected by Miro monitoring and alerting systems.
14:03,
4 Dec 2023
First reports of issues opening dashboard and boards received from Miro employees and customers.
14:10, 4 Dec 2023 An emergency response team is stood up.
14:41, 4 Dec 2023 Customers are first notified of degradation via the Miro status page.
14:10 – 18:50, 4 Dec 2023 Miro engineering teams are fully engaged in the investigation. Attempts at stabilizing systems, including scaling API server and database instances, are unsuccessful. Elevated levels of traffic are observed.
18:50, 4 Dec 2023 It is confirmed that elevated levels of traffic are impacting Miro’s deployment and scaling mechanisms and preventing Miro engineering teams restoring system stability by rolling back to a previous, stable state. Response leadership calls an emergency maintenance window, making Miro entirely unavailable to customers in order to enable restoration without traffic.
19:00, 4 Dec 2023 The emergency maintenance window begins and customer traffic is redirected to a static error page. Without traffic, Miro engineering teams are able to roll Miro applications back to a previous, stable state.
20:43, 4 Dec 2023 Primary Impact End. The emergency maintenance window is closed and customers are able to access Miro once more.
5 Dec 2023 Miro engineering teams monitor platform health throughout the day, and continue to investigate potential root causes. Hot-fixes are released for two identified issues:

A Miro server release on Nov 28 introduced a set of particularly heavy database queries which led to increasing latency of core Miro API calls.

A client change introduced on Dec 4 introduced a bug leading to increased load on one particular Miro endpoint, /profile.

Additionally it is reported that a small number of users have been unable to access Miro since the end of the emergency maintenance window.
14:02, 5 Dec 2023 Secondary Impact Start. Elevated latency and error rates detected by Miro monitoring and alerting systems.
14:18, 5 Dec 2023 Secondary Impact End. An incorrectly applied configuration change, introduced to restore service to the small number of users unable to access Miro, is rolled back.
20:07, 5 Dec 2023 Following a full day of monitoring Miro system stability, and no repeat of symptoms observed on Dec 4, the emergency response team stands down. A change freeze for everything but emergency hot-fixes is established while root cause investigation continues.
6 Dec 2023 Miro engineering teams continue to investigate potential root causes. Service is restored for the small number of users unable to access Miro since the end of the Emergency Maintenance window.
14:38, 6 Dec 2023 A major bug in Miro’s client code base, introduced on Nov 30, is identified. This bug caused customer browsers to reconnect to Miro servers every 10 minutes, leading to significantly increased traffic to core Miro APIs.
15:10, 6 Dec 2023 A hot-fix to remedy the browser reconnection bug is released. Immediate reduction in traffic volumes to Miro APIs is observed.
Share this post: