LastPass users experienced slow performance or were unable to login to LastPass to access their vault. LastPass engineers detected an exponential increase in connections to the LastPass backend database due to a software defect in the planned upgrade of the LastPass Chrome browser extensions. An attempted rollback of the change caused an unintended cascading failure across the customer-facing portions of the LastPass cloud infrastructure due to exceptionally high loads.
Issue Start Time (UTC): 06/06/2024 15:17
Issue End Time (UTC): 06/07/2024 00:31
Total Duration: 9 hours 14 minutes
As part of work being done to refactor the LastPass browser extension for Chrome to align with Google’s new MV3 requirements, we have staggered the delivery of updates to customers. This update to MV3 adds many improvements and wholesale architectural changes from Google which are intended to improve the privacy, security, and performance of extensions.
During the release of this new Chrome browser extension, engineers noticed an exponential increase in connections to the LastPass platform across all tiers, and a subsequent increase in error rates for various APIs.
We ruled out any DoS/DDoS or other security issues unrelated to the update, made the determination that the correlation in timing was sufficiently close to the new extension deployment, and decided to “revert” the extension to remediate any further potential impact.
However, there was an unexpected side effect of this rollback which resulted in even more external requests from browser extensions due to the change.
As such, many thousands of Google Chrome web browsers began downloading the new extension update in a staggered manner. This is not an uncommon occurrence at our scale.
However, as these extensions began to reinitialize and attempted to authenticate, our monitoring systems crossed thresholds indicating that we were experiencing abnormal system scalability issues distributed across our Cloud availability zones as extensions “phoned home.”
During the period of instability, we saw this behavior manifest as slowness reported from customers in our observability dashboards began indicating threshold increases which ultimately spiked to 60X nominal load. This additional load came from roughly 414,000 Chrome browsers attempting to update (potentially multiple times each) resulting in performance degradation and service unavailability.
LastPass engineers implemented measures to alleviate the load on the system during the issue. It was determined that the most efficient and timely service restoration path was to throttle the pace of inbound requests from clients while also ensuring normalized extension versions and traffic to limit any additional waves of synchronization attempts impacting the site.
Specifically, we completed the following:
We continue to execute additional measures meant to more efficiently detect and protect against these sorts of scenarios: