April 12th : Scheduled Maintenance Incident Report

April 12th : Scheduled Maintenance Incident Report

To all the Syncro partners and RepairShopr customers around the world who depend on us, I am sorry for the inconvenience caused by our maintenance on April 12th. I would like to provide an update and our key learnings.

The objective of the maintenance was to upgrade our core infrastructure by migrating our platform to a new cloud provider. This new vendor would provide for the growth needs of our customer base, including future needs for scalability, security, and reliability. The engineering team had worked for the last several months to plan the migration, and had tested it in non-production environments and believed we had addressed all issues. Here is a timeline of the events, beginning at 8pm PST on April 12th:

  • We started our scheduled maintenance at 8pm PST as planned
  • We encountered several unforeseen challenges during the maintenance, which caused us to extend the window from 1 hour to 3 hours (from 8pm PST to 11 pm PST)
  • Although we resolved several issues, the platform in the new environment continued to be unstable once we finished the maintenance period
  • The platform was available, yet had some intermittent errors, from 11 pm to 6:30 am PST
  • At 6:30am PST, we successfully rolled back to our previous environment, restoring stability to the platform

Here are our lessons learned:

  1. Prevent data availability issues : Some customers were able to access the platform during the time post-maintenance as we were working to stabilize it, and this data became unavailable as we rolled back to the database snapshot we had taken prior to beginning the maintenance window. I deeply regret that data entered between 11 pm to 6:30 am PST was not available to customers immediately upon rollback, and that customers had to contact Support to retrieve any data created during that time. Any time we cause increased workload for our customers is unacceptable. We have been, and still are, able to retrieve any information that you created during this time, so please contact Support if you continue to be affected. In the future, we will have improved maintenance procedures to avoid this situation.
  2. Scheduled Maintenance Windows : We have a global customer base, and we chose a scheduled maintenance window on a weekday, and the extended window was particularly disruptive to our EMEA and APAC customers. We know that our customers and partners rely on our system to run their business, and we take this responsibility very seriously. We will no longer schedule planned maintenance on weekdays, and we will aim to reduce the length of the maintenance.
  3. Real Time Communication : When we encountered complications that extended the maintenance period, and during the unstable period, we failed to keep you updated. Many of you commented about being unable to sign into our Syncro Community forum during the maintenance to receive updates. We will make sure that we have a better communication plan (including tools and personnel) for future maintenance for both RepairShopr and Syncro customers, so that you are always in the loop.

This infrastructure work does not impede any upcoming feature enhancements (OS patching, for example) from being released as planned. The team is doing a detailed analysis in order to come up with a plan to successfully complete this infrastructure upgrade in the future. We will keep you updated.

I truly appreciate your support and loyal partnership.

Sincerely,

Rajesh Agarwal

VP of Engineering

5 Likes