Hey folks - I’d like to provide context around the incident earlier this morning. As you know, our team has been working on migrating from Heroku to AWS. I covered a lot of the reasons for this in our most recent “What’s New in Syncro” webinar, which you can find here. We have already migrated our Web and Application servers from Heroku to AWS, which is a huge milestone. The next phase is to move the database and some of the remaining microservices, like Clockwork (which is related to some other top issues we want to resolve related to scheduled jobs for scripts, SLAs, etc).
Early this morning, we made a time-sensitive change to the Scripts table in the database to change the ID from a 32-bit integer to a 64-bit integer. We did this because we have a large volume of scripts being created and run, and the 32-bit integer would eventually (in 20 days per our calculation) run out of IDs. We came up with a solution that would have no downtime, and we tested it both in staging and on production database replicas, with no issue. However, due to a nuance related to Postgres table unions, when we shipped the change to production it resulted in database queries for the Scripting table taking much longer than expected. This led to a pileup of requests, and workers degraded - which is why users started to see downtime. Heroku has a limit on the number of connections to the database at any given time, and once we reached that limit we had to manually kill jobs in order to perform a rollback. Had we already been on AWS, we would’ve been able to roll back in seconds because they don’t have this same connection limit.
So to summarize, we tried to prevent a future time-sensitive issue from happening, tested the heck out of it, but didn’t account for a Postgres nuance, and weren’t able to roll back quickly because our database is still on Heroku. On a personal note, I’m so looking forward to being completely on AWS. Many of our recent platform issues would either not exist or would be much easier to resolve on AWS due to increased flexibility, system reliability, and control. I apologize for the downtime, I know this severely affects your ability to run a successful MSP business.
Let me know if you have any questions,
Great post-mortem, but highly recommend real-time updates. It took 50 minutes for the status page to be updated. It took 4 days to update the status page on the WP issue after it was resolved. Seems like your “investigating” part of the process is actually the “identifying the problem” step, so the first update is coming in late. Investigating the problem should be immediately after a problem has been made known and the status reflect that. Then it should go to identifying the problem. For the WP issue, there should have been a status update that said that the provider has fixed the issue and we will continue to monitor for any other issues, then the last status is the resolved status.
For clarity, this was the first status on this incident: “We are investigating reports of the site not loading. We were experiencing reports of intermittent errors from 6:15PST to 6:55PST and are currently monitoring a fix. We will update here as we get more information.”. This is actually 2 separate steps rolled into one. “We are investigating reports of the site not loading.” should have been the first status and went out immediately, otherwise, people are experiencing an issue and the status page says everything is fine.
What i want to know is why syncro continually insists on making production changes without notification and not during scheduled maintenance time. This happens too often. Please please stop doing that
Thankyou @ian.alexander for the write up.
It is refreshing to see such detail.
Is there some way you can please schedule changes to production servers on Sunday early mornings PST time?
Might cost a bit more to pay your team to work those hours, but is likely the time in the week that is least disruptive to all regions.
Thank you for the great post mortem. It is with a sense of deja vu that I write this post. Parts of Syncro breaking after a change has happend too frequently this year.
Even if moving to AWS allows you to roll back quicker, updates/maintenance/anything that could potentially cause down time should be done on the weekend.
Your user base have pointed out this time and time again.
Here’s an example of the kind of flow that I have enjoyed from one of our vendors. We get notified as soon as there is an issue and get notified of updates along the way. GoToConnect Status - GoToConnect - Intermittent issue making outbound calls