False Online/Offline status for assets

Hi All,

On Sunday, January 21 at 8:21 AM PT, a standard certificate renewal triggered a restart of our service that controls the Online/Offline status for our assets. While we have mechanisms in place to ensure purposeful restarts of the service are graceful, redistributing agent connections across our servers takes several days to complete.

By Monday, January 22 at 7:50 AM PT, as our traffic increased with the majority of our agent fleet coming back online after the weekend, our Online/Offline service drastically began to drop connections, leading to our agents falsely reporting offline. The problem occurred not only because the connections were still redistributing across the servers, but also the proxy server settings were not properly optimized to handle the influx of traffic.

As a response to the dropped connections, we temporarily disabled offline alerting in the app, adjusted our proxy server settings, and manually restarted each of our servers to forcefully optimize the redistribution of connections. By 11:21 AM PT on January 22, we found that these updates appeared to resolve the issue as our agent fleet held presence connections as expected, so we called the all clear on the incident and re-enabled agent offline notifications.

On Tuesday, January 23 at 8:20 AM PT, as our morning traffic once again increased and despite our changes the day prior, our Online/Offline service was still unable to support the influx of traffic and agent connections began to drop. Agent offline reporting was again disabled at 8:38 AM PT and the team continued work to optimize our server load configurations. After implementing these optimizations, we delayed reactivation of agent offline notifications until 9:00 AM PT on Wednesday, January 24, to confirm our system could reliably manage the morning traffic.

As an immediate response to this incident and to ensure a reliable presence system for our partners as they continue to grow their businesses in the years to come, our team is prioritizing additional optimizations of our Online/Offline system in a test environment simulating our current and projected traffic to ensure we can gracefully handle planned or unplanned restarts of the service in both the short and long term as our agent fleet continues to grow. This work is currently in progress and we are committed to investing whatever resources are necessary to mitigate the likelihood of future incidents.

1 Like

3 weeks and we’re still seeing this problem. :frowning: