Resolved issue causing RMM Agent Presence servers to restart

Hi All - Our team identified an issue causing our RMM Agent Presence servers to restart today, May 26 at 1:12pm PST. This restart may have caused a wave of false offline alerts for certain agents. The issue self resolved quickly and the Presence related servers are up and running normally.

1 Like

Not resolved in my instance. All my servers show offline. The workstations appear to be fine.

Please reach out to support (if you haven’t already) if you are still experiencing any issues.

I noticed a handful of servers where the agent service was stopped. But only 7 our of 1600 alerted as offline. It would be nice if we could get a text alert when the offline service goes haywire. A few times I got up at 3am because I thought my cloud was going down, get in the car to drive to my datacenter to find that it was just Syncro… I’m getting too old for that, would like that to not happen again.

You can subscribe to email alerts. https://www.syncrostatus.com/

You should have other remote ways beyond Syncro to verify the status of something. Our datacenter will go throw a KVM on the rack.

They are all VM’s a KVM will do nothing. Syncrostatus.com has never showed an outage when random machines just decide to show as offline. That and it doesn’t text. That and those raritan ikvm’s are expensive.

We have other remote ways, its called connectwise control, but when 1,000 virtual machines are alerting as offline, that’s not going to do you real well.

Maybe you have a special setup, but we have physical host they go on and can access, as well as other means such as IPMI.

It does when there’s a bigger scope and this was on there. I don’t know if you’re in the states or not, but if you are, you can use your cell phones email to text address to get text messages. If not, there are other services out there that can ingest and send various notification types.

You can create groups and filter in your SC portal. We used them all the time when we had LT and SC and LT would show agents offline. You said on another post that you use it to verify, so assuming you have done something like this. Beats driving somewhere when you don’t have to. Can easily restart the Syncro services in bulk in SC too.

We have IPMI, and we should be getting a Raritan iKVM - I’m actually moving one of my datacenters because it is in Philadelphia and that requires driving on 76E, which can turn a 40 minute ride into 2+ hours. I found a carrier neutral datacenter in the rural-suburbs that is only 30 minutes from me and they have plenty of backroads. So when we do move, I should have a much better remote setup with backup internet. I have 10Gig fiber at the DC, but despite having 99.9% uptime, it still goes down from time to time. Went down two days ago, they claimed the packet storm monitor disabled the port for using 5,000 pps. Which is stupid, I should be capable of 14,000,000 pps or at least somwhere near that. The tech also said their router’s CPU couldn’t handle that. I called my sales person and she is making them upgrade the equipment at that DC and they removed the limit for the moment. When I go to the new DC, which I should sign next week after our due diligence and walkthrough, my ISP already said they would have brand new equipment for me. Then I am also getting backup internet which is only 100mbps but it is blended through three different providers. The other two locations are run by a 3rd party that we buy, so I can just call them and they fix it. Can’t realistically drive to Utah or Tampa to fix a server.

I have for my work phone Comcast’s xfinity mobile because we had enough of AT&T’s just complete incompetence. They use Verizon, but email to text has been spotty at best. We have it set up for UptimeRobot and our SOC’s critical alerts as well as our building’s generators. sometimes you get it, sometimes you don’t.

We do actually double check, or I do, but at 3 am you may not be 100% awake and still get ready to roll. It takes time to boot up my laptop and log into CWC. But once or twice I didn’t have the ability to do so, and my business partner was unable to get on vDirector. So I did actually end up in the car before one of my techs was able to confirm it was just syncro.

We have groups in CWC, they are a complete mess, but we have groups. And you can search, so I can type in like “10.70.”, “10.80.” or “10.90.” and get my cloud servers. I match vlan to IP schema and such. So vlan 728, has a client’s cloud IP range as 10.70.28.0/24 and the external IP for their firewall is .28 (I own three entire /24 subnets). Vlan 819 is 10.80.19.x, 981 is 10.90.81.x and so on and so forth. So I have a quick way to check a specific DC. I also use the CWC toolbox a ton. It is very useful, and I have a restart syncro script.

I also added in UptimeRobot and EMCO Ping monitor from multiple locations. Realistically that should cull it. We are moving to Halo anyway and we may get a new RMM as well. However Halo may take 5-6 weeks to go live (professional onboarding with automation and better workflows, takes time). RMM would take another month at least. Until then, on cloud servers I may add another uptime monitor.