So, recently we made the jump at OpenStudy from a single monolithic process to a multiple-process architecture for the main site. When engineering this, Antonio essentially split out our Actors (which manage much of our business logic) and our Web frontend components (Comets for realtime push, session information, etc) into two separate entities designed to run with RabbitMQ in the middle. The good part about this is it means that we can scale out our web load horizontally, and we’ve been serving the site using two web processes since we made this change, which has been great.
Naturally, however, any good change introduces some bugs. We have had quite a few that proved troublesome, but even more so by the fact that we didn’t really have a good way to monitor individual web processes from CloudKick. There didn’t seem to be a published solution for monitoring individual nodes in a load-balanced array without pulling down CloudKick IP addresses and allowing those through your firewall, which we hated because it would require our Firewall rules to change regularly and in an automated fashion. So, we were stuck with an external monitor on the public URL which, in the event one process was failing while the others were fine, would send many failure and recovery emails as HAProxy round-robined between servers as each of the monitors subsequently hit the server. Not ideal.
So, about midnight on Tuesday I had a stroke of genius. CloudKick supports plugins. Maybe I could look into those and see how hard they are to build, right? Turns out, they’re really easy and with some bash scripting magic, I was able to come up with a CloudKick plugin that will monitor a local service on whatever machine it is running on using curl. This is the latest version of what resulted from that thought:
Essentially, this script will check a URL you specify in the arguments for the plugin on your CloudKick account. It will allow 15 seconds before determining that the service is responding too slowly. Do note CloudKick plugins are only allowed 20 seconds for their execution. The number 15 is magical for us because it allows for the possibility for a long Full GC to occur on the process without triggering a warning in CloudKick.
There are probably still some improvements to be made. I’d like to look into having it be patient enough to wait for two or more failures before reporting an error back to CloudKick (suggestion from the guys at the office after we’d been using it in production for a few days), but other than that I’m pretty happy with how it has worked for us.
If you use CloudKick, feel free to use this. I’m releasing it under the Apache license. If you have any feedback for me, or suggested changes, let me know. I’d love to get some feedback on it.