Ask an engineer to “monitor” something and chances are you’ll get a response resembling something just short of the stages of love. It will start with an overwhelming excitement and sense of ease – you’ll hear something like “Oh, sure, that’s easy!” This will soon be followed by countless hours being poured in and an attachment to the project that defies logic as they run through countless iterations of things working, not working, being completely thrown out and rewritten only to start the cycle again. At the end, what started this whole thing will either be fully monitored, partially monitored, or you’ll have a heartbroken engineer who’s just given up. At Daxko we’ve been through this cycle, well, let’s just say a lot.
For those unfamiliar with what the Daxko TechOps team is, let me provide a quick synopsis. TechOps stands for Technical Operations, and at Daxko, it’s the team responsible for the health [performance, uptime], compliance, escalated support, and operation of our production systems. Stop and think about that for a second. TechOps is responsible for ensuring that the Daxko products are 1) available, 2) not slow, 3) secure, and 4) that the data in them is consistent! That’s quite a task, and our team is able to excel at it for 2 main reasons. The first is Daxko’s version of DevOps – the TechOps Team and the Development Teams are closely integrated and we work with the Development Teams to ensure they can utilize and consume the tools that we use or build. The second is that we’ve built, and continue to iterate on, a monitoring stack that allows us to keep up with what’s going on and troubleshoot complex issues. That stack is what I’m going delve into in this post.
Monitoring means lots of monitors.
When designing and building any type of solution, you have to stop and consider the purpose of what you’re building and any constraints that may be in place. For Daxko, we have products that run in multiple environments (both hosted and cloud-based), using multiple database engines, and using multiple different technology stacks. Therefore, our core solution needs to be OS agnostic, scale easily, and be very malleable. On top of that, we have product-specific items that must be monitored and developers that want to know when things break. Great problems to have, but also lots of added complexity!
To say our stack is complete is like saying that Microsoft is done adding “features” to Windows. Our stack is always in a constant state of flux, and being okay with that is part of what makes things work. Needs change all the time, and the system should be able to cope with that. And, the team should be okay with that.
So, what is our current stack composed of?
#1: Chef: Wait! Chef? What? Yep, Chef is a critical part of our stack. Without it, we don’t have consistency in our configurations and dependencies. Without that, everything else falls apart.
#2: Sensu: The beauty of Sensu is that it works on a complete pub/sub model where clients are configured to subscribe to certain channels and the checks for the channels are published by the Sensu servers. This means if we need another server of type X then we just spin it up and its check data automagically starts flowing in. Simple, efficient, and extremely powerful (also why #1 is so critical!).
#3: Graphite + Carbonator + Grafana: I admit it. I’m a HUGE Graphite fan. I grew up in the days of MRTG and RRD and used Cacti exclusively for many years. Graphite makes capturing, storing, and analyzing LOTS of data points extremely (well mostly) easy. It is an extremely scalable system and allows for the creation of some very complicated graphs that just seem to appear instantly. Couple that with Carbonator for capturing deep metrics on Windows systems and Grafana for building dashboards and you have a hands down rock solid solution.
#4: ELK Stack: We generate billions of log entries on a monthly basis, and we rely on the ELK stack heavily for ingesting, tokenizing, searching, and alerting on those logs. Our Kibana interface is utilized by everyone from Customer Success to TechOps to the Developers themselves!
#5: Pingdom: No system is faultless, and having an outside perspective is always a good thing. Pingdom provides us with that third party view, and we consider its data to be the gold for our uptime statistics. After all, if your servers are up but nobody can use them – are you really “up”?
#6: NewRelic: Daxko’s application stacks range from .Net to Java to PHP. One of our goals is to have single interface for viewing, trending, and comparing those products. NewRelic allows us to do this and allows us to collaborate between teams easily.
#7: pagerduty: Do I really need to explain this one?
#8: Daxko Monitoring Service: You won’t find this one available for download anytime soon as it’s an internally developed tool, but it’s also one of our most powerful because it fills in the gaps left by the others. The Daxko Monitoring Service started life as a simple enough tool that was used for identifying and alerting on invalid data states that would crop up in our databases due to bugs in the software. Over the years, and multiple iterations, it has grown into a much more capable tool that still heralds back to its origins. However, in its latest iteration, it is also being expanded in a way that will allow our developers (on any team) to build test cases and execute them on timed intervals. This may seem like a trivial task; however, when you consider that some products have multiple (and some hundreds) of databases and you need the ability to be able to scale these test across all of those in a quick and effortless way while also taking some business logic into account – a custom tool is the only way to go.
A lot of tools means a lot of interfaces and honestly, that’s just a plain pain. Compound that with lots of people wanting more visibility into what’s going on, and this becomes very overwhelming. So how do you handle that? We searched for a tool to help us solve this problem and in the end settled on a plan that will extend our Monitoring Service UI to incorporate data and alerts from all of our external systems into a simple, easy to use, interface that will give our team members the ability to subscribe to events and alerts on a product by product / service by service level. Our hope is this will help to increase the collaboration between teams by making it easy for team members all over Daxko to get visibility into what is happening behind the scenes, leading to faster fixes and happier customers.
Ed M. is Daxko’s Director of TechOps and Lead Infrastructure Architect who has a passion for problem solving, system architecture, and embedded electronics.
About the AuthorMore Content by Ed McLain