This is the latest in our “Inside the Stack” series. This week, we hear from Alexis Lê-Quôc, the CTO & Co-Founder of Datadog, the popular monitoring and analytics platform. Alexis and his team have open positions on their engineering, sales, and operations teams.
Describe Datadog in 2-3 sentences.
Datadog is a monitoring service for cloud-based infrastructures. With Datadog, you get real-time visibility into the performance of your entire stack: from the lowest layers (the hypervisor and operating system) to the highest ones (your application code), and everything in between (Docker, databases, 3rd-party services). You also get a monitoring platform that works without your constant care and feeding whether you run 50 or 50,000 hosts.
What are your primary programming languages?
- We chose python for its productivity and its unmatched support for fastnumerical processing of time series. We also use python for our open-source agent.
- We chose Go for all long-lived stream processors that make up a lot of what Datadog does in the background. Go’s efficient runtime and elegant distributed programming model won us over right away.
Because we want to support as many modern stacks as possible, our community at large has built Datadog modules in 10 different languages.
What are your primary web frameworks?
Our web layer is fairly thin compared to the rest of the product. The most interesting things in the front-end happen in D3 and React.
D3 (and its ancestor protovis) is one of the most elegant and concise frameworks to build data visualizations with. It’s a framework that “feels right” thanks to its solid theoretical foundations, the clarity of its documentation, and the attention to details in the examples.
React made sense because it’s simple and very performant thanks to its virtual DOM “diffing.” We also like that its forces a clear data flow and is built around components with a clear separation between template and display logic.
What are your primary databases?
As of late 2014, we process upwards of 100 billion data points per day so the choice of databases is (a) essential to proper real-time performance, (b) a never-ending quest as we continue to grow as a business.
We have 4 main database use cases:
- Process incoming data points (aggregate, comb for anomalies, etc.), a.k.a. the firehose.
- Store and query time series.
- Store and query events that are not represented as a time series (e.g. alerts, deployments, etc.).
- Store and query metadata (e.g. tags, users, etc.).
In terms of actual databases, we use:
- Kafka as our firehose. It’s simple and reliable, and has great performance for stream processing.
- A mix of Redis, Cassandra and S3 to store and query time series. One of the nice things about time series is that they are very close to immutable.
- Elasticsearch for events that are queried using keywords.
- PostgreSQL for metadata.
We have spoken on the topic on a few occasions: here and here.
Which DevOps tools do you use?
Chef, Capistrano, Jenkins, Hubot to name a few. Beyond what we use internally we’ve made sure to provide solid integrations for the major DevOps tools like Chef, Puppet, Ansible, Logstash, Splunk, etc.
The choices in tools were in this case, partly based on history. More importantly we have consistent tooling around configuration management, continuous integration, continuous delivery, all tied together with our chat rooms.
Which part of your stack are you most excited about?
Personally I’m excited by our work in 3 areas:
- Scaling the intake as we keep adding more customers and bring it in the trillions of points per day.
- Representing performance data for clusters of 1,000s of hosts in ways that fit on a single screen and are easy to grok.
- Extracting more signal from the data, as fast as we can, e.g. improving our anomaly detection.
Visit datadog.com to read more about the company and to see their technology in action.