daniel kador interview on scaleyourcodecassandra (data store) they are hosted entirely in softlayer...
TRANSCRIPT
View original post
Daniel Kador Interview on ScaleYourCode.com
5 mins summary
Feel free to email, tweet, blog, and pass this around the web … but
please don’t alter any of its contents when you do. Thanks, and enjoy!
ScaleYourCode.com
View original post
Who is Dan Kador?
Daniel, or Dan, has been working on
Keen for about 4 years now. Before
that, he was at Salesforce where he
spent 5 years there as an engineer on
their developer platform team.
View original post
What is Keen.io?
Keen makes it easy for any software developer to collect data from anything
connected to the Internet. It could be a browser, mobile app, backend server, or a
wearable device.
Once data is collected, it lives on their servers where they handle the hard
problems of storing, processing, and securing data. Then, they make it really easy
for the owners of that data to ask questions using Keen's API.
For a basic example, you can send data on product purchases and query that data
to figure out how many sales of product X happened in the last day.
Once you have those results, you can easily share them through charts,
dashboards, or other API integrations.
Keen has lots of different use cases but one of the main ones is for people who
want to deliver analytics to their customers. Many of their customers rely on
Keen's service for their business to work.
View original post
What does Keen's stack look like?
Running on SoftLayer
Nginx (Load balancer)
Flask (Python, apps)
Tornado (Python, for API)
Kafka (Event queuing)
Storm (Distributed computing)
Cassandra (Data Store)
They are hosted entirely in SoftLayer today. That decision was made back in 2012
because they needed more IOPS than AWS could give them at the time.
They use Nginx as a load balancer for both the API and their website.
Webservers are Python built on top of Flask. The API is Python built on top of
Tornado.
There are a couple of other services hiding underneath:
There is a data processing tier which is where most of the action is. Kafka is what
handles their event queuing.
Then they have Storm, which handles the distributed computing for both writing
data (persisting it to disk) and querying data.
Finally, they have Cassandra which is their persistence layer.
View original post
How many requests per second? How much data are they
handling?
Some of their stats aren't public, but Dan says they get thousands of requests a
second. He says that the more important stat is how much data they process—
which is trillions of events a day.
Events are rich JSON documents that potentially include hundreds of properties.
View original post
Why Cassandra?
Cassandra has a distributed system from the core. It was built to be a distributed
system. It has a columnar structure which maps well to analyzing specific
properties within a JSON document. It also gives them the ability to scale by
adding more nodes.
Most customers they talk to have something like MySQL. They've tried creating
their events table, and if they achieve any reasonable amount of success, that
event table breaks their database. It's too much data for a single SQL node.
How did they tweak Cassandra to fit their needs?
They've had to do a lot of work to get it to work just right for them. There are
certain JVM tweaks you can do like heap size tuning and GC tuning.
There aren't any magic numbers that work here, it depends on your workload.
Daniel says that the most interesting thing they did with Cassandra was to figure
out how to store JSON documents in a columnar way. They've had to tweak things
at the data model layer like transforming groups of events into columns and then
storing those columns in a more compressed format. This makes it easier for
them to do bulk aggregations. The most important optimization has been the data
model.
If you're interested in learning how they figured out the JSON challenge, check
out slides by Josh Dzielak.
View original post
How do they use Kafka?
Kafka is a core piece of their stack because it is the layer that all incoming data is
persisted to. When an event is sent to Keen, they don't consider it safe until it is
persisted to Kafka. Then, Storm listens to the Kafka queues and that, in turn,
persists data to Cassandra.
View original post
How do they use Storm?
Storm is a stream processing service.
All events from all customers go to the same queue. Storm is responsible for
collating all these events and collecting them into the appropriate collections (the
equivalent to tables in MySQL).
In addition, they make sure that they are keeping data secure and not giving
access to anyone who shouldn't have access.
They also do a bunch of data processing related to what they call add-ons. These
add-ons can include IP addresses, geo information, parsing user agent strings, and
a few other options.
After that, they do storage optimizations like compression and serialization.
View original post
What do they use for data visualization?
http://blog.cartodb.com/visualizing-endangered-species-trades-at-ecohacknyc/
I told Dan I've used D3.js to visualize data, and asked him what they use to create
dashboards and charts.
Keen's JavaScript SDK is built on a couple of things. When they started, they used
a wrapper around Google Charts. These days, they also have wrappers for D3.
D3 is a very powerful tool that has been used to create amazing documents.
There's an example from the New York Times that shows a mapping of the decline
of "Stop-and-Frisk", which uses thousands if not millions of data sets, but you can
clearly see a difference within seconds of looking at the map. Could you imagine
sifting through data on paper instead?
View original post
What do they use to monitor?
This is incredibly important. Dan said they have probably spent more time
building monitoring tools than in some ways the core product.
They actually use Keen to monitor Keen, at least in part. Of course that doesn't
help much if the system is down, so they also use other systems. The biggest one
they rely on is DataDog. They use it as a central aggregation point for metrics.
Metrics from frontend load balancer stats to Cassandra and JVM. Alerting
happens through PagerDuty.
View original post
They had their first failure in 2013. What happened, and
how do they handle failures since their service is critical to
businesses?
In 2013, when they were still using MongoDB, they ran into an issue where a
customer sent a query that should have been killed by their rate limiting, but it
wasn't. The result?
Frozen app servers.
Unfortunately, their load balancer was configured to send the requests off to
another app server if it didn't receive a response within a certain amount of time.
Nginx ended up sending the bad query to every single one of their app servers,
grinding the entire service down to a halt. Ouch.
How did they fix it? They ended up doing a few things, but one of the first
changes was to switch Nginx from passing requests to another server on error
only instead of timeout. If you want more info, read the report here.
They have to monitor 24/7 to be able to detect these failures and prevent them if
possible, but sometimes things just go wrong. Systems fail in really unexpected
ways, and it can be hard to know how to detect these failures. The way they
handle these situations is to get the entire engineering team on board, figure out
a protocol, and just get to work.
View original post
There was another failure early this year. What
happened?
This was a failure in their Kafka tier.
This time, not all services were down. Queries were still working, but since Kafka
was down, they were rejecting new events.
The outage was caused by a maintenance event that should have been routine.
Kafka started behaving in nasty ways. To fix it, they had to shut the system down,
reconfigure it in some critical ways, and bring it back up.
How do you prevent these things from happening again?
A lot of it has to do with monitoring more things and getting more visibility in
your systems. You want to detect failures as soon as you can, or even before they
are about to happen if possible.
Once these failures happen, take a good look at what caused them and
implement new protocols & procedures to make sure they never happen again.
View original post
Conclusion
I have to try and fit as much information in this summary as possible without
making it too long. This means I cut out some information that could really help
you out one day. I highly recommend you view the full interview if you found this
interesting, as there are more hidden gems.
You can watch it, read it, or even listen to it. There's also an option to download
the MP3 so you can listen on your way to work. Of course, the show is also on
iTunes (https://itunes.apple.com/tt/podcast/scale-your-code-
podcast/id987253051?mt=2) and Stitcher
(http://www.stitcher.com/podcast/scaleyourcode?refid=stpr)
Thanks for reading, and please let me know if you have any feedback!
- Christophe Limpalair (@ScaleYourCode)